Linguistic Annotation ...

01 Tokenization

Tokenization is the process of dividing text into individual tokens (words, digits, punctuation). For the machine annotation of...

Created 2 years ago

Updated 2 years ago

02 Segmentation

Segmentation is the process of dividing text into individual sentences. For the machine annotation of corpora in the Slovenian ...

Created 2 years ago

Updated 2 years ago

03 Normalization

Computer-mediated communication (CMC) language significantly diverges from the standard language, posing challenges for current...

Created 2 years ago

Updated 2 years ago

04 MULTEXT-East Morphosyntax

The MULTEXT-East framework for morphosyntactic annotation of text corpora defines character codes, referred to as MSD-tags (wit...

Created 2 years ago

Updated 2 years ago

05 Lemmatization

When tagging text, each word form is assigned a lemma (the base form of the word), facilitating further processing in a unified...

Created 2 years ago

Updated 2 years ago

06 JOS-SYN Syntax

The JOS-SYN system, which was crafted during the Linguistic Annotation of Slovene: Methods and Resources project (Erjavec et al...

Created 2 years ago

Updated 2 years ago

07 Universal Dependencies

Universal Dependencies (UD) is an internationally harmonised annotation framework that aims to standardize the morphological an...

Created 2 years ago

Updated 2 years ago

08 Named Entities

Named entities (NEs) are nouns and noun phrases that specifically designate a person, location, organisation or other distinct ...

Created 2 years ago

Updated 2 years ago

09 Coreferences

Coreference occurs when several elements within a text—be it words, phrases, or entire sentences—point to the same entity in th...

Created 2 years ago

Updated 2 years ago

10 Semantic Role Labeling

Semantic role labeling (SRL), also known as semantic annotation, is the process of attributing semantic roles, such as agent, p...

Created 2 years ago

Updated 2 years ago

11 Developmental corpus Šolar

The Šolar annotation system, developed alongside the Slovene Šolar developmental corpus (Arhar Holdt et al. 2022), is designed ...

Created 2 years ago

Updated 2 years ago

12 Slovene learner corpus KOST

The KOST annotation system was developed together with the KOST corpus of Slovene as a foreign language (Stritar Kučuk 2022) an...

Created 2 years ago

Updated 2 years ago

13 Relations

Relation extraction refers to the process of identifying and categorizing semantic relationships between entities within a text...

Created 2 years ago

Updated 2 years ago

Linguistic Annotation of Slovene Corpora

01 Tokenization

02 Segmentation

03 Normalization

04 MULTEXT-East Morphosyntax

05 Lemmatization

06 JOS-SYN Syntax

07 Universal Dependencies

08 Named Entities

09 Coreferences

10 Semantic Role Labeling

11 Developmental corpus Šolar

12 Slovene learner corpus KOST

13 Relations