Linguistic Annotation of Slovene Corpora
Overview of annotating Slovene corpora supported by CLARIN.SI
01 Tokenization
Tokenization is the process of dividing text into individual tokens (words, digits, punctuation). For the machine annotation of...
02 Segmentation
Segmentation is the process of dividing text into individual sentences. For the machine annotation of corpora in the Slovenian ...
03 Normalization
Computer-mediated communication (CMC) language significantly diverges from the standard language, posing challenges for current...
04 MULTEXT-East Morphosyntax
The MULTEXT-East framework for morphosyntactic annotation of text corpora defines character codes, referred to as MSD-tags (wit...
05 Lemmatization
When tagging text, each word form is assigned a lemma (the base form of the word), facilitating further processing in a unified...
06 JOS-SYN Syntax
The JOS-SYN system, which was crafted during the Linguistic Annotation of Slovene: Methods and Resources project (Erjavec et al...
07 Universal Dependencies
Universal Dependencies (UD) is an internationally harmonised annotation framework that aims to standardize the morphological an...
08 Named Entities
Named entities (NEs) are nouns and noun phrases that specifically designate a person, location, organisation or other distinct ...
09 Coreferences
Coreference occurs when several elements within a text—be it words, phrases, or entire sentences—point to the same entity in th...
10 Semantic Role Labeling
Semantic role labeling (SRL), also known as semantic annotation, is the process of attributing semantic roles, such as agent, p...
11 Developmental corpus Šolar
The Šolar annotation system, developed alongside the Slovene Šolar developmental corpus (Arhar Holdt et al. 2022), is designed ...
12 Slovene learner corpus KOST
The KOST annotation system was developed together with the KOST corpus of Slovene as a foreign language (Stritar Kučuk 2022) an...
13 Relations
Relation extraction refers to the process of identifying and categorizing semantic relationships between entities within a text...