Skip to main content

Linguistic Annotation of Slovene Corpora

Overview of annotating Slovene corpora supported by CLARIN.SI

01 Tokenization

Tokenization is the process of dividing text into individual tokens (words, digits, punctuation). For the machine annotation of...

02 Segmentation

Segmentation is the process of dividing text into individual sentences. For the machine annotation of corpora in the Slovenian ...

03 Normalization

Computer-mediated communication (CMC) language significantly diverges from the standard language, posing challenges for current...

04 MULTEXT-East Morphosyntax

The MULTEXT-East framework for morphosyntactic annotation of text corpora defines character codes, referred to as MSD-tags (wit...

05 Lemmatization

When tagging text, each word form is assigned a lemma (the base form of the word), facilitating further processing in a unified...

06 JOS-SYN Syntax

The JOS-SYN system, which was crafted during the Linguistic Annotation of Slovene: Methods and Resources project (Erjavec et al...

07 Universal Dependencies

Universal Dependencies (UD) is an internationally harmonised annotation framework that aims to standardize the morphological an...

08 Named Entities

Named entities (NEs) are nouns and noun phrases that specifically designate a person, location, organisation or other distinct ...

09 Coreferences

Coreference occurs when several elements within a text—be it words, phrases, or entire sentences—point to the same entity in th...

10 Semantic Role Labeling

Semantic role labeling (SRL), also known as semantic annotation, is the process of attributing semantic roles, such as agent, p...

11 Developmental corpus Šolar

The Šolar annotation system, developed alongside the Slovene Šolar developmental corpus (Arhar Holdt et al. 2022), is designed ...

12 Slovene learner corpus KOST

The KOST annotation system was developed together with the KOST corpus of Slovene as a foreign language (Stritar Kučuk 2022) an...

13 Relations

Relation extraction refers to the process of identifying and categorizing semantic relationships between entities within a text...