Advanced Search
Search Results
13 total results found
01 Tokenization
Tokenization is the process of dividing text into individual tokens (words, digits, punctuation). For the machine annotation of corpora in the Slovenian context, we currently use the CLASSLA-Stanza tagger, more precisely the Obeliks tokeniser included in it. T...
02 Segmentation
Segmentation is the process of dividing text into individual sentences. For the machine annotation of corpora in the Slovenian context, we currently use the CLASSLA-Stanza tagger, more precisely the Obeliks segmentator included in it. The rules guiding the aut...
03 Normalization
Computer-mediated communication (CMC) language significantly diverges from the standard language, posing challenges for current automatic text annotation tools. Normalization is essential for enhancing further text processing because it provides a standard equ...
04 MULTEXT-East Morphosyntax
The MULTEXT-East framework for morphosyntactic annotation of text corpora defines character codes, referred to as MSD-tags (with 'MSD' standing for morphosyntactic description). For example, the "Ncmsn" tag represents a set of grammatical features "Noun Type=c...
05 Lemmatization
When tagging text, each word form is assigned a lemma (the base form of the word), facilitating further processing in a unified way. The lemmatization system was developed in the project JOS: Linguistic Annotation of Slovene (Holozan et al. 2008) and follows t...
06 JOS-SYN Syntax
The JOS-SYN system, which was crafted during the Linguistic Annotation of Slovene: Methods and Resources project (Erjavec et al. 2010) and later applied in the Communication in Slovene initiative (Krek et al. 2020), is designed to mark syntactic relations in S...
07 Universal Dependencies
Universal Dependencies (UD) is an internationally harmonised annotation framework that aims to standardize the morphological and syntactic tagging of texts across languages in order to foster the development of multilingual language technologies and contrastiv...
08 Named Entities
Named entities (NEs) are nouns and noun phrases that specifically designate a person, location, organisation or other distinct object existing in real space and time, In a broader sense, they can also include (possessive) adjectives derived from a person's nam...
09 Coreferences
Coreference occurs when several elements within a text—be it words, phrases, or entire sentences—point to the same entity in the real world, outside of language itself. This entity, known as the referent, can represent a wide array of things, including but not...
10 Semantic Role Labeling
Semantic role labeling (SRL), also known as semantic annotation, is the process of attributing semantic roles, such as agent, patient, or location, to the semantic arguments defined by a predicate or verb within a sentence. For Slovene, the system of semantic-...
11 Developmental corpus Šolar
The Šolar annotation system, developed alongside the Slovene Šolar developmental corpus (Arhar Holdt et al. 2022), is designed for categorizing language corrections in texts written by pupils in Slovene primary schools and students in Slovene secondary schools...
12 Slovene learner corpus KOST
The KOST annotation system was developed together with the KOST corpus of Slovene as a foreign language (Stritar Kučuk 2022) and is designed for categorizing teacher's corrections in texts written by speakers of Slovene as a second or foreign language. The tag...
13 Relations
Relation extraction refers to the process of identifying and categorizing semantic relationships between entities within a text. This task is vital for understanding the structure and meaning of complex language data, and it has significant applications in var...