Skip to main content
Advanced Search
Search Terms
Content Type

Exact Matches
Tag Searches
Date Options
Updated after
Updated before
Created after
Created before

Search Results

13 total results found

01 Tokenization

Tokenization is the process of dividing text into individual tokens (words, digits, punctuation). For the machine annotation of corpora in the Slovenian context, we currently use the CLASSLA-Stanza tagger, more precisely the Obeliks tokeniser included in it. T...

02 Segmentation

Segmentation is the process of dividing text into individual sentences. For the machine annotation of corpora in the Slovenian context, we currently use the CLASSLA-Stanza tagger, more precisely the Obeliks segmentator included in it. The rules guiding the aut...

03 Normalization

Computer-mediated communication (CMC) language significantly diverges from the standard language, posing challenges for current automatic text annotation tools. Normalization is essential for enhancing further text processing because it provides a standard equ...

04 MULTEXT-East Morphosyntax

The MULTEXT-East framework for morphosyntactic annotation of text corpora defines character codes, referred to as MSD-tags (with 'MSD' standing for morphosyntactic description). For example, the "Ncmsn" tag represents a set of grammatical features "Noun Type=c...

05 Lemmatization

When tagging text, each word form is assigned a lemma (the base form of the word), facilitating further processing in a unified way. The lemmatization system was developed in the project JOS: Linguistic Annotation of Slovene (Holozan et al. 2008) and follows t...

06 JOS-SYN Syntax

The JOS-SYN system, which was crafted during the Linguistic Annotation of Slovene: Methods and Resources project (Erjavec et al. 2010) and later applied in the Communication in Slovene initiative (Krek et al. 2020), is designed to mark syntactic relations in S...

07 Universal Dependencies

Universal Dependencies (UD) is an internationally harmonised annotation framework that aims to standardize the morphological and syntactic tagging of texts across languages in order to foster the development of multilingual language technologies and contrastiv...

08 Named Entities

Named entities (NEs) are nouns and noun phrases that specifically designate a person, location, organisation or other distinct object existing in real space and time, In a broader sense, they can also include (possessive) adjectives derived from a person's nam...

09 Coreferences

Coreference occurs when several elements within a text—be it words, phrases, or entire sentences—point to the same entity in the real world, outside of language itself. This entity, known as the referent, can represent a wide array of things, including but not...

10 Semantic Role Labeling

Semantic role labeling (SRL), also known as semantic annotation, is the process of attributing semantic roles, such as agent, patient, or location, to the semantic arguments defined by a predicate or verb within a sentence. For Slovene, the system of semantic-...

11 Developmental corpus Šolar

The Šolar annotation system, developed alongside the Slovene Šolar developmental corpus (Arhar Holdt et al. 2022), is designed for categorizing language corrections in texts written by pupils in Slovene primary schools and students in Slovene secondary schools...

12 Slovene learner corpus KOST

The KOST annotation system was developed together with the KOST corpus of Slovene as a foreign language (Stritar Kučuk 2022) and is designed for categorizing teacher's corrections in texts written by speakers of Slovene as a second or foreign language. The tag...

13 Relations

Relation extraction refers to the process of identifying and categorizing semantic relationships between entities within a text. This task is vital for understanding the structure and meaning of complex language data, and it has significant applications in var...