12 Slovene learner corpus KOST
The KOST annotation system was developed together with the KOST corpus of Slovene as a foreign language (Stritar Kučuk 2022) and is designed for categorizing teacher's corrections in texts written by speakers of Slovene as a second or foreign language. The tagging system is hierarchically organized in two tiers: first, the corrections are defined according to the linguistic level, followed by the characterization of the general type of correction or the part of speech. The two-tier annotations allow for a robust analysis, which has to be followed by a more detailed manual revision.
Introduction to Tags
This chapter summarises the KOST tags. A more detailed presentation can be found in the guidelines in the Annotation Guidelines chapter.
Tag | Linguistic level | Type of correction/part of speech |
---|---|---|
Z-LOC | orthography | punctuation |
Z-CRK | orthography | spelling |
Z-SN | orthography | joined or divided words |
Z-MV | orthography | capitalization |
Z-KR | orthography | abbreviation |
B-SAM | vocabulary | noun |
B-GLAG | vocabulary | verb |
B-ZAIM | vocabulary | pronoun |
B-PRID | vocabulary | adjective |
B-PRISL | vocabulary | adverb |
B-PRED | vocabulary | preposition |
B-VEZ | vocabulary | conjunction |
B-OST | vocabulary | other |
O-SAM | word form | noun |
O-GLAG | word form | verb |
O-ZAIM | word form | pronoun |
O-PRID | word form | adjective |
O-PRISL | word form | adverb |
O-OST | word form | other |
S-STR | syntax | structure |
S-BR | syntax | word order |
S-IZP | syntax | omission |
S-ODV | syntax | insertion |
POV | / | related correction |
[???] | / | incomprehensible, unclear correction |
Annotation Guidelines
This chapter summarizes the annotation guidelines for semantic-role labelling as applied to Slovene texts. The guidelines are arranged from the latest, up-to-date version to the oldest version.
Version 1.0 (04-2022)
Project Development of Slovene in a Digital Environment
STRITAR KUČUK, Mojca, 2023: KOST 1.0: Priročnik za označevanje napak, delovna verzija. Različica 1.0. [PDF] - only in Slovene
References and Links
This chapter compiles relevant references and provides links to projects where the KOST system has been developed and applied to Slovene texts.
Projects, in which the system has been developed:
Development of Slovene in a Digital Environment
Corpora containing manually revised KOST tags:
STRITAR KUČUK, Mojca, ŠTER, Helena, PISEK, Staša, PETRIC LASNIK, Ivana, KETE MATIČIČ, Jana, PIRIH SVETINA, Nataša, PREGLAU, Daniela, ARHAR HOLDT, Špela, KRSNIK, Luka, ERJAVEC, Tomaž, 2023, Slovene learner corpus KOST 1.0, Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1753.
The CJVT Svala tool for manual annotation following the KOST system:
ARHAR HOLDT, Špela, KOSEM, Iztok, STRITAR KUČUK, Mojca, KRSNIK, Luka, JOVAN, Leon Noe, 2022: CJVT Svala (Kazalnik projekta Razvoj slovenščine v digitalnem okolju), v1.0, https://orodja.cjvt.si/svala/, dostop 2. 3. 2023.
References:
STRITAR KUČUK, Mojca, 2022: KOST med korpusi usvajanja tujega jezika. Obdobja 41: Na stičišču svetov: slovenščina kot drugi in tuji jezik. 323–334. https://centerslo.si/wp-content/uploads/2022/11/Stritar-Kucuk_Obdobja-41.pdf
ARHAR HOLDT, Špela, KOSEM, Iztok, STRITAR KUČUK, Mojca, 2022: Metode in orodja za lažjo pripravo korpusov usvajanja jezika. Obdobja 41: Na stičišču svetov: slovenščina kot drugi in tuji jezik. 23–30. https://centerslo.si/wp-content/uploads/2022/11/Arhar-Holdt-et-al_Obdobja-41.pdf
STRITAR KUČUK, Mojca, 2020: Modul Leto plus – prvi korak do korpusa slovenščine kot tujega jezika. Zbornik konference Jezikovne tehnologije in digitalna humanistika 2020. 131–135. http://nl.ijs.si/jtdh20/pdf/JT-DH_2020_StritarKucuk_Modul-Leto-plus%e2%80%93prvi-korak-do-korpusa-slovenscine-kot-tujega-jezika.pdf