01 Tokenization

Tokenization is the process of dividing text into individual tokens (words, digits, punctuation). For the machine annotation of corpora in the Slovenian context, we currently use the CLASSLA-Stanza tagger, more precisely the Obeliks tokeniser included in it. The rules guiding the automatic tagging are also adhered to during manual revision.

Annotation Guidelines

This chapter summarizes the annotation guidelines for tokenization. ⬥ Space is the principal sepa...

References and Links

This chapter compiles relevant references and provides links to projects where the lemmatization ...

01 Tokenization

Annotation Guidelines

References and Links

Search Results