# Annotation Guidelines

This chapter summarizes the annotation guidelines for tokenization.<br />

⬥ Space is the principal separator for tokens.

⬥ Sequences of words that can be written both with or without space without changing its meaning (e.g. **kdorkoli**, **kdor koli** “anybody, any body”) follow the same principle and become either one or two tokens depending on the use of space.

⬥ During tokenization, all characters are divided into two categories: words (W) and characters (C).

⬥ C tokens are recognized on the basis of a predefined list of punctuation- and symbol-like characters included in the tokenizer (depending on the annotation system, e.g. Universal Dependencies or JOS/MULTEXT-East) and consist of single characters only. Sequences of two or more characters (e.g. **?!**) are treated as sequences of separate C tokens.

⬥ If a string of alphanumeric characters between two spaces includes C characters, it is usually split into several tokens (e.g. **AC/DC** and **Micro$oft** are split into three tokens 'AC' '/' 'DC' and 'Micro' '$' 'oft').

⬥ However, the following exceptions, in which C characters become parts of W tokens, apply:

&nbsp;&nbsp;&nbsp;&nbsp;￭ Apostrophe becomes part of a W token if used without space on both sides (e.g. **O’Brian** "O’Brian", **mor’va** "we have <span style="color:white">some text some</span>to").

&nbsp;&nbsp;&nbsp;&nbsp;￭ Comma and colon become part of a W token if used without space on both sides and if the string contains only digits <span style="color:white">some white text</span>(e.g. **30:00**, **200,000,000**).

&nbsp;&nbsp;&nbsp;&nbsp;￭ Hyphen becomes part of a W token if used without space on both sides and if:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;• the left part is an acronym (in capital letters), a single letter or a digit

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;• the right part is an affix or an inflectional ending; a finite list of possible affixes and endings is integrated in the <span style="color:white">some white text some</span>tokenizer, e.g. **OZN-ovski** "similar to United Nations", **a-ju** "to the letter a", **15-i** "the 15th".

&nbsp;&nbsp;&nbsp;&nbsp;￭ Dot becomes part of a W token if it is:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;• used without space on both sides and the string contains only digits, e.g. **1.2**

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;• used without space on the left and is part of an abbreviation or ordinal number (e.g. **dr.**, **4.**, **IV.**); a finite list of <span style="color:white">some white text some</span>possible abbreviations is integrated in the tokenizer.

&nbsp;&nbsp;&nbsp;&nbsp;￭ All C characters become part of a single W token in strings recognized as URLs or addresses using a regular expression.

Information on whether a token is not followed by a space (e.g. **d.o.o.** vs. **d. o. o.**) is indicated with SpaceAfter=No feature in the MISC column.