Introduction to Normalization

This chapter summarizes the process of normalizing non-standard Slovene words. A more detailed presentation can be found in the guidelines in the Annotation Guidelines chapter.

In the case of Slovene tweets, normalization was carried out simultaneously with tokenization, as shown in Table 1.
During the manual revision, five types of corrections were identified:

Token Tokenizataion Normalization
zato
tukó tako
nauta ne l bosta
s s’m sem
$0 $0
m $0 $0
pršva prišla

Table 1: Normalization and tokenization of a tweet.


Revision #2
Created 13 November 2023 12:56:45 by Tina Munda
Updated 30 November 2023 12:12:49 by Tina Munda