Skip to main content

Introduction to Segmentation

This chapter summarizes the annotation guidelines for sentence segmentation.

The main guideline for demarcating sentences is a combination of final punctuation, space, and a capitalized word. This is supplemented with additional rules that cover abbreviations. These are written with a period, which can also serve as final punctuation (when the abbreviation is at the end of a sentence, e.g., 'itd.') or not (when the abbreviation is in the middle of a sentence, for instance 'itj.'). The final list of abbreviations that fall into either category is included in the Obeliks tool.

For segmenting Slovene non-standard texts, additional rules apply:

• In an entire tweet, check whether automatic sentence segmentation is correct. In the guidelines, the end of a sentence is marked for easier understanding with the symbol ¶.

• If part of the tweet functions as an independent sentence, it is treated as such (“@multikultivator Najbrž ne . ¶ :) ¶ Kot rečeno : bolje BO . ¶ Zrihtamo , ko utegnemo . ¶ ( PS : tudi v veselje " konkurence " ;)”).

• The criterion for the end of a sentence is mainly a punctuation mark that acts as the final one in a sentence, e.g., period, exclamation point, question mark, quotation marks, or ellipsis (“Kaj praviš ? ¶ Aha !”).

• Unless there's a good reason to treat something as two sentences, it should remain one (“@urosgruber pri meni naloži CSS .. kar pa ne pomeni , da stran zgleda lepo :)“ → one sentence because the dots acts more like a comma than a period).

• The end of a tweet is automatically also the end of a sentence, so this is not marked.

Complex cases:

• Three dots:

some blue text■ Ponavadi je končno ločilo (“@SLO_Super_Visor po moje se jo izogiba kot hudič križa. ¶ Glavn da on spet laja … ¶ some blue texttttttttttt:-)))))”).
some blue text■ Včasih označuje zgolj elipso ali zamolk sredi stavka – v takšnem primeru ni končno ločilo (“To se mi zdi ... neumno.”).

• Imena (@ime), emotikoni (\o/) ali emojiji (👳) in heštegi (#hešteg):

some blue text■ Če se pojavljajo sredi stavka, so del stavka (“neka baka :) uleti pa praša če loh gre kr naprej”, “sej #tarca je pa dons some blue textttttttttttkr ok”, “sej je rekla @Sandra d je treba to drgac”).

some blue text■ Če se pojavljajo na začetku tvita, jih obravnavamo kot del prvega stavka (“@TadejTrcekTITO @lucijausaj @JJansaSDS some blue texttttitek, ne seri. odv. častno razsodišče je JE zgolj za odvetnike.”).

some blue text■ Če nadomeščajo končno ločilo, zaznamujejo konec stavka (“kot da je to važn :)) ¶ nobenga to ne briga vec sploh”).

some blue text■ Če sledijo končnemu ločilu, jih obravnavamo kot samostojen stavek (“Sonce, sneg in pot pod noge! ¶ :) ¶ Gremo v some blue texttttttttttthribe!”).

some blue text■ Če je pri koncu stavka nanizanih več imen, emotikonov ali heštegov, za konec stavka velja zadnji element (“itak ne some blue textttttttttttmorm sploh keša dvignt :) @tibonalta #broke” → konec stavka je hešteg #broke).