Skip to main content

Introduction to Segmentation

This chapter summarizes the annotation guidelines for sentence segmentation.

The main guideline for demarcating sentences is a combination of final punctuation, space, and a capitalized word. This is supplemented with additional rules that cover abbreviations. These are written with a period, which can also serve as final punctuation (when the abbreviation is at the end of a sentence, e.g., 'itd.') or not (when the abbreviation is in the middle of a sentence, for instance 'itj.'). The final list of abbreviations that fall into either category is included in the Obeliks tool.

For segmenting Slovene non-standard texts, additional rules apply:

• In an entire tweet, check whether automatic sentence segmentation is correct. In the guidelines, the end of a sentence is marked for easier understanding with the symbol ¶.

• If part of the tweet functions as an independent sentence, it is treated as such (“@multikultivator Najbrž ne . ¶ :) ¶ Kot rečeno : bolje BO . ¶ Zrihtamo , ko utegnemo . ¶ ( PS : tudi v veselje " konkurence " ;)”).

• The criterion for the end of a sentence is mainly a punctuation mark that acts as the final one in a sentence, e.g., period, exclamation point, question mark, quotation marks, or ellipsis (“Kaj praviš ? ¶ Aha !”).

• Unless there's a good reason to treat something as two sentences, it should remain one (“@urosgruber pri meni naloži CSS .. kar pa ne pomeni , da stran zgleda lepo :)“ → one sentence because the dots acts more like a comma than a period).

• The end of a tweet is automatically also the end of a sentence, so this is not marked.

Complex cases:

• Three dots:

some blue texttttttttttt some blue text■ Ponavadi je končno ločilo (“@SLO_Super_Visor po moje se jo izogiba kot hudič križa. ¶ Glavn da on spet laja … ¶ some blue texttttttttttt:-)))))”).

some blue textVčasihSometimes označujethree zgoljdots elipsoindicate alijust zamolkan srediellipsis stavkaor a pause in the middle of a sentencevin takšnemsuch primerua nicase, končnoit's ločilonot final punctuation (“To se mi zdi ... neumno.”).

ImenaNames (@ime)@name), emotikoniemoticons (\o/) alior emojijiemojis (👳), inand heštegihashtags (#hešteg)#hashtag):

some blue textČeIf sethey pojavljajoappear srediin stavka,the somiddle delof stavkaa sentence, they are part of the sentence (“neka baka :) uleti pa praša če loh gre kr naprej”, “sej #tarca je pa dons some blue textttttttttttkr ok”, “sej je rekla @Sandra d je treba to drgac”).

some blue textČeIf sethey pojavljajoappear naat začetkuthe tvita,beginning jihof obravnavamoa kottweet, delthey prvegaare stavkatreated as part of the first sentence (“@TadejTrcekTITO @lucijausaj @JJansaSDS some blue texttttitek, ne seri. odv. častno razsodišče je JE zgolj za odvetnike.”).

some blue textČeIf nadomeščajothey končnoreplace ločilo,the zaznamujejofinal konecpunctuation, stavkathey mark the end of a sentence (“kot da je to važn :)) ¶ nobenga to ne briga vec sploh”).

some blue textČeIf sledijothey končnemufollow ločilu,the jihfinal obravnavamopunctuation, kotthey samostojenare stavektreated as an independent sentence (“Sonce, sneg in pot pod noge! ¶ :) ¶ Gremo v some blue texttttttttttthribe!”).

some blue textČeIf jeseveral prinames, koncuemoticons, stavkaor nanizanihhashtags večare imen,strung emotikonovat alithe heštegov,end zaof koneca stavkasentence, veljathe zadnjilast element is considered the end of the sentence (“itak ne some blue textttttttttttmorm sploh keša dvignt :) @tibonalta #broke” → konecthe stavkaend jeof heštegthe sentence is the hashtag #broke).

Segmentation of spoken Slovene is currently done manually based on prosodically or semantically completed units.