08 Named Entities
Named entities (NEs) are nouns and noun phrases that specifically designate a person, location, organisation or other distinct object existing in real space and time, In a broader sense, they can also include (possessive) adjectives derived from a person's name, such as DERIV-PER[Obamova] izvolitev). In Slovene, named entities are typically indicated orthographically by capitalization (e.g., "Slovenska tiskovna agencija") or abbreviations (e.g., "STA"). It's important to note, however, that a capital letter or an abbreviation doesn't always signify a named entity (e.g. the Slovene acronym BDP, translated to 'GDP' in English, represents a common noun phrase). The ability to accurately identify named entities in text plays a crucial role in numerous natural language processing tasks, including information extraction, coreference resolution, sentiment analysis, and more.
Introduction to Labels
This chapter summarises labels for named entities (NEs). A more detailed presentation can be found in the guidelines in the Annotation Guidelines chapter.
Category | Subcategory | Examples | Doesn't belong in the category |
---|---|---|---|
PER some white text | Person (name and/or surname) | Janez Novak, da Vinci, Ludvik XIV. | dr., gospa, sv. |
Pet name | Fifi | ||
Artistic name, pseudonym | Madonna, mati Terez(ij)a, Banksy | ||
Fictional characters (from books, films etc.) | Ana Karenina, Rdeča kapica | ||
Nicknames | (Boštjan Gorenc -) Pižama, Zvezdica89 | ||
Named group of people (placerelated or family name) | Angleži, Nemec, Ljubljančan; Novakovi | ||
Twitter mentions | @pizama, @Nike | ||
DERIV-PER some white text | Personal possessive adjectives | Novakov (pes) | Alzheimerjeva (bolezen) |
ORG | Organizations | EU, Nato, Rimskokatoliška cerkev | parlament, vlada |
Companies | Microsoft, Pasadena d.o.o. | ||
Airport operators | Aerodrom Ljubljana | Letališče Jožeta Pučnika | |
Educational institutions | Filozofska fakulteta | ||
Institutes | Institut “Jožef Stefan” | ||
Museums, libraries | Prirodoslovni muzej | ||
Theatres, cinemas etc. | MGL, Kinodvor | ||
Media (TV, radio, newspaper etc.) | Dnevnik, Delo, Radio Center | ||
Restaurants, hotels, bars, pubs etc. | Kavarna Zvezda, [hH]otel Lev | ||
Healthcare facilities | [zZ]dravstveni dom Ribnica | ||
Music bands and other art-related groups | U2, Beatli, [aA]nsambel Avsenik | ||
Other public and private institutions | [oO]bčina Piran, NPK | ||
Political parties, civic societies, NGOs | DeSUS, Zveza potrošnikov Slovenije | ||
Sports clubs, associations | (HDD SIJ) Acroni Jesenice, (FC) Barcelona | ||
Cultural organizations (also amateur) | [mM]ešani pevski zbor Divača | ||
LOC | Celestial bodies (planets, comets etc.) | Mars, Andromeda, Halleyjev komet | |
Continents | Južna Amerika | ||
Countries, provinces, lands (historic and modern) | Slovenija, Združene države (Amerike) | EU | |
Regions | Primorska, Valonija, Nova Anglija | ||
Cities and settlements (including parts) | Ljubljana, Šiška, Vrhnika, Na klancu | ||
Streets, squares | Jamova cesta 39 | A2, gorenjska AC | |
Shopping centres | Citypark, Supernova | ||
Airports | Letališče Jožeta Pučnika | ||
Churches (named building) | [cC]erkev sv. Nikolaja | Rimskokatoliška cerkev | |
Local sights (cultural, natural) | Tromostovje, Triglavski narodni park | ||
Other named buildings (without org. structure) | [kK]ulturni dom Ljubno, WTC 2 | Cankarjev dom (ima org. strukturo, npr. direktorja) | |
Mountains, lakes, rivers and other named geographical objects | Triglav, Blejsko jezero, Sava, Logarska dolina | ||
MISC | Computer systems, programs, apps | Windows 10, Word, Android 5.1 Lollipop | .docx, pdf, OCR |
Titles of books, films, paintings and other works of art; titles of documents | Vojna in mir, Ko jagenjčki obmolknejo, Sopranovi, Guernica; Uradni list RS | ||
Registered names or models of products (cars, mobile phones, computers, games etc.) and other commercial products (brands) | Galaxy Note 7, Nokia Lumia 950, Toyota RAV4, Minecraft, Človek ne jezi se | ||
Titles of events | Oskarji, Zlata lisica, 10. mednarodna konferenca Jezikovne tehnologije | shod nacifašistov | |
Project names | Obzorje 2020 | ||
Stock market indices | SBI20, Dow Jones, Nasdaq | Bonitetne ocene (AAA) |
Annotation Guidelines
This chapter summarizes the annotation guidelines for named entity recognition (NER) as applied to Slovene texts. The guidelines are arranged from the latest, up-to-date version to the oldest version.
Version 1.1
Project Development of Slovene in a Digital Environment
ZUPAN, Katja; LJUBEŠIĆ, Nikola in ERJAVEC, Tomaž, 2023: Annotation guidelines for Slovenian named entities
Janes-NER: Version 1.1. Clean copy for the Development of Slovene in a Digital Environment project. [PDF]
References and Links
This chapter compiles relevant references and provides links to projects where named entity recognition (NER) has been developed and applied to Slovene texts.
Projects, in which the system has been developed
MUC-6 Named Entity Task Definition
CONLL 2003
BSNLP 2017 shared task
Janes - Resources, Tools and Methods for the Research of Nonstandard Internet Slovene
References
Marc Reznicek: Linguistische Annotation von Nichtstandardvarietäten / Guidelines und
„Best Practices" Guidelines NER (version 1.5). https://www.linguistik.huberlin.de/de/institut/professuren/korpuslinguistik/forschung/nosta-d/nosta-d-ner-1.5
LDC - Linguistic Data Consortium: ACE (Automatic Content Extraction) English Annotation Guidelines for Entities, Version 6.6 2008.06.13, http://projects.ldc.upenn.edu/ace (Accessed on 2 November 2020).