Slovenian Lexicographic Datasets

Slovenian Lexicographic Datasets — `dict-conversions`

Two LM-ready datasets built from 15 legacy CJVT/DZS Slovenian dictionaries (14 bilingual + 1 monolingual encyclopedia), converted to OASIS DMLex 1.0 and HuggingFace Parquet, then enriched. This document describes both datasets, the pipeline that produced them, their schemas, and what to look for when using them.

Status: 805,279 entries · 1,019,685 senses · 2,174,880 translation pairs · 113,868 relations. All 15 dictionaries validate against the official DMLex XSD 1.1 + JSON Schema with 0 leaks, 0 parse errors, 0 residual glyph markers. 52 unit tests pass.

1. The two datasets at a glance

	Collection 1 — Core	Collection 2 — Enriched
Path	`dist/core/` (+ canonical `lexidma/`, `parquet/`)	`dist/enriched/` (extends Core)
Provenance	Intrinsic — derived only from the project's own dictionaries	Extrinsic — Core + external resources
External tools	none	CLASSLA-Stanza, sloWNet/OMW, English WordNet (oewn)
Size	~2.1 GB	~184 MB (layers only; use with Core)
Reproducible offline	yes	needs the external resources (one-time download)
Contents	DMLex XML/JSON, derived tables, 12 LM task JSONLs	silver morphology + MSD task, synset/ILI links, imported antonyms, candidate scoring, sloWNet-enriched KNAUR DMLex

Use them together. Enriched is a thin layer of external-resource columns/files that extends Core; it does not duplicate it.

2. Source corpus

sh = Serbo-Croatian (legacy unified tag). KNAUR is monolingual (definitions + cross-references, no translations). DVRUSL is reconstructed from a Word .doc (OCR-grade).

Code	Languages	Family	Entries	Senses	Pairs	Relations
DVANSL	en→sl	block_blankline	68,761	75,208	264,454	0
DVFRSL	fr→sl	line	39,980	70,817	175,458	0
DVITSL	it→sl	block_blankline	60,282	98,444	197,449	0
DVRUSL	ru→sl	doc_runs	32,844	32,844	80,364	0
DVSHSL	sh→sl	block_blankline	92,302	92,302	150,228	5,563
DVSPSL	es→sl	dzs_utf16	36,460	66,722	114,691	414
DVSLAN	sl→en	block_blankline	42,088	42,227	159,625	0
DVSLFR	sl→fr	line	33,866	44,092	128,561	34
DVSLNE	sl→de	german	84,951	90,374	252,074	120
DVSLSH	sl→sh	block_blankline	72,690	87,140	152,068	2,694
DVSLSP	sl→es	dzs_utf16	33,676	66,456	93,562	349
DRSLAN	sl→en	dzs_nested	25,559	33,273	75,374	457
VSIS	sl→it	block_geslo	90,675	116,223	275,043	7,764
LAT_AZ	la→sl	block_geslo	11,521	20,315	55,929	1,852
KNAUR	sl (mono)	encyclopedia	79,624	83,248	0	94,621

Several pairs exist in both directions (DVSLFR/DVFRSL, DVSLAN/DVANSL, DVSLSH/DVSHSL, DVSLSP/DVSPSL, VSIS+DVITSL) — exploited for translation-pivot synonyms and reverse-dictionary tasks.

3. The pipeline

                 ┌───────── BASE CONVERSION (intrinsic, faithful) ──────────┐
 source dicts ─► parser family ─► IR (model.py) ─► DMLex XML/JSON  (lexidma/)
   (8 families)   + textproc                     └► Parquet         (parquet/)  + reports/
                  (specialchars, markup,
                   controlled vocab)
                          │
                          ▼
      ┌──────── COLLECTION 1 · CORE (build/, intrinsic) ────────┐
      │ derive.py  : accent-folded lemma · dedup · leak-free split
      │ synsets.py : in-sense synonyms · translation pivots · KNAUR hypernyms · GWA relation typing
      │ tasks.py   : 12 LM task JSONLs                                   → dist/core/
      └─────────────────────────────────────────────────────────┘
                          │
                          ▼
      ┌──────── COLLECTION 2 · ENRICHED (build/enrich.py, extrinsic) ───────┐
      │ CLASSLA silver lemma/UPOS/MSD (+ msd_tagging task)
      │ sloWNet/OMW synset+ILI links · ILI-bridged antonyms
      │ candidate scoring vs sloWNet · sloWNet-typed DMLex relations (KNAUR) → dist/enriched/
      └─────────────────────────────────────────────────────────────────────┘

Stage 1 — Base conversion (src/dictconv/, command dictconv convert all). Each dictionary's publisher typesetting markup is tokenized (not XML-parsed — it is malformed), {…} escapes decoded to Unicode (side-aware accents; undecodable font/template codes removed and logged), qualifiers classified into a controlled vocabulary, and emitted as a source-agnostic intermediate representation. Serializers produce DMLex 1.0 XML+JSON (validated against the official OASIS XSD 1.1 / JSON Schema) and three Parquet artifacts. The conversion is faithful: it never invents content and flags rather than silently drops.

Stage 2 — Core (dictconv build-core). Adds derived, ML-oriented layers computable from our data alone: a normalized lemma/UPOS layer, exact/near-duplicate collapse, a leak-free train/dev/test split, synonym/pivot/hypernym candidate tables, and 12 instruction-style task JSONLs.

Stage 3 — Enriched (dictconv enrich). Adds layers that need outside resources, kept strictly separate from the intrinsic data (gold-vs-silver provenance is explicit).

4. Base conversion artifacts

`lexidma/<CODE>.xml` and `.json` — OASIS DMLex 1.0

Faithful lexicographic record. Camel-case elements; text of every object except headword in a nested <text>; crosslingual module (headwordTranslation, exampleTranslation); the linking module (relation + relationType definitions that carry a Global WordNet sameAs URI, e.g. see → https://globalwordnet.github.io/schemas/wn#also). Validated with a real XSD 1.1 processor (all identity constraints + assertions, except 3 cardinality-defective ones the published schema cannot satisfy). KNAUR uses the monolingual schema.

`parquet/<CODE>.entries.parquet` — one row per entry (nested)

dict_code, entry_id, headword,
lemma,            # accent-folded join key (Sloleks/Gigafida/CLASSLA/sloWNet keying)
accented_form,    # original tonal/accented display form (null if == headword)
homograph_number, source_lang, target_lang,
meta_lang,        # editorial metalanguage = "sl"
parts_of_speech[str], upos,    # UPOS from the entry's first POS (intrinsic static map)
frequency_band,   # DRSLAN corpus band 0..3 (null elsewhere)
labels[str], collocates[str],  # DRSLAN <KO> collocates
pronunciations[{text, scheme}],
inflected_forms[{text, tag}],
senses[{ sense_id, indicator, labels[str], definitions[str],
         headword_translations[{text, lang_code, parts_of_speech[str], labels[str]}],
         headword_explanations[{text, lang_code}],
         examples[{text, labels[str], translations[{text, lang_code, labels[str]}]}] }],
has_content,      # False => no senses / all senses empty (filter before LM use)
source_ref, raw   # provenance

`parquet/<CODE>.pairs.parquet` — one row per (source,target) unit (flat)

dict_code, source_lang, target_lang, entry_id, sense_id, homograph_number,
pair_type (headword|example), source_text, target_text,
source_lemma,    # accent-folded entry headword (dedup + leak-free split key)
part_of_speech, labels[str], domain, register

`parquet/<CODE>.relations.parquet` — the full cross-reference graph

dict_code, source_lang, target_lang, relation_index, type, description,
members[{ref, headword, role, target_id}],
serialized       # True => >=2 members resolved => present in DMLex XML/JSON

This is the lossless home of the cross-ref graph: it keeps cross-references whose target never resolved to an entry id (which the DMLex XML/JSON must drop).

`reports/<CODE>.report.json` + `reports/_summary.json`

Per-dictionary stats, validation results, the controlled-value inventory, flagged-token counts, and the aggregate summary + artifact manifest (sha256 of every output).

5. Collection 1 — Core (`dist/core/`)

5.1 Derived tables (`dist/core/derived/`)

File	Rows	What it is
`lemmas.parquet`	805,279	one row per entry: lemma, accented_form, upos, frequency_band, cluster_id, split
`pairs_dedup.parquet`	2,078,214	de-duplicated translation pairs + `occurrence_count`, `canonical_id`, split
`synonym_sets.parquet`	378,783	in-sense (target-language) near-synonym sets + gloss
`synonym_pairs.parquet`	2,131,996	in-sense + pivot synonym pairs (evidence, confidence_tier)
`pivot_synonyms.parquet`	152,655	Slovene synonym candidates from translation pivots (GOLD 67,296 / SILVER 85,359)
`hypernym_candidates.parquet`	4,169	KNAUR genus-differentia hypernym candidates (confidence)
`relations_typed.parquet`	113,868	the cross-ref graph, GWA-typed (`gwa_relType`)

Pivot-synonym yield: 152,655 SILVER+GOLD pairs (≥2 agreeing pivots) materialized; 632,231 single-pivot BRONZE pairs were counted but not materialized (large, low precision); 339,415 distinct pivots used.

5.2 LM tasks (`dist/core/tasks/*.jsonl`)

Each row: {id, task, split, input:{…}, output:{…}, metadata:{…}}. Split is train/dev/test (≈90/5/5), leak-free (§5.3). Marker policy drop (undecodable-glyph rows are cleaned/omitted).

Task	Rows	input → output
`translation`	3,006,506	`{source_text, source_lang, target_lang, part_of_speech, labels, domain, register}` → `{target_text}` (both directions)
`example_translation`	574,961	example phrase → its translation
`definition`	145,633	`{headword, lang, indicator}` → `{definition}` (KNAUR)
`reverse_dictionary`	145,633	`{definition, lang}` → `{headword}`
`wsd`	22,036	`{word, context, lang}` → `{sense_gloss, sense_id}` (polysemous only; bare-number glosses dropped)
`example_usage`	85,155	`{headword, lang}` → `{example}` (monolingual usage sentences)
`morphology`	332,393	`{headword, lang}` → `{form, tag}` (dictionary inflected forms)
`pronunciation`	157,578	`{headword, lang}` → `{transcription, scheme}`
`synonyms_of`	358,573	`{word, lang}` → `{synonyms[]}` (a real set; 60% have >1)
`hypernym_of`	4,169	`{word, lang}` → `{hypernym_candidate, confidence}`
`relation`	113,373	a relation's first member → `{relation_type, members[…]}`
`relation_classify`	113,327	`{a, b, lang}` → `{relation_type}` (unordered-pair split)

5.3 Leak-free split (important)

Translation / sense / synonym tasks key on the folded Slovene lemma, so a lemma and its reverse-direction twin (hiša in sl→fr and maison→hiša in fr→sl) are always in the same split. Verified: 0 of 172,081 Slovene headword lemmas straddle splits. (e.g. translation split: train 2,711,054 / dev 144,596 / test 150,856.)
Morphology & pronunciation (about the headword form) split by headword form; relation_classify by the unordered member pair — so foreign homographs don't straddle either.
The legacy dict_code:entry_id key (now superseded) leaked ~26 % of multi-dict lemmas.

5.4 Cleaning applied (Core)

Dedup before split (with occurrence_count); degenerate targets dropped (punct/digit-only, single-char, src==tgt); unbalanced parentheses balanced; PUA sentinels + control chars stripped; undecodable glyph markers removed (marker_policy=drop). manifest.json carries a content hash (e76f2766…) and per-task split counts; dataset_card.md is the in-tree card.

6. Collection 2 — Enriched (`dist/enriched/`)

File	Rows	What it is
`silver_morphology.parquet`	208,715	CLASSLA lemma / UPOS / JOS-MULTEXT-East MSD / feats per Slovene lemma; `morph_provenance="silver_tool"`
`tasks/msd_tagging.jsonl`	208,715	`{lemma, lang}` → `{upos, msd, feats}` (the morphology/POS task; silver)
`synset_links.parquet`	64,413	Slovene lemma → sloWNet/OMW `synset_id` + ILI (join key to Princeton WN / OMW)
`antonyms.parquet`	6,107	imported Slovene antonyms (ILI-bridged through the English WordNet)
`scored_synonyms.parquet`	364,334	every Core synonym candidate + `wordnet_confirmed` + `source_count`
`scored_hypernyms.parquet`	396	checkable hypernym candidates + `wordnet_confirmed`
`lexidma/KNAUR.{xml,json}`	97,418 rel	KNAUR re-serialized as DMLex with sloWNet antonym (142) + synonym (2,655) relations, ILI/synset in `relation/description`, GWA-typed

6.1 Candidate scoring vs sloWNet (measured precision — lower bounds; sloWNet is incomplete)

Synonyms: 364,334 checkable, 17.0 % confirmed — in-sense 14.6 %, pivot 27.1 %, pivot-GOLD 38.5 %.
Hypernyms: 396 checkable, 47.5 % confirmed (ILI-bridged through the English WordNet).
Use wordnet_confirmed=True (and/or confidence_tier=GOLD) to extract a higher-precision subset.

6.2 External resources & how to reproduce

sloWNet/antonyms run in the main .venv (pip install -e '.[enrich]' → wn). CLASSLA needs Python ≤ 3.13 (its pinned numpy fails to build on 3.14), so run the silver morphology from a 3.12 env:

uv venv --python 3.12 .venv-enrich
uv pip install -p .venv-enrich/bin/python -e '.[enrich]'
python -m wn download omw-sl ; python -m wn download oewn:2021
.venv-enrich/bin/python -c "import classla; classla.download('sl')"
.venv-enrich/bin/python -m dictconv.cli enrich --in dist/core --out dist/enriched --sample-limit 0

7. What to look for (usage guidance & caveats)

Filter before training

has_content — drop entries with no usable content (≈1.2 % of entries) for entry-level tasks.
marker_policy — task JSONLs are already built with drop; never train on the keep variant (it would teach the model to emit placeholder glyphs). Corpus markers are currently 0.
Degenerate rows — already removed from the tasks; if you build your own from parquet/, apply the same filters (punct/digit-only, src==tgt, unbalanced parens).
Use the provided split. Re-shuffling by row re-introduces lemma leakage; the cluster split is the point. Hold out whole lemma clusters, not rows.

Candidates are candidates, not gold

synonym_*, pivot_synonyms, hypernym_candidates are induced and noisy. Gate with the enriched scoring: scored_synonyms.wordnet_confirmed / pivot confidence_tier=GOLD (38.5 % precision) for synonyms; scored_hypernyms.wordnet_confirmed for hypernyms. The 47.5 % hypernym figure is measured on a small checkable slice and is optimistic for the full pool (genus heads are not lemmatized — ~25 % are oblique forms; lemmatize with CLASSLA before use).
Antonyms are imported, not mined (synonyms/antonyms are translationally indistinguishable).
Precision is measured only on vocabulary sloWNet already has (~11 %). The extension value (the ~89 % of members not yet in sloWNet) is unproven — commission a small human eval before treating those as silver.

Gold vs silver

The gold lemma layer (lemmas.parquet) is 100 % coverage, accent-folded, NFC-clean.
The silver morphology (silver_morphology / msd_tagging) is CLASSLA tool output (morph_provenance="silver_tool"). Keep it filterable; measure MSD accuracy on a hand-tagged sample before training a morphological analyzer on it. The dictionary's own inflected_forms are mostly ending fragments, not full words.

Per-dictionary quality

DVRUSL (Russian) is OCR-grade (Word .doc reconstruction). Cleanups were applied (brace→paren, |/bullet stripping, space collapse) but residual noise is inherent — down-weight or exclude for high-precision work.
Definitions / reverse-dictionary come only from KNAUR (monolingual Slovene). There are no definitions from the 14 bilingual dicts.
Conditioning labels are sparse (POS on ~18 % of translation rows, domain ~4 %, register ~4 %).

Provenance / audit

Every removed escape token is logged in data/reference/removed_markers.tsv.
reports/_summary.json carries the per-file sha256 manifest; each collection's manifest.json carries a content_hash and split counts. Pin these with any eval run.

8. Loading

from datasets import load_dataset

# flat translation pairs (all dicts)
pairs = load_dataset("parquet", data_files="parquet/*.pairs.parquet", split="train")

# nested per-entry records (one dictionary)
entries = load_dataset("parquet", data_files="parquet/VSIS.entries.parquet", split="train")
entries = entries.filter(lambda r: r["has_content"])          # drop empty entries

# a Core LM task, by split
import json
train = [json.loads(l) for l in open("dist/core/tasks/translation.jsonl") if json.loads(l)["split"]=="train"]

# wordnet-confirmed synonyms only (Enriched gate)
import pyarrow.parquet as pq
syn = pq.read_table("dist/enriched/scored_synonyms.parquet").to_pylist()
gold = [r for r in syn if r["wordnet_confirmed"]]

# DMLex (faithful lexicographic view)
import json; res = json.load(open("lexidma/VSIS.json"))       # OASIS DMLex 1.0

9. Reproduce end-to-end

pip install -e .                 # base deps (pyarrow, lxml, jsonschema, xmlschema)
dictconv convert all --write-summary     # base: lexidma/, parquet/, reports/
dictconv build-core                       # Collection 1 -> dist/core/
dictconv enrich --sample-limit 0          # Collection 2 -> dist/enriched/  (see §6.2 for CLASSLA env)
dictconv audit --write-summary            # readiness audit + artifact manifest
pytest -q                                 # 52 tests

Slovenian Lexicographic Datasets

Slovenian Lexicographic Datasets — dict-conversions