Slovenian Lexicographic Datasets
dict-conversions
Two LM-ready datasets built from 15 legacy CJVT/DZS Slovenian dictionaries (14 bilingual + 1 monolingual encyclopedia), converted to OASIS DMLex 1.0 and HuggingFace Parquet, then enriched. This document describes both datasets, the pipeline that produced them, their schemas, and what to look for when using them.
Status: 805,279 entries · 1,019,685 senses · 2,174,880 translation pairs · 113,868 relations. All 15 dictionaries validate against the official DMLex XSD 1.1 + JSON Schema with 0 leaks, 0 parse errors, 0 residual glyph markers. 52 unit tests pass.
1. The two datasets at a glance
| Collection 1 — Core | Collection 2 — Enriched | |
|---|---|---|
| Path | dist/core/ (+ canonical lexidma/, parquet/) |
dist/enriched/ (extends Core) |
| Provenance | Intrinsic — derived only from the project's own dictionaries | Extrinsic — Core + external resources |
| External tools | none | CLASSLA-Stanza, sloWNet/OMW, English WordNet (oewn) |
| Size | ~2.1 GB | ~184 MB (layers only; use with Core) |
| Reproducible offline | yes | needs the external resources (one-time download) |
| Contents | DMLex XML/JSON, derived tables, 12 LM task JSONLs | silver morphology + MSD task, synset/ILI links, imported antonyms, candidate scoring, sloWNet-enriched KNAUR DMLex |
Use them together. Enriched is a thin layer of external-resource columns/files that extends Core; it does not duplicate it.
2. Source corpus
sh = Serbo-Croatian (legacy unified tag). KNAUR is monolingual (definitions + cross-references, no
translations). DVRUSL is reconstructed from a Word .doc (OCR-grade).
| Code | Languages | Family | Entries | Senses | Pairs | Relations |
|---|---|---|---|---|---|---|
| DVANSL | en→sl | block_blankline | 68,761 | 75,208 | 264,454 | 0 |
| DVFRSL | fr→sl | line | 39,980 | 70,817 | 175,458 | 0 |
| DVITSL | it→sl | block_blankline | 60,282 | 98,444 | 197,449 | 0 |
| DVRUSL | ru→sl | doc_runs | 32,844 | 32,844 | 80,364 | 0 |
| DVSHSL | sh→sl | block_blankline | 92,302 | 92,302 | 150,228 | 5,563 |
| DVSPSL | es→sl | dzs_utf16 | 36,460 | 66,722 | 114,691 | 414 |
| DVSLAN | sl→en | block_blankline | 42,088 | 42,227 | 159,625 | 0 |
| DVSLFR | sl→fr | line | 33,866 | 44,092 | 128,561 | 34 |
| DVSLNE | sl→de | german | 84,951 | 90,374 | 252,074 | 120 |
| DVSLSH | sl→sh | block_blankline | 72,690 | 87,140 | 152,068 | 2,694 |
| DVSLSP | sl→es | dzs_utf16 | 33,676 | 66,456 | 93,562 | 349 |
| DRSLAN | sl→en | dzs_nested | 25,559 | 33,273 | 75,374 | 457 |
| VSIS | sl→it | block_geslo | 90,675 | 116,223 | 275,043 | 7,764 |
| LAT_AZ | la→sl | block_geslo | 11,521 | 20,315 | 55,929 | 1,852 |
| KNAUR | sl (mono) | encyclopedia | 79,624 | 83,248 | 0 | 94,621 |
Several pairs exist in both directions (DVSLFR/DVFRSL, DVSLAN/DVANSL, DVSLSH/DVSHSL, DVSLSP/DVSPSL, VSIS+DVITSL) — exploited for translation-pivot synonyms and reverse-dictionary tasks.
3. The pipeline
┌───────── BASE CONVERSION (intrinsic, faithful) ──────────┐
source dicts ─► parser family ─► IR (model.py) ─► DMLex XML/JSON (lexidma/)
(8 families) + textproc └► Parquet (parquet/) + reports/
(specialchars, markup,
controlled vocab)
│
▼
┌──────── COLLECTION 1 · CORE (build/, intrinsic) ────────┐
│ derive.py : accent-folded lemma · dedup · leak-free split
│ synsets.py : in-sense synonyms · translation pivots · KNAUR hypernyms · GWA relation typing
│ tasks.py : 12 LM task JSONLs → dist/core/
└─────────────────────────────────────────────────────────┘
│
▼
┌──────── COLLECTION 2 · ENRICHED (build/enrich.py, extrinsic) ───────┐
│ CLASSLA silver lemma/UPOS/MSD (+ msd_tagging task)
│ sloWNet/OMW synset+ILI links · ILI-bridged antonyms
│ candidate scoring vs sloWNet · sloWNet-typed DMLex relations (KNAUR) → dist/enriched/
└─────────────────────────────────────────────────────────────────────┘
Stage 1 — Base conversion (src/dictconv/, command dictconv convert all). Each dictionary's
publisher typesetting markup is tokenized (not XML-parsed — it is malformed), {…} escapes decoded
to Unicode (side-aware accents; undecodable font/template codes removed and logged), qualifiers
classified into a controlled vocabulary, and emitted as a source-agnostic intermediate
representation. Serializers produce DMLex 1.0 XML+JSON (validated against the official OASIS XSD
1.1 / JSON Schema) and three Parquet artifacts. The conversion is faithful: it never invents
content and flags rather than silently drops.
Stage 2 — Core (dictconv build-core). Adds derived, ML-oriented layers computable from our
data alone: a normalized lemma/UPOS layer, exact/near-duplicate collapse, a leak-free
train/dev/test split, synonym/pivot/hypernym candidate tables, and 12 instruction-style task JSONLs.
Stage 3 — Enriched (dictconv enrich). Adds layers that need outside resources, kept strictly
separate from the intrinsic data (gold-vs-silver provenance is explicit).
4. Base conversion artifacts
lexidma/<CODE>.xml and .json — OASIS DMLex 1.0
Faithful lexicographic record. Camel-case elements; text of every object except headword in a
nested <text>; crosslingual module (headwordTranslation, exampleTranslation); the linking
module (relation + relationType definitions that carry a Global WordNet sameAs URI, e.g.
see → https://globalwordnet.github.io/schemas/wn#also). Validated with a real XSD 1.1 processor
(all identity constraints + assertions, except 3 cardinality-defective ones the published schema
cannot satisfy). KNAUR uses the monolingual schema.
parquet/<CODE>.entries.parquet — one row per entry (nested)
dict_code, entry_id, headword,
lemma, # accent-folded join key (Sloleks/Gigafida/CLASSLA/sloWNet keying)
accented_form, # original tonal/accented display form (null if == headword)
homograph_number, source_lang, target_lang,
meta_lang, # editorial metalanguage = "sl"
parts_of_speech[str], upos, # UPOS from the entry's first POS (intrinsic static map)
frequency_band, # DRSLAN corpus band 0..3 (null elsewhere)
labels[str], collocates[str], # DRSLAN <KO> collocates
pronunciations[{text, scheme}],
inflected_forms[{text, tag}],
senses[{ sense_id, indicator, labels[str], definitions[str],
headword_translations[{text, lang_code, parts_of_speech[str], labels[str]}],
headword_explanations[{text, lang_code}],
examples[{text, labels[str], translations[{text, lang_code, labels[str]}]}] }],
has_content, # False => no senses / all senses empty (filter before LM use)
source_ref, raw # provenance
parquet/<CODE>.pairs.parquet — one row per (source,target) unit (flat)
dict_code, source_lang, target_lang, entry_id, sense_id, homograph_number,
pair_type (headword|example), source_text, target_text,
source_lemma, # accent-folded entry headword (dedup + leak-free split key)
part_of_speech, labels[str], domain, register
parquet/<CODE>.relations.parquet — the full cross-reference graph
dict_code, source_lang, target_lang, relation_index, type, description,
members[{ref, headword, role, target_id}],
serialized # True => >=2 members resolved => present in DMLex XML/JSON
This is the lossless home of the cross-ref graph: it keeps cross-references whose target never resolved to an entry id (which the DMLex XML/JSON must drop).
reports/<CODE>.report.json + reports/_summary.json
Per-dictionary stats, validation results, the controlled-value inventory, flagged-token counts, and the aggregate summary + artifact manifest (sha256 of every output).
5. Collection 1 — Core (dist/core/)
5.1 Derived tables (dist/core/derived/)
| File | Rows | What it is |
|---|---|---|
lemmas.parquet |
805,279 | one row per entry: lemma, accented_form, upos, frequency_band, cluster_id, split |
pairs_dedup.parquet |
2,078,214 | de-duplicated translation pairs + occurrence_count, canonical_id, split |
synonym_sets.parquet |
378,783 | in-sense (target-language) near-synonym sets + gloss |
synonym_pairs.parquet |
2,131,996 | in-sense + pivot synonym pairs (evidence, confidence_tier) |
pivot_synonyms.parquet |
152,655 | Slovene synonym candidates from translation pivots (GOLD 67,296 / SILVER 85,359) |
hypernym_candidates.parquet |
4,169 | KNAUR genus-differentia hypernym candidates (confidence) |
relations_typed.parquet |
113,868 | the cross-ref graph, GWA-typed (gwa_relType) |
Pivot-synonym yield: 152,655 SILVER+GOLD pairs (≥2 agreeing pivots) materialized; 632,231 single-pivot BRONZE pairs were counted but not materialized (large, low precision); 339,415 distinct pivots used.
5.2 LM tasks (dist/core/tasks/*.jsonl)
Each row: {id, task, split, input:{…}, output:{…}, metadata:{…}}. Split is train/dev/test
(≈90/5/5), leak-free (§5.3). Marker policy drop (undecodable-glyph rows are cleaned/omitted).
| Task | Rows | input → output |
|---|---|---|
translation |
3,006,506 | {source_text, source_lang, target_lang, part_of_speech, labels, domain, register} → {target_text} (both directions) |
example_translation |
574,961 | example phrase → its translation |
definition |
145,633 | {headword, lang, indicator} → {definition} (KNAUR) |
reverse_dictionary |
145,633 | {definition, lang} → {headword} |
wsd |
22,036 | {word, context, lang} → {sense_gloss, sense_id} (polysemous only; bare-number glosses dropped) |
example_usage |
85,155 | {headword, lang} → {example} (monolingual usage sentences) |
morphology |
332,393 | {headword, lang} → {form, tag} (dictionary inflected forms) |
pronunciation |
157,578 | {headword, lang} → {transcription, scheme} |
synonyms_of |
358,573 | {word, lang} → {synonyms[]} (a real set; 60% have >1) |
hypernym_of |
4,169 | {word, lang} → {hypernym_candidate, confidence} |
relation |
113,373 | a relation's first member → {relation_type, members[…]} |
relation_classify |
113,327 | {a, b, lang} → {relation_type} (unordered-pair split) |
5.3 Leak-free split (important)
-
Translation / sense / synonym tasks key on the folded Slovene lemma, so a lemma and its
reverse-direction twin (
hišain sl→fr andmaison→hišain fr→sl) are always in the same split. Verified: 0 of 172,081 Slovene headword lemmas straddle splits. (e.g. translation split: train 2,711,054 / dev 144,596 / test 150,856.) -
Morphology & pronunciation (about the headword form) split by headword form;
relation_classifyby the unordered member pair — so foreign homographs don't straddle either. - The legacy
dict_code:entry_idkey (now superseded) leaked ~26 % of multi-dict lemmas.
5.4 Cleaning applied (Core)
Dedup before split (with occurrence_count); degenerate targets dropped (punct/digit-only,
single-char, src==tgt); unbalanced parentheses balanced; PUA sentinels + control chars stripped;
undecodable glyph markers removed (marker_policy=drop). manifest.json carries a content hash
(e76f2766…) and per-task split counts; dataset_card.md is the in-tree card.
6. Collection 2 — Enriched (dist/enriched/)
| File | Rows | What it is |
|---|---|---|
silver_morphology.parquet |
208,715 | CLASSLA lemma / UPOS / JOS-MULTEXT-East MSD / feats per Slovene lemma; morph_provenance="silver_tool" |
tasks/msd_tagging.jsonl |
208,715 | {lemma, lang} → {upos, msd, feats} (the morphology/POS task; silver) |
synset_links.parquet |
64,413 | Slovene lemma → sloWNet/OMW synset_id + ILI (join key to Princeton WN / OMW) |
antonyms.parquet |
6,107 | imported Slovene antonyms (ILI-bridged through the English WordNet) |
scored_synonyms.parquet |
364,334 | every Core synonym candidate + wordnet_confirmed + source_count |
scored_hypernyms.parquet |
396 | checkable hypernym candidates + wordnet_confirmed |
lexidma/KNAUR.{xml,json} |
97,418 rel | KNAUR re-serialized as DMLex with sloWNet antonym (142) + synonym (2,655) relations, ILI/synset in relation/description, GWA-typed |
6.1 Candidate scoring vs sloWNet (measured precision — lower bounds; sloWNet is incomplete)
- Synonyms: 364,334 checkable, 17.0 % confirmed — in-sense 14.6 %, pivot 27.1 %, pivot-GOLD 38.5 %.
- Hypernyms: 396 checkable, 47.5 % confirmed (ILI-bridged through the English WordNet).
- Use
wordnet_confirmed=True(and/orconfidence_tier=GOLD) to extract a higher-precision subset.
6.2 External resources & how to reproduce
sloWNet/antonyms run in the main .venv (pip install -e '.[enrich]' → wn). CLASSLA needs
Python ≤ 3.13 (its pinned numpy fails to build on 3.14), so run the silver morphology from a 3.12 env:
uv venv --python 3.12 .venv-enrich
uv pip install -p .venv-enrich/bin/python -e '.[enrich]'
python -m wn download omw-sl ; python -m wn download oewn:2021
.venv-enrich/bin/python -c "import classla; classla.download('sl')"
.venv-enrich/bin/python -m dictconv.cli enrich --in dist/core --out dist/enriched --sample-limit 0
7. What to look for (usage guidance & caveats)
Filter before training
-
has_content— drop entries with no usable content (≈1.2 % of entries) for entry-level tasks. -
marker_policy— task JSONLs are already built withdrop; never train on thekeepvariant (it would teach the model to emit placeholder glyphs). Corpus markers are currently 0. -
Degenerate rows — already removed from the tasks; if you build your own from
parquet/, apply the same filters (punct/digit-only,src==tgt, unbalanced parens). - Use the provided split. Re-shuffling by row re-introduces lemma leakage; the cluster split is the point. Hold out whole lemma clusters, not rows.
Candidates are candidates, not gold
-
synonym_*,pivot_synonyms,hypernym_candidatesare induced and noisy. Gate with the enriched scoring:scored_synonyms.wordnet_confirmed/pivot confidence_tier=GOLD(38.5 % precision) for synonyms;scored_hypernyms.wordnet_confirmedfor hypernyms. The 47.5 % hypernym figure is measured on a small checkable slice and is optimistic for the full pool (genus heads are not lemmatized — ~25 % are oblique forms; lemmatize with CLASSLA before use). - Antonyms are imported, not mined (synonyms/antonyms are translationally indistinguishable).
- Precision is measured only on vocabulary sloWNet already has (~11 %). The extension value (the ~89 % of members not yet in sloWNet) is unproven — commission a small human eval before treating those as silver.
Gold vs silver
- The gold lemma layer (
lemmas.parquet) is 100 % coverage, accent-folded, NFC-clean. - The silver morphology (
silver_morphology/msd_tagging) is CLASSLA tool output (morph_provenance="silver_tool"). Keep it filterable; measure MSD accuracy on a hand-tagged sample before training a morphological analyzer on it. The dictionary's owninflected_formsare mostly ending fragments, not full words.
Per-dictionary quality
-
DVRUSL (Russian) is OCR-grade (Word
.docreconstruction). Cleanups were applied (brace→paren,|/bullet stripping, space collapse) but residual noise is inherent — down-weight or exclude for high-precision work. - Definitions / reverse-dictionary come only from KNAUR (monolingual Slovene). There are no definitions from the 14 bilingual dicts.
- Conditioning labels are sparse (POS on ~18 % of translation rows, domain ~4 %, register ~4 %).
Provenance / audit
- Every removed escape token is logged in
data/reference/removed_markers.tsv. -
reports/_summary.jsoncarries the per-file sha256 manifest; each collection'smanifest.jsoncarries acontent_hashand split counts. Pin these with any eval run.
8. Loading
from datasets import load_dataset
# flat translation pairs (all dicts)
pairs = load_dataset("parquet", data_files="parquet/*.pairs.parquet", split="train")
# nested per-entry records (one dictionary)
entries = load_dataset("parquet", data_files="parquet/VSIS.entries.parquet", split="train")
entries = entries.filter(lambda r: r["has_content"]) # drop empty entries
# a Core LM task, by split
import json
train = [json.loads(l) for l in open("dist/core/tasks/translation.jsonl") if json.loads(l)["split"]=="train"]
# wordnet-confirmed synonyms only (Enriched gate)
import pyarrow.parquet as pq
syn = pq.read_table("dist/enriched/scored_synonyms.parquet").to_pylist()
gold = [r for r in syn if r["wordnet_confirmed"]]
# DMLex (faithful lexicographic view)
import json; res = json.load(open("lexidma/VSIS.json")) # OASIS DMLex 1.0
9. Reproduce end-to-end
pip install -e . # base deps (pyarrow, lxml, jsonschema, xmlschema)
dictconv convert all --write-summary # base: lexidma/, parquet/, reports/
dictconv build-core # Collection 1 -> dist/core/
dictconv enrich --sample-limit 0 # Collection 2 -> dist/enriched/ (see §6.2 for CLASSLA env)
dictconv audit --write-summary # readiness audit + artifact manifest
pytest -q # 52 tests