Projekt Marko Kokol Slovenian Lexicographic Datasets dict-conversions Two LM-ready datasets built from 15 legacy CJVT/DZS Slovenian dictionaries (14 bilingual + 1 monolingual encyclopedia), converted to OASIS DMLex 1.0 and HuggingFace Parquet , then enriched. This document describes both datasets, the pipeline that produced them, their schemas, and what to look for when using them. Status: 805,279 entries · 1,019,685 senses · 2,174,880 translation pairs · 113,868 relations. All 15 dictionaries validate against the official DMLex XSD 1.1 + JSON Schema with 0 leaks, 0 parse errors, 0 residual glyph markers . 52 unit tests pass. 1. The two datasets at a glance Collection 1 — Core Collection 2 — Enriched Path dist/core/ (+ canonical lexidma/ , parquet/ ) dist/enriched/ (extends Core) Provenance Intrinsic — derived only from the project's own dictionaries Extrinsic — Core + external resources External tools none CLASSLA-Stanza, sloWNet/OMW, English WordNet (oewn) Size ~2.1 GB ~184 MB (layers only; use with Core) Reproducible offline yes needs the external resources (one-time download) Contents DMLex XML/JSON, derived tables, 12 LM task JSONLs silver morphology + MSD task, synset/ILI links, imported antonyms, candidate scoring, sloWNet-enriched KNAUR DMLex Use them together. Enriched is a thin layer of external-resource columns/files that extends Core; it does not duplicate it. 2. Source corpus sh = Serbo-Croatian (legacy unified tag). KNAUR is monolingual (definitions + cross-references, no translations). DVRUSL is reconstructed from a Word .doc (OCR-grade). Code Languages Family Entries Senses Pairs Relations DVANSL en→sl block_blankline 68,761 75,208 264,454 0 DVFRSL fr→sl line 39,980 70,817 175,458 0 DVITSL it→sl block_blankline 60,282 98,444 197,449 0 DVRUSL ru→sl doc_runs 32,844 32,844 80,364 0 DVSHSL sh→sl block_blankline 92,302 92,302 150,228 5,563 DVSPSL es→sl dzs_utf16 36,460 66,722 114,691 414 DVSLAN sl→en block_blankline 42,088 42,227 159,625 0 DVSLFR sl→fr line 33,866 44,092 128,561 34 DVSLNE sl→de german 84,951 90,374 252,074 120 DVSLSH sl→sh block_blankline 72,690 87,140 152,068 2,694 DVSLSP sl→es dzs_utf16 33,676 66,456 93,562 349 DRSLAN sl→en dzs_nested 25,559 33,273 75,374 457 VSIS sl→it block_geslo 90,675 116,223 275,043 7,764 LAT_AZ la→sl block_geslo 11,521 20,315 55,929 1,852 KNAUR sl (mono) encyclopedia 79,624 83,248 0 94,621 Several pairs exist in both directions (DVSLFR/DVFRSL, DVSLAN/DVANSL, DVSLSH/DVSHSL, DVSLSP/DVSPSL, VSIS+DVITSL) — exploited for translation-pivot synonyms and reverse-dictionary tasks. 3. The pipeline ┌───────── BASE CONVERSION (intrinsic, faithful) ──────────┐ source dicts ─► parser family ─► IR (model.py) ─► DMLex XML/JSON (lexidma/) (8 families) + textproc └► Parquet (parquet/) + reports/ (specialchars, markup, controlled vocab) │ ▼ ┌──────── COLLECTION 1 · CORE (build/, intrinsic) ────────┐ │ derive.py : accent-folded lemma · dedup · leak-free split │ synsets.py : in-sense synonyms · translation pivots · KNAUR hypernyms · GWA relation typing │ tasks.py : 12 LM task JSONLs → dist/core/ └─────────────────────────────────────────────────────────┘ │ ▼ ┌──────── COLLECTION 2 · ENRICHED (build/enrich.py, extrinsic) ───────┐ │ CLASSLA silver lemma/UPOS/MSD (+ msd_tagging task) │ sloWNet/OMW synset+ILI links · ILI-bridged antonyms │ candidate scoring vs sloWNet · sloWNet-typed DMLex relations (KNAUR) → dist/enriched/ └─────────────────────────────────────────────────────────────────────┘ Stage 1 — Base conversion ( src/dictconv/ , command dictconv convert all ). Each dictionary's publisher typesetting markup is tokenized (not XML-parsed — it is malformed), {…} escapes decoded to Unicode (side-aware accents; undecodable font/template codes removed and logged), qualifiers classified into a controlled vocabulary, and emitted as a source-agnostic intermediate representation. Serializers produce DMLex 1.0 XML+JSON (validated against the official OASIS XSD 1.1 / JSON Schema) and three Parquet artifacts. The conversion is faithful : it never invents content and flags rather than silently drops. Stage 2 — Core ( dictconv build-core ). Adds derived, ML-oriented layers computable from our data alone: a normalized lemma/UPOS layer, exact/near-duplicate collapse, a leak-free train/dev/test split, synonym/pivot/hypernym candidate tables, and 12 instruction-style task JSONLs. Stage 3 — Enriched ( dictconv enrich ). Adds layers that need outside resources, kept strictly separate from the intrinsic data (gold-vs-silver provenance is explicit). 4. Base conversion artifacts lexidma/.xml and .json — OASIS DMLex 1.0 Faithful lexicographic record. Camel-case elements; text of every object except headword in a nested ; crosslingual module ( headwordTranslation , exampleTranslation ); the linking module ( relation + relationType definitions that carry a Global WordNet sameAs URI , e.g. see → https://globalwordnet.github.io/schemas/wn#also ). Validated with a real XSD 1.1 processor (all identity constraints + assertions, except 3 cardinality-defective ones the published schema cannot satisfy). KNAUR uses the monolingual schema. parquet/.entries.parquet — one row per entry (nested) dict_code, entry_id, headword, lemma, # accent-folded join key (Sloleks/Gigafida/CLASSLA/sloWNet keying) accented_form, # original tonal/accented display form (null if == headword) homograph_number, source_lang, target_lang, meta_lang, # editorial metalanguage = "sl" parts_of_speech[str], upos, # UPOS from the entry's first POS (intrinsic static map) frequency_band, # DRSLAN corpus band 0..3 (null elsewhere) labels[str], collocates[str], # DRSLAN collocates pronunciations[{text, scheme}], inflected_forms[{text, tag}], senses[{ sense_id, indicator, labels[str], definitions[str], headword_translations[{text, lang_code, parts_of_speech[str], labels[str]}], headword_explanations[{text, lang_code}], examples[{text, labels[str], translations[{text, lang_code, labels[str]}]}] }], has_content, # False => no senses / all senses empty (filter before LM use) source_ref, raw # provenance parquet/.pairs.parquet — one row per (source,target) unit (flat) dict_code, source_lang, target_lang, entry_id, sense_id, homograph_number, pair_type (headword|example), source_text, target_text, source_lemma, # accent-folded entry headword (dedup + leak-free split key) part_of_speech, labels[str], domain, register parquet/.relations.parquet — the full cross-reference graph dict_code, source_lang, target_lang, relation_index, type, description, members[{ref, headword, role, target_id}], serialized # True => >=2 members resolved => present in DMLex XML/JSON This is the lossless home of the cross-ref graph: it keeps cross-references whose target never resolved to an entry id (which the DMLex XML/JSON must drop). reports/.report.json + reports/_summary.json Per-dictionary stats, validation results, the controlled-value inventory, flagged-token counts, and the aggregate summary + artifact manifest (sha256 of every output). 5. Collection 1 — Core ( dist/core/ ) 5.1 Derived tables ( dist/core/derived/ ) File Rows What it is lemmas.parquet 805,279 one row per entry: lemma, accented_form, upos, frequency_band, cluster_id, split pairs_dedup.parquet 2,078,214 de-duplicated translation pairs + occurrence_count , canonical_id , split synonym_sets.parquet 378,783 in-sense (target-language) near-synonym sets + gloss synonym_pairs.parquet 2,131,996 in-sense + pivot synonym pairs (evidence, confidence_tier) pivot_synonyms.parquet 152,655 Slovene synonym candidates from translation pivots (GOLD 67,296 / SILVER 85,359) hypernym_candidates.parquet 4,169 KNAUR genus-differentia hypernym candidates (confidence) relations_typed.parquet 113,868 the cross-ref graph, GWA-typed ( gwa_relType ) Pivot-synonym yield: 152,655 SILVER+GOLD pairs (≥2 agreeing pivots) materialized; 632,231 single-pivot BRONZE pairs were counted but not materialized (large, low precision); 339,415 distinct pivots used. 5.2 LM tasks ( dist/core/tasks/*.jsonl ) Each row: {id, task, split, input:{…}, output:{…}, metadata:{…}} . Split is train/dev/test (≈90/5/5), leak-free (§5.3). Marker policy drop (undecodable-glyph rows are cleaned/omitted). Task Rows input → output translation 3,006,506 {source_text, source_lang, target_lang, part_of_speech, labels, domain, register} → {target_text} (both directions) example_translation 574,961 example phrase → its translation definition 145,633 {headword, lang, indicator} → {definition} (KNAUR) reverse_dictionary 145,633 {definition, lang} → {headword} wsd 22,036 {word, context, lang} → {sense_gloss, sense_id} (polysemous only; bare-number glosses dropped) example_usage 85,155 {headword, lang} → {example} (monolingual usage sentences) morphology 332,393 {headword, lang} → {form, tag} (dictionary inflected forms) pronunciation 157,578 {headword, lang} → {transcription, scheme} synonyms_of 358,573 {word, lang} → {synonyms[]} (a real set ; 60% have >1) hypernym_of 4,169 {word, lang} → {hypernym_candidate, confidence} relation 113,373 a relation's first member → {relation_type, members[…]} relation_classify 113,327 {a, b, lang} → {relation_type} (unordered-pair split) 5.3 Leak-free split (important) Translation / sense / synonym tasks key on the folded Slovene lemma , so a lemma and its reverse-direction twin ( hiša in sl→fr and maison→hiša in fr→sl) are always in the same split. Verified: 0 of 172,081 Slovene headword lemmas straddle splits. (e.g. translation split: train 2,711,054 / dev 144,596 / test 150,856.) Morphology & pronunciation (about the headword form ) split by headword form ; relation_classify by the unordered member pair — so foreign homographs don't straddle either. The legacy dict_code:entry_id key (now superseded) leaked ~26 % of multi-dict lemmas. 5.4 Cleaning applied (Core) Dedup before split (with occurrence_count ); degenerate targets dropped (punct/digit-only, single-char, src==tgt ); unbalanced parentheses balanced; PUA sentinels + control chars stripped; undecodable glyph markers removed ( marker_policy=drop ). manifest.json carries a content hash ( e76f2766… ) and per-task split counts; dataset_card.md is the in-tree card. 6. Collection 2 — Enriched ( dist/enriched/ ) File Rows What it is silver_morphology.parquet 208,715 CLASSLA lemma / UPOS / JOS-MULTEXT-East MSD / feats per Slovene lemma; morph_provenance="silver_tool" tasks/msd_tagging.jsonl 208,715 {lemma, lang} → {upos, msd, feats} (the morphology/POS task; silver ) synset_links.parquet 64,413 Slovene lemma → sloWNet/OMW synset_id + ILI (join key to Princeton WN / OMW) antonyms.parquet 6,107 imported Slovene antonyms (ILI-bridged through the English WordNet) scored_synonyms.parquet 364,334 every Core synonym candidate + wordnet_confirmed + source_count scored_hypernyms.parquet 396 checkable hypernym candidates + wordnet_confirmed lexidma/KNAUR.{xml,json} 97,418 rel KNAUR re-serialized as DMLex with sloWNet antonym (142) + synonym (2,655) relations, ILI/synset in relation/description , GWA-typed 6.1 Candidate scoring vs sloWNet (measured precision — lower bounds; sloWNet is incomplete) Synonyms: 364,334 checkable, 17.0 % confirmed — in-sense 14.6 %, pivot 27.1 %, pivot-GOLD 38.5 % . Hypernyms: 396 checkable, 47.5 % confirmed (ILI-bridged through the English WordNet). Use wordnet_confirmed=True (and/or confidence_tier=GOLD ) to extract a higher-precision subset. 6.2 External resources & how to reproduce sloWNet/antonyms run in the main .venv ( pip install -e '.[enrich]' → wn ). CLASSLA needs Python ≤ 3.13 (its pinned numpy fails to build on 3.14), so run the silver morphology from a 3.12 env: uv venv --python 3.12 .venv-enrich uv pip install -p .venv-enrich/bin/python -e '.[enrich]' python -m wn download omw-sl ; python -m wn download oewn:2021 .venv-enrich/bin/python -c "import classla; classla.download('sl')" .venv-enrich/bin/python -m dictconv.cli enrich --in dist/core --out dist/enriched --sample-limit 0 7. What to look for (usage guidance & caveats) Filter before training has_content — drop entries with no usable content (≈1.2 % of entries) for entry-level tasks. marker_policy — task JSONLs are already built with drop ; never train on the keep variant (it would teach the model to emit placeholder glyphs). Corpus markers are currently 0 . Degenerate rows — already removed from the tasks; if you build your own from parquet/ , apply the same filters (punct/digit-only, src==tgt , unbalanced parens). Use the provided split. Re-shuffling by row re-introduces lemma leakage; the cluster split is the point. Hold out whole lemma clusters , not rows. Candidates are candidates, not gold synonym_* , pivot_synonyms , hypernym_candidates are induced and noisy. Gate with the enriched scoring: scored_synonyms.wordnet_confirmed / pivot confidence_tier=GOLD (38.5 % precision) for synonyms; scored_hypernyms.wordnet_confirmed for hypernyms. The 47.5 % hypernym figure is measured on a small checkable slice and is optimistic for the full pool (genus heads are not lemmatized — ~25 % are oblique forms; lemmatize with CLASSLA before use). Antonyms are imported, not mined (synonyms/antonyms are translationally indistinguishable). Precision is measured only on vocabulary sloWNet already has (~11 %). The extension value (the ~89 % of members not yet in sloWNet) is unproven — commission a small human eval before treating those as silver. Gold vs silver The gold lemma layer ( lemmas.parquet ) is 100 % coverage, accent-folded, NFC-clean. The silver morphology ( silver_morphology / msd_tagging ) is CLASSLA tool output ( morph_provenance="silver_tool" ). Keep it filterable; measure MSD accuracy on a hand-tagged sample before training a morphological analyzer on it. The dictionary's own inflected_forms are mostly ending fragments , not full words. Per-dictionary quality DVRUSL (Russian) is OCR-grade (Word .doc reconstruction). Cleanups were applied (brace→paren, | /bullet stripping, space collapse) but residual noise is inherent — down-weight or exclude for high-precision work. Definitions / reverse-dictionary come only from KNAUR (monolingual Slovene). There are no definitions from the 14 bilingual dicts. Conditioning labels are sparse (POS on ~18 % of translation rows, domain ~4 %, register ~4 %). Provenance / audit Every removed escape token is logged in data/reference/removed_markers.tsv . reports/_summary.json carries the per-file sha256 manifest; each collection's manifest.json carries a content_hash and split counts. Pin these with any eval run. 8. Loading from datasets import load_dataset # flat translation pairs (all dicts) pairs = load_dataset("parquet", data_files="parquet/*.pairs.parquet", split="train") # nested per-entry records (one dictionary) entries = load_dataset("parquet", data_files="parquet/VSIS.entries.parquet", split="train") entries = entries.filter(lambda r: r["has_content"]) # drop empty entries # a Core LM task, by split import json train = [json.loads(l) for l in open("dist/core/tasks/translation.jsonl") if json.loads(l)["split"]=="train"] # wordnet-confirmed synonyms only (Enriched gate) import pyarrow.parquet as pq syn = pq.read_table("dist/enriched/scored_synonyms.parquet").to_pylist() gold = [r for r in syn if r["wordnet_confirmed"]] # DMLex (faithful lexicographic view) import json; res = json.load(open("lexidma/VSIS.json")) # OASIS DMLex 1.0 9. Reproduce end-to-end pip install -e . # base deps (pyarrow, lxml, jsonschema, xmlschema) dictconv convert all --write-summary # base: lexidma/, parquet/, reports/ dictconv build-core # Collection 1 -> dist/core/ dictconv enrich --sample-limit 0 # Collection 2 -> dist/enriched/ (see §6.2 for CLASSLA env) dictconv audit --write-summary # readiness audit + artifact manifest pytest -q # 52 tests Converted dictionaries OASIS DMLex 1.0 (XML + JSON) Slovenian lexicographic datasets from dict-conversions . Every dictionary is provided in both DMLex 1.0 serializations: .xml and .json . Two collections: intrinsic/ — Core collection (faithful conversion; the project's own sources only) All 15 converted dictionaries: DVANSL en->sl, DVFRSL fr->sl, DVITSL it->sl, DVRUSL ru->sl, DVSHSL sh->sl, DVSPSL es->sl, DVSLAN sl->en, DVSLFR sl->fr, DVSLNE sl->de, DVSLSH sl->sh, DVSLSP sl->es, DRSLAN sl->en, VSIS sl->it, LAT_AZ la->sl, KNAUR sl (monolingual encyclopedia). extrinsic/ — Enriched collection (external resources) KNAUR.xml + KNAUR.json — the monolingual encyclopedia re-serialized with sloWNet-derived antonym (142) and synonym (2,655) relation s. Each carries its sloWNet provenance (ILI / synset id) in relation/description , and the relation types link to the Global WordNet vocabulary via relationType/sameAs . KNAUR is the ONLY resource whose external (sloWNet) enrichment is expressible as DMLex : DMLex 1.0 allows external sameAs links only on tag definitions, not on senses/entries/relations. The other enrichment layers — CLASSLA silver lemma/UPOS/MSD, per-lemma synset/ILI links, imported antonyms, and candidate scoring — are tabular and ship as Parquet in the dist/enriched/ collection (not in this archive). intrinsic/KNAUR.* is the base KNAUR (no sloWNet relations); extrinsic/KNAUR.* is the enriched version — diff them to see the added relations.