# Projekt Marko Kokol # Slovenian Lexicographic Datasets ## `dict-conversions` Two LM-ready datasets built from **15 legacy CJVT/DZS Slovenian dictionaries** (14 bilingual + 1 monolingual encyclopedia), converted to OASIS **DMLex 1.0** and HuggingFace **Parquet**, then enriched. This document describes both datasets, the pipeline that produced them, their schemas, and **what to look for** when using them. > **Status:** 805,279 entries · 1,019,685 senses · 2,174,880 translation pairs · 113,868 relations. All 15 dictionaries validate against the official DMLex XSD 1.1 + JSON Schema with **0 leaks, 0 parse errors, 0 residual glyph markers**. 52 unit tests pass. --- ### 1. The two datasets at a glance

	Collection 1 — Core	Collection 2 — Enriched
Path	`dist/core/` (+ canonical `lexidma/`, `parquet/`)	`dist/enriched/` (extends Core)
Provenance	Intrinsic — derived only from the project's own dictionaries	Extrinsic — Core + external resources
External tools	none	CLASSLA-Stanza, sloWNet/OMW, English WordNet (oewn)
Size	~2.1 GB	~184 MB (layers only; use with Core)
Reproducible offline	yes	needs the external resources (one-time download)
Contents	DMLex XML/JSON, derived tables, 12 LM task JSONLs	silver morphology + MSD task, synset/ILI links, imported antonyms, candidate scoring, sloWNet-enriched KNAUR DMLex

**Use them together.** Enriched is a thin layer of external-resource columns/files that *extends*Core; it does not duplicate it. --- ### 2. Source corpus `sh` = Serbo-Croatian (legacy unified tag). KNAUR is monolingual (definitions + cross-references, no translations). DVRUSL is reconstructed from a Word `.doc` (OCR-grade).

Code	Languages	Family	Entries	Senses	Pairs	Relations
DVANSL	en→sl	block\_blankline	68,761	75,208	264,454	0
DVFRSL	fr→sl	line	39,980	70,817	175,458	0
DVITSL	it→sl	block\_blankline	60,282	98,444	197,449	0
DVRUSL	ru→sl	doc\_runs	32,844	32,844	80,364	0
DVSHSL	sh→sl	block\_blankline	92,302	92,302	150,228	5,563
DVSPSL	es→sl	dzs\_utf16	36,460	66,722	114,691	414
DVSLAN	sl→en	block\_blankline	42,088	42,227	159,625	0
DVSLFR	sl→fr	line	33,866	44,092	128,561	34
DVSLNE	sl→de	german	84,951	90,374	252,074	120
DVSLSH	sl→sh	block\_blankline	72,690	87,140	152,068	2,694
DVSLSP	sl→es	dzs\_utf16	33,676	66,456	93,562	349
DRSLAN	sl→en	dzs\_nested	25,559	33,273	75,374	457
VSIS	sl→it	block\_geslo	90,675	116,223	275,043	7,764
LAT\_AZ	la→sl	block\_geslo	11,521	20,315	55,929	1,852
KNAUR	sl (mono)	encyclopedia	79,624	83,248	0	94,621

Several pairs exist in **both directions** (DVSLFR/DVFRSL, DVSLAN/DVANSL, DVSLSH/DVSHSL, DVSLSP/DVSPSL, VSIS+DVITSL) — exploited for translation-pivot synonyms and reverse-dictionary tasks. --- ### 3. The pipeline ``` ┌───────── BASE CONVERSION (intrinsic, faithful) ──────────┐ source dicts ─► parser family ─► IR (model.py) ─► DMLex XML/JSON (lexidma/) (8 families) + textproc └► Parquet (parquet/) + reports/ (specialchars, markup, controlled vocab) │ ▼ ┌──────── COLLECTION 1 · CORE (build/, intrinsic) ────────┐ │ derive.py : accent-folded lemma · dedup · leak-free split │ synsets.py : in-sense synonyms · translation pivots · KNAUR hypernyms · GWA relation typing │ tasks.py : 12 LM task JSONLs → dist/core/ └─────────────────────────────────────────────────────────┘ │ ▼ ┌──────── COLLECTION 2 · ENRICHED (build/enrich.py, extrinsic) ───────┐ │ CLASSLA silver lemma/UPOS/MSD (+ msd_tagging task) │ sloWNet/OMW synset+ILI links · ILI-bridged antonyms │ candidate scoring vs sloWNet · sloWNet-typed DMLex relations (KNAUR) → dist/enriched/ └─────────────────────────────────────────────────────────────────────┘ ``` **Stage 1 — Base conversion** (`src/dictconv/`, command `dictconv convert all`). Each dictionary's publisher typesetting markup is tokenized (not XML-parsed — it is malformed), `{…}` escapes decoded to Unicode (side-aware accents; undecodable font/template codes removed and logged), qualifiers classified into a controlled vocabulary, and emitted as a source-agnostic intermediate representation. Serializers produce **DMLex 1.0 XML+JSON** (validated against the official OASIS XSD 1.1 / JSON Schema) and three **Parquet** artifacts. The conversion is *faithful*: it never invents content and flags rather than silently drops. **Stage 2 — Core** (`dictconv build-core`). Adds derived, ML-oriented layers computable from our data alone: a normalized lemma/UPOS layer, exact/near-duplicate collapse, a **leak-free**train/dev/test split, synonym/pivot/hypernym candidate tables, and 12 instruction-style task JSONLs. **Stage 3 — Enriched** (`dictconv enrich`). Adds layers that need outside resources, kept strictly separate from the intrinsic data (gold-vs-silver provenance is explicit). --- ### 4. Base conversion artifacts #### `lexidma/

.xml` and `.json` — OASIS DMLex 1.0

Faithful lexicographic record. Camel-case elements; text of every object except `headword` in a nested ``; crosslingual module (`headwordTranslation`, `exampleTranslation`); the linking module (`relation` + **`relationType` definitions that carry a Global WordNet `sameAs` URI**, e.g. `see → https://globalwordnet.github.io/schemas/wn#also`). Validated with a real **XSD 1.1** processor (all identity constraints + assertions, except 3 cardinality-defective ones the published schema cannot satisfy). KNAUR uses the monolingual schema.

#### `parquet/.entries.parquet` — one row per entry (nested)

```
dict_code, entry_id, headword,
lemma,            # accent-folded join key (Sloleks/Gigafida/CLASSLA/sloWNet keying)
accented_form,    # original tonal/accented display form (null if == headword)
homograph_number, source_lang, target_lang,
meta_lang,        # editorial metalanguage = "sl"
parts_of_speech[str], upos,    # UPOS from the entry's first POS (intrinsic static map)
frequency_band,   # DRSLAN corpus band 0..3 (null elsewhere)
labels[str], collocates[str],  # DRSLAN  collocates
pronunciations[{text, scheme}],
inflected_forms[{text, tag}],
senses[{ sense_id, indicator, labels[str], definitions[str],
         headword_translations[{text, lang_code, parts_of_speech[str], labels[str]}],
         headword_explanations[{text, lang_code}],
         examples[{text, labels[str], translations[{text, lang_code, labels[str]}]}] }],
has_content,      # False => no senses / all senses empty (filter before LM use)
source_ref, raw   # provenance

```

#### `parquet/.pairs.parquet` — one row per (source,target) unit (flat)

```
dict_code, source_lang, target_lang, entry_id, sense_id, homograph_number,
pair_type (headword|example), source_text, target_text,
source_lemma,    # accent-folded entry headword (dedup + leak-free split key)
part_of_speech, labels[str], domain, register

```

#### `parquet/.relations.parquet` — the full cross-reference graph

```
dict_code, source_lang, target_lang, relation_index, type, description,
members[{ref, headword, role, target_id}],
serialized       # True => >=2 members resolved => present in DMLex XML/JSON

```

This is the **lossless** home of the cross-ref graph: it keeps cross-references whose target never resolved to an entry id (which the DMLex XML/JSON must drop).

#### `reports/.report.json` + `reports/_summary.json`

Per-dictionary stats, validation results, the controlled-value inventory, flagged-token counts, and the aggregate summary + artifact manifest (sha256 of every output).

---

### 5. Collection 1 — Core (`dist/core/`)

#### 5.1 Derived tables (`dist/core/derived/`)

File Rows What it is
`lemmas.parquet` 805,279 one row per entry: lemma, accented\_form, upos, frequency\_band, cluster\_id, split
`pairs_dedup.parquet` 2,078,214 de-duplicated translation pairs + `occurrence_count`, `canonical_id`, split
`synonym_sets.parquet` 378,783 in-sense (target-language) near-synonym sets + gloss
`synonym_pairs.parquet` 2,131,996 in-sense + pivot synonym pairs (evidence, confidence\_tier)
`pivot_synonyms.parquet` 152,655 Slovene synonym candidates from translation pivots (GOLD 67,296 / SILVER 85,359)
`hypernym_candidates.parquet` 4,169 KNAUR genus-differentia hypernym candidates (confidence)
`relations_typed.parquet` 113,868 the cross-ref graph, GWA-typed (`gwa_relType`)

> Pivot-synonym yield: 152,655 SILVER+GOLD pairs (≥2 agreeing pivots) materialized; **632,231**single-pivot BRONZE pairs were counted but **not** materialized (large, low precision); 339,415 distinct pivots used.

#### 5.2 LM tasks (`dist/core/tasks/*.jsonl`)

Each row: `{id, task, split, input:{…}, output:{…}, metadata:{…}}`. Split is `train/dev/test`(≈90/5/5), leak-free (§5.3). Marker policy `drop` (undecodable-glyph rows are cleaned/omitted).

Task Rows input → output
`translation` 3,006,506 `{source_text, source_lang, target_lang, part_of_speech, labels, domain, register}` → `{target_text}` (both directions)
`example_translation` 574,961 example phrase → its translation
`definition` 145,633 `{headword, lang, indicator}` → `{definition}` (KNAUR)
`reverse_dictionary` 145,633 `{definition, lang}` → `{headword}`
`wsd` 22,036 `{word, context, lang}` → `{sense_gloss, sense_id}` (polysemous only; bare-number glosses dropped)
`example_usage` 85,155 `{headword, lang}` → `{example}` (monolingual usage sentences)
`morphology` 332,393 `{headword, lang}` → `{form, tag}` (dictionary inflected forms)
`pronunciation` 157,578 `{headword, lang}` → `{transcription, scheme}`
`synonyms_of` 358,573 `{word, lang}` → `{synonyms[]}` (a real **set**; 60% have >1)
`hypernym_of` 4,169 `{word, lang}` → `{hypernym_candidate, confidence}`
`relation` 113,373 a relation's first member → `{relation_type, members[…]}`
`relation_classify` 113,327 `{a, b, lang}` → `{relation_type}` (unordered-pair split)

#### 5.3 Leak-free split (important)

- **Translation / sense / synonym tasks** key on the **folded Slovene lemma**, so a lemma and its reverse-direction twin (`hiša` in sl→fr and `maison→hiša` in fr→sl) are always in the **same**split. Verified: **0 of 172,081** Slovene headword lemmas straddle splits. (e.g. translation split: train 2,711,054 / dev 144,596 / test 150,856.)
- **Morphology & pronunciation** (about the headword *form*) split by **headword form**; `relation_classify` by the **unordered member pair** — so foreign homographs don't straddle either.
- The legacy `dict_code:entry_id` key (now superseded) leaked ~26 % of multi-dict lemmas.

#### 5.4 Cleaning applied (Core)

Dedup before split (with `occurrence_count`); degenerate targets dropped (punct/digit-only, single-char, `src==tgt`); unbalanced parentheses balanced; PUA sentinels + control chars stripped; undecodable glyph markers removed (`marker_policy=drop`). `manifest.json` carries a content hash (`e76f2766…`) and per-task split counts; `dataset_card.md` is the in-tree card.

---

### 6. Collection 2 — Enriched (`dist/enriched/`)

File Rows What it is
`silver_morphology.parquet` 208,715 CLASSLA lemma / UPOS / **JOS-MULTEXT-East MSD** / feats per Slovene lemma; `morph_provenance="silver_tool"`
`tasks/msd_tagging.jsonl` 208,715 `{lemma, lang}` → `{upos, msd, feats}` (the morphology/POS task; **silver**)
`synset_links.parquet` 64,413 Slovene lemma → sloWNet/OMW `synset_id` + **ILI** (join key to Princeton WN / OMW)
`antonyms.parquet` 6,107 imported Slovene antonyms (ILI-bridged through the English WordNet)
`scored_synonyms.parquet` 364,334 every Core synonym candidate + `wordnet_confirmed` + `source_count`
`scored_hypernyms.parquet` 396 checkable hypernym candidates + `wordnet_confirmed`
`lexidma/KNAUR.{xml,json}` 97,418 rel KNAUR re-serialized as DMLex with sloWNet **antonym (142)** + **synonym (2,655)** relations, ILI/synset in `relation/description`, GWA-typed

#### 6.1 Candidate scoring vs sloWNet (measured precision — lower bounds; sloWNet is incomplete)

- **Synonyms:** 364,334 checkable, **17.0 %** confirmed — in-sense 14.6 %, pivot 27.1 %, **pivot-GOLD 38.5 %**.
- **Hypernyms:** 396 checkable, **47.5 %** confirmed (ILI-bridged through the English WordNet).
- Use `wordnet_confirmed=True` (and/or `confidence_tier=GOLD`) to extract a higher-precision subset.

#### 6.2 External resources & how to reproduce

sloWNet/antonyms run in the main `.venv` (`pip install -e '.[enrich]'` → `wn`). **CLASSLA needs Python ≤ 3.13** (its pinned numpy fails to build on 3.14), so run the silver morphology from a 3.12 env:

```bash
uv venv --python 3.12 .venv-enrich
uv pip install -p .venv-enrich/bin/python -e '.[enrich]'
python -m wn download omw-sl ; python -m wn download oewn:2021
.venv-enrich/bin/python -c "import classla; classla.download('sl')"
.venv-enrich/bin/python -m dictconv.cli enrich --in dist/core --out dist/enriched --sample-limit 0

```

---

### 7. What to look for (usage guidance & caveats)

**Filter before training**

- **`has_content`** — drop entries with no usable content (≈1.2 % of entries) for entry-level tasks.
- **`marker_policy`** — task JSONLs are already built with `drop`; never train on the `keep` variant (it would teach the model to emit placeholder glyphs). Corpus markers are currently 0.
- **Degenerate rows** — already removed from the tasks; if you build your own from `parquet/`, apply the same filters (punct/digit-only, `src==tgt`, unbalanced parens).
- **Use the provided split.** Re-shuffling by row re-introduces lemma leakage; the cluster split is the point. Hold out **whole lemma clusters**, not rows.

**Candidates are candidates, not gold**

- `synonym_*`, `pivot_synonyms`, `hypernym_candidates` are **induced** and noisy. Gate with the enriched scoring: `scored_synonyms.wordnet_confirmed` / `pivot confidence_tier=GOLD` (38.5 % precision) for synonyms; `scored_hypernyms.wordnet_confirmed` for hypernyms. The 47.5 % hypernym figure is measured on a small checkable slice and is optimistic for the full pool (genus heads are not lemmatized — ~25 % are oblique forms; lemmatize with CLASSLA before use).
- **Antonyms are imported, not mined** (synonyms/antonyms are translationally indistinguishable).
- **Precision is measured only on vocabulary sloWNet already has (~11 %).** The extension value (the ~89 % of members not yet in sloWNet) is **unproven** — commission a small human eval before treating those as silver.

**Gold vs silver**

- The **gold** lemma layer (`lemmas.parquet`) is 100 % coverage, accent-folded, NFC-clean.
- The **silver** morphology (`silver_morphology` / `msd_tagging`) is CLASSLA tool output (`morph_provenance="silver_tool"`). Keep it filterable; measure MSD accuracy on a hand-tagged sample before training a morphological analyzer on it. The dictionary's own `inflected_forms` are mostly *ending fragments*, not full words.

**Per-dictionary quality**

- **DVRUSL** (Russian) is OCR-grade (Word `.doc` reconstruction). Cleanups were applied (brace→paren, `|`/bullet stripping, space collapse) but residual noise is inherent — down-weight or exclude for high-precision work.
- **Definitions / reverse-dictionary** come **only from KNAUR** (monolingual Slovene). There are no definitions from the 14 bilingual dicts.
- **Conditioning labels are sparse** (POS on ~18 % of translation rows, domain ~4 %, register ~4 %).

**Provenance / audit**

- Every removed escape token is logged in [`data/reference/removed_markers.tsv`](data/reference/removed_markers.tsv).
- `reports/_summary.json` carries the per-file sha256 manifest; each collection's `manifest.json`carries a `content_hash` and split counts. Pin these with any eval run.

---

### 8. Loading

```python
from datasets import load_dataset

# flat translation pairs (all dicts)
pairs = load_dataset("parquet", data_files="parquet/*.pairs.parquet", split="train")

# nested per-entry records (one dictionary)
entries = load_dataset("parquet", data_files="parquet/VSIS.entries.parquet", split="train")
entries = entries.filter(lambda r: r["has_content"])          # drop empty entries

# a Core LM task, by split
import json
train = [json.loads(l) for l in open("dist/core/tasks/translation.jsonl") if json.loads(l)["split"]=="train"]

# wordnet-confirmed synonyms only (Enriched gate)
import pyarrow.parquet as pq
syn = pq.read_table("dist/enriched/scored_synonyms.parquet").to_pylist()
gold = [r for r in syn if r["wordnet_confirmed"]]

# DMLex (faithful lexicographic view)
import json; res = json.load(open("lexidma/VSIS.json"))       # OASIS DMLex 1.0

```

### 9. Reproduce end-to-end

```bash
pip install -e .                 # base deps (pyarrow, lxml, jsonschema, xmlschema)
dictconv convert all --write-summary     # base: lexidma/, parquet/, reports/
dictconv build-core                       # Collection 1 -> dist/core/
dictconv enrich --sample-limit 0          # Collection 2 -> dist/enriched/  (see §6.2 for CLASSLA env)
dictconv audit --write-summary            # readiness audit + artifact manifest
pytest -q                                 # 52 tests

```

# Converted dictionaries

# OASIS DMLex 1.0 (XML + JSON)

Slovenian lexicographic datasets from `dict-conversions`. Every dictionary is provided in **both**DMLex 1.0 serializations: `.xml` and `.json`. Two collections:

### intrinsic/ — Core collection (faithful conversion; the project's own sources only)

All 15 converted dictionaries: DVANSL en->sl, DVFRSL fr->sl, DVITSL it->sl, DVRUSL ru->sl, DVSHSL sh->sl, DVSPSL es->sl, DVSLAN sl->en, DVSLFR sl->fr, DVSLNE sl->de, DVSLSH sl->sh, DVSLSP sl->es, DRSLAN sl->en, VSIS sl->it, LAT\_AZ la->sl, KNAUR sl (monolingual encyclopedia).

### extrinsic/ — Enriched collection (external resources)

KNAUR.xml + KNAUR.json — the monolingual encyclopedia re-serialized with sloWNet-derived **antonym (142)** and **synonym (2,655)** `relation`s. Each carries its sloWNet provenance (ILI / synset id) in `relation/description`, and the relation types link to the Global WordNet vocabulary via `relationType/sameAs`.

KNAUR is the ONLY resource whose external (sloWNet) enrichment is expressible *as DMLex*: DMLex 1.0 allows external `sameAs` links only on tag definitions, not on senses/entries/relations. The other enrichment layers — CLASSLA silver lemma/UPOS/MSD, per-lemma synset/ILI links, imported antonyms, and candidate scoring — are tabular and ship as Parquet in the `dist/enriched/` collection (not in this archive). `intrinsic/KNAUR.*` is the base KNAUR (no sloWNet relations); `extrinsic/KNAUR.*` is the enriched version — diff them to see the added relations.

File	Rows	What it is
`lemmas.parquet`	805,279	one row per entry: lemma, accented\_form, upos, frequency\_band, cluster\_id, split
`pairs_dedup.parquet`	2,078,214	de-duplicated translation pairs + `occurrence_count`, `canonical_id`, split
`synonym_sets.parquet`	378,783	in-sense (target-language) near-synonym sets + gloss
`synonym_pairs.parquet`	2,131,996	in-sense + pivot synonym pairs (evidence, confidence\_tier)
`pivot_synonyms.parquet`	152,655	Slovene synonym candidates from translation pivots (GOLD 67,296 / SILVER 85,359)
`hypernym_candidates.parquet`	4,169	KNAUR genus-differentia hypernym candidates (confidence)
`relations_typed.parquet`	113,868	the cross-ref graph, GWA-typed (`gwa_relType`)

Task	Rows	input → output
`translation`	3,006,506	`{source_text, source_lang, target_lang, part_of_speech, labels, domain, register}` → `{target_text}` (both directions)
`example_translation`	574,961	example phrase → its translation
`definition`	145,633	`{headword, lang, indicator}` → `{definition}` (KNAUR)
`reverse_dictionary`	145,633	`{definition, lang}` → `{headword}`
`wsd`	22,036	`{word, context, lang}` → `{sense_gloss, sense_id}` (polysemous only; bare-number glosses dropped)
`example_usage`	85,155	`{headword, lang}` → `{example}` (monolingual usage sentences)
`morphology`	332,393	`{headword, lang}` → `{form, tag}` (dictionary inflected forms)
`pronunciation`	157,578	`{headword, lang}` → `{transcription, scheme}`
`synonyms_of`	358,573	`{word, lang}` → `{synonyms[]}` (a real set; 60% have >1)
`hypernym_of`	4,169	`{word, lang}` → `{hypernym_candidate, confidence}`
`relation`	113,373	a relation's first member → `{relation_type, members[…]}`
`relation_classify`	113,327	`{a, b, lang}` → `{relation_type}` (unordered-pair split)

File	Rows	What it is
`silver_morphology.parquet`	208,715	CLASSLA lemma / UPOS / JOS-MULTEXT-East MSD / feats per Slovene lemma; `morph_provenance="silver_tool"`
`tasks/msd_tagging.jsonl`	208,715	`{lemma, lang}` → `{upos, msd, feats}` (the morphology/POS task; silver)
`synset_links.parquet`	64,413	Slovene lemma → sloWNet/OMW `synset_id` + ILI (join key to Princeton WN / OMW)
`antonyms.parquet`	6,107	imported Slovene antonyms (ILI-bridged through the English WordNet)
`scored_synonyms.parquet`	364,334	every Core synonym candidate + `wordnet_confirmed` + `source_count`
`scored_hypernyms.parquet`	396	checkable hypernym candidates + `wordnet_confirmed`
`lexidma/KNAUR.{xml,json}`	97,418 rel	KNAUR re-serialized as DMLex with sloWNet antonym (142) + synonym (2,655) relations, ILI/synset in `relation/description`, GWA-typed