# Projekt Marko Kokol

# Slovenian Lexicographic Datasets

## `dict-conversions`

Two LM-ready datasets built from **15 legacy CJVT/DZS Slovenian dictionaries** (14 bilingual + 1 monolingual encyclopedia), converted to OASIS **DMLex 1.0** and HuggingFace **Parquet**, then enriched. This document describes both datasets, the pipeline that produced them, their schemas, and **what to look for** when using them.

> **Status:** 805,279 entries · 1,019,685 senses · 2,174,880 translation pairs · 113,868 relations. All 15 dictionaries validate against the official DMLex XSD 1.1 + JSON Schema with **0 leaks, 0 parse errors, 0 residual glyph markers**. 52 unit tests pass.

---

### 1. The two datasets at a glance

<table id="bkmrk-collection-1-%E2%80%94-core-"><thead><tr><th></th><th>**Collection 1 — Core**</th><th>**Collection 2 — Enriched**</th></tr></thead><tbody><tr><td>Path</td><td>`dist/core/` (+ canonical `lexidma/`, `parquet/`)</td><td>`dist/enriched/` (extends Core)</td></tr><tr><td>Provenance</td><td>**Intrinsic** — derived only from the project's own dictionaries</td><td>**Extrinsic** — Core + external resources</td></tr><tr><td>External tools</td><td>none</td><td>CLASSLA-Stanza, sloWNet/OMW, English WordNet (oewn)</td></tr><tr><td>Size</td><td>~2.1 GB</td><td>~184 MB (layers only; use *with* Core)</td></tr><tr><td>Reproducible offline</td><td>yes</td><td>needs the external resources (one-time download)</td></tr><tr><td>Contents</td><td>DMLex XML/JSON, derived tables, 12 LM task JSONLs</td><td>silver morphology + MSD task, synset/ILI links, imported antonyms, candidate scoring, sloWNet-enriched KNAUR DMLex</td></tr></tbody></table>

**Use them together.** Enriched is a thin layer of external-resource columns/files that *extends*Core; it does not duplicate it.

---

### 2. Source corpus

`sh` = Serbo-Croatian (legacy unified tag). KNAUR is monolingual (definitions + cross-references, no translations). DVRUSL is reconstructed from a Word `.doc` (OCR-grade).

<table id="bkmrk-code-languages-famil"><thead><tr><th>Code</th><th>Languages</th><th>Family</th><th align="right">Entries</th><th align="right">Senses</th><th align="right">Pairs</th><th align="right">Relations</th></tr></thead><tbody><tr><td>DVANSL</td><td>en→sl</td><td>block\_blankline</td><td align="right">68,761</td><td align="right">75,208</td><td align="right">264,454</td><td align="right">0</td></tr><tr><td>DVFRSL</td><td>fr→sl</td><td>line</td><td align="right">39,980</td><td align="right">70,817</td><td align="right">175,458</td><td align="right">0</td></tr><tr><td>DVITSL</td><td>it→sl</td><td>block\_blankline</td><td align="right">60,282</td><td align="right">98,444</td><td align="right">197,449</td><td align="right">0</td></tr><tr><td>DVRUSL</td><td>ru→sl</td><td>doc\_runs</td><td align="right">32,844</td><td align="right">32,844</td><td align="right">80,364</td><td align="right">0</td></tr><tr><td>DVSHSL</td><td>sh→sl</td><td>block\_blankline</td><td align="right">92,302</td><td align="right">92,302</td><td align="right">150,228</td><td align="right">5,563</td></tr><tr><td>DVSPSL</td><td>es→sl</td><td>dzs\_utf16</td><td align="right">36,460</td><td align="right">66,722</td><td align="right">114,691</td><td align="right">414</td></tr><tr><td>DVSLAN</td><td>sl→en</td><td>block\_blankline</td><td align="right">42,088</td><td align="right">42,227</td><td align="right">159,625</td><td align="right">0</td></tr><tr><td>DVSLFR</td><td>sl→fr</td><td>line</td><td align="right">33,866</td><td align="right">44,092</td><td align="right">128,561</td><td align="right">34</td></tr><tr><td>DVSLNE</td><td>sl→de</td><td>german</td><td align="right">84,951</td><td align="right">90,374</td><td align="right">252,074</td><td align="right">120</td></tr><tr><td>DVSLSH</td><td>sl→sh</td><td>block\_blankline</td><td align="right">72,690</td><td align="right">87,140</td><td align="right">152,068</td><td align="right">2,694</td></tr><tr><td>DVSLSP</td><td>sl→es</td><td>dzs\_utf16</td><td align="right">33,676</td><td align="right">66,456</td><td align="right">93,562</td><td align="right">349</td></tr><tr><td>DRSLAN</td><td>sl→en</td><td>dzs\_nested</td><td align="right">25,559</td><td align="right">33,273</td><td align="right">75,374</td><td align="right">457</td></tr><tr><td>VSIS</td><td>sl→it</td><td>block\_geslo</td><td align="right">90,675</td><td align="right">116,223</td><td align="right">275,043</td><td align="right">7,764</td></tr><tr><td>LAT\_AZ</td><td>la→sl</td><td>block\_geslo</td><td align="right">11,521</td><td align="right">20,315</td><td align="right">55,929</td><td align="right">1,852</td></tr><tr><td>KNAUR</td><td>sl (mono)</td><td>encyclopedia</td><td align="right">79,624</td><td align="right">83,248</td><td align="right">0</td><td align="right">94,621</td></tr></tbody></table>

Several pairs exist in **both directions** (DVSLFR/DVFRSL, DVSLAN/DVANSL, DVSLSH/DVSHSL, DVSLSP/DVSPSL, VSIS+DVITSL) — exploited for translation-pivot synonyms and reverse-dictionary tasks.

---

### 3. The pipeline

```
                 ┌───────── BASE CONVERSION (intrinsic, faithful) ──────────┐
 source dicts ─► parser family ─► IR (model.py) ─► DMLex XML/JSON  (lexidma/)
   (8 families)   + textproc                     └► Parquet         (parquet/)  + reports/
                  (specialchars, markup,
                   controlled vocab)
                          │
                          ▼
      ┌──────── COLLECTION 1 · CORE (build/, intrinsic) ────────┐
      │ derive.py  : accent-folded lemma · dedup · leak-free split
      │ synsets.py : in-sense synonyms · translation pivots · KNAUR hypernyms · GWA relation typing
      │ tasks.py   : 12 LM task JSONLs                                   → dist/core/
      └─────────────────────────────────────────────────────────┘
                          │
                          ▼
      ┌──────── COLLECTION 2 · ENRICHED (build/enrich.py, extrinsic) ───────┐
      │ CLASSLA silver lemma/UPOS/MSD (+ msd_tagging task)
      │ sloWNet/OMW synset+ILI links · ILI-bridged antonyms
      │ candidate scoring vs sloWNet · sloWNet-typed DMLex relations (KNAUR) → dist/enriched/
      └─────────────────────────────────────────────────────────────────────┘

```

**Stage 1 — Base conversion** (`src/dictconv/`, command `dictconv convert all`). Each dictionary's publisher typesetting markup is tokenized (not XML-parsed — it is malformed), `{…}` escapes decoded to Unicode (side-aware accents; undecodable font/template codes removed and logged), qualifiers classified into a controlled vocabulary, and emitted as a source-agnostic intermediate representation. Serializers produce **DMLex 1.0 XML+JSON** (validated against the official OASIS XSD 1.1 / JSON Schema) and three **Parquet** artifacts. The conversion is *faithful*: it never invents content and flags rather than silently drops.

**Stage 2 — Core** (`dictconv build-core`). Adds derived, ML-oriented layers computable from our data alone: a normalized lemma/UPOS layer, exact/near-duplicate collapse, a **leak-free**train/dev/test split, synonym/pivot/hypernym candidate tables, and 12 instruction-style task JSONLs.

**Stage 3 — Enriched** (`dictconv enrich`). Adds layers that need outside resources, kept strictly separate from the intrinsic data (gold-vs-silver provenance is explicit).

---

### 4. Base conversion artifacts

#### `lexidma/<CODE>.xml` and `.json` — OASIS DMLex 1.0

Faithful lexicographic record. Camel-case elements; text of every object except `headword` in a nested `<text>`; crosslingual module (`headwordTranslation`, `exampleTranslation`); the linking module (`relation` + **`relationType` definitions that carry a Global WordNet `sameAs` URI**, e.g. `see → https://globalwordnet.github.io/schemas/wn#also`). Validated with a real **XSD 1.1** processor (all identity constraints + assertions, except 3 cardinality-defective ones the published schema cannot satisfy). KNAUR uses the monolingual schema.

#### `parquet/<CODE>.entries.parquet` — one row per entry (nested)

```
dict_code, entry_id, headword,
lemma,            # accent-folded join key (Sloleks/Gigafida/CLASSLA/sloWNet keying)
accented_form,    # original tonal/accented display form (null if == headword)
homograph_number, source_lang, target_lang,
meta_lang,        # editorial metalanguage = "sl"
parts_of_speech[str], upos,    # UPOS from the entry's first POS (intrinsic static map)
frequency_band,   # DRSLAN corpus band 0..3 (null elsewhere)
labels[str], collocates[str],  # DRSLAN <KO> collocates
pronunciations[{text, scheme}],
inflected_forms[{text, tag}],
senses[{ sense_id, indicator, labels[str], definitions[str],
         headword_translations[{text, lang_code, parts_of_speech[str], labels[str]}],
         headword_explanations[{text, lang_code}],
         examples[{text, labels[str], translations[{text, lang_code, labels[str]}]}] }],
has_content,      # False => no senses / all senses empty (filter before LM use)
source_ref, raw   # provenance

```

#### `parquet/<CODE>.pairs.parquet` — one row per (source,target) unit (flat)

```
dict_code, source_lang, target_lang, entry_id, sense_id, homograph_number,
pair_type (headword|example), source_text, target_text,
source_lemma,    # accent-folded entry headword (dedup + leak-free split key)
part_of_speech, labels[str], domain, register

```

#### `parquet/<CODE>.relations.parquet` — the full cross-reference graph

```
dict_code, source_lang, target_lang, relation_index, type, description,
members[{ref, headword, role, target_id}],
serialized       # True => >=2 members resolved => present in DMLex XML/JSON

```

This is the **lossless** home of the cross-ref graph: it keeps cross-references whose target never resolved to an entry id (which the DMLex XML/JSON must drop).

#### `reports/<CODE>.report.json` + `reports/_summary.json`

Per-dictionary stats, validation results, the controlled-value inventory, flagged-token counts, and the aggregate summary + artifact manifest (sha256 of every output).

---

### 5. Collection 1 — Core (`dist/core/`)

#### 5.1 Derived tables (`dist/core/derived/`)

<table id="bkmrk-file-rows-what-it-is"><thead><tr><th>File</th><th align="right">Rows</th><th>What it is</th></tr></thead><tbody><tr><td>`lemmas.parquet`</td><td align="right">805,279</td><td>one row per entry: lemma, accented\_form, upos, frequency\_band, cluster\_id, split</td></tr><tr><td>`pairs_dedup.parquet`</td><td align="right">2,078,214</td><td>de-duplicated translation pairs + `occurrence_count`, `canonical_id`, split</td></tr><tr><td>`synonym_sets.parquet`</td><td align="right">378,783</td><td>in-sense (target-language) near-synonym sets + gloss</td></tr><tr><td>`synonym_pairs.parquet`</td><td align="right">2,131,996</td><td>in-sense + pivot synonym pairs (evidence, confidence\_tier)</td></tr><tr><td>`pivot_synonyms.parquet`</td><td align="right">152,655</td><td>Slovene synonym candidates from translation pivots (GOLD 67,296 / SILVER 85,359)</td></tr><tr><td>`hypernym_candidates.parquet`</td><td align="right">4,169</td><td>KNAUR genus-differentia hypernym candidates (confidence)</td></tr><tr><td>`relations_typed.parquet`</td><td align="right">113,868</td><td>the cross-ref graph, GWA-typed (`gwa_relType`)</td></tr></tbody></table>

> Pivot-synonym yield: 152,655 SILVER+GOLD pairs (≥2 agreeing pivots) materialized; **632,231**single-pivot BRONZE pairs were counted but **not** materialized (large, low precision); 339,415 distinct pivots used.

#### 5.2 LM tasks (`dist/core/tasks/*.jsonl`)

Each row: `{id, task, split, input:{…}, output:{…}, metadata:{…}}`. Split is `train/dev/test`(≈90/5/5), leak-free (§5.3). Marker policy `drop` (undecodable-glyph rows are cleaned/omitted).

<table id="bkmrk-task-rows-input-%E2%86%92-ou"><thead><tr><th>Task</th><th align="right">Rows</th><th>input → output</th></tr></thead><tbody><tr><td>`translation`</td><td align="right">3,006,506</td><td>`{source_text, source_lang, target_lang, part_of_speech, labels, domain, register}` → `{target_text}` (both directions)</td></tr><tr><td>`example_translation`</td><td align="right">574,961</td><td>example phrase → its translation</td></tr><tr><td>`definition`</td><td align="right">145,633</td><td>`{headword, lang, indicator}` → `{definition}` (KNAUR)</td></tr><tr><td>`reverse_dictionary`</td><td align="right">145,633</td><td>`{definition, lang}` → `{headword}`</td></tr><tr><td>`wsd`</td><td align="right">22,036</td><td>`{word, context, lang}` → `{sense_gloss, sense_id}` (polysemous only; bare-number glosses dropped)</td></tr><tr><td>`example_usage`</td><td align="right">85,155</td><td>`{headword, lang}` → `{example}` (monolingual usage sentences)</td></tr><tr><td>`morphology`</td><td align="right">332,393</td><td>`{headword, lang}` → `{form, tag}` (dictionary inflected forms)</td></tr><tr><td>`pronunciation`</td><td align="right">157,578</td><td>`{headword, lang}` → `{transcription, scheme}`</td></tr><tr><td>`synonyms_of`</td><td align="right">358,573</td><td>`{word, lang}` → `{synonyms[]}` (a real **set**; 60% have &gt;1)</td></tr><tr><td>`hypernym_of`</td><td align="right">4,169</td><td>`{word, lang}` → `{hypernym_candidate, confidence}`</td></tr><tr><td>`relation`</td><td align="right">113,373</td><td>a relation's first member → `{relation_type, members[…]}`</td></tr><tr><td>`relation_classify`</td><td align="right">113,327</td><td>`{a, b, lang}` → `{relation_type}` (unordered-pair split)</td></tr></tbody></table>

#### 5.3 Leak-free split (important)

- **Translation / sense / synonym tasks** key on the **folded Slovene lemma**, so a lemma and its reverse-direction twin (`hiša` in sl→fr and `maison→hiša` in fr→sl) are always in the **same**split. Verified: **0 of 172,081** Slovene headword lemmas straddle splits. (e.g. translation split: train 2,711,054 / dev 144,596 / test 150,856.)
- **Morphology &amp; pronunciation** (about the headword *form*) split by **headword form**; `relation_classify` by the **unordered member pair** — so foreign homographs don't straddle either.
- The legacy `dict_code:entry_id` key (now superseded) leaked ~26 % of multi-dict lemmas.

#### 5.4 Cleaning applied (Core)

Dedup before split (with `occurrence_count`); degenerate targets dropped (punct/digit-only, single-char, `src==tgt`); unbalanced parentheses balanced; PUA sentinels + control chars stripped; undecodable glyph markers removed (`marker_policy=drop`). `manifest.json` carries a content hash (`e76f2766…`) and per-task split counts; `dataset_card.md` is the in-tree card.

---

### 6. Collection 2 — Enriched (`dist/enriched/`)

<table id="bkmrk-file-rows-what-it-is-1"><thead><tr><th>File</th><th align="right">Rows</th><th>What it is</th></tr></thead><tbody><tr><td>`silver_morphology.parquet`</td><td align="right">208,715</td><td>CLASSLA lemma / UPOS / **JOS-MULTEXT-East MSD** / feats per Slovene lemma; `morph_provenance="silver_tool"`</td></tr><tr><td>`tasks/msd_tagging.jsonl`</td><td align="right">208,715</td><td>`{lemma, lang}` → `{upos, msd, feats}` (the morphology/POS task; **silver**)</td></tr><tr><td>`synset_links.parquet`</td><td align="right">64,413</td><td>Slovene lemma → sloWNet/OMW `synset_id` + **ILI** (join key to Princeton WN / OMW)</td></tr><tr><td>`antonyms.parquet`</td><td align="right">6,107</td><td>imported Slovene antonyms (ILI-bridged through the English WordNet)</td></tr><tr><td>`scored_synonyms.parquet`</td><td align="right">364,334</td><td>every Core synonym candidate + `wordnet_confirmed` + `source_count`</td></tr><tr><td>`scored_hypernyms.parquet`</td><td align="right">396</td><td>checkable hypernym candidates + `wordnet_confirmed`</td></tr><tr><td>`lexidma/KNAUR.{xml,json}`</td><td align="right">97,418 rel</td><td>KNAUR re-serialized as DMLex with sloWNet **antonym (142)** + **synonym (2,655)** relations, ILI/synset in `relation/description`, GWA-typed</td></tr></tbody></table>

#### 6.1 Candidate scoring vs sloWNet (measured precision — lower bounds; sloWNet is incomplete)

- **Synonyms:** 364,334 checkable, **17.0 %** confirmed — in-sense 14.6 %, pivot 27.1 %, **pivot-GOLD 38.5 %**.
- **Hypernyms:** 396 checkable, **47.5 %** confirmed (ILI-bridged through the English WordNet).
- Use `wordnet_confirmed=True` (and/or `confidence_tier=GOLD`) to extract a higher-precision subset.

#### 6.2 External resources &amp; how to reproduce

sloWNet/antonyms run in the main `.venv` (`pip install -e '.[enrich]'` → `wn`). **CLASSLA needs Python ≤ 3.13** (its pinned numpy fails to build on 3.14), so run the silver morphology from a 3.12 env:

```bash
uv venv --python 3.12 .venv-enrich
uv pip install -p .venv-enrich/bin/python -e '.[enrich]'
python -m wn download omw-sl ; python -m wn download oewn:2021
.venv-enrich/bin/python -c "import classla; classla.download('sl')"
.venv-enrich/bin/python -m dictconv.cli enrich --in dist/core --out dist/enriched --sample-limit 0

```

---

### 7. What to look for (usage guidance &amp; caveats)

**Filter before training**

- **`has_content`** — drop entries with no usable content (≈1.2 % of entries) for entry-level tasks.
- **`marker_policy`** — task JSONLs are already built with `drop`; never train on the `keep` variant (it would teach the model to emit placeholder glyphs). Corpus markers are currently 0.
- **Degenerate rows** — already removed from the tasks; if you build your own from `parquet/`, apply the same filters (punct/digit-only, `src==tgt`, unbalanced parens).
- **Use the provided split.** Re-shuffling by row re-introduces lemma leakage; the cluster split is the point. Hold out **whole lemma clusters**, not rows.

**Candidates are candidates, not gold**

- `synonym_*`, `pivot_synonyms`, `hypernym_candidates` are **induced** and noisy. Gate with the enriched scoring: `scored_synonyms.wordnet_confirmed` / `pivot confidence_tier=GOLD` (38.5 % precision) for synonyms; `scored_hypernyms.wordnet_confirmed` for hypernyms. The 47.5 % hypernym figure is measured on a small checkable slice and is optimistic for the full pool (genus heads are not lemmatized — ~25 % are oblique forms; lemmatize with CLASSLA before use).
- **Antonyms are imported, not mined** (synonyms/antonyms are translationally indistinguishable).
- **Precision is measured only on vocabulary sloWNet already has (~11 %).** The extension value (the ~89 % of members not yet in sloWNet) is **unproven** — commission a small human eval before treating those as silver.

**Gold vs silver**

- The **gold** lemma layer (`lemmas.parquet`) is 100 % coverage, accent-folded, NFC-clean.
- The **silver** morphology (`silver_morphology` / `msd_tagging`) is CLASSLA tool output (`morph_provenance="silver_tool"`). Keep it filterable; measure MSD accuracy on a hand-tagged sample before training a morphological analyzer on it. The dictionary's own `inflected_forms` are mostly *ending fragments*, not full words.

**Per-dictionary quality**

- **DVRUSL** (Russian) is OCR-grade (Word `.doc` reconstruction). Cleanups were applied (brace→paren, `|`/bullet stripping, space collapse) but residual noise is inherent — down-weight or exclude for high-precision work.
- **Definitions / reverse-dictionary** come **only from KNAUR** (monolingual Slovene). There are no definitions from the 14 bilingual dicts.
- **Conditioning labels are sparse** (POS on ~18 % of translation rows, domain ~4 %, register ~4 %).

**Provenance / audit**

- Every removed escape token is logged in [`data/reference/removed_markers.tsv`](data/reference/removed_markers.tsv).
- `reports/_summary.json` carries the per-file sha256 manifest; each collection's `manifest.json`carries a `content_hash` and split counts. Pin these with any eval run.

---

### 8. Loading

```python
from datasets import load_dataset

# flat translation pairs (all dicts)
pairs = load_dataset("parquet", data_files="parquet/*.pairs.parquet", split="train")

# nested per-entry records (one dictionary)
entries = load_dataset("parquet", data_files="parquet/VSIS.entries.parquet", split="train")
entries = entries.filter(lambda r: r["has_content"])          # drop empty entries

# a Core LM task, by split
import json
train = [json.loads(l) for l in open("dist/core/tasks/translation.jsonl") if json.loads(l)["split"]=="train"]

# wordnet-confirmed synonyms only (Enriched gate)
import pyarrow.parquet as pq
syn = pq.read_table("dist/enriched/scored_synonyms.parquet").to_pylist()
gold = [r for r in syn if r["wordnet_confirmed"]]

# DMLex (faithful lexicographic view)
import json; res = json.load(open("lexidma/VSIS.json"))       # OASIS DMLex 1.0

```

### 9. Reproduce end-to-end

```bash
pip install -e .                 # base deps (pyarrow, lxml, jsonschema, xmlschema)
dictconv convert all --write-summary     # base: lexidma/, parquet/, reports/
dictconv build-core                       # Collection 1 -> dist/core/
dictconv enrich --sample-limit 0          # Collection 2 -> dist/enriched/  (see §6.2 for CLASSLA env)
dictconv audit --write-summary            # readiness audit + artifact manifest
pytest -q                                 # 52 tests

```

# Converted dictionaries

# OASIS DMLex 1.0 (XML + JSON)

Slovenian lexicographic datasets from `dict-conversions`. Every dictionary is provided in **both**DMLex 1.0 serializations: `<CODE>.xml` and `<CODE>.json`. Two collections:

### intrinsic/ — Core collection (faithful conversion; the project's own sources only)

All 15 converted dictionaries: DVANSL en-&gt;sl, DVFRSL fr-&gt;sl, DVITSL it-&gt;sl, DVRUSL ru-&gt;sl, DVSHSL sh-&gt;sl, DVSPSL es-&gt;sl, DVSLAN sl-&gt;en, DVSLFR sl-&gt;fr, DVSLNE sl-&gt;de, DVSLSH sl-&gt;sh, DVSLSP sl-&gt;es, DRSLAN sl-&gt;en, VSIS sl-&gt;it, LAT\_AZ la-&gt;sl, KNAUR sl (monolingual encyclopedia).

### extrinsic/ — Enriched collection (external resources)

KNAUR.xml + KNAUR.json — the monolingual encyclopedia re-serialized with sloWNet-derived **antonym (142)** and **synonym (2,655)** `relation`s. Each carries its sloWNet provenance (ILI / synset id) in `relation/description`, and the relation types link to the Global WordNet vocabulary via `relationType/sameAs`.

KNAUR is the ONLY resource whose external (sloWNet) enrichment is expressible *as DMLex*: DMLex 1.0 allows external `sameAs` links only on tag definitions, not on senses/entries/relations. The other enrichment layers — CLASSLA silver lemma/UPOS/MSD, per-lemma synset/ILI links, imported antonyms, and candidate scoring — are tabular and ship as Parquet in the `dist/enriched/` collection (not in this archive). `intrinsic/KNAUR.*` is the base KNAUR (no sloWNet relations); `extrinsic/KNAUR.*` is the enriched version — diff them to see the added relations.