MUDIDI: A Two-Stage Framework for Multilingual Dictionary Digitization with Language Models

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Multilingual dictionaries are often available only as scanned documents, whose complex layouts, specialized scripts, and intricate entry structures hinder efficient conversion into structured digital formats. This work proposes MUDIDI, a two-stage framework that first evaluates the quality of character recognition and markup preservation, then performs lexical entry segmentation and maps the results to the SIL Dictionary Format. The study presents the first systematic evaluation of optical character recognition (OCR), large language models (LLMs), and vision-language models (VLMs) on this task, revealing that LLMs consistently outperform other approaches. Incorporating contextual information—such as dictionary prefaces—further enhances performance significantly. The project is publicly released with a manually annotated dataset spanning 30 dictionaries and a complete processing pipeline.

📝 Abstract

Multilingual dictionaries are among the most valuable documentary resources for low-resource and endangered languages, yet many remain available only as scans. For many decades, their digitization and conversion into a machine-readable format was nearly impossible due to language-specific scripts, complex multi-column layouts full of entries with abbreviations and cross-references. Recent vision-language models offer a promising solution, but it is unclear how well they preserve characters, markup, and process lexicographic structure. We introduce MUDIDI, a two-stage framework for multi-lingual dictionary digitization. Stage One evaluates the quality of character recognition and markup preservation; Stage Two focuses on dictionary entry segmentation with subsequent mapping into a machine-readable lexicographic schema, SIL's Multi-Dictionary Formatter. We also release a dataset that consists of human-annotated lexicographic entries collected from 30 public-domain dictionaries featuring diverse writing systems, language families, and formats. We benchmark OCR systems, general-purpose Large Language Models (LLMs), and Vision Language Models (VLMs) on the dataset, demonstrating superior performance of LLMs across most writing systems and languages in both stages, and provide practical guidelines on improving the results for more challenging scenarios. Finally, we show that supplementing additional information, such as dictionary introduction, to the LLMs can improve the quality of the digitized dictionary. Github: https://github.com/DavidSamuell/MUDIDI-Pipeline-for-Digitization-of-Multilingual-Dictionary/

Problem

Research questions and friction points this paper is trying to address.

multilingual dictionary digitization

low-resource languages

endangered languages

machine-readable format

lexicographic structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual dictionary digitization

two-stage framework

large language models