🤖 AI Summary
This work addresses the challenge of automatic transliteration from Judeo-Arabic—written in Hebrew script—to standard Arabic script, which is complicated by ambiguous character mappings, inconsistent orthography, and frequent code-switching with Hebrew and Aramaic. We propose the first end-to-end automated transliteration framework: a fine-grained character-level mapping module establishes baseline correspondences, followed by a context-aware post-correction module to resolve ambiguities. To rigorously evaluate performance, we introduce the first large language model–based benchmark specifically designed for Judeo-Arabic transliteration, and publicly release both the benchmark data and our models. Experiments demonstrate substantial improvements in transliteration accuracy, enabling off-the-shelf Arabic NLP tools—including part-of-speech tagging and machine translation—to process historical Judeo-Arabic texts effectively. This advancement significantly facilitates the digitization and computational linguistic analysis of Judeo-Arabic heritage documents.
📝 Abstract
Judeo-Arabic refers to Arabic variants historically spoken by Jewish communities across the Arab world, primarily during the Middle Ages. Unlike standard Arabic, it is written in Hebrew script by Jewish writers and for Jewish audiences. Transliterating Judeo-Arabic into Arabic script is challenging due to ambiguous letter mappings, inconsistent orthographic conventions, and frequent code-switching into Hebrew and Aramaic. In this paper, we introduce a two-step approach to automatically transliterate Judeo-Arabic into Arabic script: simple character-level mapping followed by post-correction to address grammatical and orthographic errors. We also present the first benchmark evaluation of LLMs on this task. Finally, we show that transliteration enables Arabic NLP tools to perform morphosyntactic tagging and machine translation, which would have not been feasible on the original texts.