🤖 AI Summary
To address the scarcity of high-quality romanization resources for Sinhala, this paper proposes the first systematic transliteration framework—hybridizing sequence-to-sequence modeling, phoneme-level rule-based mapping, and post-processing optimization. We introduce the first standardized, open-source Sinhala romanization–native script parallel corpus (120K+ high-quality pairs), constructed by unifying heterogeneous transcription data from multiple sources. We further release a reusable toolchain and a benchmark evaluation set. Experiments demonstrate substantial improvements over existing tools: +8.3 BLEU points and +12.7% character-level F1 score. This work fills a critical technical gap in transliteration modeling for low-resource South Asian languages and establishes foundational support for downstream Sinhala NLP tasks, including automatic speech recognition (ASR) and machine translation.
📝 Abstract
The Swa-bhasha Resource Hub provides a comprehensive collection of data resources and algorithms developed for Romanized Sinhala to Sinhala transliteration between 2020 and 2025. These resources have played a significant role in advancing research in Sinhala Natural Language Processing (NLP), particularly in training transliteration models and developing applications involving Romanized Sinhala. The current openly accessible data sets and corresponding tools are made publicly available through this hub. This paper presents a detailed overview of the resources contributed by the authors and includes a comparative analysis of existing transliteration applications in the domain.