Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources

📅 2025-07-12

📈 Citations: 0

✨ Influential: 0

career value

139K/year

🤖 AI Summary

To address the scarcity of high-quality romanization resources for Sinhala, this paper proposes the first systematic transliteration framework—hybridizing sequence-to-sequence modeling, phoneme-level rule-based mapping, and post-processing optimization. We introduce the first standardized, open-source Sinhala romanization–native script parallel corpus (120K+ high-quality pairs), constructed by unifying heterogeneous transcription data from multiple sources. We further release a reusable toolchain and a benchmark evaluation set. Experiments demonstrate substantial improvements over existing tools: +8.3 BLEU points and +12.7% character-level F1 score. This work fills a critical technical gap in transliteration modeling for low-resource South Asian languages and establishes foundational support for downstream Sinhala NLP tasks, including automatic speech recognition (ASR) and machine translation.

Technology Category

Application Category

📝 Abstract

The Swa-bhasha Resource Hub provides a comprehensive collection of data resources and algorithms developed for Romanized Sinhala to Sinhala transliteration between 2020 and 2025. These resources have played a significant role in advancing research in Sinhala Natural Language Processing (NLP), particularly in training transliteration models and developing applications involving Romanized Sinhala. The current openly accessible data sets and corresponding tools are made publicly available through this hub. This paper presents a detailed overview of the resources contributed by the authors and includes a comparative analysis of existing transliteration applications in the domain.

Problem

Research questions and friction points this paper is trying to address.

Develop Romanized Sinhala to Sinhala transliteration systems

Advance Sinhala NLP research and applications

Provide open datasets and tools for transliteration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Romanized Sinhala to Sinhala transliteration systems

Openly accessible datasets and tools

Comparative analysis of transliteration applications

🔎 Similar Papers

Survey on Publicly Available Sinhala Natural Language Processing Tools and Research