Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources

📅 2025-07-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of high-quality romanization resources for Sinhala, this paper proposes the first systematic transliteration framework—hybridizing sequence-to-sequence modeling, phoneme-level rule-based mapping, and post-processing optimization. We introduce the first standardized, open-source Sinhala romanization–native script parallel corpus (120K+ high-quality pairs), constructed by unifying heterogeneous transcription data from multiple sources. We further release a reusable toolchain and a benchmark evaluation set. Experiments demonstrate substantial improvements over existing tools: +8.3 BLEU points and +12.7% character-level F1 score. This work fills a critical technical gap in transliteration modeling for low-resource South Asian languages and establishes foundational support for downstream Sinhala NLP tasks, including automatic speech recognition (ASR) and machine translation.

Technology Category

Application Category

📝 Abstract
The Swa-bhasha Resource Hub provides a comprehensive collection of data resources and algorithms developed for Romanized Sinhala to Sinhala transliteration between 2020 and 2025. These resources have played a significant role in advancing research in Sinhala Natural Language Processing (NLP), particularly in training transliteration models and developing applications involving Romanized Sinhala. The current openly accessible data sets and corresponding tools are made publicly available through this hub. This paper presents a detailed overview of the resources contributed by the authors and includes a comparative analysis of existing transliteration applications in the domain.
Problem

Research questions and friction points this paper is trying to address.

Develop Romanized Sinhala to Sinhala transliteration systems
Advance Sinhala NLP research and applications
Provide open datasets and tools for transliteration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Romanized Sinhala to Sinhala transliteration systems
Openly accessible datasets and tools
Comparative analysis of transliteration applications
🔎 Similar Papers
No similar papers found.
Deshan Sumanathilaka
Deshan Sumanathilaka
PhD Candidate at Swansea University
NLPMachine Translation and TransliterationMLWSD
S
Sameera Perera
Informatics Institute of Technology, Colombo, Sri Lanka
S
Sachithya Dharmasiri
Informatics Institute of Technology, Colombo, Sri Lanka
M
Maneesha Athukorala
Informatics Institute of Technology, Colombo, Sri Lanka
A
Anuja Dilrukshi Herath
Informatics Institute of Technology, Colombo, Sri Lanka
R
Rukshan Dias
Informatics Institute of Technology, Colombo, Sri Lanka
P
Pasindu Gamage
Informatics Institute of Technology, Colombo, Sri Lanka
Ruvan Weerasinghe
Ruvan Weerasinghe
University of Colombo School of Computing
Computational LinguisticsMachine TranslationMachine LearningIntelligent SystemsBioinformatics
Y
Y. H. P. P. Priyadarshana
Kyoto University of Advanced Science (KUAS), Kyoto, Japan