🤖 AI Summary
This study addresses the transliteration task from Romanized to native script for Sinhala, a low-resource language. We systematically compare rule-based approaches with Transformer-based sequence-to-sequence models. Our method employs a character-level encoder-decoder architecture augmented with self-supervised pretraining to enhance generalization in the absence of large-scale parallel corpora. A key contribution is the model’s ability to automatically learn non-standard romanization patterns—bypassing the coverage limitations inherent in hand-crafted rules. Experimental results show that the Transformer approach achieves an 18.7-point BLEU improvement over the rule-based baseline on our manually constructed test set, demonstrating substantially higher accuracy and robustness. To foster reproducibility and further research, we release all code and data publicly on GitHub.
📝 Abstract
Due to reasons of convenience and lack of tech literacy, transliteration (i.e., Romanizing native scripts instead of using localization tools) is eminently prevalent in the context of low-resource languages such as Sinhala, which have their own writing script. In this study, our focus is on Romanized Sinhala transliteration. We propose two methods to address this problem: Our baseline is a rule-based method, which is then compared against our second method where we approach the transliteration problem as a sequence-to-sequence task akin to the established Neural Machine Translation (NMT) task. For the latter, we propose a Transformer-based Encode-Decoder solution. We witnessed that the Transformer-based method could grab many ad-hoc patterns within the Romanized scripts compared to the rule-based method. The code base associated with this paper is available on GitHub - https://github.com/kasunw22/Sinhala-Transliterator/