Sinhala Transliteration: A Comparative Analysis Between Rule-based and Seq2Seq Approaches

📅 2024-12-31

📈 Citations: 0

✨ Influential: 0

career value

141K/year

🤖 AI Summary

This study addresses the transliteration task from Romanized to native script for Sinhala, a low-resource language. We systematically compare rule-based approaches with Transformer-based sequence-to-sequence models. Our method employs a character-level encoder-decoder architecture augmented with self-supervised pretraining to enhance generalization in the absence of large-scale parallel corpora. A key contribution is the model’s ability to automatically learn non-standard romanization patterns—bypassing the coverage limitations inherent in hand-crafted rules. Experimental results show that the Transformer approach achieves an 18.7-point BLEU improvement over the rule-based baseline on our manually constructed test set, demonstrating substantially higher accuracy and robustness. To foster reproducibility and further research, we release all code and data publicly on GitHub.

Technology Category

Application Category

📝 Abstract

Due to reasons of convenience and lack of tech literacy, transliteration (i.e., Romanizing native scripts instead of using localization tools) is eminently prevalent in the context of low-resource languages such as Sinhala, which have their own writing script. In this study, our focus is on Romanized Sinhala transliteration. We propose two methods to address this problem: Our baseline is a rule-based method, which is then compared against our second method where we approach the transliteration problem as a sequence-to-sequence task akin to the established Neural Machine Translation (NMT) task. For the latter, we propose a Transformer-based Encode-Decoder solution. We witnessed that the Transformer-based method could grab many ad-hoc patterns within the Romanized scripts compared to the rule-based method. The code base associated with this paper is available on GitHub - https://github.com/kasunw22/Sinhala-Transliterator/

Problem

Research questions and friction points this paper is trying to address.

Rule-based methods

Seq2Seq methods

Sinhala transliteration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer technology

Romanization of Sinhala

Efficient handling of unique writing rules

🔎 Similar Papers

Survey on Publicly Available Sinhala Natural Language Processing Tools and Research