Dialectal and Low Resource Machine Translation for Aromanian

📅 2024-10-23
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the low-resource challenges in translating endangered Eastern Romance Aromanian dialects—namely, scarcity of parallel corpora, high orthographic variation, and lack of standardization—this paper introduces the first English–Romanian–Aromanian neural machine translation (NMT) system. Methodologically, we (1) release the largest publicly available Aromanian–Romanian parallel corpus to date (79K sentence pairs); (2) propose a language-agnostic sentence embedding–based mining framework coupled with automatic orthographic normalization, incorporating a hybrid rule- and statistics-based diacritic conversion module; and (3) build a Transformer-based multilingual NMT model enhanced with cross-lingual sentence embeddings and an automated evaluation pipeline. All data, models, and tools are open-sourced on Hugging Face and arotranslate.com. Experimental results demonstrate significant improvements in translation quality, enabling scalable language documentation and community-driven applications.

Technology Category

Application Category

📝 Abstract
This paper presents the process of building a neural machine translation system with support for English, Romanian, and Aromanian - an endangered Eastern Romance language. The primary contribution of this research is twofold: (1) the creation of the most extensive Aromanian-Romanian parallel corpus to date, consisting of 79,000 sentence pairs, and (2) the development and comparative analysis of several machine translation models optimized for Aromanian. To accomplish this, we introduce a suite of auxiliary tools, including a language-agnostic sentence embedding model for text mining and automated evaluation, complemented by a diacritics conversion system for different writing standards. This research brings contributions to both computational linguistics and language preservation efforts by establishing essential resources for a historically under-resourced language. All datasets, trained models, and associated tools are public: https://huggingface.co/aronlp and https://arotranslate.com
Problem

Research questions and friction points this paper is trying to address.

Machine Translation
Aromanian Dialects
Limited Resources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Aromanian-Romanian parallel corpus
Aromanian-optimized translation model
Multilingual sentence analysis tool
🔎 Similar Papers
No similar papers found.
A
Alexandru-Iulius Jerpelea
Tudor Vianu National College of Computer Science, Bucharest
A
Alina-cStefania Ruadoi
West University of Timisoara
Sergiu Nisioi
Sergiu Nisioi
Human Language Technologies Research Centre, University of Bucharest
translationesesecond language acquisitionmachine translationdeep learning