🤖 AI Summary
Addressing the low-resource challenges in translating endangered Eastern Romance Aromanian dialects—namely, scarcity of parallel corpora, high orthographic variation, and lack of standardization—this paper introduces the first English–Romanian–Aromanian neural machine translation (NMT) system. Methodologically, we (1) release the largest publicly available Aromanian–Romanian parallel corpus to date (79K sentence pairs); (2) propose a language-agnostic sentence embedding–based mining framework coupled with automatic orthographic normalization, incorporating a hybrid rule- and statistics-based diacritic conversion module; and (3) build a Transformer-based multilingual NMT model enhanced with cross-lingual sentence embeddings and an automated evaluation pipeline. All data, models, and tools are open-sourced on Hugging Face and arotranslate.com. Experimental results demonstrate significant improvements in translation quality, enabling scalable language documentation and community-driven applications.
📝 Abstract
This paper presents the process of building a neural machine translation system with support for English, Romanian, and Aromanian - an endangered Eastern Romance language. The primary contribution of this research is twofold: (1) the creation of the most extensive Aromanian-Romanian parallel corpus to date, consisting of 79,000 sentence pairs, and (2) the development and comparative analysis of several machine translation models optimized for Aromanian. To accomplish this, we introduce a suite of auxiliary tools, including a language-agnostic sentence embedding model for text mining and automated evaluation, complemented by a diacritics conversion system for different writing standards. This research brings contributions to both computational linguistics and language preservation efforts by establishing essential resources for a historically under-resourced language. All datasets, trained models, and associated tools are public: https://huggingface.co/aronlp and https://arotranslate.com