🤖 AI Summary
Low-resource Luxembourgish suffers from a scarcity of high-quality instruction data, while conventional machine translation approaches often introduce semantic distortions and cultural misalignments. To address this, we propose a translation-free cross-lingual instruction data construction method: leveraging English–French–German aligned corpora, we integrate multilingual alignment mining, cross-lingual instruction mapping, and rigorous human validation to construct— for the first time— a culturally adapted, semantically consistent Luxembourgish instruction-tuning dataset. Our approach preserves language-specific nuances and eliminates translation noise, thereby advancing the paradigm for low-resource language data curation. Experimental results demonstrate that large language models fine-tuned on this dataset achieve significant improvements in Luxembourgish instruction following, generation quality, and cross-lingual representation alignment.
📝 Abstract
Instruction tuning has become a key technique for enhancing the performance of large language models, enabling them to better follow human prompts. However, low-resource languages such as Luxembourgish face severe limitations due to the lack of high-quality instruction datasets. Traditional reliance on machine translation often introduces semantic misalignment and cultural inaccuracies. In this work, we address these challenges by creating a cross-lingual instruction tuning dataset for Luxembourgish, without resorting to machine-generated translations into it. Instead, by leveraging aligned data from English, French, and German, we build a high-quality dataset that preserves linguistic and cultural nuances. We provide evidence that cross-lingual instruction tuning not only improves representational alignment across languages but also the model's generative capabilities in Luxembourgish. This highlights how cross-lingual data curation can avoid the common pitfalls of machine-translated data and directly benefit low-resource language development.