LuxInstruct: A Cross-Lingual Instruction Tuning Dataset For Luxembourgish

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low-resource Luxembourgish suffers from a scarcity of high-quality instruction data, while conventional machine translation approaches often introduce semantic distortions and cultural misalignments. To address this, we propose a translation-free cross-lingual instruction data construction method: leveraging English–French–German aligned corpora, we integrate multilingual alignment mining, cross-lingual instruction mapping, and rigorous human validation to construct— for the first time— a culturally adapted, semantically consistent Luxembourgish instruction-tuning dataset. Our approach preserves language-specific nuances and eliminates translation noise, thereby advancing the paradigm for low-resource language data curation. Experimental results demonstrate that large language models fine-tuned on this dataset achieve significant improvements in Luxembourgish instruction following, generation quality, and cross-lingual representation alignment.

Technology Category

Application Category

📝 Abstract
Instruction tuning has become a key technique for enhancing the performance of large language models, enabling them to better follow human prompts. However, low-resource languages such as Luxembourgish face severe limitations due to the lack of high-quality instruction datasets. Traditional reliance on machine translation often introduces semantic misalignment and cultural inaccuracies. In this work, we address these challenges by creating a cross-lingual instruction tuning dataset for Luxembourgish, without resorting to machine-generated translations into it. Instead, by leveraging aligned data from English, French, and German, we build a high-quality dataset that preserves linguistic and cultural nuances. We provide evidence that cross-lingual instruction tuning not only improves representational alignment across languages but also the model's generative capabilities in Luxembourgish. This highlights how cross-lingual data curation can avoid the common pitfalls of machine-translated data and directly benefit low-resource language development.
Problem

Research questions and friction points this paper is trying to address.

Creating cross-lingual instruction dataset for Luxembourgish
Overcoming machine translation limitations for low-resource languages
Improving language model alignment and generative capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages aligned data from multiple languages
Avoids machine translation into Luxembourgish
Preserves linguistic and cultural nuances
🔎 Similar Papers
No similar papers found.