M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models

📅 2024-06-24
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing instruction fine-tuning (IFT) datasets exhibit strong bias toward high-resource languages—especially English—and lack comprehensive multilingual and multi-turn task coverage, thereby hindering large language models’ (LLMs’) ability to achieve instruction alignment in low-resource languages. To address this, we introduce M2Lingual: the first fully synthetic, multilingual, multi-turn, task-oriented IFT dataset, spanning 70 languages, 17+ NLP tasks, and 182K high-quality samples. Our method introduces a novel two-step Evolution taxonomy to guide the generation of complex multi-turn instructions, integrating seed-guided evolution, multilingual rewriting, dialogue-structure modeling, and quality-controlled diversity sampling. Experiments demonstrate that M2Lingual substantially improves performance across LLMs of varying scales on multilingual and multi-turn understanding and generation tasks. The dataset and generation code are publicly released on Hugging Face and GitHub.

Technology Category

Application Category

📝 Abstract
Instruction finetuning (IFT) is critical for aligning Large Language Models (LLMs) to follow instructions. While many effective IFT datasets have been introduced recently, they predominantly focus on high-resource languages like English. To better align LLMs across a broad spectrum of languages and tasks, we propose a fully synthetic, novel taxonomy (Evol) guided Multilingual, Multi-turn instruction finetuning dataset, called M2Lingual. It is constructed by first selecting a diverse set of seed examples and then utilizing the proposed Evol taxonomy to convert these seeds into complex and challenging multi-turn instructions. We demonstrate the effectiveness of M2Lingual by training LLMs of varying sizes and showcasing the enhanced performance across a diverse set of languages. We contribute the 2 step Evol taxonomy with the guided generation code: https://github.com/ServiceNow/M2Lingual, as well as the first fully synthetic, general and task-oriented, multi-turn, multilingual dataset built with Evol - M2Lingual: https://huggingface.co/datasets/ServiceNow-AI/ M2Lingual - containing 182K total IFT pairs, covering 70 languages and 17+ NLP tasks.
Problem

Research questions and friction points this paper is trying to address.

Enhance multilingual instruction alignment in LLMs
Address lack of IFT datasets for low-resource languages
Improve multi-turn instruction handling across diverse tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic multilingual multi-turn instruction dataset
Evol taxonomy for complex instruction generation
Enhanced LLM performance across 70 languages
🔎 Similar Papers
No similar papers found.