M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models

📅 2024-06-24

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing instruction fine-tuning (IFT) datasets exhibit strong bias toward high-resource languages—especially English—and lack comprehensive multilingual and multi-turn task coverage, thereby hindering large language models’ (LLMs’) ability to achieve instruction alignment in low-resource languages. To address this, we introduce M2Lingual: the first fully synthetic, multilingual, multi-turn, task-oriented IFT dataset, spanning 70 languages, 17+ NLP tasks, and 182K high-quality samples. Our method introduces a novel two-step Evolution taxonomy to guide the generation of complex multi-turn instructions, integrating seed-guided evolution, multilingual rewriting, dialogue-structure modeling, and quality-controlled diversity sampling. Experiments demonstrate that M2Lingual substantially improves performance across LLMs of varying scales on multilingual and multi-turn understanding and generation tasks. The dataset and generation code are publicly released on Hugging Face and GitHub.

Technology Category

Application Category

📝 Abstract

Instruction finetuning (IFT) is critical for aligning Large Language Models (LLMs) to follow instructions. While many effective IFT datasets have been introduced recently, they predominantly focus on high-resource languages like English. To better align LLMs across a broad spectrum of languages and tasks, we propose a fully synthetic, novel taxonomy (Evol) guided Multilingual, Multi-turn instruction finetuning dataset, called M2Lingual. It is constructed by first selecting a diverse set of seed examples and then utilizing the proposed Evol taxonomy to convert these seeds into complex and challenging multi-turn instructions. We demonstrate the effectiveness of M2Lingual by training LLMs of varying sizes and showcasing the enhanced performance across a diverse set of languages. We contribute the 2 step Evol taxonomy with the guided generation code: https://github.com/ServiceNow/M2Lingual, as well as the first fully synthetic, general and task-oriented, multi-turn, multilingual dataset built with Evol - M2Lingual: https://huggingface.co/datasets/ServiceNow-AI/ M2Lingual - containing 182K total IFT pairs, covering 70 languages and 17+ NLP tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhance multilingual instruction alignment in LLMs

Address lack of IFT datasets for low-resource languages

Improve multi-turn instruction handling across diverse tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic multilingual multi-turn instruction dataset

Evol taxonomy for complex instruction generation

Enhanced LLM performance across 70 languages

🔎 Similar Papers

No similar papers found.