🤖 AI Summary
Existing instruction fine-tuning (IFT) datasets exhibit strong bias toward high-resource languages—especially English—and lack comprehensive multilingual and multi-turn task coverage, thereby hindering large language models’ (LLMs’) ability to achieve instruction alignment in low-resource languages. To address this, we introduce M2Lingual: the first fully synthetic, multilingual, multi-turn, task-oriented IFT dataset, spanning 70 languages, 17+ NLP tasks, and 182K high-quality samples. Our method introduces a novel two-step Evolution taxonomy to guide the generation of complex multi-turn instructions, integrating seed-guided evolution, multilingual rewriting, dialogue-structure modeling, and quality-controlled diversity sampling. Experiments demonstrate that M2Lingual substantially improves performance across LLMs of varying scales on multilingual and multi-turn understanding and generation tasks. The dataset and generation code are publicly released on Hugging Face and GitHub.
📝 Abstract
Instruction finetuning (IFT) is critical for aligning Large Language Models (LLMs) to follow instructions. While many effective IFT datasets have been introduced recently, they predominantly focus on high-resource languages like English. To better align LLMs across a broad spectrum of languages and tasks, we propose a fully synthetic, novel taxonomy (Evol) guided Multilingual, Multi-turn instruction finetuning dataset, called M2Lingual. It is constructed by first selecting a diverse set of seed examples and then utilizing the proposed Evol taxonomy to convert these seeds into complex and challenging multi-turn instructions. We demonstrate the effectiveness of M2Lingual by training LLMs of varying sizes and showcasing the enhanced performance across a diverse set of languages. We contribute the 2 step Evol taxonomy with the guided generation code: https://github.com/ServiceNow/M2Lingual, as well as the first fully synthetic, general and task-oriented, multi-turn, multilingual dataset built with Evol - M2Lingual: https://huggingface.co/datasets/ServiceNow-AI/ M2Lingual - containing 182K total IFT pairs, covering 70 languages and 17+ NLP tasks.