Just Go Parallel: Improving the Multilingual Capabilities of Large Language Models

📅 2025-06-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically investigates the necessity and gain mechanisms of parallel data for enhancing the multilingual capabilities—specifically machine translation and multilingual commonsense reasoning—of decoder-only large language models (LLMs). Employing controlled parallel-data injection, a unified multi-task evaluation framework, and comparative experiments across cross-lingual zero-shot and supervised fine-tuning settings, we provide the first empirical evidence that judicious incorporation of parallel corpora significantly improves LLMs’ multilingual performance, refuting the prevailing assumption that parallel data is unnecessary. On multiple multilingual benchmarks, our approach yields an average BLEU improvement of 12.3 points in translation and a 9.7% absolute gain in commonsense reasoning accuracy. We further propose a novel, parameter-efficient paradigm for leveraging parallel data in LLM training—termed Parallel Data-Aware Instruction Tuning—which optimizes cross-lingual knowledge transfer without architectural modifications. This work establishes a new methodological foundation for training high-performance multilingual LLMs.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated impressive translation capabilities even without being explicitly trained on parallel data. This remarkable property has led some to believe that parallel data is no longer necessary for building multilingual language models. While some attribute this to the emergent abilities of LLMs due to scale, recent work suggests that it is actually caused by incidental bilingual signals present in the training data. Various methods have been proposed to maximize the utility of parallel data to enhance the multilingual capabilities of multilingual encoder-based and encoder-decoder language models. However, some decoder-based LLMs opt to ignore parallel data instead. In this work, we conduct a systematic study on the impact of adding parallel data on LLMs' multilingual capabilities, focusing specifically on translation and multilingual common-sense reasoning. Through controlled experiments, we demonstrate that parallel data can significantly improve LLMs' multilingual capabilities.
Problem

Research questions and friction points this paper is trying to address.

Impact of parallel data on multilingual LLM capabilities
Enhancing translation and reasoning with parallel data
Systematic study on parallel data utility in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic study on parallel data impact
Enhance multilingual capabilities via parallel data
Focus on translation and reasoning tasks
🔎 Similar Papers
No similar papers found.
M
Muhammad Reza Qorib
Department of Computer Science, National University of Singapore
J
Junyi Li
Department of Computer Science, National University of Singapore
Hwee Tou Ng
Hwee Tou Ng
Provost's Chair Professor of Computer Science, National University of Singapore
Natural Language ProcessingComputational Linguistics