Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models?

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study identifies an implicit language bias in large reasoning models (LRMs): when processing multilingual inputs, LRMs default to high-resource languages—particularly English—for internal reasoning, severely degrading performance on low-resource language tasks. To systematically investigate this, we design a multidimensional, controllable evaluation framework covering MMMLU, MATH-500, CulturalBench, and LMSYS-toxic, integrating reasoning-path tracing and cross-lingual attribution analysis. We empirically demonstrate that enforcing same-language reasoning—though it reduces general reasoning capability (especially for low-resource languages)—significantly improves cultural alignment and language-specific accuracy in safety evaluation. Crucially, this work is the first to empirically establish “reasoning-language–input-language mismatch” as a fundamental bottleneck to multilingual fairness. Our findings provide both theoretical grounding and methodological tools for developing language-neutral LRMs.

Technology Category

Application Category

📝 Abstract
Large reasoning models (LRMs) have demonstrated impressive performance across a range of reasoning tasks, yet little is known about their internal reasoning processes in multilingual settings. We begin with a critical question: {it In which language do these models reason when solving problems presented in different languages?} Our findings reveal that, despite multilingual training, LRMs tend to default to reasoning in high-resource languages (e.g., English) at test time, regardless of the input language. When constrained to reason in the same language as the input, model performance declines, especially for low-resource languages. In contrast, reasoning in high-resource languages generally preserves performance. We conduct extensive evaluations across reasoning-intensive tasks (MMMLU, MATH-500) and non-reasoning benchmarks (CulturalBench, LMSYS-toxic), showing that the effect of language choice varies by task type: input-language reasoning degrades performance on reasoning tasks but benefits cultural tasks, while safety evaluations exhibit language-specific behavior. By exposing these linguistic biases in LRMs, our work highlights a critical step toward developing more equitable models that serve users across diverse linguistic backgrounds.
Problem

Research questions and friction points this paper is trying to address.

Investigates language preference in multilingual reasoning models
Examines performance impact of input-language reasoning constraints
Reveals task-dependent effects of language choice in models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual input analysis in reasoning models
High-resource language default reasoning
Task-specific language choice impact
🔎 Similar Papers
No similar papers found.