Demystifying Multilingual Chain-of-Thought in Process Reward Modeling

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work addresses the absence of process-level reward modeling for multilingual complex reasoning tasks. We introduce and evaluate the first multilingual Process Reward Model (PRM) covering seven languages. Methodologically, we propose a translation-aligned multilingual chain-of-thought annotation framework, integrating fine-grained step-level reward modeling with reinforcement learning feedback. Key contributions include: (1) the first process-level reward modeling approach supporting both multilingualism and multi-step reasoning; (2) empirical identification of coupling effects among training language count, English data scale, candidate response count, and model parameter count; and (3) significant improvements in average accuracy across 11 languages on cross-lingual benchmarks (e.g., GSM8K and MGSM), alongside reduced early reasoning error rates. Our code and multilingual PRM models are publicly released to advance trustworthy multilingual reasoning research.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are designed to perform a wide range of tasks. To improve their ability to solve complex problems requiring multi-step reasoning, recent research leverages process reward modeling to provide fine-grained feedback at each step of the reasoning process for reinforcement learning (RL), but it predominantly focuses on English. In this paper, we tackle the critical challenge of extending process reward models (PRMs) to multilingual settings. To achieve this, we train multilingual PRMs on a dataset spanning seven languages, which is translated from English. Through comprehensive evaluations on two widely used reasoning benchmarks across 11 languages, we demonstrate that multilingual PRMs not only improve average accuracy but also reduce early-stage reasoning errors. Furthermore, our results highlight the sensitivity of multilingual PRMs to both the number of training languages and the volume of English data, while also uncovering the benefits arising from more candidate responses and trainable parameters. This work opens promising avenues for robust multilingual applications in complex, multi-step reasoning tasks. In addition, we release the code to foster research along this line.

Problem

Research questions and friction points this paper is trying to address.

Extend process reward models to multilingual settings

Improve accuracy and reduce early-stage reasoning errors

Explore sensitivity to training languages and English data volume

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual process reward modeling

Reinforcement learning with fine-grained feedback

Training on seven translated languages

🔎 Similar Papers

Sharing Matters: Analysing Neurons Across Languages and Tasks in LLMs