Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments

📅 2025-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multilingual settings, large language models (LLMs) struggle with multi-step reasoning—especially for non-English languages—due to tight coupling between reasoning and execution, rendering chain-of-thought (CoT) prompting ineffective. Method: This work systematically evaluates the Program-of-Thought (PoT) paradigm, which decouples multilingual reasoning generation from executable code execution. We investigate (i) how instruction fine-tuning affects cross-lingual alignment between questions and reasoning steps, and (ii) how reasoning quality—quantified by functional correctness of generated code—determines final answer accuracy. Contribution/Results: We introduce, for the first time, PoT reasoning quality as a heuristic metric for test-time performance prediction and adaptive optimization. Experiments show that PoT-finetuned models significantly outperform CoT baselines on multilingual reasoning tasks; moreover, reasoning quality exhibits strong positive correlation with answer accuracy, revealing the intrinsic mechanism behind PoT’s superior generalization in multilingual contexts.

Technology Category

Application Category

📝 Abstract
Multi-step reasoning is essential for large language models (LLMs), yet multilingual performance remains challenging. While Chain-of-Thought (CoT) prompting improves reasoning, it struggles with non-English languages due to the entanglement of reasoning and execution. Program-of-Thought (PoT) prompting separates reasoning from execution, offering a promising alternative but shifting the challenge to generating programs from non-English questions. We propose a framework to evaluate PoT by separating multilingual reasoning from code execution to examine (i) the impact of fine-tuning on question-reasoning alignment and (ii) how reasoning quality affects answer correctness. Our findings demonstrate that PoT fine-tuning substantially enhances multilingual reasoning, outperforming CoT fine-tuned models. We further demonstrate a strong correlation between reasoning quality (measured through code quality) and answer accuracy, highlighting its potential as a test-time performance improvement heuristic.
Problem

Research questions and friction points this paper is trying to address.

Enhance multilingual reasoning in LLMs
Separate reasoning from code execution
Improve answer accuracy through reasoning quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Program-of-Thought prompting separates reasoning
Fine-tuning enhances multilingual reasoning performance
Code quality correlates with answer accuracy
🔎 Similar Papers