What Really Counts? Examining Step and Token Level Attribution in Multilingual CoT Reasoning

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

career value

146K/year

🤖 AI Summary

This work investigates attribution patterns in chain-of-thought (CoT) reasoning within multilingual large language models, focusing on cross-lingual reliability and interpretability disparities. We propose a joint step-level and token-level attribution analysis framework, implementing systematic evaluation on the MGSM multilingual benchmark using the Qwen2.5-1.5B-Instruct model, integrated with ContextCite and Inseq. Results reveal that final reasoning steps are consistently over-attributed, and attribution bias is exacerbated in low-resource languages. Structured CoT prompting significantly improves both accuracy and attribution consistency for high-resource Latin-script languages, whereas negation or distractor sentence perturbations concurrently degrade both task performance and attribution stability. To our knowledge, this is the first study to uncover systematic asymmetries in multilingual CoT attribution—demonstrating non-uniform attribution behavior across languages and prompting formats. These findings provide theoretical insights into trustworthy multilingual reasoning and inform methodological refinements for robust, interpretable cross-lingual inference.

Technology Category

Application Category

📝 Abstract

This study investigates the attribution patterns underlying Chain-of-Thought (CoT) reasoning in multilingual LLMs. While prior works demonstrate the role of CoT prompting in improving task performance, there are concerns regarding the faithfulness and interpretability of the generated reasoning chains. To assess these properties across languages, we applied two complementary attribution methods--ContextCite for step-level attribution and Inseq for token-level attribution--to the Qwen2.5 1.5B-Instruct model using the MGSM benchmark. Our experimental results highlight key findings such as: (1) attribution scores excessively emphasize the final reasoning step, particularly in incorrect generations; (2) structured CoT prompting significantly improves accuracy primarily for high-resource Latin-script languages; and (3) controlled perturbations via negation and distractor sentences reduce model accuracy and attribution coherence. These findings highlight the limitations of CoT prompting, particularly in terms of multilingual robustness and interpretive transparency.

Problem

Research questions and friction points this paper is trying to address.

Examines attribution faithfulness in multilingual CoT reasoning chains

Compares step-level and token-level attribution patterns across languages

Identifies limitations in multilingual robustness and interpretive transparency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Applied step-level and token-level attribution methods

Used Qwen2.5 model with MGSM multilingual benchmark

Analyzed attribution patterns via controlled perturbation experiments

🔎 Similar Papers

Towards Faithful Chain-of-Thought: Large Language Models are Bridging Reasoners