🤖 AI Summary
This work investigates attribution patterns in chain-of-thought (CoT) reasoning within multilingual large language models, focusing on cross-lingual reliability and interpretability disparities. We propose a joint step-level and token-level attribution analysis framework, implementing systematic evaluation on the MGSM multilingual benchmark using the Qwen2.5-1.5B-Instruct model, integrated with ContextCite and Inseq. Results reveal that final reasoning steps are consistently over-attributed, and attribution bias is exacerbated in low-resource languages. Structured CoT prompting significantly improves both accuracy and attribution consistency for high-resource Latin-script languages, whereas negation or distractor sentence perturbations concurrently degrade both task performance and attribution stability. To our knowledge, this is the first study to uncover systematic asymmetries in multilingual CoT attribution—demonstrating non-uniform attribution behavior across languages and prompting formats. These findings provide theoretical insights into trustworthy multilingual reasoning and inform methodological refinements for robust, interpretable cross-lingual inference.
📝 Abstract
This study investigates the attribution patterns underlying Chain-of-Thought (CoT) reasoning in multilingual LLMs. While prior works demonstrate the role of CoT prompting in improving task performance, there are concerns regarding the faithfulness and interpretability of the generated reasoning chains. To assess these properties across languages, we applied two complementary attribution methods--ContextCite for step-level attribution and Inseq for token-level attribution--to the Qwen2.5 1.5B-Instruct model using the MGSM benchmark. Our experimental results highlight key findings such as: (1) attribution scores excessively emphasize the final reasoning step, particularly in incorrect generations; (2) structured CoT prompting significantly improves accuracy primarily for high-resource Latin-script languages; and (3) controlled perturbations via negation and distractor sentences reduce model accuracy and attribution coherence. These findings highlight the limitations of CoT prompting, particularly in terms of multilingual robustness and interpretive transparency.