Understanding Chain-of-Thought in LLMs through Information Theory

📅 2024-11-18

🏛️ arXiv.org

📈 Citations: 13

✨ Influential: 1

career value

179K/year

🤖 AI Summary

Existing chain-of-thought (CoT) evaluation methods rely on human annotations or focus solely on final answer correctness, rendering them incapable of precisely diagnosing intermediate reasoning errors and resulting in high false positive rates. This work proposes the first unsupervised, information-theoretic CoT evaluation framework, modeling each reasoning step as an information gain process and quantifying step-wise information flow via entropy and mutual information—without requiring any labeled data—to localize reasoning bottlenecks. Its core contribution is the formalization of information gain within CoT, enabling fine-grained, interpretable failure mode identification. On benchmarks including GSM-8K, the method significantly reduces false positive rates compared to existing outcome-oriented evaluators. Moreover, it provides task-level performance attribution, uncovering latent reasoning deficiencies in models. By grounding CoT assessment in information theory, this framework establishes a novel paradigm for analyzing and improving CoT reliability.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have shown impressive performance in complex reasoning tasks through Chain-of-Thought (CoT) reasoning, allowing models to break down problems into manageable sub-tasks. However, existing CoT evaluation techniques either require annotated CoT data or fall short in accurately assessing intermediate reasoning steps, leading to high rates of false positives. In this paper, we formalize CoT reasoning in LLMs through an information-theoretic lens. Specifically, our framework quantifies the `information gain' at each reasoning step, enabling the identification of failure modes in LLMs without the need for expensive annotated datasets. We demonstrate the efficacy of our approach through extensive experiments on toy and GSM-8K data, where it significantly outperforms existing outcome-based methods by providing more accurate insights into model performance on individual tasks.

Problem

Research questions and friction points this paper is trying to address.

Assessing Chain-of-Thought reasoning without annotated data

Quantifying information-gain in LLM reasoning steps

Improving accuracy in evaluating intermediate reasoning performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantifies information-gain per reasoning step

Identifies failure modes without annotated data

Outperforms outcome-based methods on subtasks

🔎 Similar Papers

From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency