🤖 AI Summary
This work investigates how to identify and leverage the uncertainty dynamics in chain-of-thought (CoT) reasoning to enable efficient and reliable inference control. For the first time, sequential change-point detection is introduced into CoT analysis, revealing a two-phase entropy evolution: an initial high-entropy exploration stage followed by a low-entropy convergence phase. Building on this insight, the authors develop a training-free, real-time monitoring framework based on the CUSUM algorithm. The resulting early-stopping and test-time scaling mechanisms achieve 63.06% accuracy while generating only 88.9% of the tokens, significantly outperforming DEER and Dynasor. Furthermore, a weighted voting strategy within the test-time scaling framework consistently surpasses self-consistency methods, offering superior trade-offs between computational efficiency and model performance.
📝 Abstract
This paper investigates the entropy dynamics of Chain-of-Thought (CoT) and uncovers a consistent two-phase structure: an Uncertainty Region of exploration transitioning sharply to a Confidence Region of convergence. We demonstrate that the Confidence Region possesses two critical properties: 1) High Reliability -- answers in the confidence region become highly accurate and stable, and 2) High Redundancy -- models generate unnecessary tokens long after reaching the correct answer. These properties unlock more efficient and reliable inference strategies: 1) Early Exit leverages reliability and redundancy to terminate computation safely when returns diminish, and 2)Test-Time Scaling uses the Confidence Region signal to prioritize converged trajectories. To operationalize these insights, we formulate Confidence Region detection as a sequential change-point detection problem, being the first to apply classical change-point methods to monitor CoT reasoning. Using the Cumulative Sum (CUSUM) algorithm, a statistically optimal change-point detector, we develop a training-free framework for real-time inference control. Experiments show our approach establishes a superior Pareto-frontier for early exit. CUSUM achieves 63.06% accuracy with 11.1% token reduction, outperforming DEER and Dynasor by 3.28% and 4.36% in accuracy respectively. For test-time scaling, CUSUM-weighted voting consistently outperforms self-consistency.