🤖 AI Summary
Large language models (LLMs) exhibit counterintuitive reasoning behaviors—such as degraded performance under few-shot prompting—highlighting a fundamental gap in understanding the internal mechanisms of reasoning-oriented LLMs (RLMs). To address this, we propose the first unified graph-analytic framework that models long-chain reasoning as a directed reasoning graph: semantic clustering compresses atomic reasoning steps into interpretable nodes, while topological metrics—including exploration density, branching/convergence ratio, and path coherence—are rigorously quantified. Extensive experiments across diverse models and prompting paradigms demonstrate that these structural properties strongly correlate with task accuracy, outperforming conventional token-level evaluation. Crucially, our analysis reveals that prompting strategies exert systematic, plastic control over internal reasoning topology—enabling targeted intervention and optimization. This work establishes a novel paradigm for interpretable reasoning modeling and efficient prompt engineering grounded in structural graph analysis.
📝 Abstract
Recent advances in test-time scaling have enabled Large Language Models (LLMs) to display sophisticated reasoning abilities via extended Chain-of-Thought (CoT) generation. Despite their potential, these Reasoning LLMs (RLMs) often demonstrate counterintuitive and unstable behaviors, such as performance degradation under few-shot prompting, that challenge our current understanding of RLMs. In this work, we introduce a unified graph-based analytical framework for better modeling the reasoning processes of RLMs. Our method first clusters long, verbose CoT outputs into semantically coherent reasoning steps, then constructs directed reasoning graphs to capture contextual and logical dependencies among these steps. Through comprehensive analysis across models and prompting regimes, we reveal that structural properties, such as exploration density, branching, and convergence ratios, strongly correlate with reasoning accuracy. Our findings demonstrate how prompting strategies substantially reshape the internal reasoning structure of RLMs, directly affecting task outcomes. The proposed framework not only enables quantitative evaluation of reasoning quality beyond conventional metrics but also provides practical insights for prompt engineering and the cognitive analysis of LLMs. Code and resources will be released to facilitate future research in this direction.