Reasoning Structure of Large Language Models

📅 2026-06-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

180K/year
🤖 AI Summary
Existing evaluation metrics struggle to capture the structural differences in large language models’ logical reasoning capabilities. This work proposes a novel approach by constructing a logical puzzle benchmark that parses unstructured reasoning traces into verifiable claim dependency graphs, thereby modeling the reasoning process as a quantifiable graph structure for the first time. The authors introduce a new metric, “reasoning efficiency,” to measure the concentration of logical flow within these graphs. Experimental results demonstrate that this structured evaluation effectively identifies model failure modes and reveals how reasoning performance varies with puzzle difficulty, significantly outperforming conventional assessment methods based solely on answer accuracy or token count.
📝 Abstract
Large reasoning models (LRMs) are often evaluated using metrics such as final-answer accuracy or token count. However, identical scores on these metrics can hide fundamentally different reasoning structures. To address this limitation, we introduce a scalable LRM benchmark of logic puzzles and a pipeline that converts unstructured traces into verifiable reasoning graphs of claims and dependencies. This turns reasoning into a structured, measurable object whose topology can be quantitatively analyzed. Building on this, we define a reasoning efficiency metric that quantifies how concentrated the model's logical flow is. Our analysis on open-source reasoning models shows that structural measurements separate behaviors that token count and accuracy conflate, providing a practical tool for diagnosing failure modes and comparing how reasoning scales with puzzle difficulty.
Problem

Research questions and friction points this paper is trying to address.

reasoning structure
large language models
evaluation metrics
logic puzzles
reasoning efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning graph
structured reasoning
reasoning efficiency
logic puzzles benchmark
dependency analysis