ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This work addresses the challenge of evaluating reasoning trajectories in Large Reasoning Models (LRMs), which often exhibit nonlinear structures such as backtracking and self-correction. The authors propose ReasoningFlow, a novel framework that models LRM reasoning trajectories as fine-grained directed acyclic graphs (DAGs) and systematically defines and annotates their discourse structure. Leveraging a hybrid human–automatic annotation approach, they construct a dataset of 247.7k reasoning steps across mathematical, scientific, and argumentative tasks, covering five prominent LRM architectures. Their analysis reveals three key findings: reasoning trajectories share structural commonalities across models; erroneous intermediate steps do not necessarily compromise final answer correctness; and mechanistic causal dependencies often diverge from surface-level discourse structures. The study releases the first public dataset and codebase dedicated to the discourse structure of reasoning trajectories.

📝 Abstract

Large reasoning models (LRMs) produce reasoning traces with non-linear structures, such as backtracking and self-correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a framework that captures the discourse structures of LRM reasoning traces into fine-grained directed acyclic graphs (DAGs). We develop and validate our annotation schema through careful manual annotation of 31 traces (2.1k steps), achieving high inter-annotator agreement, then scale to automatic annotation of 1,260 traces (247.7k steps) spanning three tasks (math, science, argumentation) and five models (Qwen2.5-32B-Inst, QwQ-32B, DeepSeek-V3, DeepSeek-R1, GPT-oss-120B). By analyzing ReasoningFlow graphs, we find: (1) LRMs exhibit structurally similar traces, despite being trained from different base models and potentially non-overlapping post-training data. (2) ReasoningFlow reveals diverse fine-grained reasoning behaviors (e.g., local verification, self-reflection, and assumptions) that can be used for better reasoning trace monitorability. (3) In LRMs, most of the erroneous steps are not used to derive final answers. (4) Mechanistic causal dependencies between steps do not reflect the language-level discourse structure. We release the dataset and code in: https://github.com/jinulee-v/reasoningflow.

Problem

Research questions and friction points this paper is trying to address.

reasoning traces

non-linear structures

large reasoning models

discourse structures

reasoning monitorability

Innovation

Methods, ideas, or system contributions that make the work stand out.

ReasoningFlow

discourse structure

reasoning traces