Markovian ODE-guided scoring can assess the quality of offline reasoning traces in language models

πŸ“… 2026-03-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing methods struggle to effectively evaluate the quality of diverse or degenerate reasoning trajectories generated by language models from a human cognitive perspective, exhibiting limited generalizability. To address this challenge, this work proposes MarODE, a novel framework that uniquely integrates Markov processes with ordinary differential equations (ODEs) to model the dynamic evolution of reasoning trajectories. MarODE establishes a theoretically grounded, human-aligned evaluation paradigm that captures the temporal and structural nuances of human judgment. Through human-centric perturbation testing and large-scale empirical validation, MarODE demonstrates superior performance, surpassing current baselines by over 250% in Somers’ D correlation, thereby significantly enhancing both the accuracy and generalizability of reasoning quality assessment.

Technology Category

Application Category

πŸ“ Abstract
Reasoning traces produced by generative language models are increasingly used for tasks ranging from mathematical problem solving to automated fact checking. However, existing evaluation methods remain largely mechanical and fail to capture human-centric notions of reasoning quality in a way that generalizes across varied and progressively degraded reasoning. We introduce MarODE, an offline evaluation framework that assigns quality scores to reasoning traces. Its effectiveness is assessed using human-centric perturbations and human judgments, which jointly evaluate the fundamental dimensions of an evaluation metric - goodness and soundness. The approach is grounded in a Markovian formulation of reasoning progression and an ordinary differential equation based characterization of trace dynamics, enabling efficient evaluation of reasoning quality. In a large-scale evaluation, MarODE outperforms existing baselines by over 250% under Somers' D correlation. Our results emphasize the value of theory-driven evaluation frameworks as reasoning traces become central to language model-based systems.
Problem

Research questions and friction points this paper is trying to address.

reasoning traces
evaluation framework
reasoning quality
language models
human-centric evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Markovian ODE
reasoning trace evaluation
language model reasoning
offline evaluation framework
trace dynamics
πŸ”Ž Similar Papers
No similar papers found.
A
Arghodeep Nandi
Department of Electrical Engineering, Indian Institute of Technology Delhi, New Delhi, India
O
Ojasva Saxena
Department of Electrical Engineering, Indian Institute of Technology Delhi, New Delhi, India
Tanmoy Chakraborty
Tanmoy Chakraborty
Associate Professor, IIT Delhi, India
Natural Language ProcessingLarge Language ModelsSocial Computing