๐ค AI Summary
To address the absence of verifiable external reward signals in open-ended long-horizon reasoning tasks, this paper proposes DRO, a self-driven reinforcement learning framework. Its core innovation is โReasoning-Reflection Rewardโ (R3), a fine-grained, human-label-free reward signal dynamically generated by the model itself during chain-of-thought (CoT) reasoning, enabling internal alignment between intermediate steps and final outcomes. DRO establishes a fully self-contained training paradigm integrating R3-guided reinforcement learning, self-reflective reward modeling, and dynamic data filtering based on R3 confidence. Evaluated on ParaRev (paragraph revision) and FinQA (mathematical question answering), DRO significantly outperforms strong baselines while demonstrating robust generalization across both open-domain reasoning and structured reasoning tasks. To our knowledge, DRO is the first unsupervised, fine-grained, process-aware reasoning optimization framework.
๐ Abstract
Recent advances in Large Language Models (LLMs) have showcased impressive reasoning abilities in structured tasks like mathematics and programming, largely driven by Reinforcement Learning with Verifiable Rewards (RLVR), which uses outcome-based signals that are scalable, effective, and robust against reward hacking. However, applying similar techniques to open-ended long-form reasoning tasks remains challenging due to the absence of generic, verifiable reward signals. To address this, we propose Direct Reasoning Optimization (DRO), a reinforcement learning framework for fine-tuning LLMs on open-ended, particularly long-form, reasoning tasks, guided by a new reward signal: the Reasoning Reflection Reward (R3). At its core, R3 selectively identifies and emphasizes key tokens in the reference outcome that reflect the influence of the model's preceding chain-of-thought reasoning, thereby capturing the consistency between reasoning and reference outcome at a fine-grained level. Crucially, R3 is computed internally using the same model being optimized, enabling a fully self-contained training setup. Additionally, we introduce a dynamic data filtering strategy based on R3 for open-ended reasoning tasks, reducing cost while improving downstream performance. We evaluate DRO on two diverse datasets -- ParaRev, a long-form paragraph revision task, and FinQA, a math-oriented QA benchmark -- and show that it consistently outperforms strong baselines while remaining broadly applicable across both open-ended and structured domains.