Rethinking Reasoning Quality in Large Language Models through Enhanced Chain-of-Thought via RL

πŸ“… 2025-09-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing reinforcement learning reward functions focus solely on answer correctness and formatting, neglecting the causal contribution of chain-of-thought (CoT) reasoning to answer quality and lacking mechanisms to control logical depth. Method: We propose Dynamic Reasoning Efficiency Reward (DRER), the first RL reward that explicitly models the causal effect of CoT on answer correctness; it incorporates a dynamic length advantage mechanism to enable controllable optimization of logical depth. To support CoT quality evaluation and training, we introduce Logictreeβ€”a fine-grained, human-annotated dataset. Contribution/Results: After DRER optimization, a 7B model achieves GPT-4o-mini-level performance within 400 training steps, with a 30% increase in CoT confidence. The method demonstrates strong generalization across diverse logical and mathematical reasoning tasks.

Technology Category

Application Category

πŸ“ Abstract
Reinforcement learning (RL) has recently become the dominant paradigm for strengthening the reasoning abilities of large language models (LLMs). Yet the rule-based reward functions commonly used on mathematical or programming benchmarks assess only answer format and correctness, providing no signal as to whether the induced Chain-of-Thought (CoT) actually improves the answer. Furthermore, such task-specific training offers limited control over logical depth and therefore may fail to reveal a model's genuine reasoning capacity. We propose Dynamic Reasoning Efficiency Reward (DRER) -- a plug-and-play RL reward framework that reshapes both reward and advantage signals. (i) A Reasoning Quality Reward assigns fine-grained credit to those reasoning chains that demonstrably raise the likelihood of the correct answer, directly incentivising the trajectories with beneficial CoT tokens. (ii) A Dynamic Length Advantage decays the advantage of responses whose length deviates from a validation-derived threshold, stabilising training. To facilitate rigorous assessment, we also release Logictree, a dynamically constructed deductive reasoning dataset that functions both as RL training data and as a comprehensive benchmark. Experiments confirm the effectiveness of DRER: our 7B model attains GPT-o3-mini level performance on Logictree with 400 trianing steps, while the average confidence of CoT-augmented answers rises by 30%. The model further exhibits generalisation across diverse logical-reasoning datasets, and the mathematical benchmark AIME24. These results illuminate how RL shapes CoT behaviour and chart a practical path toward enhancing formal-reasoning skills in large language models. All code and data are available in repository https://github.com/Henryhe09/DRER.
Problem

Research questions and friction points this paper is trying to address.

Assessing reasoning quality beyond answer correctness in LLMs
Enhancing logical depth and control in Chain-of-Thought reasoning
Improving generalization across diverse logical reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

RL reward framework enhancing reasoning quality
Dynamic length advantage stabilizing training process
Deductively constructed dataset for training and benchmarking
πŸ”Ž Similar Papers
No similar papers found.
H
Haoyang He
Beijing University of Posts and Telecommunications
Z
Zihua Rong
Beijing University of Posts and Telecommunications
K
Kun Ji
Beijing University of Posts and Telecommunications
C
Chenyang Li
Beijing University of Posts and Telecommunications
Qing Huang
Qing Huang
Chinese Academy of Science
Material Editing
C
Chong Xia
Beijing University of Posts and Telecommunications
Lan Yang
Lan Yang
Edwin & Florence Skinner Professor, Electrical & Systems Engineering, Washington Univ. in St Louis
resonatorlasernonlinear opticssensingnon-Hermitian physics
H
Honggang Zhang
Beijing University of Posts and Telecommunications