π€ AI Summary
Existing reinforcement learning reward functions focus solely on answer correctness and formatting, neglecting the causal contribution of chain-of-thought (CoT) reasoning to answer quality and lacking mechanisms to control logical depth. Method: We propose Dynamic Reasoning Efficiency Reward (DRER), the first RL reward that explicitly models the causal effect of CoT on answer correctness; it incorporates a dynamic length advantage mechanism to enable controllable optimization of logical depth. To support CoT quality evaluation and training, we introduce Logictreeβa fine-grained, human-annotated dataset. Contribution/Results: After DRER optimization, a 7B model achieves GPT-4o-mini-level performance within 400 training steps, with a 30% increase in CoT confidence. The method demonstrates strong generalization across diverse logical and mathematical reasoning tasks.
π Abstract
Reinforcement learning (RL) has recently become the dominant paradigm for strengthening the reasoning abilities of large language models (LLMs). Yet the rule-based reward functions commonly used on mathematical or programming benchmarks assess only answer format and correctness, providing no signal as to whether the induced Chain-of-Thought (CoT) actually improves the answer. Furthermore, such task-specific training offers limited control over logical depth and therefore may fail to reveal a model's genuine reasoning capacity. We propose Dynamic Reasoning Efficiency Reward (DRER) -- a plug-and-play RL reward framework that reshapes both reward and advantage signals. (i) A Reasoning Quality Reward assigns fine-grained credit to those reasoning chains that demonstrably raise the likelihood of the correct answer, directly incentivising the trajectories with beneficial CoT tokens. (ii) A Dynamic Length Advantage decays the advantage of responses whose length deviates from a validation-derived threshold, stabilising training. To facilitate rigorous assessment, we also release Logictree, a dynamically constructed deductive reasoning dataset that functions both as RL training data and as a comprehensive benchmark. Experiments confirm the effectiveness of DRER: our 7B model attains GPT-o3-mini level performance on Logictree with 400 trianing steps, while the average confidence of CoT-augmented answers rises by 30%. The model further exhibits generalisation across diverse logical-reasoning datasets, and the mathematical benchmark AIME24. These results illuminate how RL shapes CoT behaviour and chart a practical path toward enhancing formal-reasoning skills in large language models. All code and data are available in repository https://github.com/Henryhe09/DRER.