🤖 AI Summary
Large language models (LLMs) often generate excessively verbose responses during mathematical reasoning, incurring high computational cost and latency without commensurate accuracy gains.
Method: This paper proposes a performance-aware adaptive reward mechanism for reinforcement learning—departing from fixed-length penalties, it dynamically modulates length penalty intensity based on real-time accuracy estimates, enabling automatic trade-offs between correctness and inference efficiency. The method integrates online performance monitoring, dynamic weight scheduling, and policy gradient optimization, requiring no manual hyperparameter tuning and adapting autonomously as model capabilities evolve.
Contribution/Results: Evaluated across multiple mathematical reasoning benchmarks, the approach reduces average inference length by up to 47% while preserving ≥98% of the original accuracy. This yields substantial improvements in “fast-and-accurate” reasoning capability, demonstrating both efficiency gains and robustness to accuracy degradation.
📝 Abstract
Large language models (LLMs) have demonstrated strong reasoning abilities in mathematical tasks, often enhanced through reinforcement learning (RL). However, RL-trained models frequently produce unnecessarily long reasoning traces -- even for simple queries -- leading to increased inference costs and latency. While recent approaches attempt to control verbosity by adding length penalties to the reward function, these methods rely on fixed penalty terms that are hard to tune and cannot adapt as the model's reasoning capability evolves, limiting their effectiveness. In this work, we propose an adaptive reward-shaping method that enables LLMs to"think fast and right"-- producing concise outputs without sacrificing correctness. Our method dynamically adjusts the reward trade-off between accuracy and response length based on model performance: when accuracy is high, the length penalty increases to encourage faster length reduction; when accuracy drops, the penalty is relaxed to preserve correctness. This adaptive reward accelerates early-stage length reduction while avoiding over-compression in later stages. Experiments across multiple datasets show that our approach consistently and dramatically reduces reasoning length while largely maintaining accuracy, offering a new direction for cost-efficient adaptive reasoning in large-scale language models.