ConciseRL: Conciseness-Guided Reinforcement Learning for Efficient Reasoning Models

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Redundant reasoning chains in large language models (LLMs) incur computational waste, impair readability, and exacerbate hallucination risks. To address this, we propose a hyperparameter-free, dynamically context-aware conciseness scoring mechanism that leverages LLM-as-judge for automatic, fine-grained evaluation of reasoning path quality—requiring no human annotations or fixed-length constraints, and enabling task-difficulty-adaptive reasoning length. This score serves as a reward signal in Proximal Policy Optimization (PPO), jointly optimizing correctness and conciseness. On MATH, our method reduces token consumption by up to 31× while improving overall accuracy by 7%; for the hardest problems, accuracy increases by 7.5% and word count decreases by 3.6×. On TheoremQA, accuracy improves by 2.2% and token usage drops by 12.5×. Our core contribution is the first end-to-end framework for reasoning trajectory refinement, featuring dynamic conciseness modeling via judge-style LLMs.

Technology Category

Application Category

📝 Abstract

Large language models excel at complex tasks by breaking down problems into structured reasoning steps. However, reasoning traces often extend beyond reaching a correct answer, causing wasted computation, reduced readability, and hallucinations. To address this, we introduce a novel hyperparameter-free conciseness score used as a reward signal within a reinforcement learning framework to guide models toward generating correct and concise reasoning traces. This score is evaluated by a large language model acting as a judge, enabling dynamic, context-aware feedback beyond simple token length. Our method achieves state-of-the-art efficiency-accuracy trade-offs on the MATH dataset, reducing token usage by up to 31x on simple problems while improving accuracy by 7%, and on the hardest problems, it outperforms full reasoning by +7.5% accuracy with up to 3.6x fewer tokens. On TheoremQA, our method improves accuracy by +2.2% using 12.5x fewer tokens. We also conduct ablation studies on the judge model, reward composition, and problem difficulty, showing that our method dynamically adapts reasoning length based on problem difficulty and benefits significantly from stronger judges. The code, model weights, and datasets are open-sourced at https://github.com/RazvanDu/ConciseRL.

Problem

Research questions and friction points this paper is trying to address.

Reducing wasted computation in reasoning traces

Improving accuracy and conciseness in reasoning models

Dynamic adaptation of reasoning length based on difficulty

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hyperparameter-free conciseness score as reward

LLM judge enables dynamic feedback

Adapts reasoning length to problem difficulty

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting