DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Reasoning language models (e.g., o1, DeepSeek-R1) improve performance via lengthy chain-of-thought reasoning but suffer from output redundancy and low intelligence density per token. Method: We propose DLER, a novel reinforcement learning training paradigm integrating batch-normalized reward shaping, high-clipping-ratio PPO, dynamic sampling, and truncation-based length penalty to mitigate advantage estimation bias, entropy collapse, and sparse reward issues. Furthermore, we introduce difficulty-aware adaptive truncation and selective model merging to jointly optimize inference efficiency and accuracy. Contribution/Results: On multiple benchmarks, DLER-7B reduces output length by over 70% and improves accuracy by 28% relative to DeepSeek-R1-7B, while significantly lowering latency. This represents the first substantial Pareto improvement on the accuracy–efficiency frontier for reasoning LMs.

Technology Category

Application Category

📝 Abstract

Reasoning language models such as OpenAI-o1, DeepSeek-R1, and Qwen achieve strong performance via extended chains of thought but often generate unnecessarily long outputs. Maximizing intelligence per token--accuracy relative to response length--remains an open problem. We revisit reinforcement learning (RL) with the simplest length penalty--truncation--and show that accuracy degradation arises not from the lack of sophisticated penalties but from inadequate RL optimization. We identify three key challenges: (i) large bias in advantage estimation, (ii) entropy collapse, and (iii) sparse reward signal. We address them with Doing Length pEnalty Right (DLER), a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and a simple truncation length penalty. DLER achieves state-of-the-art accuracy--efficiency trade-offs, cutting output length by over 70 percent while surpassing all previous baseline accuracy. It also improves test-time scaling: compared to DeepSeek-R1-7B, DLER-7B generates multiple concise responses in parallel with 28 percent higher accuracy and lower latency. We further introduce Difficulty-Aware DLER, which adaptively tightens truncation on easier questions for additional efficiency gains. We also propose an update-selective merging method that preserves baseline accuracy while retaining the concise reasoning ability of the DLER model, which is useful for scenarios where RL training data is scarce.

Problem

Research questions and friction points this paper is trying to address.

Optimizing intelligence per token in reasoning language models

Addressing accuracy degradation from inadequate RL optimization

Reducing output length while maintaining or improving accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses truncation as simple length penalty

Applies batch-wise normalization and higher clipping

Implements dynamic sampling for sparse rewards

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL