HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the high computational cost of chain-of-thought (CoT) reasoning in large language models, a challenge exacerbated by existing compression methods that rely on manually specified length budgets, complex training procedures, and poor scalability. The authors propose HMPO, a single-stage reinforcement learning framework that innovatively integrates an adaptive median-length budget, cosine-decayed token rewards, and a multiplicative reward mechanism to prevent reward hacking without manual hyperparameter tuning. Evaluated across dense and mixture-of-experts (MoE) models ranging from 9B to 122B parameters, HMPO achieves CoT token compression rates of 19%–46% with negligible accuracy loss, substantially reduces training costs, and demonstrates strong generalization across diverse tasks and model architectures.

📝 Abstract

Large language models achieve remarkable performance via extended chain-of-thought (CoT) reasoning, yet this lengthy process incurs substantial inference overhead. Existing CoT compression methods struggle with inflexible manual length budgets, computationally expensive multi-stage training pipelines, and fragile scalability restricted to small models. We propose HMPO (Hybrid Median-length Policy Optimization), a cost-effective, single-stage reinforcement learning framework. HMPO efficiently compresses CoT via three synergistic components: an adaptive median-based budget derived from successful rollouts to eliminate manual tuning, a cosine-decay token reward for smooth length penalization, and a multiplicative reward formulation that substantially mitigates trivial reward hacking by strictly prioritizing answer correctness. Trained exclusively on mathematical data, HMPO generalizes seamlessly across math, code, science, and instruction-following tasks. Extensive experiments scaling from 9B to 122B parameters across dense and Mixture-of-Experts (MoE) architectures demonstrate that HMPO achieves 19%--46% token compression with negligible accuracy degradation, all while drastically reducing training costs compared to existing multi-stage baselines.

Problem

Research questions and friction points this paper is trying to address.

Chain-of-Thought compression

inference overhead

manual length budgets

multi-stage training

scalability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought Compression

Reinforcement Learning

Adaptive Length Budget