Arbitrary Entropy Policy Optimization: Entropy Is Controllable in Reinforcement Finetuning

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
GRPO suffers from pervasive entropy collapse during reinforcement fine-tuning, leading to diminished exploration and premature policy convergence; existing entropy regularization methods provide limited mitigation while introducing bias and training instability. To address this, we propose Arbitrary Entropy Strategy Optimization (AESO), the first method enabling stable, precise control of policy entropy at arbitrary target levels in LLM reinforcement fine-tuning. Methodologically, AESO abandons conventional entropy rewards and instead employs a user-specified target distribution—e.g., a temperature-scaled softmax—as the reference for REINFORCE gradient regularization, establishing a unified framework jointly regularizing policy gradients, output distributions, and REINFORCE updates. Experiments demonstrate that AESO completely eliminates entropy collapse, maintains target entropy with high fidelity, uncovers a non-monotonic relationship between entropy and task performance, and consistently outperforms GRPO and other baselines across diverse reasoning benchmarks.

Technology Category

Application Category

📝 Abstract
Reinforcement finetuning (RFT) is essential for enhancing the reasoning capabilities of large language models (LLM), yet the widely adopted Group Relative Policy Optimization (GRPO) suffers from entropy collapse, where entropy monotonically decreases, exploration vanishes, and policies converge prematurely. Existing entropy-regularized methods only partially alleviate this issue while introducing bias and instability, leaving entropy control unresolved and the connection between entropy, exploration, and performance unclear. We propose Arbitrary Entropy Policy Optimization (AEPO), which eliminates entropy collapse by replacing entropy bonuses with REINFORCE policy gradient on temperature-adjusted distributions and stabilizing entropy through temperature regulation. AEPO integrates three key designs: policy gradient as regularization, distribution as regularization, and REINFORCE as regularization, enabling precise entropy control without distorting optimization. Experiments demonstrate three major contributions: AEPO (1) stabilizes entropy at arbitrary target levels, effectively removing collapse in GRPO; (2) reveals a non-monotonic relation where performance first improves then declines with increasing entropy, clarifying the link between entropy, exploration, and reasoning; and (3) generalizes beyond entropy, providing a broader RFT paradigm where superior target distributions can serve as REINFORCE regularizers.
Problem

Research questions and friction points this paper is trying to address.

Addresses entropy collapse in reinforcement finetuning of language models
Establishes connection between entropy control, exploration, and model performance
Enables arbitrary entropy stabilization without distorting optimization objectives
Innovation

Methods, ideas, or system contributions that make the work stand out.

Replaces entropy bonuses with REINFORCE policy gradient
Stabilizes entropy through temperature regulation
Enables precise entropy control without distorting optimization
🔎 Similar Papers
2024-07-09Neural Information Processing SystemsCitations: 3
C
Chen Wang
College of Software, Nankai University
Z
Zhaochun Li
Zhongguancun Academy
J
Jionghao Bai
School of Automation , Beijing Institute of Technology
Y
Yuzhi Zhang
College of Software, Nankai University
Shisheng Cui
Shisheng Cui
Professor, School of Automation, Beijing Institute of Technology
stochastic optimizationvariational inequalitygame theorymachine learning
Zhou Zhao
Zhou Zhao
Zhejiang University
Machine LearningData MiningMultimedia Computing
Y
Yue Wang
Zhongguancun Academy