Arbitrary Entropy Policy Optimization: Entropy Is Controllable in Reinforcement Finetuning

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

GRPO suffers from pervasive entropy collapse during reinforcement fine-tuning, leading to diminished exploration and premature policy convergence; existing entropy regularization methods provide limited mitigation while introducing bias and training instability. To address this, we propose Arbitrary Entropy Strategy Optimization (AESO), the first method enabling stable, precise control of policy entropy at arbitrary target levels in LLM reinforcement fine-tuning. Methodologically, AESO abandons conventional entropy rewards and instead employs a user-specified target distribution—e.g., a temperature-scaled softmax—as the reference for REINFORCE gradient regularization, establishing a unified framework jointly regularizing policy gradients, output distributions, and REINFORCE updates. Experiments demonstrate that AESO completely eliminates entropy collapse, maintains target entropy with high fidelity, uncovers a non-monotonic relationship between entropy and task performance, and consistently outperforms GRPO and other baselines across diverse reasoning benchmarks.

Technology Category

Application Category

📝 Abstract

Reinforcement finetuning (RFT) is essential for enhancing the reasoning capabilities of large language models (LLM), yet the widely adopted Group Relative Policy Optimization (GRPO) suffers from entropy collapse, where entropy monotonically decreases, exploration vanishes, and policies converge prematurely. Existing entropy-regularized methods only partially alleviate this issue while introducing bias and instability, leaving entropy control unresolved and the connection between entropy, exploration, and performance unclear. We propose Arbitrary Entropy Policy Optimization (AEPO), which eliminates entropy collapse by replacing entropy bonuses with REINFORCE policy gradient on temperature-adjusted distributions and stabilizing entropy through temperature regulation. AEPO integrates three key designs: policy gradient as regularization, distribution as regularization, and REINFORCE as regularization, enabling precise entropy control without distorting optimization. Experiments demonstrate three major contributions: AEPO (1) stabilizes entropy at arbitrary target levels, effectively removing collapse in GRPO; (2) reveals a non-monotonic relation where performance first improves then declines with increasing entropy, clarifying the link between entropy, exploration, and reasoning; and (3) generalizes beyond entropy, providing a broader RFT paradigm where superior target distributions can serve as REINFORCE regularizers.

Problem

Research questions and friction points this paper is trying to address.

Addresses entropy collapse in reinforcement finetuning of language models

Establishes connection between entropy control, exploration, and model performance

Enables arbitrary entropy stabilization without distorting optimization objectives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Replaces entropy bonuses with REINFORCE policy gradient

Stabilizes entropy through temperature regulation

Enables precise entropy control without distorting optimization

🔎 Similar Papers

Can Learned Optimization Make Reinforcement Learning Less Difficult?

2024-07-09Neural Information Processing SystemsCitations: 3

Authors to Follow