Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective

📅 2025-10-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In reinforcement learning with verifiable rewards (RLVR), large language models suffer from entropy collapse—i.e., a sharp decline in policy diversity—leading to exploration-exploitation imbalance and degraded generalization. Existing entropy regularization methods operate through opaque mechanisms, indirectly modulating advantages or token probabilities, resulting in limited efficacy and frequent failure. This work is the first to quantitatively characterize the root cause of entropy collapse at the token-level entropy dynamics. We propose STEER, a reweighting framework that directly stabilizes entropy evolution: it employs entropy-change-aware fine-grained loss reweighting and gradient adjustment to adaptively balance exploration and exploitation during training. Experiments demonstrate that STEER significantly mitigates entropy collapse, improves performance on downstream tasks—including mathematical reasoning—and enhances training stability, thereby validating the effectiveness of direct entropy-dynamics regulation.

Technology Category

Application Category

📝 Abstract
While Reinforcement Learning with Verifiable Rewards (RLVR) can enhance LLM reasoning, its training process poses a critical risk: entropy collapse. This phenomenon is a rapid loss of policy diversity, stemming from the exploration-exploitation imbalance and leading to a lack of generalization. Recent entropy-intervention methods aim to prevent coloredtext{entropy collapse}, yet their underlying mechanisms remain unclear. In this paper, we conduct a quantitative analysis to reveal token-level entropy changes and how existing entropy intervention methods help avoid entropy collapse. Our findings point out a fundamental limitation of existing methods: they attempt to control entropy dynamics indirectly. By only affecting related factors, such as the advantage signal and generation probability, their effectiveness is inherently limited and could potentially fail. To address this limitation, we introduce an entropy-change-aware reweighting scheme, namely Stabilizing Token-level Entropy-changE via Reweighting (STEER), that adaptively stabilizes entropy dynamics through fine-grained token-level adjustments. Our approach mitigates over-exploitation while fostering robust exploration. Extensive experiments demonstrate that STEER significantly mitigates entropy collapse, stabilizes entropy dynamics, and achieves stronger downstream performance across various mathematical reasoning benchmarks footnote{Our code is available at https://github.com/zz-haooo/STEER.
Problem

Research questions and friction points this paper is trying to address.

Addresses entropy collapse in RLVR training process
Analyzes limitations of indirect entropy intervention methods
Proposes direct token-level entropy stabilization technique
Innovation

Methods, ideas, or system contributions that make the work stand out.

Directly stabilizes token-level entropy dynamics
Uses adaptive reweighting for fine-grained adjustments
Mitigates over-exploitation while fostering robust exploration
🔎 Similar Papers
No similar papers found.
Z
Zhezheng Hao
Zhejiang University
H
Hong Wang
University of Science and Technology of China
H
Haoyang Liu
University of Science and Technology of China
Jian Luo
Jian Luo
University of California San Diego
Materials ScienceCeramicsGrain Boundary
Jiarui Yu
Jiarui Yu
USTC
MultimodalComputer Vision
Hande Dong
Hande Dong
Tencent
machine learningdata miningNLP
Qiang Lin
Qiang Lin
University of Rochester
Nonlinear PhotonicsQuantum PhotonicsMechanical Photonics
C
Can Wang
Zhejiang University
J
Jiawei Chen
Zhejiang University