🤖 AI Summary
GRPO suffers from Lazy Likelihood Displacement (LLD), wherein the model’s likelihood of generating correct responses improves minimally—or even degrades—due to uniform penalization of all tokens in incorrect responses, causing detrimental negative gradient interference.
Method: This work first uncovers the intra-group gradient mechanism underlying LLD and proposes NTHR (Negative Token Handling via Re-weighting), a novel token-level adaptive penalty reweighting method. Leveraging GRPO’s group structure and correct responses as anchors, NTHR dynamically identifies and attenuates penalties on LLD-inducing tokens via gradient sensitivity analysis—departing from conventional DPO-style global correction paradigms.
Contribution/Results: On mathematical reasoning benchmarks, NTHR substantially mitigates LLD, delivering consistent performance gains across 0.5B–3B parameter models. This validates both the efficacy and generalizability of token-level adaptive penalty reweighting in preference optimization.
📝 Abstract
Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs), with Group Relative Policy Optimization (GRPO) emerging as a widely used algorithm in recent systems. Despite GRPO's widespread adoption, we identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training. This behavior mirrors a recently discovered misalignment issue in Direct Preference Optimization (DPO), attributed to the influence of negative gradients. We provide a theoretical analysis of GRPO's learning dynamic, identifying the source of LLD as the naive penalization of all tokens in incorrect responses with the same strength. To address this, we develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO's group-based structure, using correct responses as anchors to identify influential tokens. Experiments on math reasoning benchmarks demonstrate that NTHR effectively mitigates LLD, yielding consistent performance gains across models ranging from 0.5B to 3B parameters.