On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization

📅 2025-05-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

GRPO suffers from Lazy Likelihood Displacement (LLD), wherein the model’s likelihood of generating correct responses improves minimally—or even degrades—due to uniform penalization of all tokens in incorrect responses, causing detrimental negative gradient interference. Method: This work first uncovers the intra-group gradient mechanism underlying LLD and proposes NTHR (Negative Token Handling via Re-weighting), a novel token-level adaptive penalty reweighting method. Leveraging GRPO’s group structure and correct responses as anchors, NTHR dynamically identifies and attenuates penalties on LLD-inducing tokens via gradient sensitivity analysis—departing from conventional DPO-style global correction paradigms. Contribution/Results: On mathematical reasoning benchmarks, NTHR substantially mitigates LLD, delivering consistent performance gains across 0.5B–3B parameter models. This validates both the efficacy and generalizability of token-level adaptive penalty reweighting in preference optimization.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs), with Group Relative Policy Optimization (GRPO) emerging as a widely used algorithm in recent systems. Despite GRPO's widespread adoption, we identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training. This behavior mirrors a recently discovered misalignment issue in Direct Preference Optimization (DPO), attributed to the influence of negative gradients. We provide a theoretical analysis of GRPO's learning dynamic, identifying the source of LLD as the naive penalization of all tokens in incorrect responses with the same strength. To address this, we develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO's group-based structure, using correct responses as anchors to identify influential tokens. Experiments on math reasoning benchmarks demonstrate that NTHR effectively mitigates LLD, yielding consistent performance gains across models ranging from 0.5B to 3B parameters.

Problem

Research questions and friction points this paper is trying to address.

Identifies Lazy Likelihood Displacement in GRPO training

Analyzes negative gradient impact on token penalization

Proposes NTHR to mitigate LLD using group-based anchors

Innovation

Methods, ideas, or system contributions that make the work stand out.

NTHR method downweights LLD-causing tokens

Uses correct responses as anchor tokens

Improves GRPO via group-based penalty adjustment

🔎 Similar Papers

No similar papers found.

Authors to Follow