GVPO: Group Variance Policy Optimization for Large Language Model Post-Training

📅 2025-04-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address training instability in large language model post-training caused by relative reward modeling, this paper proposes KL-constrained Reward Maximization (KLRM). Methodologically, KLRM integrates the analytical solution to KL-constrained reward maximization into a gradient-weighting scheme, unifying policy alignment and optimization stability. This solution is unique, hyperparameter-free, and inherently compatible with arbitrary sampling distributions—thereby overcoming limitations of on-policy learning and importance sampling. The method further incorporates implicit reward modeling, group-variance-aware gradient design, and relative reward normalization. Empirical results demonstrate that KLRM significantly outperforms state-of-the-art methods—including GRPO—across multi-task alignment benchmarks. It achieves more robust convergence, enhanced training stability, and maintains high efficiency and scalability.

Technology Category

Application Category

📝 Abstract
Post-training plays a crucial role in refining and aligning large language models to meet specific tasks and human preferences. While recent advancements in post-training techniques, such as Group Relative Policy Optimization (GRPO), leverage increased sampling with relative reward scoring to achieve superior performance, these methods often suffer from training instability that limits their practical adoption. To address this challenge, we present Group Variance Policy Optimization (GVPO). GVPO incorporates the analytical solution to KL-constrained reward maximization directly into its gradient weights, ensuring alignment with the optimal policy. The method provides intuitive physical interpretations: its gradient mirrors the mean squared error between the central distance of implicit rewards and that of actual rewards. GVPO offers two key advantages: (1) it guarantees a unique optimal solution, exactly the KL-constrained reward maximization objective, (2) it supports flexible sampling distributions that avoids on-policy and importance sampling limitations. By unifying theoretical guarantees with practical adaptability, GVPO establishes a new paradigm for reliable and versatile LLM post-training.
Problem

Research questions and friction points this paper is trying to address.

Addresses training instability in post-training large language models
Ensures alignment with optimal policy via KL-constrained reward maximization
Overcomes limitations of on-policy and importance sampling distributions
Innovation

Methods, ideas, or system contributions that make the work stand out.

GVPO integrates KL-constrained reward maximization into gradient weights
GVPO's gradient reflects mean squared error of reward distances
GVPO enables flexible sampling without on-policy limitations
🔎 Similar Papers
Kaichen Zhang
Kaichen Zhang
Hong Kong University of Science and Technology (Guangzhou)
Y
Yuzhong Hong
Zuoyebang Education Technology
Junwei Bao
Junwei Bao
zuoyebang.com // JD.com // MSRA
NLPLLMQA+DialogGeneration
H
Hongfei Jiang
Zuoyebang Education Technology
Y
Yang Song
Zuoyebang Education Technology
D
Dingqian Hong
Zuoyebang Education Technology
Hui Xiong
Hui Xiong
Senior Scientist, Candela Corporation
Ultrafast dynamicsatomic molecular physicsfree electron laser