GAPO: Group Adaptive Policy Optimization for Real-World Code Edit

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In real-world code editing tasks, reinforcement learning post-training often suffers from skewed reward distributions and outlier interference, leading to distorted advantage estimation and unstable policy optimization. To address this, we propose GAPO—a robust advantage estimation method based on Adaptive Highest-Density Interval (ADHI) sampling: it filters trajectories via ADHI and computes Q-values using the median rather than the group mean. GAPO is critic-free, plug-and-play, and integrates three synergistic mechanisms: grouped relative advantage estimation, adaptive outlier filtering, and median-centered value aggregation. Evaluated on 51,844 real-world editing tasks across 10 programming languages, GAPO consistently improves accuracy across nine models ranging from 3B to 14B parameters, significantly outperforming both GRPO and its variant DAPO.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) is widely used for post-training large language models (LLMs) in code editing, where group-relative methods like GRPO are popular for their critic-free, normalized advantage estimation. However, in real-world code-editing scenarios, reward distributions are often skewed with unpredictable outliers, leading to distorted advantage computation and increased noise. To address this issue, we propose Group Adaptive Policy Optimization (GAPO), which adaptively finds an outlier-free highest-density interval (HDI) per prompt and then uses the median of that interval as an adaptive Q to replace the group mean in advantage calculation. This adaptive Q robustly handles skewed distributions while remaining plug-and-play and efficient. We validate GAPO on nine instruction-tuned LLMs (3B-14B) using a large internal dataset of 51,844 real-world, history-aware code-editing tasks across 10 languages, demonstrating consistent improvements in exact match accuracy over GRPO and its variant DAPO. Code is publicly available.
Problem

Research questions and friction points this paper is trying to address.

Addresses skewed reward distributions in code editing
Handles unpredictable outliers in advantage computation
Improves reinforcement learning for real-world code tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptively finds outlier-free highest-density interval per prompt
Uses median of interval as adaptive Q for advantage calculation
Robustly handles skewed reward distributions in code editing
🔎 Similar Papers
No similar papers found.
J
Jianqing Zhang
Shanghai Jiao Tong University
Z
Zhezheng Hao
Zhejiang University
W
Wei Xia
Tencent
Hande Dong
Hande Dong
Tencent
machine learningdata miningNLP
H
Hong Wang
Tencent
Chenxing Wei
Chenxing Wei
Shenzhen University
nlp
Y
Yuyan Zhou
Tencent
Y
Yubin Qi
Peking University
Qiang Lin
Qiang Lin
University of Rochester
Nonlinear PhotonicsQuantum PhotonicsMechanical Photonics
J
Jian Cao
Shanghai Jiao Tong University