Group Expectation Policy Optimization for Stable Heterogeneous Reinforcement Learning in LLMs

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address sampling-learning coupling instability, importance sampling failure, and increased KL divergence caused by network latency in LLM reinforcement learning post-training under heterogeneous distributed environments, this paper proposes HeteroRL—an asynchronous decentralized framework. HeteroRL decouples sampling and learning across geographically distributed nodes to enable high-latency-robust deployment. Furthermore, it introduces Group Expectation Policy Optimization (GEPO), a theoretically grounded algorithm that reduces the variance of importance weights exponentially, thereby mitigating policy degradation induced by communication delays. Experiments demonstrate that HeteroRL incurs less than 3% performance degradation under extreme 1800-second inter-node communication latency—significantly outperforming baselines such as GRPO. To our knowledge, this work is the first to empirically validate a scalable, stable, and efficient decentralized RL training paradigm in strongly heterogeneous network settings.

Technology Category

Application Category

📝 Abstract
As single-center computing approaches power constraints, decentralized training is becoming essential. Reinforcement Learning (RL) post-training enhances Large Language Models (LLMs) but faces challenges in heterogeneous distributed environments due to its tightly-coupled sampling-learning alternation. We propose HeteroRL, an asynchronous RL architecture that decouples rollout sampling from parameter learning, enabling robust deployment across geographically distributed nodes under network delays. We identify that latency-induced KL divergence causes importance sampling failure due to high variance. To address this, we propose Group Expectation Policy Optimization (GEPO), which reduces importance weight variance through a refined sampling mechanism. Theoretically, GEPO achieves exponential variance reduction. Experiments show it maintains superior stability over methods like GRPO, with less than 3% performance degradation under 1800-second delays, demonstrating strong potential for decentralized RL in heterogeneous networks.
Problem

Research questions and friction points this paper is trying to address.

Addresses unstable reinforcement learning in heterogeneous distributed LLM training
Reduces high variance from latency-induced importance sampling failures
Enables stable decentralized RL under network delays and performance degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous RL architecture decouples sampling and learning
Group Expectation Policy Optimization reduces variance
Exponential variance reduction through refined sampling mechanism
🔎 Similar Papers
No similar papers found.
H
Han Zhang
Peng Cheng Laboratory
R
Ruibin Zheng
Guangzhou University
Z
Zexuan Yi
Peng Cheng Laboratory
Hanyang Peng
Hanyang Peng
Peng Cheng Laboratory
Deep LearningOptimization
H
Hui Wang
Peng Cheng Laboratory
Y
Yue Yu
Peng Cheng Laboratory