Group Expectation Policy Optimization for Stable Heterogeneous Reinforcement Learning in LLMs

📅 2025-08-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address sampling-learning coupling instability, importance sampling failure, and increased KL divergence caused by network latency in LLM reinforcement learning post-training under heterogeneous distributed environments, this paper proposes HeteroRL—an asynchronous decentralized framework. HeteroRL decouples sampling and learning across geographically distributed nodes to enable high-latency-robust deployment. Furthermore, it introduces Group Expectation Policy Optimization (GEPO), a theoretically grounded algorithm that reduces the variance of importance weights exponentially, thereby mitigating policy degradation induced by communication delays. Experiments demonstrate that HeteroRL incurs less than 3% performance degradation under extreme 1800-second inter-node communication latency—significantly outperforming baselines such as GRPO. To our knowledge, this work is the first to empirically validate a scalable, stable, and efficient decentralized RL training paradigm in strongly heterogeneous network settings.

Technology Category

Application Category

📝 Abstract

As single-center computing approaches power constraints, decentralized training is becoming essential. Reinforcement Learning (RL) post-training enhances Large Language Models (LLMs) but faces challenges in heterogeneous distributed environments due to its tightly-coupled sampling-learning alternation. We propose HeteroRL, an asynchronous RL architecture that decouples rollout sampling from parameter learning, enabling robust deployment across geographically distributed nodes under network delays. We identify that latency-induced KL divergence causes importance sampling failure due to high variance. To address this, we propose Group Expectation Policy Optimization (GEPO), which reduces importance weight variance through a refined sampling mechanism. Theoretically, GEPO achieves exponential variance reduction. Experiments show it maintains superior stability over methods like GRPO, with less than 3% performance degradation under 1800-second delays, demonstrating strong potential for decentralized RL in heterogeneous networks.

Problem

Research questions and friction points this paper is trying to address.

Addresses unstable reinforcement learning in heterogeneous distributed LLM training

Reduces high variance from latency-induced importance sampling failures

Enables stable decentralized RL under network delays and performance degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous RL architecture decouples sampling and learning

Group Expectation Policy Optimization reduces variance

Exponential variance reduction through refined sampling mechanism

🔎 Similar Papers

No similar papers found.

Authors to Follow