🤖 AI Summary
To address sampling-learning coupling instability, importance sampling failure, and increased KL divergence caused by network latency in LLM reinforcement learning post-training under heterogeneous distributed environments, this paper proposes HeteroRL—an asynchronous decentralized framework. HeteroRL decouples sampling and learning across geographically distributed nodes to enable high-latency-robust deployment. Furthermore, it introduces Group Expectation Policy Optimization (GEPO), a theoretically grounded algorithm that reduces the variance of importance weights exponentially, thereby mitigating policy degradation induced by communication delays. Experiments demonstrate that HeteroRL incurs less than 3% performance degradation under extreme 1800-second inter-node communication latency—significantly outperforming baselines such as GRPO. To our knowledge, this work is the first to empirically validate a scalable, stable, and efficient decentralized RL training paradigm in strongly heterogeneous network settings.
📝 Abstract
As single-center computing approaches power constraints, decentralized training is becoming essential. Reinforcement Learning (RL) post-training enhances Large Language Models (LLMs) but faces challenges in heterogeneous distributed environments due to its tightly-coupled sampling-learning alternation. We propose HeteroRL, an asynchronous RL architecture that decouples rollout sampling from parameter learning, enabling robust deployment across geographically distributed nodes under network delays. We identify that latency-induced KL divergence causes importance sampling failure due to high variance. To address this, we propose Group Expectation Policy Optimization (GEPO), which reduces importance weight variance through a refined sampling mechanism. Theoretically, GEPO achieves exponential variance reduction. Experiments show it maintains superior stability over methods like GRPO, with less than 3% performance degradation under 1800-second delays, demonstrating strong potential for decentralized RL in heterogeneous networks.