Learning to Shard: RL for Co-optimizing the Parallelism Degrees and Per-operator Sharding Dimensions in Distributed LLM Inference

📅 2025-08-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Coordinating coarse-grained parallelism (e.g., tensor/pipeline/model parallelism) with fine-grained operator-level sharding dimensions in distributed LLM inference remains challenging, especially for meeting stringent service-level objectives (SLOs). Method: This paper proposes the first reinforcement learning–based joint optimization framework. It employs an attention mechanism to model high-performing historical strategies and integrates elite experience replay to efficiently navigate an ultra-large configuration space—overcoming limitations of static, heuristic-based approaches. Contribution/Results: The framework enables fully automated parallelization policy generation for ultra-large models—including MoE architectures up to 1.6 trillion parameters—on H100 GPU clusters. Experiments demonstrate throughput improvements of 3.5× over meta-heuristic baselines and 1.06× over Megatron’s heuristic policies, significantly enhancing inference performance and SLO compliance under complex hardware topologies.

Technology Category

Application Category

📝 Abstract
Distributed LLM inference requires careful coordination of parallelization strategies across hundreds to thousands of NPUs to meet production SLOs. Current systems like Megatron-LM rely on static heuristics that separately configure parallelism degrees and per-operator sharding dimensions, leaving significant performance on the table as models scale and hardware topologies diversify. We introduce Learn to Shard, to our knowledge, the first RL-based approach to co-optimize both coarse-grained parallelism degrees and fine-grained per-operator sharding dimensions for distributed LLM inference. Our method employs an attention-based policy over an elite history that learns from high-performing strategies to efficiently navigate the vast combinatorial search space. Evaluated on H100 clusters with MoE models up to 1.6T parameters, Learn to Shard achieves up to 3.5x throughput improvement over metaheuristic baselines and 1.06x over Megatron heuristics.
Problem

Research questions and friction points this paper is trying to address.

Co-optimizing parallelism degrees and sharding dimensions for distributed LLM inference
Overcoming limitations of static heuristics in large-scale model deployment
Navigating combinatorial search space for optimal NPU coordination strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

RL co-optimizes parallelism degrees and sharding dimensions
Attention-based policy learns from elite strategy history
Navigates combinatorial search space for distributed inference
🔎 Similar Papers
No similar papers found.
Ruokai Yin
Ruokai Yin
Yale University
Computer ArchitectureDomain-specific AccelerationDeep LearningNeuromorphic Computing
S
Sattwik Deb Mishra
Microsoft Azure
X
Xuan Zuo
Microsoft Azure
H
Hokchhay Tann
Microsoft Azure
P
Preyas Shah
Microsoft Azure
Apala Guha
Apala Guha
Microsoft Azure