Dynamic Hierarchical Birkhoff-von Neumann Decomposition for All-to-All GPU Communication

📅 2026-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance bottlenecks in all-to-all communication within two-tier GPU architectures—characterized by fast intra-server and slow inter-server links—caused by port bandwidth constraints and traffic skew. To mitigate these issues, the authors propose a dynamic hierarchical scheduling framework that first balances traffic within each server to alleviate micro-level skew, then leverages a hierarchical Birkhoff–von Neumann (BvN) decomposition to generate efficient matchings. A key innovation is the integration of a dynamic frame-length mechanism with hierarchical traffic shaping, which, for the first time, theoretically guarantees queue stability under such two-tier topologies. The framework features a low-complexity hierarchical BvN decomposition algorithm. Simulations demonstrate that under localized hotspot traffic, the average frame length is significantly reduced, and the system achieves stable support for Poisson arrival traffic.

Technology Category

Application Category

📝 Abstract
All-to-all GPU communication is a critical bottleneck in large-scale training clusters, where completion time is constrained by per-port bandwidth and can be severely impacted by traffic skew across GPUs and network interface cards (NICs). This issue is amplified by the two-tier structure of modern GPU systems, which combine fast intra-server links with much slower inter-server networks. Motivated by recent system observations that highlight the importance of traffic reshaping and hierarchy awareness, we study all-to-all scheduling from an online switching and queueing-theoretic perspective. We propose a dynamic hierarchical Birkhoff--von Neumann (BvN) decomposition framework tailored to two-tier GPU fabrics. At each frame boundary, traffic is first balanced within each server using simple local operations to mitigate micro-level GPU/NIC skew while preserving aggregate server-to-server demand. A hierarchical BvN decomposition is then applied at the server level and refined into GPU-level matchings, significantly reducing decomposition complexity relative to a flat GPU-level approach. By integrating this construction with the dynamic frame sizing (DFS) principle, we obtain an online scheduler with provable stability under admissible Poisson arrivals. Simulations demonstrate substantial reductions in mean frame length, particularly under server-localized hotspot traffic.
Problem

Research questions and friction points this paper is trying to address.

all-to-all communication
traffic skew
two-tier GPU architecture
bandwidth bottleneck
large-scale training clusters
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical Birkhoff-von Neumann decomposition
all-to-all GPU communication
dynamic frame sizing
two-tier GPU fabric
online scheduling
🔎 Similar Papers
No similar papers found.
Y
Yen-Chieh Wu
Institute of Communications Engineering, National Tsing Hua University, Hsinchu 30013, Taiwan, R.O.C.
Cheng-Shang Chang
Cheng-Shang Chang
Distinguished Chair Professor of Electrical Engineering, National Tsing Hua University
Network SciencePerformance EvaluationNetworkingWireless Networks
D
Duan-Shin Lee
Institute of Communications Engineering, National Tsing Hua University, Hsinchu 30013, Taiwan, R.O.C.
H. Jonathan Chao
H. Jonathan Chao
Professor of ECE, New York University
Networking