Dynamic Hierarchical Birkhoff-von Neumann Decomposition for All-to-All GPU Communication

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

259K/year

🤖 AI Summary

This work addresses the performance bottlenecks in all-to-all communication within two-tier GPU architectures—characterized by fast intra-server and slow inter-server links—caused by port bandwidth constraints and traffic skew. To mitigate these issues, the authors propose a dynamic hierarchical scheduling framework that first balances traffic within each server to alleviate micro-level skew, then leverages a hierarchical Birkhoff–von Neumann (BvN) decomposition to generate efficient matchings. A key innovation is the integration of a dynamic frame-length mechanism with hierarchical traffic shaping, which, for the first time, theoretically guarantees queue stability under such two-tier topologies. The framework features a low-complexity hierarchical BvN decomposition algorithm. Simulations demonstrate that under localized hotspot traffic, the average frame length is significantly reduced, and the system achieves stable support for Poisson arrival traffic.

Technology Category

Application Category

📝 Abstract

All-to-all GPU communication is a critical bottleneck in large-scale training clusters, where completion time is constrained by per-port bandwidth and can be severely impacted by traffic skew across GPUs and network interface cards (NICs). This issue is amplified by the two-tier structure of modern GPU systems, which combine fast intra-server links with much slower inter-server networks. Motivated by recent system observations that highlight the importance of traffic reshaping and hierarchy awareness, we study all-to-all scheduling from an online switching and queueing-theoretic perspective. We propose a dynamic hierarchical Birkhoff--von Neumann (BvN) decomposition framework tailored to two-tier GPU fabrics. At each frame boundary, traffic is first balanced within each server using simple local operations to mitigate micro-level GPU/NIC skew while preserving aggregate server-to-server demand. A hierarchical BvN decomposition is then applied at the server level and refined into GPU-level matchings, significantly reducing decomposition complexity relative to a flat GPU-level approach. By integrating this construction with the dynamic frame sizing (DFS) principle, we obtain an online scheduler with provable stability under admissible Poisson arrivals. Simulations demonstrate substantial reductions in mean frame length, particularly under server-localized hotspot traffic.

Problem

Research questions and friction points this paper is trying to address.

all-to-all communication

traffic skew

two-tier GPU architecture

bandwidth bottleneck

large-scale training clusters

Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical Birkhoff-von Neumann decomposition

all-to-all GPU communication

dynamic frame sizing