Characterizing the Impact of Congestion in Modern HPC Interconnects

📅 2026-04-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the severe performance degradation of collective communication in modern high-performance computing (HPC) interconnects caused by network congestion induced by heterogeneous workloads. It presents the first systematic evaluation of EDR/HDR/NDR InfiniBand, Cray Slingshot, and emerging Ethernet interconnects under both steady-state and bursty congestion across multiple system scales and diverse burst patterns. Through controlled experiments on real HPC platforms, the authors emulate bursts of varying intensity, duration, and interval, combined with collective communication microbenchmarks, to uncover the scale-dependent nature of congestion behavior and its relationship to typical AI communication patterns. The findings delineate distinct performance degradation characteristics across interconnect technologies, providing critical empirical insights for designing effective congestion control and load-balancing strategies.

Technology Category

Application Category

📝 Abstract
High-performance computing (HPC) systems increasingly support both scalable AI training and large-scale simulation workloads. Both typically rely heavily on collective communication operations. On modern supercomputers, however, network congestion has emerged as a major limitation, driven by heterogeneous traffic patterns resulting from diverse workload mixes. As system scale and active users continue to grow, understanding how today's interconnect technologies respond to congestion is essential for establishing realistic performance expectations and informing future system design. This paper presents a comprehensive characterization of congestion behavior across four major HPC fabrics: EDR InfiniBand, HDR InfiniBand, NDR InfiniBand, Cray Slingshot, and emerging Ethernet fabrics. These fabrics span high-performance proprietary interconnects as well as adaptive Ethernet-based designs aligned with emerging standards such as Ultra Ethernet. We evaluate their responses to both steady congestion and a wide range of bursty patterns that vary in duration, intensity, and pause length, capturing the bursty communication typical of AI workloads. Our study covers multiple scales, examining how congestion manifests differently as system size increases and identifying scale-dependent behaviors that influence collective performance. By analyzing the challenges that arise under these controlled stress conditions, we aim to provide a practical overview of congestion issues and possible optimizations. The insights derived from this evaluation can guide researchers and HPC architects in designing more effective congestion-control mechanisms and network load-balancing strategies.
Problem

Research questions and friction points this paper is trying to address.

congestion
HPC interconnects
collective communication
bursty traffic
network performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

congestion characterization
HPC interconnects
bursty traffic
collective communication
multi-scale evaluation
🔎 Similar Papers
No similar papers found.