🤖 AI Summary
Distributed ML at scale (thousands of GPUs) faces a critical bottleneck: tail latency in collective communications. Conventional RDMA solutions (e.g., RoCE) rely heavily on retransmission and strict in-order delivery, resulting in high protocol complexity, tail-latency sensitivity, and pipeline stalls. This work proposes a domain-specific RDMA NIC architecture tailored for ML workloads, the first to explicitly exploit ML’s intrinsic tolerance to packet loss and out-of-order delivery. It eliminates retransmission and rigid ordering, instead adopting adaptive timeout–driven best-effort, out-of-order transmission. Reliability recovery is delegated to the ML layer—via techniques such as Hadamard transforms and erasure coding—enabling a lightweight, resilient NIC design. The architecture maintains full compatibility with standard congestion control mechanisms (DCQCN, EQDS, Swift). Evaluations on Hyperstack and CloudLab demonstrate 2.0× speedup in training time-to-accuracy (TTA), 1.6× higher inference throughput, 3.5× reduction in 99th-percentile latency, 2.7× lower BRAM utilization, and nearly doubled fault tolerance.
📝 Abstract
As distributed machine learning (ML) workloads scale to thousands of GPUs connected by high-speed interconnects, tail latency in collective communication has become a major bottleneck. Existing RDMA transports, such as RoCE, IRN, SRNIC, and Falcon, enforce strict reliability and in-order delivery, relying on retransmissions and packet sequencing to ensure correctness. While these approaches work well for general-purpose workloads, they introduce complexity and latency that scale poorly in ML, where even rare packet delays can stall entire model pipelines.
We present OptiNIC, a domain-specific RDMA transport that revisits traditional reliability guarantees based on ML's tolerance for partial or missing data. OptiNIC eliminates retransmissions and in-order delivery from the NIC, enabling a best-effort, out-of-order transport model for RDMA. Unlike traditional RDMA, which signals completion only after complete data delivery, OptiNIC introduces adaptive timeouts to trigger forward progress when data may be lost or delayed. OptiNIC retains standard congestion control mechanisms (e.g., DCQCN, EQDS, or Swift) while shifting loss recovery to the ML pipeline itself (e.g., via the Hadamard Transform and Erasure Coding).
Our evaluation shows that OptiNIC improves time-to-accuracy (TTA) by 2x and increases throughput by 1.6x for training and inference, respectively, across two public clouds (i.e., Hyperstack and CloudLab). OptiNIC also lowers 99th-percentile latency by 3.5x, cuts BRAM usage by 2.7x, and nearly doubles NIC resilience to faults-delivering a resilient, tail-optimized RDMA transport purpose-built for distributed ML workloads.