🤖 AI Summary
This work addresses the underexplored performance impact of Reverse Address Translation (RAT) in multi-GPU systems, particularly for small-scale, latency-sensitive collective communications. By integrating a detailed Link MMU/TLB model with the Omnet++ network backend into an extended version of ASTRA-sim, this study quantifies translation-induced delays in All-to-All communication patterns across multi-node GPU clusters. The analysis reveals that cold TLB misses can increase communication latency for small messages by up to 1.4×. To mitigate this overhead, the paper proposes two novel optimizations: a pre-translation kernel and software-guided TLB prefetching. These techniques effectively hide translation latency, substantially improving throughput and scalability for inference workloads.
📝 Abstract
Distributed ML workloads rely heavily on collective communication across multi-GPU, multi-node systems. Emerging scale-up fabrics, such as NVLink and UALink, enable direct memory access across nodes but introduce a critical destination-side translation step: translating Network Physical Addresses (NPAs) to System Physical Addresses (SPAs), which we term Reverse Translation (Reverse Address Translation). Despite its importance, the performance impact of Reverse Address Translation remains poorly understood. In this work, we present the first systematic study of Reverse Address Translation in large-scale GPU clusters. Using an extended ASTRA-sim framework with Omnet++ as the network backend, we model Link MMUs and Link TLBs and evaluate their effect on All-to-All collective communication across varying input sizes and GPU counts. Our analysis shows that cold TLB misses dominate latency for small, latency-sensitive collectives, causing up to 1.4x performance degradation, while larger collectives benefit from warmed caches and experience diminishing returns from over sized TLBs. Based on these observations, we propose two avenues for optimization: fused pre-translation kernels that overlap Reverse Address Translation with computation and software-guided TLB prefetching to proactively populate likely-needed entries. These techniques aim to hide translation latency, particularly for small collectives, improving throughput and scalability for inference workloads. Our study establishes a foundation for designing efficient destination-side translation mechanisms in large-scale multi-GPU systems.