Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the underexplored performance impact of Reverse Address Translation (RAT) in multi-GPU systems, particularly for small-scale, latency-sensitive collective communications. By integrating a detailed Link MMU/TLB model with the Omnet++ network backend into an extended version of ASTRA-sim, this study quantifies translation-induced delays in All-to-All communication patterns across multi-node GPU clusters. The analysis reveals that cold TLB misses can increase communication latency for small messages by up to 1.4×. To mitigate this overhead, the paper proposes two novel optimizations: a pre-translation kernel and software-guided TLB prefetching. These techniques effectively hide translation latency, substantially improving throughput and scalability for inference workloads.

Technology Category

Application Category

📝 Abstract
Distributed ML workloads rely heavily on collective communication across multi-GPU, multi-node systems. Emerging scale-up fabrics, such as NVLink and UALink, enable direct memory access across nodes but introduce a critical destination-side translation step: translating Network Physical Addresses (NPAs) to System Physical Addresses (SPAs), which we term Reverse Translation (Reverse Address Translation). Despite its importance, the performance impact of Reverse Address Translation remains poorly understood. In this work, we present the first systematic study of Reverse Address Translation in large-scale GPU clusters. Using an extended ASTRA-sim framework with Omnet++ as the network backend, we model Link MMUs and Link TLBs and evaluate their effect on All-to-All collective communication across varying input sizes and GPU counts. Our analysis shows that cold TLB misses dominate latency for small, latency-sensitive collectives, causing up to 1.4x performance degradation, while larger collectives benefit from warmed caches and experience diminishing returns from over sized TLBs. Based on these observations, we propose two avenues for optimization: fused pre-translation kernels that overlap Reverse Address Translation with computation and software-guided TLB prefetching to proactively populate likely-needed entries. These techniques aim to hide translation latency, particularly for small collectives, improving throughput and scalability for inference workloads. Our study establishes a foundation for designing efficient destination-side translation mechanisms in large-scale multi-GPU systems.
Problem

Research questions and friction points this paper is trying to address.

Reverse Address Translation
Multi-GPU Systems
Collective Communication
Translation Overhead
Distributed ML
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reverse Address Translation
Link TLB
Multi-GPU Scale-Up
Collective Communication
TLB Prefetching
A
Amel Fatima
University of Virginia
Tuan Ta
Tuan Ta
AMD Research and Advanced Development
Computer Architecture
B
Bradford M. Beckmann
AMD Research