π€ AI Summary
In multi-GPU training on consumer-grade GPUs, inter-GPU communication bottlenecks severely limit computational efficiency; existing computation-communication overlap strategies fail to simultaneously achieve fine-grained concurrency, zero computational interference, and communication-primitive agnosticism. This paper proposes a lightweight tile-wise signaling mechanism that enables efficient computation-communication overlap solely via standard NCCL APIsβwithout modifying the underlying communication library or compute kernels. Leveraging dependency-aware scheduling and memory reordering, our approach achieves fine-grained, primitive-agnostic overlap while preserving original compute performance. To the best of our knowledge, this is the first method to support arbitrary NCCL communication primitives with fine-grained overlap without compromising computational throughput. Experimental evaluation on consumer-grade GPUs demonstrates an end-to-end speedup of up to 1.65Γ, significantly outperforming state-of-the-art overlap techniques.
π Abstract
Generative models have achieved remarkable success across various applications, driving the demand for multi-GPU computing. Inter-GPU communication becomes a bottleneck in multi-GPU computing systems, particularly on consumer-grade GPUs. By exploiting concurrent hardware execution, overlapping computation and communication latency is an effective technique for mitigating the communication overhead. We identify that an efficient and adaptable overlapping design should satisfy (1) tile-wise overlapping to maximize the overlapping opportunity, (2) interference-free computation to maintain the original computational performance, and (3) communication agnosticism to reduce the development burden against varying communication primitives. Nevertheless, current designs fail to simultaneously optimize for all of those features. To address the issue, we propose FlashOverlap, a lightweight design characterized by tile-wise overlapping, interference-free computation, and communication agnosticism. FlashOverlap utilizes a novel signaling mechanism to identify tile-wise data dependency without interrupting the computation process, and reorders data to contiguous addresses, enabling communication by simply calling NCCL APIs. Experiments show that such a lightweight design achieves up to 1.65x speedup, outperforming existing works in most cases.