RapidGNN: Communication Efficient Large-Scale Distributed Training of Graph Neural Networks

๐Ÿ“… 2025-05-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address high communication overhead and memory pressure in distributed training of large-scale Graph Neural Networks (GNNs), this paper proposes a co-optimization framework based on deterministic pre-sampling. It models mini-batch feature access patterns to enable precise remote feature caching and prefetching, and introducesโ€” for the first timeโ€”a deterministic precomputation mechanism that preserves training stochasticity while significantly reducing the frequency and latency of remote data fetches. The method integrates deterministic graph sampling, feature access modeling, distributed cache scheduling, and prefetching strategies into a lightweight, efficient training framework. Evaluated on Reddit and OGBN-Products, it achieves up to 2.45ร— (average 2.10ร—) end-to-end training throughput improvement, reduces remote feature accesses by over 4ร—, and lowers energy consumption by up to 23%.

Technology Category

Application Category

๐Ÿ“ Abstract
Graph Neural Networks (GNNs) have achieved state-of-the-art (SOTA) performance in diverse domains. However, training GNNs on large-scale graphs poses significant challenges due to high memory demands and significant communication overhead in distributed settings. Traditional sampling-based approaches mitigate computation load to some extent but often fail to address communication inefficiencies inherent in distributed environments. This paper presents RapidGNN that introduces a deterministic sampling strategy to precompute mini-batches. By leveraging the sampling strategy, RapidGNN accurately anticipates feature access patterns, enabling optimal cache construction and timely prefetching of remote features. This reduces the frequency and latency of remote data transfers without compromising the stochastic nature of training. Evaluations on Reddit and OGBN-Products datasets demonstrate that RapidGNN achieves significant reductions in training time and remote feature fetches, outperforming existing models in both communication efficiency and throughput. Our findings highlight RapidGNN's potential for scalable, high-performance GNN training across large, real-world graph datasets along with improving energy efficiency. Our model improves end-to-end training throughput by 2.10x on average over SOTA model GraphSAGE-METIS (up to 2.45x in some settings), while cutting remote feature fetches by over 4x. It also reduces energy consumption up to 23%.
Problem

Research questions and friction points this paper is trying to address.

Reducing communication overhead in distributed GNN training
Optimizing feature access patterns for efficient caching
Improving scalability and energy efficiency in large-scale GNNs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deterministic sampling for mini-batch precomputation
Optimal cache construction via feature access prediction
Reduced remote data transfers with timely prefetching
๐Ÿ”Ž Similar Papers
No similar papers found.
Arefin Niam
Arefin Niam
Tennessee Technological University
Artificial IntelligenceHigh Performance ComputingSystems+ML
M
M. S. Q. Z. Nine
Department of Computer Science, Tennessee Technological University