GraphGen+: Advancing Distributed Subgraph Generation and Graph Learning On Industrial Graphs

📅 2025-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the dual challenges of online single-machine performance bottlenecks in distributed subgraph sampling and prohibitive storage/I/O overheads from offline precomputation in trillion-edge industrial graph training, this paper proposes the first co-scheduling architecture that jointly optimizes subgraph generation and in-memory graph learning. Built upon a distributed in-memory computing framework, the architecture integrates topology-aware sampling, pipelined subgraph construction, and asynchronous gradient synchronization—enabling fully in-memory, distributed real-time subgraph generation without external storage and eliminating precomputation entirely. Experiments demonstrate that our approach achieves 27× higher subgraph generation throughput than SQL-based methods and 1.3× that of GraphGen; supports per-iteration training on million-node graphs; and reduces I/O overhead to zero.

Technology Category

Application Category

📝 Abstract
Graph-based computations are crucial in a wide range of applications, where graphs can scale to trillions of edges. To enable efficient training on such large graphs, mini-batch subgraph sampling is commonly used, which allows training without loading the entire graph into memory. However, existing solutions face significant trade-offs: online subgraph generation, as seen in frameworks like DGL and PyG, is limited to a single machine, resulting in severe performance bottlenecks, while offline precomputed subgraphs, as in GraphGen, improve sampling efficiency but introduce large storage overhead and high I/O costs during training. To address these challenges, we propose extbf{GraphGen+}, an integrated framework that synchronizes distributed subgraph generation with in-memory graph learning, eliminating the need for external storage while significantly improving efficiency. GraphGen+ achieves a extbf{27$ imes$} speedup in subgraph generation compared to conventional SQL-like methods and a extbf{1.3$ imes$} speedup over GraphGen, supporting training on 1 million nodes per iteration and removing the overhead associated with precomputed subgraphs, making it a scalable and practical solution for industry-scale graph learning.
Problem

Research questions and friction points this paper is trying to address.

Efficient distributed subgraph generation for large-scale graphs
Reducing storage and I/O overhead in graph learning
Improving performance bottlenecks in existing graph frameworks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributed subgraph generation synchronization
In-memory graph learning integration
Eliminates external storage overhead
🔎 Similar Papers
No similar papers found.