AcOrch: Accelerating Sampling-based GNN Training under CPU-NPU Heterogeneous Environments

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the low hardware utilization and performance bottlenecks in sampling-based graph neural network training on CPU–NPU heterogeneous systems, which arise from the divergent resource demands across multiple execution phases. To overcome these challenges, the authors propose a fine-grained task orchestration framework coupled with a two-level pipelined execution model. This approach jointly optimizes the subgraph sampling, feature gathering, and model training stages, enabling efficient task overlapping and precise mapping both between the CPU and NPU and within the NPU’s internal computing units—specifically the AI Cube and AI Vector units. Evaluated on the Ascend 910B platform, the proposed method achieves an average speedup of 2.31× over the state-of-the-art MindSporeGL system and, for the first time, realizes pipelined parallelism for graph learning tasks across distinct NPU compute units.

📝 Abstract

Graph Neural Networks (GNNs) have achieved remarkable success in various applications. Sampling-based GNN training, which conducts mini-batch training on sampled subgraphs, has become a promising solution for large-scale graphs. Given the resource-intensive nature of sampling-based GNN training, Neural Processing Units (NPUs), such as the Ascend AI processor, offer a promising alternative due to their high throughput and energy efficiency, making them well-suited for GNN workloads. However, the multi-stage nature of sampling-based training, which involves subgraph sampling, feature gathering, and model training, with different resource requirements and computation volume. This requires careful coordination to fully utilize the heterogeneous computation resources of CPUs and NPUs. In this work, we present AcOrch, a sampling-based GNN training system optimized for CPU-NPU heterogeneous platforms. AcOrch offers fine-grained task orchestration and adopts a two-level pipelined execution model to overlap sampling, gathering, and training. It analyzes the heterogeneous compute features of NPUs and maps tasks to AI Cube (AIC) units, AI Vector (AIV) units, and CPU cores accordingly. Moreover, the two-level pipeline enables overlapping execution not only between the CPU and NPU, but also among different types of compute units within the NPU (e.g., AIC and AIV units), thereby maximizing the utilization of available resources. Experiments on an Ascend 910B AI processor show that AcOrch achieves an average speedup of 2.31x over the state-of-the-art NPU-native graph learning system, MindSporeGL.

Problem

Research questions and friction points this paper is trying to address.

sampling-based GNN training

CPU-NPU heterogeneous environment

resource coordination

multi-stage computation

graph neural networks

Innovation

Methods, ideas, or system contributions that make the work stand out.

sampling-based GNN training

CPU-NPU heterogeneous system

task orchestration