AMPED: Accelerating MTTKRP for Billion-Scale Sparse Tensor Decomposition on Multiple GPUs

📅 2025-07-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address memory and throughput bottlenecks in Matricized Tensor Times Khatri–Rao Product (MTTKRP) computation for large-scale sparse tensor decomposition—particularly on billion-scale non-zero tensors—this paper proposes AMPED, a multi-GPU parallel algorithm. AMPED innovatively integrates fine-grained sparse tensor tiling, cross-GPU data partitioning, and dynamic load-balancing scheduling to significantly mitigate inter-GPU load imbalance and idle waiting. Evaluated on real-world billion-scale sparse tensors under a single-node 4-GPU configuration, AMPED achieves a 5.1× geometric mean speedup over the state-of-the-art baseline methods. It is the first approach enabling efficient, single-node decomposition of ultra-large-scale sparse tensors, thereby overcoming fundamental memory and computational limitations inherent to single-GPU systems.

Technology Category

Application Category

📝 Abstract
Matricized Tensor Times Khatri-Rao Product (MTTKRP) is the computational bottleneck in sparse tensor decomposition. As real-world sparse tensors grow to billions of nonzeros, they increasingly demand higher memory capacity and compute throughput from hardware accelerators. In this work, we present AMPED, a multi-GPU parallel algorithm designed to accelerate MTTKRP on billion-scale sparse tensors. AMPED scales beyond the limits of a single GPU, meeting both the memory and performance requirements of large-scale workloads. We introduce a partitioning strategy combined with a dynamic load balancing scheme to distribute computation and minimize GPU idle time. On real-world billion-scale tensors, AMPED achieves a 5.1x geometric mean speedup in total execution time over state-of-the-art GPU baselines using 4 GPUs on a single CPU node.
Problem

Research questions and friction points this paper is trying to address.

Accelerating MTTKRP for billion-scale sparse tensors
Scaling beyond single GPU memory and performance limits
Reducing GPU idle time via dynamic load balancing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-GPU parallel algorithm for MTTKRP acceleration
Partitioning strategy with dynamic load balancing
Scales beyond single GPU memory limits
🔎 Similar Papers
No similar papers found.