GNStor: Design of GPU-Native High-Performance Remote All-Flash Array

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

239K/year
🤖 AI Summary
This work addresses the performance bottleneck in existing GPU-accelerated all-flash array (AFA) systems, which rely on CPU-centric I/O architectures that incur high interaction overhead and I/O amplification, limiting end-to-end efficiency. To overcome this, the authors propose a GPU-native remote AFA system that shifts storage functionality into SSD firmware via a GPU-centric NoR software stack (GNoR) and a decentralized AFA engine (deEngine), enabling GPUs to directly and efficiently access remote AFAs without CPU intervention. The design integrates RDMA-based NVMe-oF, leverages the GPU’s SIMT parallel execution model, and employs an atomic-operation-driven I/O orchestration mechanism. Experimental results demonstrate that the proposed system achieves 3.2× higher I/O throughput and reduces application execution time by 31.1% compared to state-of-the-art approaches.
📝 Abstract
GPU has become the leading computing device for a wide range of data-intensive applications, which tightly collaborates with remote all-flash array (AFA) to accommodate ever-expanding datasets, facilitate multi-client data sharing, and guarantee fault tolerance. Although GPU is the center of computation, all I/O processes in existing GPU-AFA systems are still CPU-centric. CPU orchestrates remote I/O requests and executes a centralized AFA engine to take charge of AFA-level functionalities (e.g., access control and metadata persistence). This design disparity suffers from substantial CPU-GPU interaction overhead and I/O traffic amplification, compromising end-to-end I/O performance. In this work, we present \emph{GNStor}, a GPU-native AFA system that enables GPU to directly access remote AFA without CPU intervention in the I/O path, thereby fully exploiting the performance of AFA. Specifically, GNStor first proposes a GPU-centric NVMe over RDMA (NoR) software stack (named \emph{GNoR}), paving a fast path for GPUs to directly initiate NoR I/O requests to SSDs within remote AFA. GNoR employs an atomic-operation-based I/O orchestration design and follows the single-instruction-multiple-thread (SIMT) execution model of GPU, fully exploiting the massive parallelism of GPU architectures. To facilitate essential AFA functionalities in a CPU-bypass I/O path, GNStor further designs \emph{deEngine}, a decentralized AFA engine that seamlessly decomposes and integrates AFA-level tasks into each SSD firmware, thereby achieving efficient AFA access at low cost. Evaluation results show that GNStor achieves 3.2$\times$ higher I/O throughput and reduces application execution time by 31.1\%, compared to state-of-the-art AFA systems.
Problem

Research questions and friction points this paper is trying to address.

GPU-native
remote all-flash array
CPU-centric I/O
I/O performance
NVMe over RDMA
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPU-native storage
NVMe over RDMA
CPU-bypass I/O
decentralized AFA engine
SIMT-based I/O orchestration