HAVEN: High-Bandwidth Flash Augmented Vector Engine for Large-Scale Approximate Nearest-Neighbor Search Acceleration

πŸ“… 2026-03-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the bandwidth bottleneck in large-scale approximate nearest neighbor search (ANNS) under high-recall regimes, where full-precision vector re-ranking is essential but constrained by GPU memory capacity, often forcing data to reside in CPU memory or SSDs. To overcome this limitation, the authors propose the first integration of high-bandwidth 3D NAND flash (HBF) directly into the GPU package, enabling on-package residency of billion-scale full-precision vector databases. By co-designing a near-storage search unit with an optimized IVF-PQ pipeline, the architecture eliminates data migration overhead during re-ranking, effectively breaking the traditional memory wall. Compared to conventional GPU-DRAM and GPU-SSD systems, the proposed design achieves up to 20Γ— higher throughput and up to 40Γ— lower latency, simultaneously delivering high recall and high throughput.

Technology Category

Application Category

πŸ“ Abstract
Retrieval-Augmented Generation (RAG) relies on large-scale Approximate Nearest Neighbor Search (ANNS) to retrieve semantically relevant context for large language models. Among ANNS methods, IVF-PQ offers an attractive balance between memory efficiency and search accuracy. However, achieving high recall requires reranking which fetches full-precision vectors for reranking, and the billion-scale vector databases need to reside in CPU DRAM or SSD due to the limited capacity of GPU HBM. This off-GPU data movement introduces substantial latency and throughput degradation. We propose HAVEN, a GPU architecture augmented with High-Bandwidth Flash (HBF) which is a recently introduced die-stacked 3D NAND technology engineered to deliver terabyte-scale capacity and hundreds of GB/s read bandwidth. By integrating HBF and near-storage search unit as an on-package complement to HBM, HAVEN enables the full-precision vector database to reside entirely on-device, eliminating PCIe and DDR bottlenecks during reranking. Through detailed modeling of re-architected 3D NAND subarrays, power-constrained HBF bandwidth, and end-to-end IVF-PQ pipelines, we demonstrate that HAVEN improves reranking throughput by up to 20x and latency up to 40x across billion-scale datasets compared to GPU-DRAM and GPU-SSD systems. Our results show that HBF-augmented GPUs enable high-recall retrieval at throughput previously achievable only without reranking, offering a promising direction for memory-centric AI accelerators.
Problem

Research questions and friction points this paper is trying to address.

Approximate Nearest Neighbor Search
Reranking
GPU Memory Bottleneck
Large-Scale Vector Retrieval
Memory-Centric Acceleration
Innovation

Methods, ideas, or system contributions that make the work stand out.

High-Bandwidth Flash
Near-Storage Processing
IVF-PQ
Reranking Acceleration
Memory-Centric AI
πŸ”Ž Similar Papers
No similar papers found.