GPU-Accelerated ANNS: Quantized for Speed, Built for Change

📅 2026-01-11

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses key challenges in GPU-accelerated approximate nearest neighbor search (ANNS) under high-dimensional settings, including memory bandwidth bottlenecks, inefficient batch updates, and limited compute-memory overlap due to data dependencies in greedy search. To overcome these limitations, the authors propose a native GPU ANNS system built upon the Vamana graph index, featuring a lock-free, streaming CUDA algorithm for batch-parallel index construction, the RaBitQ quantization scheme that avoids random memory accesses, and an optimized greedy search kernel. The resulting system achieves high query throughput while supporting dynamic updates. Experimental results across five datasets show up to 1.93× higher query throughput than CAGRA, 2.4× faster average index construction, and 19–131× speedups over BANG, with roofline model efficiency reaching up to 80% of peak performance.

Technology Category

Application Category

📝 Abstract

Approximate nearest neighbor search (ANNS) is a core problem in machine learning and information retrieval applications. GPUs offer a promising path to high-performance ANNS: they provide massive parallelism for distance computations, are readily available, and can co-locate with downstream applications. Despite these advantages, current GPU-accelerated ANNS systems face three key limitations. First, real-world applications operate on evolving datasets that require fast batch updates, yet most GPU indices must be rebuilt from scratch when new data arrives. Second, high-dimensional vectors strain memory bandwidth, but current GPU systems lack efficient quantization techniques that reduce data movement without introducing costly random memory accesses. Third, the data-dependent memory accesses inherent to greedy search make overlapping compute and memory difficult, leading to reduced performance. We present Jasper, a GPU-native ANNS system with both high query throughput and updatability. Jasper builds on the Vamana graph index and overcomes existing bottlenecks via three contributions: (1) a CUDA batch-parallel construction algorithm that enables lock-free streaming insertions, (2) a GPU-efficient implementation of RaBitQ quantization that reduces memory footprint up to 8x without the random access penalties, and (3) an optimized greedy search kernel that increases compute utilization, resulting in better latency hiding and higher throughput. Our evaluation across five datasets shows that Jasper achieves up to 1.93x higher query throughput than CAGRA and achieves up to 80% peak utilization as measured by the roofline model. Jasper's construction scales efficiently and constructs indices an average of 2.4x faster than CAGRA while providing updatability that CAGRA lacks. Compared to BANG, the previous fastest GPU Vamana implementation, Jasper delivers 19-131x faster queries.

Problem

Research questions and friction points this paper is trying to address.

Approximate Nearest Neighbor Search

GPU

Dynamic Updates

Quantization

Memory Bandwidth

Innovation

Methods, ideas, or system contributions that make the work stand out.

GPU-native ANNS

batch-parallel construction

RaBitQ quantization

updatable index

greedy search optimization

🔎 Similar Papers

BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU

2024-01-20arXiv.orgCitations: 2

Authors to Follow