The Clustering Strikes Back: Building Cost-Effective and High-Performance ANNS at Scale with Helmsman

📅 2026-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high memory overhead and scalability limitations of large-scale approximate nearest neighbor search (ANNS) systems that rely on in-memory graph structures such as HNSW. The authors propose Helmsman, a clustering-based ANNS system that integrates a user-space I/O stack, a learning-driven hierarchical pruning mechanism, and a GPU-accelerated indexing pipeline to enable efficient billion-scale vector retrieval on all-flash storage architectures. Helmsman substantially overcomes the latency, construction speed, and resource efficiency bottlenecks of conventional clustering approaches: it reduces hardware costs by over 90%, enables billion-scale index reconstruction within hours, and in production deployment replaces a cluster requiring 35,000 CPU cores and 0.35 PB of memory with only 40 servers, maintaining stable operation for several months.
📝 Abstract
RedNote (a.k.a., Xiaohongshu, a global-scale social network platform) widely adopts approximate nearest neighbor search (ANNS) to power its search, recommendation, and advertising services. Due to the demanding Service Level Agreements (SLAs), we have to rely on in-memory graph-based ANNS (i.e., HNSW) to provide high throughput and low latency. However, the ever-growing user base and content volume have led to an explosive increase in memory footprint and consequently huge CapEx and OpEx. After exploring various alternatives, we find that building a clustering-based ANNS on top of all-flash servers can be promising. Yet, we still experience severe overheads from the kernel I/O stack, a fixed pruning strategy, and slow index construction. We present HELMSMAN, a high-performance and cost-effective clustering-based ANNS system, which combines an ANNS-oriented userspace storage stack, a leveling-learned pruning module, and GPU-accelerated pipelines of construction. HELMSMAN saves over 90% of hardware costs and enables billion-scale index (re)builds within hours. In the current production deployment, operating stably for several months, 40 machines now host ANNS workloads that previously required about 35,000 cores and 0.35 PB DRAM.
Problem

Research questions and friction points this paper is trying to address.

approximate nearest neighbor search
memory footprint
hardware cost
index construction
I/O overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

clustering-based ANNS
userspace storage stack
learned pruning
GPU-accelerated indexing
cost-effective similarity search
🔎 Similar Papers
2024-09-01arXiv.orgCitations: 4
2024-10-10International Conference on Machine LearningCitations: 3