DiskJoin: Large-scale Vector Similarity Join with SSD

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the disk I/O bottleneck of billion-scale vector similarity joins on a single machine and the high communication overhead of distributed solutions, this paper proposes the first NVMe SSD-aware single-machine algorithm. Methodologically, it jointly optimizes I/O and computational efficiency by restructuring data access patterns, designing a热度-aware dynamic main-memory caching mechanism, and introducing a probabilistic pruning strategy. The key contribution lies in deeply integrating NVMe SSD characteristics—such as parallelism, low latency, and high bandwidth—into the algorithmic design, enabling high-throughput similarity joins without requiring a cluster. Experiments on real-world large-scale datasets demonstrate that our approach achieves 50×–1000× speedup over state-of-the-art methods, significantly pushing the performance boundary of single-machine vector similarity joins.

Technology Category

Application Category

📝 Abstract
Similarity join--a widely used operation in data science--finds all pairs of items that have distance smaller than a threshold. Prior work has explored distributed computation methods to scale similarity join to large data volumes but these methods require a cluster deployment, and efficiency suffers from expensive inter-machine communication. On the other hand, disk-based solutions are more cost-effective by using a single machine and storing the large dataset on high-performance external storage, such as NVMe SSDs, but in these methods the disk I/O time is a serious bottleneck. In this paper, we propose DiskJoin, the first disk-based similarity join algorithm that can process billion-scale vector datasets efficiently on a single machine. DiskJoin improves disk I/O by tailoring the data access patterns to avoid repetitive accesses and read amplification. It also uses main memory as a dynamic cache and carefully manages cache eviction to improve cache hit rate and reduce disk retrieval time. For further acceleration, we adopt a probabilistic pruning technique that can effectively prune a large number of vector pairs from computation. Our evaluation on real-world, large-scale datasets shows that DiskJoin significantly outperforms alternatives, achieving speedups from 50x to 1000x.
Problem

Research questions and friction points this paper is trying to address.

Efficiently scaling vector similarity joins on single machines
Reducing disk I/O bottlenecks in large-scale similarity computations
Pruning unnecessary vector pair computations for billion-scale datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

SSD-optimized access patterns reduce I/O
Dynamic memory caching improves hit rates
Probabilistic pruning minimizes unnecessary computations
🔎 Similar Papers
No similar papers found.
Y
Yanqi Chen
University of Massachusetts Amherst
X
Xiao Yan
Centre for Perceptual and Interactive Intelligence
Alexandra Meliou
Alexandra Meliou
University of Massachusetts, Amherst
data managementReverse Data Managementdata provenancecausalityexplanations
E
Eric Lo
Chinese University of Hong Kong