🤖 AI Summary
To address the disk I/O bottleneck of billion-scale vector similarity joins on a single machine and the high communication overhead of distributed solutions, this paper proposes the first NVMe SSD-aware single-machine algorithm. Methodologically, it jointly optimizes I/O and computational efficiency by restructuring data access patterns, designing a热度-aware dynamic main-memory caching mechanism, and introducing a probabilistic pruning strategy. The key contribution lies in deeply integrating NVMe SSD characteristics—such as parallelism, low latency, and high bandwidth—into the algorithmic design, enabling high-throughput similarity joins without requiring a cluster. Experiments on real-world large-scale datasets demonstrate that our approach achieves 50×–1000× speedup over state-of-the-art methods, significantly pushing the performance boundary of single-machine vector similarity joins.
📝 Abstract
Similarity join--a widely used operation in data science--finds all pairs of items that have distance smaller than a threshold. Prior work has explored distributed computation methods to scale similarity join to large data volumes but these methods require a cluster deployment, and efficiency suffers from expensive inter-machine communication. On the other hand, disk-based solutions are more cost-effective by using a single machine and storing the large dataset on high-performance external storage, such as NVMe SSDs, but in these methods the disk I/O time is a serious bottleneck. In this paper, we propose DiskJoin, the first disk-based similarity join algorithm that can process billion-scale vector datasets efficiently on a single machine. DiskJoin improves disk I/O by tailoring the data access patterns to avoid repetitive accesses and read amplification. It also uses main memory as a dynamic cache and carefully manages cache eviction to improve cache hit rate and reduce disk retrieval time. For further acceleration, we adopt a probabilistic pruning technique that can effectively prune a large number of vector pairs from computation. Our evaluation on real-world, large-scale datasets shows that DiskJoin significantly outperforms alternatives, achieving speedups from 50x to 1000x.