π€ AI Summary
This work addresses the performance degradation of existing approximate nearest neighbor (ANN) indexes under high-frequency streaming updates, which often stems from costly global rebuilds, update congestion, or structural imbalance. To overcome these limitations, we propose UBISβan updatable balanced index structure that dynamically schedules concurrent updates to avoid global reconstruction and internal skew. By doing so, UBIS maintains high index quality while ensuring stable and efficient similarity search. Experimental evaluation on real-world datasets demonstrates that UBIS achieves a 77% average improvement in search accuracy and a 45% increase in update throughput compared to state-of-the-art methods.
π Abstract
As artificial intelligence gains more and more popularity, vectors are one of the most widely used data structures for services such as information retrieval and recommendation. Approximate Nearest Neighbor Search (ANNS), which generally relies on indices optimized for fast search to organize large datasets, has played a core role in these popular services. As the frequency of data shift grows, it is crucial for indices to accommodate new data and support real-time updates. Existing researches adopting two different approaches hold the following drawbacks: 1) approaches using additional buffers to temporarily store new data are resource-intensive and inefficient due to the global rebuilding processes; 2) approaches upgrading the internal index structure suffer from performance degradation because of update congestion and imbalanced distribution in streaming workloads. In this paper, we propose UBIS, an Updatable Balanced Index for stable streaming similarity Search, to resolve conflicts by scheduling concurrent updates and maintain good index quality by reducing imbalanced update cases, when the update frequency grows. Experimental results in the real-world datasets demonstrate that UBIS achieves up to 77% higher search accuracy and 45% higher update throughput on average compared to the state-of-the-art indices in streaming workloads.