🤖 AI Summary
This work addresses the scalability limitations of traditional k-nearest neighbor algorithms in data stream scenarios, where high computational overhead and the inability of standard K-d trees to support non-Minkowski distances—such as Canberra distance—or dynamic updates pose significant challenges. To overcome these issues, the authors propose a novel online K-d tree structure that, for the first time, extends K-d trees to support Canberra distance in dynamic environments. The approach enables approximate nearest neighbor search through dynamic node insertion and deletion, maintenance of structural invariants, and an efficient subtree pruning strategy. Experimental results demonstrate that the proposed method achieves substantially higher throughput and faster processing speeds than existing baselines, while incurring only minimal average accuracy loss.
📝 Abstract
The k-Nearest Neighbors (kNN) algorithm has long been widely used in Machine Learning (ML) applications. However, the main concern when using it is the computational cost required for neighborhood search, which can make it unfeasible for large-scale applications. Optimization algorithms, such as the K-d tree, become an option in such scenarios. Under data streams, it can be challenging to maintain the properties of the K-d tree, as it requires inserting and deleting nodes on the fly. These operations can make maintaining the tree's balance and invariants difficult. Additionally, traditional K-d trees were initially designed for Minkowski-based distance functions. In this work, we describe an Online K-d tree and its adaptation to the Canberra distance that supports dynamic updates over data streams while preserving the structural invariants required for efficient traversal. Experimental analysis demonstrates that the Online K-d tree algorithm achieves faster processing time under data streams, and that adapting to the Canberra distance enabled effective subtree pruning, as evidenced by a minor loss in average accuracy and a substantial gain in instances processed per second. Our implementation can be found in our GitHub repository