🤖 AI Summary
To address the computational bottleneck of constructing minimum spanning trees (MSTs) on large-scale, high-dimensional data, this paper proposes a three-stage approximation algorithm: (1) constructing an approximate nearest neighbor graph, (2) establishing initial inter-component connections across disconnected components, and (3) iteratively refining the edge set. The method integrates approximate nearest neighbor search, graph connectivity analysis, and edge-set optimization. It achieves a time complexity of $O(dn log n)$ and space complexity of $O(dn + kn)$, where $n$ is the number of points, $d$ the dimensionality, and $k$ the average neighborhood size. Empirical evaluation on million-point cloud datasets with thousand-dimensional features demonstrates controlled approximation error and up to 1000× speedup over exact MST algorithms. This significantly extends the practical applicability of MSTs to ultra-large-scale, high-dimensional settings.
📝 Abstract
We present Fast Approximate Minimum Spanning Tree (FAMST), a novel algorithm that addresses the computational challenges of constructing Minimum Spanning Trees (MSTs) for large-scale and high-dimensional datasets. FAMST utilizes a three-phase approach: Approximate Nearest Neighbor (ANN) graph construction, ANN inter-component connection, and iterative edge refinement. For a dataset of $n$ points in a $d$-dimensional space, FAMST achieves $mathcal{O}(dn log n)$ time complexity and $mathcal{O}(dn + kn)$ space complexity when $k$ nearest neighbors are considered, which is a significant improvement over the $mathcal{O}(n^2)$ time and space complexity of traditional methods.
Experiments across diverse datasets demonstrate that FAMST achieves remarkably low approximation errors while providing speedups of up to 1000$ imes$ compared to exact MST algorithms. We analyze how the key hyperparameters, $k$ (neighborhood size) and $λ$ (inter-component edges), affect performance, providing practical guidelines for hyperparameter selection. FAMST enables MST-based analysis on datasets with millions of points and thousands of dimensions, extending the applicability of MST techniques to problem scales previously considered infeasible.