π€ AI Summary
Estimating the intrinsic dimension (ID) of high-dimensional data is highly sensitive to noise and scale selection: fine scales lead to overestimation due to noise, while coarse scales underestimate true complexity. To address this, we propose a multi-scale ID estimation method grounded in local connectivity, introducing the Connectivity Factor (CF) as a robust statistical measure. Our approach integrates sliding-window analysis with parallelization for scalable computation. It achieves a balanced trade-off among noise robustness, scale adaptivity, and computational efficiency. On synthetic benchmarks, our method attains mean absolute error (MAE) comparable to state-of-the-art approaches, achieves the highest exact-match rateβup to 25.0%βand significantly outperforms both the Maximum Likelihood Estimator (MLE) and TWO-NN. Moreover, it accurately captures the fractal structure of decision boundaries, demonstrating superior geometric fidelity in complex manifold learning scenarios.
π Abstract
Modern datasets often contain high-dimensional features exhibiting complex dependencies. To effectively analyze such data, dimensionality reduction methods rely on estimating the dataset's intrinsic dimension (id) as a measure of its underlying complexity. However, estimating id is challenging due to its dependence on scale: at very fine scales, noise inflates id estimates, while at coarser scales, estimates stabilize to lower, scale-invariant values. This paper introduces a novel, scalable, and parallelizable method called eDCF, which is based on Connectivity Factor (CF), a local connectivity-based metric, to robustly estimate intrinsic dimension across varying scales. Our method consistently matches leading estimators, achieving comparable values of mean absolute error (MAE) on synthetic benchmarks with noisy samples. Moreover, our approach also attains higher exact intrinsic dimension match rates, reaching up to 25.0% compared to 16.7% for MLE and 12.5% for TWO-NN, particularly excelling under medium to high noise levels and large datasets. Further, we showcase our method's ability to accurately detect fractal geometries in decision boundaries, confirming its utility for analyzing realistic, structured data.