🤖 AI Summary
To address the high computational complexity and memory overhead of energy statistics—such as distance variance, covariance, correlation, and energy distance—in high-dimensional, large-scale data, this paper proposes a lightweight, efficient algorithmic framework. The method leverages vectorized distance compression, block-wise iteration, and memory reuse to avoid explicit construction and storage of the full distance matrix. It tightly integrates Rfast’s C++ kernel within R, achieving sub-quadratic time complexity and linear space complexity. For the first time, it enables real-time computation of computationally intensive statistics—including partial distance correlation and multivariate energy distance—while preserving strict statistical consistency. On datasets with 10,000 samples and 1,000 dimensions, the implementation achieves over 100× speedup and 90% memory reduction compared to state-of-the-art methods, with numerical precision identical to classical implementations. The framework has been successfully deployed in distribution homogeneity testing and nonlinear independence detection.
📝 Abstract
Energy statistics, also known as $mathcal{varepsilon}$-statistics, are functions of distances between statistical observations. This class of functions has enabled the development of non-linear statistical concepts, such as distance variance, distance covariance, and distance correlation. However, the computational burden associated with $mathcal{varepsilon}$-statistics is substantial, particularly when the data reside in multivariate space. To address this challenge, we have developed a method to significantly reduce memory requirements and accelerate computations, thereby facilitating the analysis of large data sets. The following cases are demonstrated: univariate and multivariate distance variance, distance covariance, partial distance correlation, energy distance, and hypothesis testing for the equality of univariate distributions.