🤖 AI Summary
This paper addresses the challenge of measuring dependence between numerical and categorical variables by proposing the Categorical Gini Correlation (CGC)—a novel, theoretically consistent, and computationally efficient dependency measure. CGC is grounded in Gini impurity, offering a clear statistical interpretation, asymptotic normality, and sensitivity to nonlinear and non-monotonic relationships. We derive procedures for constructing confidence intervals and conducting asymptotic independence tests. Furthermore, we design a vectorized, multi-process parallel algorithm that significantly accelerates computation on large-scale datasets. Empirical evaluations demonstrate that CGC outperforms state-of-the-art methods—including MIC, Distance Correlation, and Kendall’s τ-b—in both robustness and discriminative power for feature selection. An open-source Python package is released, enabling end-to-end statistical inference and high-dimensional feature selection.
📝 Abstract
Categorical Gini Correlation (CGC), introduced by Dang et al. (2020), is a novel dependence measure designed to quantify the association between a numerical variable and a categorical variable. It has appealing properties compared to existing dependence measures, such as zero correlation mutually implying independence between the variables. It has also shown superior performance over existing methods when applied to feature screening for classification. This article presents a Python implementation for computing CGC, constructing confidence intervals, and performing independence tests based on it. Efficient algorithms have been implemented for all procedures, and they have been optimized using vectorization and parallelization to enhance computational efficiency.