🤖 AI Summary
Performance metrics (e.g., accuracy, precision, recall, F-score, Jaccard index) for binary classification rules suffer from statistical unreliability under small-sample conditions. Method: This paper proposes an analytical confidence interval construction method based on asymptotic normal approximation. It generalizes the “plus-four” correction into a unified “fuzzy correction” framework for variance adjustment, enabling simultaneous individual and joint confidence region inference over multiple rules and multiple metrics—without resampling. Contribution/Results: Leveraging variance-stabilizing transformations and multivariate joint region construction, the method significantly improves coverage accuracy in small-sample regimes. Empirical evaluation across multiple benchmark datasets demonstrates superior performance over bootstrap-based approaches. The method is computationally efficient, theoretically rigorous, and well-suited for real-time, scalable data mining evaluation scenarios.
📝 Abstract
In data mining, when binary prediction rules are used to predict a binary outcome, many performance measures are used in a vast array of literature for the purposes of evaluation and comparison. Some examples include classification accuracy, precision, recall, F measures, and Jaccard index. Typically, these performance measures are only approximately estimated from a finite dataset, which may lead to findings that are not statistically significant. In order to properly quantify such statistical uncertainty, it is important to provide confidence intervals associated with these estimated performance measures. We consider statistical inference about general performance measures used in data mining, with both individual and joint confidence intervals. These confidence intervals are based on asymptotic normal approximations and can be computed fast, without needs to do bootstrap resampling. We study the finite sample coverage probabilities for these confidence intervals and also propose a `blurring correction' on the variance to improve the finite sample performance. This 'blurring correction' generalizes the plus-four method from binomial proportion to general performance measures used in data mining. Our framework allows multiple performance measures of multiple classification rules to be inferred simultaneously for comparisons.