🤖 AI Summary
This work proposes a general stratified augmentation framework to address the inefficiency of traditional subsampling methods in large-scale data settings. By incorporating, for the first time, a maximum variance reduction objective into the stratification strategy, the method optimizes both the stratification variable and interval boundaries to minimize the asymptotic variance of the estimator. It provides a theoretically grounded, efficient selection criterion applicable to any subsampling design. Leveraging asymptotic normality analysis, the proposed algorithm achieves linear computational complexity and is compatible with both uniform and non-uniform subsampling schemes. Experimental results on simulated and real-world datasets demonstrate that the approach substantially reduces estimation variance and improves accuracy, with only a linear increase in computational overhead.
📝 Abstract
Subsampling is a widely used and effective approach for addressing the computational challenges posed by massive datasets. Substantial progress has been made in developing non-uniform, probability-based subsampling schemes that prioritize more informative observations. We propose a novel stratification mechanism that can be combined with existing subsampling designs to further improve estimation efficiency. We establish the estimator's asymptotic normality and quantify the resulting efficiency gains, which enables a principled procedure for selecting stratification variables and interval boundaries that target reductions in asymptotic variance. The resulting algorithm, Maximum-Variance-Reduction Stratification (MVRS), achieves significant improvements in estimation efficiency while incurring only linear additional computational cost. MVRS is applicable to both non-uniform and uniform subsampling methods. Experiments on simulated and real datasets confirm that MVRS markedly reduces estimator variance and improves accuracy compared with existing subsampling methods.