Maximum-Variance-Reduction Stratification for Improved Subsampling

📅 2026-01-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a general stratified augmentation framework to address the inefficiency of traditional subsampling methods in large-scale data settings. By incorporating, for the first time, a maximum variance reduction objective into the stratification strategy, the method optimizes both the stratification variable and interval boundaries to minimize the asymptotic variance of the estimator. It provides a theoretically grounded, efficient selection criterion applicable to any subsampling design. Leveraging asymptotic normality analysis, the proposed algorithm achieves linear computational complexity and is compatible with both uniform and non-uniform subsampling schemes. Experimental results on simulated and real-world datasets demonstrate that the approach substantially reduces estimation variance and improves accuracy, with only a linear increase in computational overhead.

Technology Category

Application Category

📝 Abstract
Subsampling is a widely used and effective approach for addressing the computational challenges posed by massive datasets. Substantial progress has been made in developing non-uniform, probability-based subsampling schemes that prioritize more informative observations. We propose a novel stratification mechanism that can be combined with existing subsampling designs to further improve estimation efficiency. We establish the estimator's asymptotic normality and quantify the resulting efficiency gains, which enables a principled procedure for selecting stratification variables and interval boundaries that target reductions in asymptotic variance. The resulting algorithm, Maximum-Variance-Reduction Stratification (MVRS), achieves significant improvements in estimation efficiency while incurring only linear additional computational cost. MVRS is applicable to both non-uniform and uniform subsampling methods. Experiments on simulated and real datasets confirm that MVRS markedly reduces estimator variance and improves accuracy compared with existing subsampling methods.
Problem

Research questions and friction points this paper is trying to address.

subsampling
estimation efficiency
asymptotic variance
stratification
massive datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Maximum-Variance-Reduction Stratification
subsampling
asymptotic variance reduction
stratification
estimation efficiency
🔎 Similar Papers
No similar papers found.
D
Dingyi Wang
State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China; School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China
Haiying Wang
Haiying Wang
Ulster University
Machine Learningdata miningdata integrationrobotics and computational biology
Qingpei Hu
Qingpei Hu
Professor of Chinese Academy of Sciences
Industrial StatisticsSystem Reliability