Optimal two-phase sampling designs for generalized raking estimators with multiple parameters of interest

📅 2025-07-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Measurement error in multiple parameters within electronic health records (EHRs) induces bias in statistical inference. Method: We propose an adaptive, multi-stage two-phase sampling design integrated with generalized raking (GR) estimation, unifying inverse probability weighting (IPW) and GR frameworks. Contribution/Results: We derive, for the first time, the optimal adaptive sampling structure for GR estimation under multivariate measurement error—revealing fundamental divergence from the IPW optimum. We further introduce an integer A-optimal allocation strategy, proving its superiority over parameter-wise independent optimization. Simulation studies and analysis of the Vanderbilt HIV cohort demonstrate that our approach substantially improves estimation efficiency over conventional case–control sampling, achieving relative efficiency gains of 32%–68%. This work provides a practical, high-efficiency integrated design for sampling and estimation in multivariate observational studies affected by measurement error.

Technology Category

Application Category

📝 Abstract
Large observational datasets compiled from electronic health records are a valuable resource for medical research but are often affected by measurement error and misclassification. Valid statistical inference requires proper adjustment for these errors. Two-phase sampling with generalized raking (GR) estimation is an efficient solution to this problem that is robust to complex error structures. In this approach, error-prone variables are observed in a large phase 1 cohort, and a subset is selected in phase 2 for validation with error-free measurements. Previous research has studied optimal phase 2 sampling designs for inverse probability weighted (IPW) estimators in non-adaptive, multi-parameter settings, and for GR estimators in single-parameter settings. In this work, we extend these results by deriving optimal adaptive, multiwave sampling designs for IPW and GR estimators when multiple parameters are of interest. We propose several practical allocation strategies and evaluate their performance through extensive simulations and a data example from the Vanderbilt Comprehensive Care Clinic HIV Study. Our results show that independently optimizing allocation for each parameter improves efficiency over traditional case-control sampling. We also derive an integer-valued, A-optimal allocation method that typically outperforms independent optimization. Notably, we find that optimal designs for GR can differ substantially from those for IPW, and that this distinction can meaningfully affect estimator efficiency in the multiple-parameter setting. These findings offer practical guidance for future two-phase studies using error-prone data.
Problem

Research questions and friction points this paper is trying to address.

Optimize two-phase sampling for multiple parameters
Adjust measurement errors in health data analysis
Compare GR and IPW estimator efficiency impacts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimal adaptive multiwave sampling designs
Generalized raking estimators for multi-parameters
A-optimal allocation method outperforms traditional
🔎 Similar Papers
No similar papers found.