🤖 AI Summary
In regression analysis, directly incorporating AI/ML-generated variables—such as imputed labels, nonlinear dimensionality reduction scores, or synthetic indices—as covariates induces estimation bias and invalidates standard errors, thereby compromising statistical inference. This paper is the first to systematically characterize this failure mechanism. We propose two theoretically grounded solutions: (1) a bias-corrected confidence interval that analytically adjusts for the asymptotic bias introduced by ML-based imputation; and (2) a joint estimation framework that simultaneously models latent variables and regression parameters within a two-stage optimization procedure, embedding ML modeling directly into the inferential workflow. Our methods apply broadly to canonical settings including label imputation, nonlinear dimensionality reduction, and index construction. Empirical results demonstrate that the proposed approaches restore consistency of standard errors and achieve nominal coverage of confidence intervals, substantially enhancing the reliability and robustness of regression inference.
📝 Abstract
Researchers now routinely use AI or other machine learning methods to estimate latent variables of economic interest, then plug-in the estimates as covariates in a regression. We show both theoretically and empirically that naively treating AI/ML-generated variables as"data"leads to biased estimates and invalid inference. To restore valid inference, we propose two methods: (1) an explicit bias correction with bias-corrected confidence intervals, and (2) joint estimation of the regression parameters and latent variables. We illustrate these ideas through applications involving label imputation, dimensionality reduction, and index construction via classification and aggregation.