Inference for Regression with Variables Generated by AI or Machine Learning

📅 2024-02-23

📈 Citations: 2

✨ Influential: 0

career value

182K/year

🤖 AI Summary

In regression analysis, directly incorporating AI/ML-generated variables—such as imputed labels, nonlinear dimensionality reduction scores, or synthetic indices—as covariates induces estimation bias and invalidates standard errors, thereby compromising statistical inference. This paper is the first to systematically characterize this failure mechanism. We propose two theoretically grounded solutions: (1) a bias-corrected confidence interval that analytically adjusts for the asymptotic bias introduced by ML-based imputation; and (2) a joint estimation framework that simultaneously models latent variables and regression parameters within a two-stage optimization procedure, embedding ML modeling directly into the inferential workflow. Our methods apply broadly to canonical settings including label imputation, nonlinear dimensionality reduction, and index construction. Empirical results demonstrate that the proposed approaches restore consistency of standard errors and achieve nominal coverage of confidence intervals, substantially enhancing the reliability and robustness of regression inference.

Technology Category

Application Category

📝 Abstract

Researchers now routinely use AI or other machine learning methods to estimate latent variables of economic interest, then plug-in the estimates as covariates in a regression. We show both theoretically and empirically that naively treating AI/ML-generated variables as"data"leads to biased estimates and invalid inference. To restore valid inference, we propose two methods: (1) an explicit bias correction with bias-corrected confidence intervals, and (2) joint estimation of the regression parameters and latent variables. We illustrate these ideas through applications involving label imputation, dimensionality reduction, and index construction via classification and aggregation.

Problem

Research questions and friction points this paper is trying to address.

Bias in regression using AI-generated variables as data

Invalid inference from naive treatment of ML estimates

Need methods to correct bias in latent variable regression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bias correction for AI-generated variables

Joint estimation of regression and latent variables

Valid inference via corrected confidence intervals

🔎 Similar Papers

Learning vs Retrieval: The Role of In-Context Examples in Regression with LLMs