🤖 AI Summary
This study addresses the challenge of learning optimal policies from observational data under nonlinear social welfare criteria that prioritize specific subpopulations. To model such nonlinear welfare, the authors propose a utility function grounded in potential outcomes and intermediate parameters, and develop a novel reweighting debiasing approach—replacing conventional orthogonalization—that leverages machine learning–estimated propensity scores. The method integrates sieve approximation with K-fold cross-validation to enable fully automated policy learning. Theoretically, it establishes the first oracle inequality for nonlinear welfare in infinite-dimensional policy spaces, achieving minimax-optimal convergence rates in both welfare regret and average welfare regret.
📝 Abstract
This paper explores policy learning from observational data, focusing on a nonlinear welfare criterion in a binary treatment setting. The nonlinear criterion is inspired by scenarios where policymakers prioritize specific population segments. We model this criterion using a utility function that encompasses potential outcomes and intermediate parameters, with the latter capturing higher moments of the outcome distributions. When formulated in the context of observational data, both the intermediate parameters and the welfare criterion depend on the propensity score, which we estimate using machine-learning techniques. To address bias in machine learning estimates, we introduce a novel reweighting-based debiasing approach that offers a promising alternative to traditional orthogonality-based methods. To tackle the complexities of infinite-dimensional policy spaces, we employ sieve approximations and $K$-fold cross-validation for model selection, thereby fully automating the policy-learning process. Despite these complexities, we demonstrate that both the welfare regret and the average welfare regret of our proposed policy learning method satisfy an oracle inequality, thereby providing theoretical guarantees on the performance of the estimated policy relative to the best possible policy. This finding extends the existing results from linear to nonlinear welfare criteria, from finite-dimensional to infinite-dimensional policy spaces, and from a known propensity score to a machine-learned one.