Statistical Inference for Misspecified Contextual Bandits

📅 2025-09-07

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

Contextual bandits (e.g., LinUCB) often fail to converge—and thus invalidate statistical inference—under reward model misspecification, a pervasive issue in real-world adaptive experiments (e.g., approximating complex dynamical systems with linear models). To address this, we propose a novel adaptive algorithmic framework with provable convergence guarantees under general misspecification, overcoming the non-convergence limitation of existing methods. Building upon this, we construct an inverse probability weighting–based Z-estimator (IPW-Z) and develop its asymptotic theory alongside a robust variance estimator, enabling the first asymptotically normal inference under arbitrary model misspecification. Simulation studies demonstrate that our method yields robust, data-efficient confidence intervals in both online and offline settings, substantially outperforming prior approaches restricted to specific misspecification structures.

Technology Category

Application Category

📝 Abstract

Contextual bandit algorithms have transformed modern experimentation by enabling real-time adaptation for personalized treatment and efficient use of data. Yet these advantages create challenges for statistical inference due to adaptivity. A fundamental property that supports valid inference is policy convergence, meaning that action-selection probabilities converge in probability given the context. Convergence ensures replicability of adaptive experiments and stability of online algorithms. In this paper, we highlight a previously overlooked issue: widely used algorithms such as LinUCB may fail to converge when the reward model is misspecified, and such non-convergence creates fundamental obstacles for statistical inference. This issue is practically important, as misspecified models -- such as linear approximations of complex dynamic system -- are often employed in real-world adaptive experiments to balance bias and variance. Motivated by this insight, we propose and analyze a broad class of algorithms that are guaranteed to converge even under model misspecification. Building on this guarantee, we develop a general inference framework based on an inverse-probability-weighted Z-estimator (IPW-Z) and establish its asymptotic normality with a consistent variance estimator. Simulation studies confirm that the proposed method provides robust and data-efficient confidence intervals, and can outperform existing approaches that exist only in the special case of offline policy evaluation. Taken together, our results underscore the importance of designing adaptive algorithms with built-in convergence guarantees to enable stable experimentation and valid statistical inference in practice.

Problem

Research questions and friction points this paper is trying to address.

Addressing non-convergence in contextual bandits under model misspecification

Ensuring valid statistical inference despite adaptive data collection

Providing robust confidence intervals for misspecified reward models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Convergent algorithms under model misspecification

Inverse-probability-weighted Z-estimator framework

Asymptotic normality with consistent variance estimation

🔎 Similar Papers

Identifiable latent bandits: Combining observational data and exploration for personalized healthcare