🤖 AI Summary
This work addresses domain adaptation in contextual bandits across heterogeneous domains—specifically, transferring policies from a source domain (e.g., animal experiments) to a target domain (e.g., human clinical trials) under distributional shift. We propose the first general framework for domain-adaptive contextual bandits, theoretically establishing a sublinear regret bound under cross-domain transfer—thereby relaxing the conventional single-domain assumption. Methodologically, we integrate neural representation learning with adversarial domain alignment to jointly optimize policy and domain-invariant features, leveraging labeled source data while respecting target-domain feedback constraints. Empirical evaluation on multiple real-world datasets demonstrates significant improvements over state-of-the-art contextual bandit methods, confirming strong cross-domain generalization and practical feasibility for clinical deployment.
📝 Abstract
Contextual bandit algorithms are essential for solving real-world decision making problems. In practice, collecting a contextual bandit's feedback from different domains may involve different costs. For example, measuring drug reaction from mice (as a source domain) and humans (as a target domain). Unfortunately, adapting a contextual bandit algorithm from a source domain to a target domain with distribution shift still remains a major challenge and largely unexplored. In this paper, we introduce the first general domain adaptation method for contextual bandits. Our approach learns a bandit model for the target domain by collecting feedback from the source domain. Our theoretical analysis shows that our algorithm maintains a sub-linear regret bound even adapting across domains. Empirical results show that our approach outperforms the state-of-the-art contextual bandit algorithms on real-world datasets.