Contextual Combinatorial Bandits with Changing Action Sets via Gaussian Processes

📅 2021-10-05

📈 Citations: 2

✨ Influential: 1

career value

227K/year

🤖 AI Summary

This paper studies the contextual combinatorial bandit problem with a dynamically evolving base arm set over time, aiming to maximize cumulative reward. To address the dual challenges of time-varying feasible action sets and context-dependent rewards, we introduce Gaussian process (GP) modeling into this framework for the first time, proposing the O’CLOK-UCB algorithm and its sparse GP-accelerated variant. Our method integrates kernelized UCB, combinatorial feasibility constraints, and Lipschitz continuity analysis. We establish a sublinear regret bound of Õ(√(λ∗(K)KTγ_T)), where λ∗(K) is the largest eigenvalue of the action covariance matrix and γ_T is the maximum information gain—revealing their coupled impact on regret. Empirical evaluation on real-world datasets demonstrates significant improvements over existing UCB-based approaches, confirming both theoretical rigor and practical efficacy.

📝 Abstract

We consider a contextual bandit problem with a combinatorial action set and time-varying base arm availability. At the beginning of each round, the agent observes the set of available base arms and their contexts and then selects an action that is a feasible subset of the set of available base arms to maximize its cumulative reward in the long run. We assume that the mean outcomes of base arms are samples from a Gaussian Process (GP) indexed by the context set X , and the expected reward is Lipschitz continuous in expected base arm outcomes. For this setup, we propose an algorithm called Optimistic Combinatorial Learning and Optimization with Kernel Upper Conﬁdence Bounds (O’CLOK-UCB) and prove that it incurs ˜ O ( (cid:112) λ ∗ ( K ) KTγ T ) regret with high probability, where γ T is the maximum information gain associated with the set of base arm contexts that appeared in the ﬁrst T rounds, K is the maximum cardinality of any feasible action over all rounds and λ ∗ ( K ) is the maximum eigenvalue of all covariance matrices of selected actions up to time T , which is a function of K . To dramati-cally speed up the algorithm, we also propose a variant of O’CLOK-UCB that uses sparse GPs. Finally, we experimentally show that both algorithms exploit inter-base arm outcome correlation and vastly outperform the previous state-of-the-art UCB-based algorithms in realistic setups.

Problem

Research questions and friction points this paper is trying to address.

Solving combinatorial bandits with time-varying available actions

Modeling base arm outcomes via Gaussian Process contextual dependencies

Optimizing cumulative reward under changing action set constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Gaussian Processes for contextual bandit modeling

Proposes O'CLOK-UCB algorithm with kernel confidence bounds

Introduces sparse GP variant for computational efficiency

🔎 Similar Papers

Diffusion Models Meet Contextual Bandits with Large Action Spaces