π€ AI Summary
This paper studies the contextual multinomial logit (MNL) bandit problem: minimizing cumulative regret in dynamic assortment selection under large item sets and variable reward structures (uniform or non-uniform). We propose OFU-MNL+, the first algorithm achieving both theoretical optimality and computational efficiency. Built upon the optimism-in-the-face-of-uncertainty (OFU) framework, it integrates MNL modeling, contextual linear parameter estimation, and refined regret decomposition. OFU-MNL+ attains, for the first time in this setting, both instance-dependent and minimax-optimal regret bounds: $ ilde{O}(dsqrt{T/K})$ for uniform rewards and $ ilde{O}(dsqrt{T})$ for non-uniform rewardsβboth matching information-theoretic lower bounds. Crucially, its per-step decision complexity is constant. Extensive experiments validate both its theoretical guarantees and practical effectiveness.
π Abstract
In this paper, we study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the maximum assortment size $K$. Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of $Omega(dsqrt{smash[b]{T/K}})$ and propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $ ilde{O}(dsqrt{smash[b]{T/K}})$. We also provide instance-dependent minimax regret bounds under uniform rewards. Under non-uniform rewards, we prove a lower bound of $Omega(dsqrt{T})$ and an upper bound of $ ilde{O}(dsqrt{T})$, also achievable by OFU-MNL+. Our empirical studies support these theoretical findings. To the best of our knowledge, this is the first work in the contextual MNL bandit literature to prove minimax optimality -- for either uniform or non-uniform reward setting -- and to propose a computationally efficient algorithm that achieves this optimality up to logarithmic factors.