Nearly Minimax Optimal Regret for Multinomial Logistic Bandit

πŸ“… 2024-05-16
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 2
✨ Influential: 1
πŸ“„ PDF
πŸ€– AI Summary
This paper studies the contextual multinomial logit (MNL) bandit problem: minimizing cumulative regret in dynamic assortment selection under large item sets and variable reward structures (uniform or non-uniform). We propose OFU-MNL+, the first algorithm achieving both theoretical optimality and computational efficiency. Built upon the optimism-in-the-face-of-uncertainty (OFU) framework, it integrates MNL modeling, contextual linear parameter estimation, and refined regret decomposition. OFU-MNL+ attains, for the first time in this setting, both instance-dependent and minimax-optimal regret bounds: $ ilde{O}(dsqrt{T/K})$ for uniform rewards and $ ilde{O}(dsqrt{T})$ for non-uniform rewardsβ€”both matching information-theoretic lower bounds. Crucially, its per-step decision complexity is constant. Extensive experiments validate both its theoretical guarantees and practical effectiveness.

Technology Category

Application Category

πŸ“ Abstract
In this paper, we study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the maximum assortment size $K$. Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of $Omega(dsqrt{smash[b]{T/K}})$ and propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $ ilde{O}(dsqrt{smash[b]{T/K}})$. We also provide instance-dependent minimax regret bounds under uniform rewards. Under non-uniform rewards, we prove a lower bound of $Omega(dsqrt{T})$ and an upper bound of $ ilde{O}(dsqrt{T})$, also achievable by OFU-MNL+. Our empirical studies support these theoretical findings. To the best of our knowledge, this is the first work in the contextual MNL bandit literature to prove minimax optimality -- for either uniform or non-uniform reward setting -- and to propose a computationally efficient algorithm that achieves this optimality up to logarithmic factors.
Problem

Research questions and friction points this paper is trying to address.

contextual information
reinforcement learning
algorithm design
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimal Difficulty Metric
OFU-MNL+ Algorithm
Contextual Multi-Armed Bandits
πŸ”Ž Similar Papers
No similar papers found.
J
Joongkyu Lee
Graduate School of Data Science, Seoul National University
Min-hwan Oh
Min-hwan Oh
Seoul National University
Reinforcement LearningBandit AlgorithmsMachine Learning