🤖 AI Summary
This work addresses the challenge of multi-objective bandits where user preferences and rewards are unknown and entangled, rendering learning from utility feedback alone inefficient. To improve sample efficiency, the paper introduces active conversational queries into this setting for the first time, leveraging structured user preferences—such as “a cheap and clean hotel”—to guide exploration. To overcome translation invariance and noise inherent in preference queries, the authors propose MO-PQUCB, a hybrid algorithm that integrates query anchoring with bandit feedback. The method employs a robust estimator based on the Plackett–Luce subset choice model, incorporates translation-invariant regularization, and adopts a dual-exploration UCB strategy. Theoretical analysis demonstrates accelerated preference estimation and an improved regret bound, while experiments on both synthetic and real-world scenarios show significant performance gains over existing preference-aware multi-objective MAB algorithms.
📝 Abstract
Personalized decision-making in multi-objective bandits requires learning user-specific trade-offs among competing objectives. Since arm utility depends on both unknown rewards and unknown preferences, existing methods infer preferences only from utility feedback, entangling preference learning with reward exploration. In practice, however, users often reveal their priorities through proactive conversational queries (e.g., "cheap and clean hotel"), yet this structured signal is not leveraged. We formalize a proactive query-based framework in which user queries provide structured preference signals. Modeling these signals via a Plackett-Luce subset choice model, we show that query-only learning is insufficient due to a fundamental shift-invariance barrier. To resolve this, we introduce MO-PQUCB, a hybrid algorithm that integrates query-based preference anchoring with bandit feedback through shift-invariant regularization and dual-exploration UCB. We prove that proactive queries accelerate preference estimation and yield improved regret scaling over prior preference-aware MO-MAB methods. Under corrupted queries, we further characterize statistical limits and design a robust estimator achieving near-optimal performance when the corruption is sparse. Experiments validate both theoretical and practical gains.