Tree-Guided Identify-Then-Exploit: A Unified Framework of Best Arm Identification and Regret Minimization for Dueling Bandits

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

270K/year

🤖 AI Summary

This work proposes TG-ITE, a unified framework for stochastic dueling bandits with $N$ arms under the Condorcet winner assumption, which jointly optimizes three key objectives: best-arm identification (BAI), weak regret, and strong regret. The method introduces a tree-guided identification mechanism that identifies the winning arm with high confidence using only $O(N)$ comparisons. By integrating warm-start initialization and a multi-objective adaptive exploitation strategy, TG-ITE achieves an $O(N)$ sample complexity without requiring additional assumptions. Notably, it presents the first winner-stays-type algorithm attaining $O(N)$ weak regret, thereby closing the suboptimality gap of $O(\log N)$ present in existing approaches. Moreover, the framework simultaneously guarantees optimal $O(N)$ rates for both BAI and weak regret, while maintaining strong regret at $O(N \log T)$.

📝 Abstract

We study $N$-armed stochastic dueling bandits under the Condorcet-winner assumption, where three widely adopted objectives are considered: best-arm identification (BAI), weak regret, and strong regret. We propose Tree-Guided Identify-Then-Exploit (TG-ITE), the first unified framework to tackle all these objectives to our knowledge. Without requiring stronger assumptions, we propose a shared tree-guided identification approach to find a high-confidence incumbent within $O(N)$ comparisons. We further propose varied exploitation strategies to utilize this warm-start stage to optimize the specific objectives at hand. This methodology enables our approach to (1) achieve $O(N)$ sample complexity in BAI without commonly adopted stronger assumptions; (2) build the first winner-stays-style algorithm to achieve $O(N)$ weak regret; (3) enjoy the same $O(N \log T)$ guarantee as specialized strong-regret approaches; (4) realize the joint optimization of BAI and weak regret with $O(N)$ guarantees for both, eliminating the sub-optimal gap of $O(\log N)$ in the existing approach. Our results provide evidence that the trade-off between BAI and regret minimization is relatively benign in dueling bandits.

Problem

Research questions and friction points this paper is trying to address.

dueling bandits

best arm identification

regret minimization

Condorcet winner

unified framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

dueling bandits

best arm identification

regret minimization