Tokenized Bandit for LLM Decoding and Alignment

📅 2025-06-08

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This paper addresses the sequential decision-making problem inherent in large language model (LLM) decoding and alignment, formalizing it as a novel “query-driven sequential bandit” (QDSB)—a variant of contextual bandits where, given a query, a decision-maker generates tokens sequentially and irreversibly, receiving stochastic utility feedback dependent on the full generated sequence only upon completion. The authors prove that learning is impossible under unstructured assumptions and introduce, for the first time, the DDMC (Decomposable, Directional, Monotonic, Contextual) structural assumption. Under DDMC, they establish the near-optimality of greedy decoding in LLMs and characterize the fundamental learnability boundary for decoding-time alignment. Methodologically, the approach integrates linear bandits, stochastic multi-armed bandits, and sequential function modeling, achieving regret bounds of $ ilde{O}(Lsqrt{T})$ and $ ilde{O}(L T^{1/3})$. Extensive experiments on synthetic and real-world datasets validate both algorithmic efficacy and the empirical plausibility of the DDMC assumption.

Technology Category

Application Category

📝 Abstract

We introduce the tokenized linear bandit (TLB) and multi-armed bandit (TMAB), variants of linear and stochastic multi-armed bandit problems inspired by LLM decoding and alignment. In these problems, at each round $t in [T]$, a user submits a query (context), and the decision maker (DM) sequentially selects a token irrevocably from a token set. Once the sequence is complete, the DM observes a random utility from the user, whose expectation is presented by a sequence function mapping the chosen token sequence to a nonnegative real value that depends on the query. In both problems, we first show that learning is impossible without any structure on the sequence function. We introduce a natural assumption, diminishing distance with more commons (DDMC), and propose algorithms with regret $ ilde{O}(Lsqrt{T})$ and $ ilde{O}(Lsqrt{T^{2/3}})$ for TLB and TMAB, respectively. As a side product, we obtain an (almost) optimality of the greedy decoding for LLM decoding algorithm under DDMC, which justifies the unresaonable effectiveness of greedy decoding in several tasks. This also has an immediate application to decoding-time LLM alignment, when the misaligned utility can be represented as the frozen LLM's utility and a linearly realizable latent function. We finally validate our algorithm's performance empirically as well as verify our assumptions using synthetic and real-world datasets.

Problem

Research questions and friction points this paper is trying to address.

Introduces tokenized bandit models for LLM decoding and alignment

Proves learning requires structured sequence functions (DDMC assumption)

Validates algorithms for decoding-time LLM alignment with empirical results

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tokenized linear and multi-armed bandit variants

Diminishing distance with more commons assumption

Greedy decoding optimality under DDMC

🔎 Similar Papers

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling