Canonical Autoregressive Generation

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
During autoregressive text generation, large language models frequently deviate from the canonical tokenization used during training, leading to output inconsistency and degraded performance. This work identifies, for the first time, that preserving prefix-consistent canonical tokenizations throughout generation is essential. We propose *canonical sampling*, a theoretically grounded decoding method that provably converges to the empirical token distribution observed in training data. Leveraging the deterministic mapping property of standard tokenizers, our approach enforces lightweight, tokenizer-aware constraints during sampling—requiring no model architecture modifications or retraining. We provide formal convergence guarantees showing that the induced output distribution more closely approximates the true training distribution than conventional sampling. Empirical evaluation across multiple models and benchmarks demonstrates significant improvements in tokenization standardness and substantial mitigation of decoding inconsistency.

Technology Category

Application Category

📝 Abstract
State of the art large language models are trained using large amounts of tokens derived from raw text using what is called a tokenizer. Crucially, the tokenizer determines the (token) vocabulary a model will use during inference as well as, in principle, the (token) language. This is because, while the token vocabulary may allow for different tokenizations of a string, the tokenizer always maps the string to only one of these tokenizations--the canonical tokenization. However, multiple lines of empirical evidence suggest that large language models do not always generate canonical token sequences, and this comes with several negative consequences. In this work, we first show that, to generate a canonical token sequence, a model needs to generate (partial) canonical token sequences at each step of the autoregressive generation process underpinning its functioning. Building upon this theoretical result, we introduce canonical sampling, a simple and efficient sampling method that precludes a given model from generating non-canonical token sequences. Further, we also show that, in comparison with standard sampling, the distribution of token sequences generated using canonical sampling is provably closer to the true distribution of token sequences used during training.
Problem

Research questions and friction points this paper is trying to address.

Large language models often generate non-canonical token sequences.
Non-canonical tokenization leads to negative consequences in model output.
Standard sampling deviates from true training token sequence distribution.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces canonical sampling method
Ensures canonical token sequences generation
Improves alignment with training distribution