AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing image tokenizers rely on bidirectional context encoding, which mismatches the unidirectional, autoregressive nature of generative models and leads to generation misalignment. To address this, we propose AliTok—the first causal decoder-driven, alignment-aware tokenizer. AliTok enforces causal token encoding via a prefix-token mechanism and a two-stage training paradigm, enabling end-to-end alignment between the tokenizer and autoregressive sequence modeling. Evaluated on ImageNet-256, our 177M-parameter model achieves gFID = 1.50 and IS = 305.9; the 662M-parameter variant attains gFID = 1.35 and samples 10× faster than state-of-the-art diffusion models. This work establishes a novel co-design paradigm for tokenizers and generative models, bridging architectural and modeling objectives across the full generative pipeline.

Technology Category

Application Category

📝 Abstract

Autoregressive image generation aims to predict the next token based on previous ones. However, existing image tokenizers encode tokens with bidirectional dependencies during the compression process, which hinders the effective modeling by autoregressive models. In this paper, we propose a novel Aligned Tokenizer (AliTok), which utilizes a causal decoder to establish unidirectional dependencies among encoded tokens, thereby aligning the token modeling approach between the tokenizer and autoregressive model. Furthermore, by incorporating prefix tokens and employing two-stage tokenizer training to enhance reconstruction consistency, AliTok achieves great reconstruction performance while being generation-friendly. On ImageNet-256 benchmark, using a standard decoder-only autoregressive model as the generator with only 177M parameters, AliTok achieves a gFID score of 1.50 and an IS of 305.9. When the parameter count is increased to 662M, AliTok achieves a gFID score of 1.35, surpassing the state-of-the-art diffusion method with 10x faster sampling speed. The code and weights are available at https://github.com/ali-vilab/alitok.

Problem

Research questions and friction points this paper is trying to address.

Align tokenizer and autoregressive model for image generation

Resolve bidirectional dependency mismatch in image tokenizers

Improve reconstruction and generation performance with causal dependencies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal decoder ensures unidirectional token dependencies

Prefix tokens enhance reconstruction consistency

Two-stage training improves generation-friendly performance

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models