AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model

πŸ“… 2025-06-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing image tokenizers rely on bidirectional context encoding, which mismatches the unidirectional, autoregressive nature of generative models and leads to generation misalignment. To address this, we propose AliTokβ€”the first causal decoder-driven, alignment-aware tokenizer. AliTok enforces causal token encoding via a prefix-token mechanism and a two-stage training paradigm, enabling end-to-end alignment between the tokenizer and autoregressive sequence modeling. Evaluated on ImageNet-256, our 177M-parameter model achieves gFID = 1.50 and IS = 305.9; the 662M-parameter variant attains gFID = 1.35 and samples 10Γ— faster than state-of-the-art diffusion models. This work establishes a novel co-design paradigm for tokenizers and generative models, bridging architectural and modeling objectives across the full generative pipeline.

Technology Category

Application Category

πŸ“ Abstract
Autoregressive image generation aims to predict the next token based on previous ones. However, existing image tokenizers encode tokens with bidirectional dependencies during the compression process, which hinders the effective modeling by autoregressive models. In this paper, we propose a novel Aligned Tokenizer (AliTok), which utilizes a causal decoder to establish unidirectional dependencies among encoded tokens, thereby aligning the token modeling approach between the tokenizer and autoregressive model. Furthermore, by incorporating prefix tokens and employing two-stage tokenizer training to enhance reconstruction consistency, AliTok achieves great reconstruction performance while being generation-friendly. On ImageNet-256 benchmark, using a standard decoder-only autoregressive model as the generator with only 177M parameters, AliTok achieves a gFID score of 1.50 and an IS of 305.9. When the parameter count is increased to 662M, AliTok achieves a gFID score of 1.35, surpassing the state-of-the-art diffusion method with 10x faster sampling speed. The code and weights are available at https://github.com/ali-vilab/alitok.
Problem

Research questions and friction points this paper is trying to address.

Align tokenizer and autoregressive model for image generation
Resolve bidirectional dependency mismatch in image tokenizers
Improve reconstruction and generation performance with causal dependencies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal decoder ensures unidirectional token dependencies
Prefix tokens enhance reconstruction consistency
Two-stage training improves generation-friendly performance
πŸ”Ž Similar Papers
No similar papers found.
Pingyu Wu
Pingyu Wu
University of Science and Technology of China
computer vision
K
Kai Zhu
University of Science and Technology of China
Y
Yu Liu
Tongyi Lab
Longxiang Tang
Longxiang Tang
Tsinghua University
Computer Vision
J
Jian Yang
University of Science and Technology of China
Yansong Peng
Yansong Peng
University of Science and Technology of China
AIAIGCComputer VisionObject Detection
W
Wei Zhai
University of Science and Technology of China
Y
Yang Cao
University of Science and Technology of China
Z
Zheng-Jun Zha
University of Science and Technology of China