π€ AI Summary
Existing image tokenizers rely on bidirectional context encoding, which mismatches the unidirectional, autoregressive nature of generative models and leads to generation misalignment. To address this, we propose AliTokβthe first causal decoder-driven, alignment-aware tokenizer. AliTok enforces causal token encoding via a prefix-token mechanism and a two-stage training paradigm, enabling end-to-end alignment between the tokenizer and autoregressive sequence modeling. Evaluated on ImageNet-256, our 177M-parameter model achieves gFID = 1.50 and IS = 305.9; the 662M-parameter variant attains gFID = 1.35 and samples 10Γ faster than state-of-the-art diffusion models. This work establishes a novel co-design paradigm for tokenizers and generative models, bridging architectural and modeling objectives across the full generative pipeline.
π Abstract
Autoregressive image generation aims to predict the next token based on previous ones. However, existing image tokenizers encode tokens with bidirectional dependencies during the compression process, which hinders the effective modeling by autoregressive models. In this paper, we propose a novel Aligned Tokenizer (AliTok), which utilizes a causal decoder to establish unidirectional dependencies among encoded tokens, thereby aligning the token modeling approach between the tokenizer and autoregressive model. Furthermore, by incorporating prefix tokens and employing two-stage tokenizer training to enhance reconstruction consistency, AliTok achieves great reconstruction performance while being generation-friendly. On ImageNet-256 benchmark, using a standard decoder-only autoregressive model as the generator with only 177M parameters, AliTok achieves a gFID score of 1.50 and an IS of 305.9. When the parameter count is increased to 662M, AliTok achieves a gFID score of 1.35, surpassing the state-of-the-art diffusion method with 10x faster sampling speed. The code and weights are available at https://github.com/ali-vilab/alitok.