ChannelTok: Efficient Flexible-Length Vision Tokenization

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

173K/year
🤖 AI Summary
Existing flexible visual tokenizers struggle to balance efficiency and generation quality due to architectural complexity. This work proposes a lightweight, channel-level visual tokenization approach that treats each latent channel as an independent visual token. By integrating a CNN-Transformer hybrid backbone and introducing a random tail-dropping mechanism during training, the method encourages channels to self-organize according to semantic importance. This paradigm enables flexible inference by retaining only the top-k channels, supporting variable-length autoregressive image generation with adaptive compression. Evaluated on ImageNet, the model achieves a state-of-the-art perceptual quality of rFID 2.92 while offering an 8.6× faster decoding speed and a 2.1× smaller model size (159M parameters), substantially improving the trade-off between efficiency and generation fidelity.
📝 Abstract
Leading flexible vision tokenizers achieve SOTA quality at an extreme cost, relying on parameter-heavy backbones and slow, multi-step generative decoders. We depart from this complex, spatial-token paradigm and introduce a simple, lightweight, and fast channel-wise flexible-length tokenizer. Our method treats each latent channel as a visual token, enabling a parameter-efficient CNN-Transformer hybrid backbone. Furthermore, employing a stochastic tail-dropping paradigm during training naturally forces channels to organize by semantic importance. This allows for flexible compression at inference by simply retaining the first $k$ channels, and naturally enables variable-length autoregressive image generation. We validate our approach through extensive experiments on ImageNet, demonstrating consistent quality across diverse token budgets. The results establish a new quality-efficiency frontier: our model achieves state-of-the-art perceptual quality (rFID 2.92) while being $8.6\times$ faster in decoding and $2.1\times$ smaller (159M params) than the next-best alternative. Our work establishes channel-wise tokenization as a powerful and practical paradigm for efficient visual representation. Project page: https://channeltok.github.io
Problem

Research questions and friction points this paper is trying to address.

flexible vision tokenization
efficiency
parameter-heavy backbones
slow generative decoders
visual representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

channel-wise tokenization
flexible-length vision tokenizer
CNN-Transformer hybrid
stochastic tail-dropping
efficient visual representation
🔎 Similar Papers