MergeTok: Unified Continuous and Discrete Visual Tokenization via Token Merging

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

This work addresses the inherent trade-off between continuous (VAE) and discrete (VQ) visual tokenizer paradigms: while VAEs offer high-fidelity reconstruction but limited semantic control, VQ-based approaches enable autoregressive generation yet suffer from training instability and codebook collapse. To bridge this gap, we propose MergeTok, which introduces, for the first time, a semantic-aware token merging mechanism into an encoder-decoder architecture, serving as a synergistic supervision bridge between continuous and discrete representations. By enforcing intra-group diversity and inter-group exclusivity constraints, MergeTok jointly enhances disentanglement in the VAE latent space and stabilizes VQ training, achieving both high-fidelity reconstruction and generation-friendly discreteness within a unified framework. Experiments on ImageNet-256 demonstrate that, under identical token budgets, MergeTok significantly lowers rFID compared to strong baselines and yields semantically coherent token representations compatible with both autoregressive and diffusion generators.

📝 Abstract

Most visual tokenizers for image generation are bifurcated into two families with complementary limitations: continuous VAEs offer high-fidelity reconstruction but suffer from dense, entangled latents that are poorly suited for semantic control, whereas discrete VQ-based models enable autoregressive generation yet struggle with gradient sparsity, unstable training, and codebook collapse. In this work, we introduce MergeTok, a unified tokenizer that jointly optimizes continuous (VAE) and discrete (VQ) tokenizers within a encoder-decoder architecture, leveraging token merging techniques as a semantic bridge. By clustering similar tokens during encoding, MergeTok establishes a structural prior that provides dual supervision signals: (i) it imposes merged-token semantic alignment in the VAE branch, regularizing its latent space toward disentangled, semantic-aware representations; (ii) it derives group-wise constraints, promoting intra-group diversity and inter-group exclusivity that stabilize VQ training. MergeTok shows competitive reconstruction and generation performance on ImageNet-256, with substantially lower rFID than strong VAE and VQ models under matched token budgets, while producing semantically-organized token representations compatible with both autoregressive and diffusion generators. This shows that a single architecture can endow visual tokenizers with robust semantic organization and generator-friendly discreteness.

Problem

Research questions and friction points this paper is trying to address.

visual tokenization

continuous VAE

discrete VQ

semantic control

codebook collapse

Innovation

Methods, ideas, or system contributions that make the work stand out.

token merging

unified tokenization

semantic disentanglement