🤖 AI Summary
This work addresses the inherent trade-off between continuous (VAE) and discrete (VQ) visual tokenizer paradigms: while VAEs offer high-fidelity reconstruction but limited semantic control, VQ-based approaches enable autoregressive generation yet suffer from training instability and codebook collapse. To bridge this gap, we propose MergeTok, which introduces, for the first time, a semantic-aware token merging mechanism into an encoder-decoder architecture, serving as a synergistic supervision bridge between continuous and discrete representations. By enforcing intra-group diversity and inter-group exclusivity constraints, MergeTok jointly enhances disentanglement in the VAE latent space and stabilizes VQ training, achieving both high-fidelity reconstruction and generation-friendly discreteness within a unified framework. Experiments on ImageNet-256 demonstrate that, under identical token budgets, MergeTok significantly lowers rFID compared to strong baselines and yields semantically coherent token representations compatible with both autoregressive and diffusion generators.
📝 Abstract
Most visual tokenizers for image generation are bifurcated into two families with complementary limitations: continuous VAEs offer high-fidelity reconstruction but suffer from dense, entangled latents that are poorly suited for semantic control, whereas discrete VQ-based models enable autoregressive generation yet struggle with gradient sparsity, unstable training, and codebook collapse. In this work, we introduce MergeTok, a unified tokenizer that jointly optimizes continuous (VAE) and discrete (VQ) tokenizers within a encoder-decoder architecture, leveraging token merging techniques as a semantic bridge. By clustering similar tokens during encoding, MergeTok establishes a structural prior that provides dual supervision signals: (i) it imposes merged-token semantic alignment in the VAE branch, regularizing its latent space toward disentangled, semantic-aware representations; (ii) it derives group-wise constraints, promoting intra-group diversity and inter-group exclusivity that stabilize VQ training. MergeTok shows competitive reconstruction and generation performance on ImageNet-256, with substantially lower rFID than strong VAE and VQ models under matched token budgets, while producing semantically-organized token representations compatible with both autoregressive and diffusion generators. This shows that a single architecture can endow visual tokenizers with robust semantic organization and generator-friendly discreteness.