๐ค AI Summary
This work addresses the challenge of establishing a unified representation framework for multimodal understanding, generation, and reconstruction. To this end, we propose a vector-quantized autoencoder (VQ-VAE) architecture that jointly models continuous semantic features and discrete generative tokens within a single tokenizer. Our key contributions are threefold: (1) the first unified tokenizer that simultaneously enables semantic understanding and high-fidelity visual generation; (2) a high-dimensional semantic codebook achieving 100% codebook utilization; and (3) a symmetric Vision Transformer (ViT) decoder coupled with a two-stage training strategyโfirst freezing the encoder to learn the codebook, then jointly optimizing the full model via self-distillation. Leveraging pretrained vision foundation models, our approach achieves state-of-the-art performance across diverse understanding, generation, and reconstruction benchmarks, while balancing fine-grained reconstruction fidelity, generation efficiency, and model scalability.
๐ Abstract
Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints. This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction. Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.