DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

📅 2025-03-18
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
To address the semantic-perceptual representation conflict in autoregressive multimodal large language models (MLLMs), where shared single-codebook tokenization compromises both visual understanding and generation, this paper proposes DualToken: a dual-codebook visual tokenizer architecture. DualToken decouples low-level perceptual representations—optimized for reconstruction—and high-level semantic representations—driven by contrastive learning—via feature-level separation and joint distillation for synergistic optimization. Unlike conventional single-codebook approaches that necessitate trade-offs between detail fidelity and semantic abstraction, DualToken enables simultaneous enhancement of both understanding and generation within a unified autoregressive framework. Experiments demonstrate that DualToken achieves state-of-the-art performance on both visual reconstruction benchmarks (e.g., LPIPS, FID) and semantic understanding tasks (e.g., VQA, image captioning), significantly outperforming naive two-encoder concatenation baselines.

Technology Category

Application Category

📝 Abstract
The differing representation spaces required for visual understanding and generation pose a challenge in unifying them within the autoregressive paradigm of large language models. A vision tokenizer trained for reconstruction excels at capturing low-level perceptual details, making it well-suited for visual generation but lacking high-level semantic representations for understanding tasks. Conversely, a vision encoder trained via contrastive learning aligns well with language but struggles to decode back into the pixel space for generation tasks. To bridge this gap, we propose DualToken, a method that unifies representations for both understanding and generation within a single tokenizer. However, directly integrating reconstruction and semantic objectives in a single tokenizer creates conflicts, leading to degraded performance in both reconstruction quality and semantic performance. Instead of forcing a single codebook to handle both semantic and perceptual information, DualToken disentangles them by introducing separate codebooks for high and low-level features, effectively transforming their inherent conflict into a synergistic relationship. As a result, DualToken achieves state-of-the-art performance in both reconstruction and semantic tasks while demonstrating remarkable effectiveness in downstream MLLM understanding and generation tasks. Notably, we also show that DualToken, as a unified tokenizer, surpasses the naive combination of two distinct types vision encoders, providing superior performance within a unified MLLM.
Problem

Research questions and friction points this paper is trying to address.

Unify visual understanding and generation tasks
Resolve conflicts between reconstruction and semantic objectives
Enhance performance in multimodal large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

DualToken unifies visual understanding and generation
Separate codebooks for high and low-level features
Achieves state-of-the-art in reconstruction and semantics
🔎 Similar Papers
No similar papers found.
W
Wei Song
Baichuan Inc., Westlake University, Zhejiang University
Y
Yuran Wang
Baichuan Inc., Wuhan University
Z
Zijia Song
Westlake University
Y
Yadong Li
Baichuan Inc.
Haoze Sun
Haoze Sun
Tsinghua University
Low-level image processingImage super-resolutionDiffusion generation model
W
Weipeng Chen
Baichuan Inc.
Z
Zenan Zhou
Baichuan Inc.
Jianhua Xu
Jianhua Xu
University of Electronic Science and Technology of China
Multi-Agent、Evolutionary Games、LLM-Agents
J
Jiaqi Wang
Shanghai AI Laboratory, Shanghai Innovation Institute
Kaicheng Yu
Kaicheng Yu
Assistant Professor, Westlake University, PI of Autonomous Intelligence Lab
computer vision3D understandingautonomous perceptionautomatic machine learning