DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the semantic-perceptual representation conflict in autoregressive multimodal large language models (MLLMs), where shared single-codebook tokenization compromises both visual understanding and generation, this paper proposes DualToken: a dual-codebook visual tokenizer architecture. DualToken decouples low-level perceptual representations—optimized for reconstruction—and high-level semantic representations—driven by contrastive learning—via feature-level separation and joint distillation for synergistic optimization. Unlike conventional single-codebook approaches that necessitate trade-offs between detail fidelity and semantic abstraction, DualToken enables simultaneous enhancement of both understanding and generation within a unified autoregressive framework. Experiments demonstrate that DualToken achieves state-of-the-art performance on both visual reconstruction benchmarks (e.g., LPIPS, FID) and semantic understanding tasks (e.g., VQA, image captioning), significantly outperforming naive two-encoder concatenation baselines.

Technology Category

Application Category

📝 Abstract

The differing representation spaces required for visual understanding and generation pose a challenge in unifying them within the autoregressive paradigm of large language models. A vision tokenizer trained for reconstruction excels at capturing low-level perceptual details, making it well-suited for visual generation but lacking high-level semantic representations for understanding tasks. Conversely, a vision encoder trained via contrastive learning aligns well with language but struggles to decode back into the pixel space for generation tasks. To bridge this gap, we propose DualToken, a method that unifies representations for both understanding and generation within a single tokenizer. However, directly integrating reconstruction and semantic objectives in a single tokenizer creates conflicts, leading to degraded performance in both reconstruction quality and semantic performance. Instead of forcing a single codebook to handle both semantic and perceptual information, DualToken disentangles them by introducing separate codebooks for high and low-level features, effectively transforming their inherent conflict into a synergistic relationship. As a result, DualToken achieves state-of-the-art performance in both reconstruction and semantic tasks while demonstrating remarkable effectiveness in downstream MLLM understanding and generation tasks. Notably, we also show that DualToken, as a unified tokenizer, surpasses the naive combination of two distinct types vision encoders, providing superior performance within a unified MLLM.

Problem

Research questions and friction points this paper is trying to address.

Unify visual understanding and generation tasks

Resolve conflicts between reconstruction and semantic objectives

Enhance performance in multimodal large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

DualToken unifies visual understanding and generation

Separate codebooks for high and low-level features

Achieves state-of-the-art in reconstruction and semantics

🔎 Similar Papers

No similar papers found.

Authors to Follow