MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

๐Ÿ“… 2024-11-26
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 3
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current unified vision-language models suffer from high training complexity, heavy data dependency, and inferior comprehension performance compared to specialized modelsโ€”primarily because conventional visual tokenizers (e.g., VQGAN) capture only low-level visual features, hindering semantic alignment with language tokens. To address this, we propose Semantic Discrete Encoding (SDE), a novel visual tokenization framework that explicitly incorporates semantic constraints into visual discretization, enabling the first *semantic-level* alignment between visual and linguistic tokens. Integrated with joint vision-language tokenization, a unified multimodal Transformer architecture, and LLM-guided cross-modal joint training, our approach significantly reduces data requirements and improves joint modeling efficiency. Experiments demonstrate that our model achieves a 4.8% comprehension gain over Emu3 and outperforms LLaVA-NeXT 34B by 3.7%; it also surpasses existing unified models in generative capability.

Technology Category

Application Category

๐Ÿ“ Abstract
We introduce MUSE-VL, a Unified Vision-Language Model through Semantic discrete Encoding for multimodal understanding and generation. Recently, the research community has begun exploring unified models for visual generation and understanding. However, existing vision tokenizers (e.g., VQGAN) only consider low-level information, which makes it difficult to align with language tokens. This results in high training complexity and necessitates a large amount of training data to achieve optimal performance. Additionally, their performance is still far from dedicated understanding models. This paper proposes Semantic Discrete Encoding (SDE), which effectively aligns the information of visual tokens and language tokens by adding semantic constraints to the visual tokenizer. This greatly reduces the amount of training data and improves the performance of the unified model. With the same LLM size, our method improved the understanding performance by 4.8% compared to the previous SOTA Emu3 and surpassed the dedicated understanding model LLaVA-NeXT 34B by 3.7%. Our model also surpasses the existing unified models on visual generation benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Aligns visual and language tokens using Semantic Discrete Encoding
Reduces training data needs and improves model performance
Surpasses existing models in visual understanding and generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Discrete Encoding aligns visual and language tokens
Reduces training data and improves model performance
Surpasses existing models in understanding and generation
๐Ÿ”Ž Similar Papers
No similar papers found.