M$^{2}$Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation

📅 2023-11-29

📈 Citations: 2

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) suffer from coarse-grained text-image alignment, weak generalization across diverse multimodal tasks, and insufficient generation coherence in text-image interleaved generation. To address these limitations, we propose M³, a unified framework featuring three core innovations: (1) the M³Adapter—a learnable gated adapter enabling fine-grained visual-semantic alignment; (2) a two-stage M³FT fine-tuning strategy that decouples model parameters to dynamically balance creative generation and semantic fidelity; and (3) a multimodal prompt fusion mechanism for holistic cross-modal representation integration. Evaluated on three challenging benchmarks—interleaved text-image generation, multi-turn multimodal dialogue, and visual storytelling—M³ achieves state-of-the-art performance across all tasks. It significantly improves cross-task robustness, high-fidelity alignment accuracy, and long-horizon generation coherence, demonstrating strong generalization capability in complex multimodal reasoning and generation scenarios.

📝 Abstract

While current LLM chatbots like GPT-4V bridge the gap between human instructions and visual representations to enable text-image generations, they still lack efficient alignment methods for high-fidelity performance on multiple downstream tasks. In this paper, we propose extbf{$M^{2}Chat$}, a novel unified multimodal LLM framework for generating interleaved text-image conversation across various scenarios. Specifically, we propose an $M^{3}Adapter$ that efficiently integrates granular low-level visual information and high-level semantic features from multi-modality prompts. Upon the well-aligned fused feature, $M^{3}Adapter$ tailors a learnable gating strategy to balance the model creativity and consistency across various tasks adaptively. Moreover, to further enhance the effectiveness of $M^{3}Adapter$ while preserving the coherence of semantic context comprehension, we introduce a two-stage $M^{3}FT$ fine-tuning strategy. This strategy optimizes disjoint groups of parameters for image-text alignment and visual-instruction respectively. Extensive experiments demonstrate our $M^{2}Chat$ surpasses state-of-the-art counterparts across diverse benchmarks, showcasing its prowess in interleaving generation, storytelling, and multimodal dialogue systems. The demo and code are available at ed{https://mattie-e.github.io/M2Chat.github.io}.

Problem

Research questions and friction points this paper is trying to address.

Lack efficient alignment for high-fidelity multimodal tasks

Need unified framework for interleaved text-image generation

Balance model creativity and consistency adaptively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal LLM framework for text-image generation

M3Adapter integrates visual and semantic features adaptively

Two-stage fine-tuning optimizes image-text alignment separately

🔎 Similar Papers

MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens