🤖 AI Summary
Existing multimodal large language models (MLLMs) suffer from coarse-grained text-image alignment, weak generalization across diverse multimodal tasks, and insufficient generation coherence in text-image interleaved generation. To address these limitations, we propose M³, a unified framework featuring three core innovations: (1) the M³Adapter—a learnable gated adapter enabling fine-grained visual-semantic alignment; (2) a two-stage M³FT fine-tuning strategy that decouples model parameters to dynamically balance creative generation and semantic fidelity; and (3) a multimodal prompt fusion mechanism for holistic cross-modal representation integration. Evaluated on three challenging benchmarks—interleaved text-image generation, multi-turn multimodal dialogue, and visual storytelling—M³ achieves state-of-the-art performance across all tasks. It significantly improves cross-task robustness, high-fidelity alignment accuracy, and long-horizon generation coherence, demonstrating strong generalization capability in complex multimodal reasoning and generation scenarios.
📝 Abstract
While current LLM chatbots like GPT-4V bridge the gap between human instructions and visual representations to enable text-image generations, they still lack efficient alignment methods for high-fidelity performance on multiple downstream tasks. In this paper, we propose extbf{$M^{2}Chat$}, a novel unified multimodal LLM framework for generating interleaved text-image conversation across various scenarios. Specifically, we propose an $M^{3}Adapter$ that efficiently integrates granular low-level visual information and high-level semantic features from multi-modality prompts. Upon the well-aligned fused feature, $M^{3}Adapter$ tailors a learnable gating strategy to balance the model creativity and consistency across various tasks adaptively. Moreover, to further enhance the effectiveness of $M^{3}Adapter$ while preserving the coherence of semantic context comprehension, we introduce a two-stage $M^{3}FT$ fine-tuning strategy. This strategy optimizes disjoint groups of parameters for image-text alignment and visual-instruction respectively. Extensive experiments demonstrate our $M^{2}Chat$ surpasses state-of-the-art counterparts across diverse benchmarks, showcasing its prowess in interleaving generation, storytelling, and multimodal dialogue systems. The demo and code are available at
ed{https://mattie-e.github.io/M2Chat.github.io}.