IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks struggle to evaluate unified multimodal models’ comprehension and generation capabilities in dynamic, multi-turn interleaved image-text dialogues, particularly overlooking exposure bias during generation. To address this gap, this work proposes IMUG-Bench—the first comprehensive evaluation benchmark tailored for such scenarios—encompassing three task types: static spatial, temporal-causal, and hybrid reasoning. The benchmark comprises 3,113 samples and 12,034 interactive turns, enabling joint assessment of model understanding and generation while incorporating dynamically formulated questions to uncover failure modes. Through large-scale experiments, we systematically analyze the performance boundaries of prominent open- and closed-source models and demonstrate that test-time strategies—including chain-of-thought reasoning, self-verification, and Best-of-N sampling—significantly enhance generation accuracy and mitigate exposure bias.
📝 Abstract
In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation within a single framework. Mastering dynamic, multi-turn interleaved image-text dialogues is a crucial task for UMMs in real-world applications. However, existing benchmarks fail to evaluate this important task, as they are often limited to single-turn or static settings, and typically overlook exposure bias in multi-turn interactions. To bridge this gap, we propose IMUG-Bench, a comprehensive benchmark for multi-turn interleaved image-text dialogue of UMMs that jointly evaluates their understanding and generation capabilities. Our IMUG-Bench comprises three classes: Static Spatial, Temporal Causal, and Hybrid, covering 3,113 samples and 12,034 interaction turns. It also includes dynamic understanding questions, thereby supporting evaluation that better reflects real-world multi-turn interaction scenarios. Large-scale experiments on IMUG-Bench systematically evaluate mainstream open-source and closed-source UMMs, revealing their capability boundaries and failure modes, and uncovering pronounced exposure bias on the generation side in multi-turn interactions. We further explore several test-time scaling strategies, including Chain-of-Thought, Self-Verification, and Best-of-N Sampling, which effectively improve generation accuracy and mitigate exposure bias in generation tasks. These findings provide insights into enhancing the robustness and multi-turn interaction capability of future UMMs.
Problem

Research questions and friction points this paper is trying to address.

unified multimodal models
interleaved image-text dialogue
multi-turn interaction
exposure bias
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Multimodal Models
Interleaved Image-Text Dialogue
Exposure Bias
Multi-turn Interaction
Test-time Scaling