MM-Snowball: Evaluating and Mitigating Hallucination Snowballing in Multimodal Multi-Turn Dialogue

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the snowball effect in multimodal large language models during multi-turn dialogues, where initial hallucinations can cascade into the neglect of visual evidence and breakdown of conversational coherence. To tackle this issue, the authors introduce the first fine-grained evaluation benchmark specifically designed for this problem and propose Conflict-Aware Visual Rectification (CAVR), a training-free method that jointly suppresses error propagation by refreshing visual representations at the feature level and correcting output distributions at the logit level. Experimental results demonstrate that CAVR significantly outperforms existing approaches on the new benchmark, achieving state-of-the-art performance and effectively enhancing both the reliability and visual grounding capabilities of models in long-horizon, multi-turn interactions.

📝 Abstract

Multimodal large language models (MLLMs) demonstrate remarkable visual understanding, yet their reliability in interactive settings is severely undermined by hallucination snowballing: a phenomenon where initial errors amplify across conversational turns, leading to a collapse in coherence. This failure reveals a fundamental vulnerability where models progressively neglect visual grounding in favor of over-relying on polluted textual history. Existing benchmarks are predominantly confined to single-turn VQA, which fail to capture the complex dynamics of error propagation in long-horizon interactions. To address this, we introduce MM-Snowball, the first benchmark for fine-grained diagnosis of hallucination snowballing within dialogues. Extensive evaluation shows that our benchmark poses a significant challenge even to advanced MLLMs and reveals the inefficacy of existing mitigation methods designed for single-turn VQA. To counteract this degradation, we propose Conflict-Aware Visual Rectification (CAVR). This training-free method mitigates snowballing through a synergistic dual-mechanism that refreshes visual grounding at the representation level and rectifies output distributions at the logit level, effectively re-anchoring the model to visual facts. Experiments demonstrate that CAVR achieves state-of-the-art performance, offering a promising path toward more reliable interactive AI. Data and code are available at: https://frenkie-chiang.github.io/MM-Snowball

Problem

Research questions and friction points this paper is trying to address.

hallucination snowballing

multimodal dialogue

visual grounding

error propagation

multimodal large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

hallucination snowballing

multimodal dialogue

visual grounding