🤖 AI Summary
This work addresses the snowball effect in multimodal large language models during multi-turn dialogues, where initial hallucinations can cascade into the neglect of visual evidence and breakdown of conversational coherence. To tackle this issue, the authors introduce the first fine-grained evaluation benchmark specifically designed for this problem and propose Conflict-Aware Visual Rectification (CAVR), a training-free method that jointly suppresses error propagation by refreshing visual representations at the feature level and correcting output distributions at the logit level. Experimental results demonstrate that CAVR significantly outperforms existing approaches on the new benchmark, achieving state-of-the-art performance and effectively enhancing both the reliability and visual grounding capabilities of models in long-horizon, multi-turn interactions.
📝 Abstract
Multimodal large language models (MLLMs) demonstrate remarkable visual understanding, yet their reliability in interactive settings is severely undermined by hallucination snowballing: a phenomenon where initial errors amplify across conversational turns, leading to a collapse in coherence. This failure reveals a fundamental vulnerability where models progressively neglect visual grounding in favor of over-relying on polluted textual history. Existing benchmarks are predominantly confined to single-turn VQA, which fail to capture the complex dynamics of error propagation in long-horizon interactions. To address this, we introduce MM-Snowball, the first benchmark for fine-grained diagnosis of hallucination snowballing within dialogues. Extensive evaluation shows that our benchmark poses a significant challenge even to advanced MLLMs and reveals the inefficacy of existing mitigation methods designed for single-turn VQA. To counteract this degradation, we propose Conflict-Aware Visual Rectification (CAVR). This training-free method mitigates snowballing through a synergistic dual-mechanism that refreshes visual grounding at the representation level and rectifies output distributions at the logit level, effectively re-anchoring the model to visual facts. Experiments demonstrate that CAVR achieves state-of-the-art performance, offering a promising path toward more reliable interactive AI. Data and code are available at: https://frenkie-chiang.github.io/MM-Snowball