AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Audio-visual-language large models (AV-LLMs) suffer from hallucinations induced by unimodal mismatch and suboptimal cross-modal interactions. Method: We propose AVCD, a training-free contrastive decoding framework featuring a novel modality-aware dynamic attention masking mechanism to reconstruct joint audio-video-text contrastive decoding, coupled with an entropy-guided adaptive skip-step decoding strategy. Results: On the AVHBench benchmark, AVCD boosts accuracy by 6% for VideoLLaMA2 and 11% for video-SALMONN—outperforming existing decoding methods—while demonstrating strong robustness and cross-model generalizability. This work pioneers the extension of contrastive decoding to audio-visual multimodal settings, establishing an efficient, lightweight, and plug-and-play paradigm for mitigating hallucinations in AV-LLMs.

Technology Category

Application Category

📝 Abstract
Hallucination remains a major challenge in multimodal large language models (MLLMs). To address this, various contrastive decoding (CD) methods have been proposed that contrasts original logits with hallucinated logits generated from perturbed inputs. While CD has shown promise in vision-language models (VLMs), it is not well-suited for AV-LLMs, where hallucinations often emerge from both unimodal and cross-modal combinations involving audio, video, and language. These intricate interactions call for a more adaptive and modality-aware decoding strategy. In this paper, we propose Audio-Visual Contrastive Decoding (AVCD)-a novel, training-free decoding framework designed to model trimodal interactions and suppress modality-induced hallucinations in AV-LLMs. Unlike previous CD methods in VLMs that corrupt a fixed modality, AVCD leverages attention distributions to dynamically identify less dominant modalities and applies attentive masking to generate perturbed output logits. To support CD in a trimodal setting, we also reformulate the original CD framework to jointly handle audio, visual, and textual inputs. Finally, to improve efficiency, we introduce entropy-guided adaptive decoding, which selectively skips unnecessary decoding steps based on the model's confidence in its predictions. Extensive experiments demonstrate that AVCD consistently outperforms existing decoding methods. Especially, on the AVHBench dataset, it improves accuracy by 6% for VideoLLaMA2 and 11% for video-SALMONN, demonstrating strong robustness and generalizability.
Problem

Research questions and friction points this paper is trying to address.

Mitigating hallucinations in audio-visual large language models
Adaptive decoding for trimodal audio-visual-text interactions
Improving efficiency with entropy-guided adaptive decoding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic attentive masking for perturbed logits
Trimodal CD framework for AV-LLMs
Entropy-guided adaptive decoding efficiency