VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding

📅 2024-12-04
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses temporal hallucination—a pervasive issue in multimodal large language models (MLLMs) for video understanding—by formally defining and quantifying three hallucination types: action, temporal logic, and scene transition. To systematically evaluate this problem, we introduce VidHalluc, the first large-scale benchmark for video temporal hallucination assessment, comprising 5,002 annotated video pairs; it exposes critical reliability deficiencies in state-of-the-art MLLMs. We propose DINO-HEAL, a training-free mitigation method that leverages DINOv2 to extract robust visual features and applies spatial saliency-guided feature reweighting to suppress hallucinatory outputs. Experiments on VidHalluc demonstrate that DINO-HEAL reduces average hallucination rates by 3.02%. Both the VidHalluc benchmark and the DINO-HEAL implementation are publicly released to foster reproducible research.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have recently shown significant advancements in video understanding, excelling in content reasoning and instruction-following tasks. However, hallucination, where models generate inaccurate or misleading content, remains underexplored in the video domain. Building on the observation that MLLM visual encoders often fail to distinguish visually different yet semantically similar video pairs, we introduce VidHalluc, the largest benchmark designed to examine hallucinations in MLLMs for video understanding. It consists of 5,002 videos, paired to highlight cases prone to hallucinations. VidHalluc assesses hallucinations across three critical dimensions: (1) action, (2) temporal sequence, and (3) scene transition. Comprehensive testing shows that most MLLMs are vulnerable to hallucinations across these dimensions. Furthermore, we propose DINO-HEAL, a training-free method that reduces hallucinations by incorporating spatial saliency from DINOv2 to reweight visual features during inference. Our results show that DINO-HEAL consistently improves performance on VidHalluc, achieving an average improvement of 3.02% in mitigating hallucinations across all tasks. Both the VidHalluc benchmark and DINO-HEAL code are available at https://people-robots.github.io/vidhalluc.
Problem

Research questions and friction points this paper is trying to address.

Evaluating hallucinations in MLLMs for video understanding
Assessing model inaccuracies in action, sequence, and scene transitions
Proposing a method to reduce hallucinations in video MLLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest benchmark VidHalluc for video hallucinations
Training-free DINO-HEAL reduces hallucinations effectively
Uses DINOv2 spatial saliency to reweight features
🔎 Similar Papers
No similar papers found.