VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding

📅 2024-12-04

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses temporal hallucination—a pervasive issue in multimodal large language models (MLLMs) for video understanding—by formally defining and quantifying three hallucination types: action, temporal logic, and scene transition. To systematically evaluate this problem, we introduce VidHalluc, the first large-scale benchmark for video temporal hallucination assessment, comprising 5,002 annotated video pairs; it exposes critical reliability deficiencies in state-of-the-art MLLMs. We propose DINO-HEAL, a training-free mitigation method that leverages DINOv2 to extract robust visual features and applies spatial saliency-guided feature reweighting to suppress hallucinatory outputs. Experiments on VidHalluc demonstrate that DINO-HEAL reduces average hallucination rates by 3.02%. Both the VidHalluc benchmark and the DINO-HEAL implementation are publicly released to foster reproducible research.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have recently shown significant advancements in video understanding, excelling in content reasoning and instruction-following tasks. However, hallucination, where models generate inaccurate or misleading content, remains underexplored in the video domain. Building on the observation that MLLM visual encoders often fail to distinguish visually different yet semantically similar video pairs, we introduce VidHalluc, the largest benchmark designed to examine hallucinations in MLLMs for video understanding. It consists of 5,002 videos, paired to highlight cases prone to hallucinations. VidHalluc assesses hallucinations across three critical dimensions: (1) action, (2) temporal sequence, and (3) scene transition. Comprehensive testing shows that most MLLMs are vulnerable to hallucinations across these dimensions. Furthermore, we propose DINO-HEAL, a training-free method that reduces hallucinations by incorporating spatial saliency from DINOv2 to reweight visual features during inference. Our results show that DINO-HEAL consistently improves performance on VidHalluc, achieving an average improvement of 3.02% in mitigating hallucinations across all tasks. Both the VidHalluc benchmark and DINO-HEAL code are available at https://people-robots.github.io/vidhalluc.

Problem

Research questions and friction points this paper is trying to address.

Evaluating hallucinations in MLLMs for video understanding

Assessing model inaccuracies in action, sequence, and scene transitions

Proposing a method to reduce hallucinations in video MLLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest benchmark VidHalluc for video hallucinations

Training-free DINO-HEAL reduces hallucinations effectively

Uses DINOv2 spatial saliency to reweight features

🔎 Similar Papers

EventHallusion: Diagnosing Event Hallucinations in Video LLMs