SmartSight: Mitigating Hallucination in Video-LLMs Without Compromising Video Understanding via Temporal Attention Collapse

📅 2025-12-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video large language models (Video-LLMs) suffer from pervasive perceptual hallucinations, severely compromising their safety and practical utility; existing mitigation strategies often degrade comprehension capabilities. This paper proposes a training-free intervention framework that jointly generates multiple candidate responses and leverages model self-reflection to identify low-hallucination outputs. Our key contributions are: (1) the first training-agnostic hallucination mitigation paradigm; (2) Temporal Attention Collapse Degree (TAC), a novel metric quantifying hallucination severity via attention dynamics; and (3) identification of the visual attention vanishing point, enabling early stopping and efficient evaluation. Evaluated on the VRIPT-HAL benchmark, our method reduces the hallucination rate of Qwen2.5-VL-7B by 10.59% while improving its VideoMMMU video understanding performance by 8.86%, demonstrating simultaneous hallucination suppression and capability preservation.

Technology Category

Application Category

📝 Abstract
Despite Video Large Language Models having rapidly advanced in recent years, perceptual hallucinations pose a substantial safety risk, which severely restricts their real-world applicability. While several methods for hallucination mitigation have been proposed, they often compromise the model's capacity for video understanding and reasoning. In this work, we propose SmartSight, a pioneering step to address this issue in a training-free manner by leveraging the model's own introspective capabilities. Specifically, SmartSight generates multiple candidate responses to uncover low-hallucinated outputs that are often obscured by standard greedy decoding. It assesses the hallucination of each response using the Temporal Attention Collapse score, which measures whether the model over-focuses on trivial temporal regions of the input video when generating the response. To improve efficiency, SmartSight identifies the Visual Attention Vanishing point, enabling more accurate hallucination estimation and early termination of hallucinated responses, leading to a substantial reduction in decoding cost. Experiments show that SmartSight substantially lowers hallucinations for Qwen2.5-VL-7B by 10.59% on VRIPT-HAL, while simultaneously enhancing video understanding and reasoning, boosting performance on VideoMMMU by up to 8.86%. These results highlight SmartSight's effectiveness in improving the reliability of open-source Video-LLMs.
Problem

Research questions and friction points this paper is trying to address.

Mitigates hallucinations in Video-LLMs without harming video understanding
Uses temporal attention collapse to detect hallucinated responses efficiently
Improves reliability and safety of Video-LLMs in real-world applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates multiple candidate responses to uncover low-hallucinated outputs
Uses Temporal Attention Collapse score to assess hallucination in responses
Identifies Visual Attention Vanishing point for efficient early termination
🔎 Similar Papers
No similar papers found.
Y
Yiming Sun
College of Computer Science and Artificial Intelligence, Fudan University, Shanghai, China
M
Mi Zhang
College of Computer Science and Artificial Intelligence, Fudan University, Shanghai, China
F
Feifei Li
College of Computer Science and Artificial Intelligence, Fudan University, Shanghai, China
Geng Hong
Geng Hong
Fudan University
SecurityCybercrimeLLM Security and Safety
Min Yang
Min Yang
Bytedance
Vision Language ModelComputer VisionVideo Understanding