SmartSight: Mitigating Hallucination in Video-LLMs Without Compromising Video Understanding via Temporal Attention Collapse

📅 2025-12-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Video large language models (Video-LLMs) suffer from pervasive perceptual hallucinations, severely compromising their safety and practical utility; existing mitigation strategies often degrade comprehension capabilities. This paper proposes a training-free intervention framework that jointly generates multiple candidate responses and leverages model self-reflection to identify low-hallucination outputs. Our key contributions are: (1) the first training-agnostic hallucination mitigation paradigm; (2) Temporal Attention Collapse Degree (TAC), a novel metric quantifying hallucination severity via attention dynamics; and (3) identification of the visual attention vanishing point, enabling early stopping and efficient evaluation. Evaluated on the VRIPT-HAL benchmark, our method reduces the hallucination rate of Qwen2.5-VL-7B by 10.59% while improving its VideoMMMU video understanding performance by 8.86%, demonstrating simultaneous hallucination suppression and capability preservation.

Technology Category

Application Category

📝 Abstract

Despite Video Large Language Models having rapidly advanced in recent years, perceptual hallucinations pose a substantial safety risk, which severely restricts their real-world applicability. While several methods for hallucination mitigation have been proposed, they often compromise the model's capacity for video understanding and reasoning. In this work, we propose SmartSight, a pioneering step to address this issue in a training-free manner by leveraging the model's own introspective capabilities. Specifically, SmartSight generates multiple candidate responses to uncover low-hallucinated outputs that are often obscured by standard greedy decoding. It assesses the hallucination of each response using the Temporal Attention Collapse score, which measures whether the model over-focuses on trivial temporal regions of the input video when generating the response. To improve efficiency, SmartSight identifies the Visual Attention Vanishing point, enabling more accurate hallucination estimation and early termination of hallucinated responses, leading to a substantial reduction in decoding cost. Experiments show that SmartSight substantially lowers hallucinations for Qwen2.5-VL-7B by 10.59% on VRIPT-HAL, while simultaneously enhancing video understanding and reasoning, boosting performance on VideoMMMU by up to 8.86%. These results highlight SmartSight's effectiveness in improving the reliability of open-source Video-LLMs.

Problem

Research questions and friction points this paper is trying to address.

Mitigates hallucinations in Video-LLMs without harming video understanding

Uses temporal attention collapse to detect hallucinated responses efficiently

Improves reliability and safety of Video-LLMs in real-world applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates multiple candidate responses to uncover low-hallucinated outputs

Uses Temporal Attention Collapse score to assess hallucination in responses

Identifies Visual Attention Vanishing point for efficient early termination

🔎 Similar Papers

No similar papers found.

Authors to Follow