🤖 AI Summary
Video large language models (VideoLLMs) suffer from temporal hallucination—generating factually inconsistent descriptions misaligned with video content. To address this, we propose the first activation engineering framework explicitly designed for video temporal dynamics, requiring no model fine-tuning. Our method identifies time-sensitive neural modules via data-driven neuron activation analysis, then applies module-level dynamic activation scaling or masking guided by temporal variability quantification. Crucially, we empirically establish that temporal hallucination stems primarily from insufficient sensitivity to temporal dynamics—not task-specific factors—enabling targeted intervention. Evaluated across diverse VideoLLM architectures and standard benchmarks, our approach significantly reduces hallucination rates, improves factual consistency and temporal reasoning reliability, and preserves original model performance on non-hallucination metrics. This work introduces a paradigm shift in mitigating temporal hallucinations through interpretable, architecture-agnostic activation modulation.
📝 Abstract
Multimodal large language models (MLLMs) have achieved remarkable progress in video understanding.However, hallucination, where the model generates plausible yet incorrect outputs, persists as a significant and under-addressed challenge in the video domain. Among existing solutions, activation engineering has proven successful in mitigating hallucinations in LLMs and ImageLLMs, yet its applicability to VideoLLMs remains largely unexplored. In this work, we are the first to systematically investigate the effectiveness and underlying mechanisms of activation engineering for mitigating hallucinations in VideoLLMs. We initially conduct an investigation of the key factors affecting the performance of activation engineering and find that a model's sensitivity to hallucination depends on $ extbf{temporal variation}$ rather than task type. Moreover, selecting appropriate internal modules and dataset for activation engineering is critical for reducing hallucination. Guided by these findings, we propose a temporal-aware activation engineering framework for VideoLLMs, which adaptively identifies and manipulates hallucination-sensitive modules based on the temporal variation characteristic, substantially mitigating hallucinations without additional LLM fine-tuning. Experiments across multiple models and benchmarks demonstrate that our method markedly reduces hallucination in VideoLLMs, thereby validating the robustness of our findings.