🤖 AI Summary
Conventional unsupervised video anomaly detection (VAD) methods rely either on large-scale labeled data or computationally intensive modeling; meanwhile, existing prompt-free multimodal large language model (MLLM)-based approaches suffer from textual output constraints—leading to loss of anomaly cues, normalcy bias, and prompt sensitivity.
Method: We propose the first head-level probing framework that directly identifies robust, anomaly-sensitive attention heads within a frozen MLLM—bypassing text generation entirely—to enable prompt-free, real-time, and interpretable VAD without fine-tuning. Our approach introduces a multi-criteria (salience + stability) robust-head identification module, coupled with a lightweight anomaly scorer and temporal localizer.
Contribution/Results: On UCF-Crime and XD benchmarks, our method achieves state-of-the-art performance among prompt-free methods, with efficient inference. It empirically validates the effectiveness and practicality of mining discriminative attention heads in real-world VAD.
📝 Abstract
Video Anomaly Detection (VAD) aims to locate events that deviate from normal patterns in videos. Traditional approaches often rely on extensive labeled data and incur high computational costs. Recent tuning-free methods based on Multimodal Large Language Models (MLLMs) offer a promising alternative by leveraging their rich world knowledge. However, these methods typically rely on textual outputs, which introduces information loss, exhibits normalcy bias, and suffers from prompt sensitivity, making them insufficient for capturing subtle anomalous cues. To address these constraints, we propose HeadHunt-VAD, a novel tuning-free VAD paradigm that bypasses textual generation by directly hunting robust anomaly-sensitive internal attention heads within the frozen MLLM. Central to our method is a Robust Head Identification module that systematically evaluates all attention heads using a multi-criteria analysis of saliency and stability, identifying a sparse subset of heads that are consistently discriminative across diverse prompts. Features from these expert heads are then fed into a lightweight anomaly scorer and a temporal locator, enabling efficient and accurate anomaly detection with interpretable outputs. Extensive experiments show that HeadHunt-VAD achieves state-of-the-art performance among tuning-free methods on two major VAD benchmarks while maintaining high efficiency, validating head-level probing in MLLMs as a powerful and practical solution for real-world anomaly detection.