๐ค AI Summary
In weakly supervised video anomaly detection (WS-VAD), the inherent label ambiguity of multi-instance learning (MIL) severely hinders discriminative feature learning. To address this, we propose a prototype interaction modeling and pseudo-instance discriminative enhancement framework. First, we design a novel Prototype Interaction Layer (PIL) to enable controllable normality modeling. Second, we introduce an extremum-guided Pseudo-Instance Discriminative Enhancement (PIDE) loss, which performs contrastive optimization exclusively on high-confidence normal and abnormal instances, thereby improving robustness and discriminability. Our method is built upon a lightweight neural network with only 0.4M parametersโover 800ร smaller than ViT-based counterparts. Evaluated on ShanghaiTech and UCF-Crime, it achieves AUC scores of 97.98% and 87.12%, respectively, significantly outperforming existing weakly supervised approaches.
๐ Abstract
Weakly-supervised video anomaly detection (WS-VAD) using Multiple Instance Learning (MIL) suffers from label ambiguity, hindering discriminative feature learning. We propose ProDisc-VAD, an efficient framework tackling this via two synergistic components. The Prototype Interaction Layer (PIL) provides controlled normality modeling using a small set of learnable prototypes, establishing a robust baseline without being overwhelmed by dominant normal data. The Pseudo-Instance Discriminative Enhancement (PIDE) loss boosts separability by applying targeted contrastive learning exclusively to the most reliable extreme-scoring instances (highest/lowest scores). ProDisc-VAD achieves strong AUCs (97.98% ShanghaiTech, 87.12% UCF-Crime) using only 0.4M parameters, over 800x fewer than recent ViT-based methods like VadCLIP, demonstrating exceptional efficiency alongside state-of-the-art performance. Code is available at https://github.com/modadundun/ProDisc-VAD.