🤖 AI Summary
Video anomaly detection (VAD) faces a fundamental trade-off between insufficient spatiotemporal modeling capability and excessive computational cost, hindering real-time deployment. To address this, we propose a Mamba-based multi-scale spatiotemporal learning framework that jointly enhances appearance and motion representation through three innovations: hierarchical spatial encoding, multi-temporal-scale dynamic modeling, and task-oriented feature decomposition. This design simultaneously strengthens modeling expressiveness and reduces computational overhead. Experiments demonstrate state-of-the-art frame-level AUC scores of 98.5%, 92.1%, and 77.9% on UCSD Ped2, CUHK Avenue, and ShanghaiTech, respectively, with only 20.1 G FLOPs and a real-time inference speed of 45 FPS. The method achieves an unprecedented balance between accuracy and efficiency, establishing a new paradigm for lightweight VAD.
📝 Abstract
Video anomaly detection (VAD) is an essential task in the image processing community with prospects in video surveillance, which faces fundamental challenges in balancing detection accuracy with computational efficiency. As video content becomes increasingly complex with diverse behavioral patterns and contextual scenarios, traditional VAD approaches struggle to provide robust assessment for modern surveillance systems. Existing methods either lack comprehensive spatial-temporal modeling or require excessive computational resources for real-time applications. In this regard, we present a Mamba-based multi-scale spatial-temporal learning (M2S2L) framework in this paper. The proposed method employs hierarchical spatial encoders operating at multiple granularities and multi-temporal encoders capturing motion dynamics across different time scales. We also introduce a feature decomposition mechanism to enable task-specific optimization for appearance and motion reconstruction, facilitating more nuanced behavioral modeling and quality-aware anomaly assessment. Experiments on three benchmark datasets demonstrate that M2S2L framework achieves 98.5%, 92.1%, and 77.9% frame-level AUCs on UCSD Ped2, CUHK Avenue, and ShanghaiTech respectively, while maintaining efficiency with 20.1G FLOPs and 45 FPS inference speed, making it suitable for practical surveillance deployment.