M2S2L: Mamba-based Multi-Scale Spatial-temporal Learning for Video Anomaly Detection

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Video anomaly detection (VAD) faces a fundamental trade-off between insufficient spatiotemporal modeling capability and excessive computational cost, hindering real-time deployment. To address this, we propose a Mamba-based multi-scale spatiotemporal learning framework that jointly enhances appearance and motion representation through three innovations: hierarchical spatial encoding, multi-temporal-scale dynamic modeling, and task-oriented feature decomposition. This design simultaneously strengthens modeling expressiveness and reduces computational overhead. Experiments demonstrate state-of-the-art frame-level AUC scores of 98.5%, 92.1%, and 77.9% on UCSD Ped2, CUHK Avenue, and ShanghaiTech, respectively, with only 20.1 G FLOPs and a real-time inference speed of 45 FPS. The method achieves an unprecedented balance between accuracy and efficiency, establishing a new paradigm for lightweight VAD.

Technology Category

Application Category

📝 Abstract

Video anomaly detection (VAD) is an essential task in the image processing community with prospects in video surveillance, which faces fundamental challenges in balancing detection accuracy with computational efficiency. As video content becomes increasingly complex with diverse behavioral patterns and contextual scenarios, traditional VAD approaches struggle to provide robust assessment for modern surveillance systems. Existing methods either lack comprehensive spatial-temporal modeling or require excessive computational resources for real-time applications. In this regard, we present a Mamba-based multi-scale spatial-temporal learning (M2S2L) framework in this paper. The proposed method employs hierarchical spatial encoders operating at multiple granularities and multi-temporal encoders capturing motion dynamics across different time scales. We also introduce a feature decomposition mechanism to enable task-specific optimization for appearance and motion reconstruction, facilitating more nuanced behavioral modeling and quality-aware anomaly assessment. Experiments on three benchmark datasets demonstrate that M2S2L framework achieves 98.5%, 92.1%, and 77.9% frame-level AUCs on UCSD Ped2, CUHK Avenue, and ShanghaiTech respectively, while maintaining efficiency with 20.1G FLOPs and 45 FPS inference speed, making it suitable for practical surveillance deployment.

Problem

Research questions and friction points this paper is trying to address.

Balancing detection accuracy with computational efficiency in video anomaly detection

Addressing complex video content with diverse behavioral patterns and contexts

Overcoming limitations in spatial-temporal modeling for real-time surveillance applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba-based multi-scale spatial-temporal learning framework

Hierarchical spatial encoders with multi-granularity processing

Feature decomposition for appearance and motion reconstruction

🔎 Similar Papers

MTFL: multi-timescale feature learning for weakly-supervised anomaly detection in surveillance videos