DAMS:Dual-Branch Adaptive Multiscale Spatiotemporal Framework for Video Anomaly Detection

๐Ÿ“… 2025-07-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Video anomaly detection faces three key challenges: modeling multi-scale temporal dependencies, bridging strong visual-semantic heterogeneity, and coping with severe label scarcity. To address these, we propose a dual-branch adaptive multi-scale spatiotemporal framework. The first branch is an Adaptive Temporal Pyramid Network that captures low-level spatiotemporal patterns via hierarchical pooling, temporal context enhancement, and CBAM attention. The second branch is a CLIP-driven cross-modal semantic branch that leverages semantic alignment and multi-scale instance selection to enable high-level semantic guidance. These branches are jointly optimized through a positive mutual complementarity architecture, enabling hierarchical reasoningโ€”from motion-level features to semantic-level concepts. Extensive experiments on UCF-Crime and XD-Violence demonstrate significant improvements in localization accuracy and robustness, validating the effectiveness of our decoupled multi-scale fusion strategy and cross-modal semantic guidance.

Technology Category

Application Category

๐Ÿ“ Abstract
The goal of video anomaly detection is tantamount to performing spatio-temporal localization of abnormal events in the video. The multiscale temporal dependencies, visual-semantic heterogeneity, and the scarcity of labeled data exhibited by video anomalies collectively present a challenging research problem in computer vision. This study offers a dual-path architecture called the Dual-Branch Adaptive Multiscale Spatiotemporal Framework (DAMS), which is based on multilevel feature decoupling and fusion, enabling efficient anomaly detection modeling by integrating hierarchical feature learning and complementary information. The main processing path of this framework integrates the Adaptive Multiscale Time Pyramid Network (AMTPN) with the Convolutional Block Attention Mechanism (CBAM). AMTPN enables multigrained representation and dynamically weighted reconstruction of temporal features through a three-level cascade structure (time pyramid pooling, adaptive feature fusion, and temporal context enhancement). CBAM maximizes the entropy distribution of feature channels and spatial dimensions through dual attention mapping. Simultaneously, the parallel path driven by CLIP introduces a contrastive language-visual pre-training paradigm. Cross-modal semantic alignment and a multiscale instance selection mechanism provide high-order semantic guidance for spatio-temporal features. This creates a complete inference chain from the underlying spatio-temporal features to high-level semantic concepts. The orthogonal complementarity of the two paths and the information fusion mechanism jointly construct a comprehensive representation and identification capability for anomalous events. Extensive experimental results on the UCF-Crime and XD-Violence benchmarks establish the effectiveness of the DAMS framework.
Problem

Research questions and friction points this paper is trying to address.

Detect abnormal events in videos spatio-temporally
Address multiscale temporal dependencies and visual-semantic heterogeneity
Overcome scarcity of labeled data for video anomalies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-path architecture with multilevel feature decoupling
Adaptive Multiscale Time Pyramid Network for temporal features
CLIP-driven cross-modal semantic alignment for high-order guidance
D
Dezhi An
School of Cyberspace Security, Gansu University of Political Science and Law, No. 6 Anning West Road, Lanzhou, 730070, Gansu, China
Wenqiang Liu
Wenqiang Liu
Senior Manager/Senior Staff Researcher,Tencent
Deep LearningMachine LearningNLPMultilingualLLM
K
Kefan Wang
School of Cyberspace Security, Gansu University of Political Science and Law, No. 6 Anning West Road, Lanzhou, 730070, Gansu, China
Z
Zening chen
School of Cyberspace Security, Gansu University of Political Science and Law, No. 6 Anning West Road, Lanzhou, 730070, Gansu, China
J
Jun Lu
School of Cyberspace Security, Gansu University of Political Science and Law, No. 6 Anning West Road, Lanzhou, 730070, Gansu, China
S
Shengcai Zhang
School of Cyberspace Security, Gansu University of Political Science and Law, No. 6 Anning West Road, Lanzhou, 730070, Gansu, China