Action Hints: Semantic Typicality and Context Uniqueness for Generalizable Skeleton-based Video Anomaly Detection

📅 2025-09-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges in zero-shot video anomaly detection (ZS-VAD)—namely, the absence of target-domain training data and poor generalization to diverse unseen normal/abnormal behaviors across novel surveillance scenarios—this paper proposes a skeleton-based joint modeling framework leveraging semantic typicality and contextual uniqueness. We map skeleton sequences into an action-semantic space via language-guided semantic embedding, and integrate large-model knowledge distillation with spatiotemporal discrepancy analysis to enable test-time scene-adaptive anomaly boundary estimation. Crucially, our approach eliminates reliance on domain-specific, fixed normal-pattern priors, thereby substantially enhancing cross-scenario generalizability. Evaluated on four major benchmarks—ShanghaiTech, UBnormal, NWPU, and UCF-Crime—our method achieves state-of-the-art performance among skeleton-based ZS-VAD approaches, successfully detecting anomalies across over 100 previously unseen surveillance scenes.

Technology Category

Application Category

📝 Abstract
Zero-Shot Video Anomaly Detection (ZS-VAD) requires temporally localizing anomalies without target domain training data, which is a crucial task due to various practical concerns, e.g., data privacy or new surveillance deployments. Skeleton-based approach has inherent generalizable advantages in achieving ZS-VAD as it eliminates domain disparities both in background and human appearance. However, existing methods only learn low-level skeleton representation and rely on the domain-limited normality boundary, which cannot generalize well to new scenes with different normal and abnormal behavior patterns. In this paper, we propose a novel zero-shot video anomaly detection framework, unlocking the potential of skeleton data via action typicality and uniqueness learning. Firstly, we introduce a language-guided semantic typicality modeling module that projects skeleton snippets into action semantic space and distills LLM's knowledge of typical normal and abnormal behaviors during training. Secondly, we propose a test-time context uniqueness analysis module to finely analyze the spatio-temporal differences between skeleton snippets and then derive scene-adaptive boundaries. Without using any training samples from the target domain, our method achieves state-of-the-art results against skeleton-based methods on four large-scale VAD datasets: ShanghaiTech, UBnormal, NWPU, and UCF-Crime, featuring over 100 unseen surveillance scenes.
Problem

Research questions and friction points this paper is trying to address.

Zero-shot anomaly detection without target domain training data
Generalizing skeleton-based methods to unseen surveillance scenes
Overcoming domain disparities in normal and abnormal behavior patterns
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-guided semantic typicality modeling
Test-time context uniqueness analysis
Scene-adaptive boundaries derivation
🔎 Similar Papers
No similar papers found.
Canhui Tang
Canhui Tang
Xi'an Jiaotong University
Computer Vision
Sanping Zhou
Sanping Zhou
Xi'an Jiaotong University
Computer VisionMachine Learning
Haoyue Shi
Haoyue Shi
Xi'an Jiaotong University
Anomaly DetectionImage Generation
L
Le Wang
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Shaanxi 710049, China