PoseWatch: A Transformer-based Architecture for Human-centric Video Anomaly Detection Using Spatio-temporal Pose Tokenization

📅 2024-08-27
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address modeling challenges in human-centric video anomaly detection—stemming from behavioral diversity, data bias, and privacy sensitivity—this paper proposes a privacy-preserving anomaly detection paradigm using only human pose sequences as input. Methodologically, we introduce Spatio-Temporal Relative Pose (ST-PRP) encoding and design a Unified Encoder–Twin Decoder (UETD) Transformer architecture, pioneering the deep integration of NLP-inspired sequential modeling into pose-based temporal anomaly discrimination. Our framework incorporates keypoint spatio-temporal tokenization, dual supervision via contrastive reconstruction and future prediction, and multi-scale temporal attention. Evaluated on multiple benchmark datasets, it achieves state-of-the-art performance, significantly improving both anomaly localization accuracy and cross-scenario generalization. Experimental results validate the effectiveness and robustness of a purely pose-driven paradigm for behavioral anomaly detection.

Technology Category

Application Category

📝 Abstract
Video Anomaly Detection (VAD) presents a significant challenge in computer vision, particularly due to the unpredictable and infrequent nature of anomalous events, coupled with the diverse and dynamic environments in which they occur. Human-centric VAD, a specialized area within this domain, faces additional complexities, including variations in human behavior, potential biases in data, and substantial privacy concerns related to human subjects. These issues complicate the development of models that are both robust and generalizable. To address these challenges, recent advancements have focused on pose-based VAD, which leverages human pose as a high-level feature to mitigate privacy concerns, reduce appearance biases, and minimize background interference. In this paper, we introduce PoseWatch, a novel transformer-based architecture designed specifically for human-centric pose-based VAD. PoseWatch features an innovative Spatio-Temporal Pose and Relative Pose (ST-PRP) tokenization method that enhances the representation of human motion over time, which is also beneficial for broader human behavior analysis tasks. The architecture's core, a Unified Encoder Twin Decoders (UETD) transformer, significantly improves the detection of anomalous behaviors in video data. Extensive evaluations across multiple benchmark datasets demonstrate that PoseWatch consistently outperforms existing methods, establishing a new state-of-the-art in pose-based VAD. This work not only demonstrates the efficacy of PoseWatch but also highlights the potential of integrating Natural Language Processing techniques with computer vision to advance human behavior analysis.
Problem

Research questions and friction points this paper is trying to address.

Detects human-centric video anomalies using pose-based features.
Addresses privacy and bias issues in anomaly detection models.
Improves detection of subtle deviations in human motion patterns.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatio-Temporal Pose Tokenization for motion representation
Transformer-based architecture with Unified Encoder Twin Decoders
Relative Pose emphasis for detecting subtle anomalies
🔎 Similar Papers
No similar papers found.