From Frames to Events: Rethinking Evaluation in Human-Centric Video Anomaly Detection

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses a critical limitation in existing video anomaly detection methods, which rely on frame-level evaluation metrics that fail to capture the ability to detect continuous anomalous events in real-world scenarios, thereby inflating performance estimates. To remedy this, the authors propose an event-centric paradigm: they first analyze the event structure of mainstream datasets and establish the first event-level evaluation benchmark. They then introduce two event localization strategies—a post-processing pipeline based on hierarchical Gaussian smoothing and adaptive binarization, and an end-to-end dual-branch network—incorporating temporal action localization techniques such as tIoU matching and multi-threshold F1 scoring. Experiments reveal that while current state-of-the-art models achieve over 52% frame-level AUC-ROC on datasets like NWPUC, their event-level localization accuracy (at tIoU=0.2) falls below 10%, with an average event F1 score of merely 0.11, underscoring the necessity of shifting to event-level evaluation and highlighting the contributions of this study.

Technology Category

Application Category

📝 Abstract

Pose-based Video Anomaly Detection (VAD) has gained significant attention for its privacy-preserving nature and robustness to environmental variations. However, traditional frame-level evaluations treat video as a collection of isolated frames, fundamentally misaligned with how anomalies manifest and are acted upon in the real world. In operational surveillance systems, what matters is not the flagging of individual frames, but the reliable detection, localization, and reporting of a coherent anomalous event, a contiguous temporal episode with an identifiable onset and duration. Frame-level metrics are blind to this distinction, and as a result, they systematically overestimate model performance for any deployment that requires actionable, event-level alerts. In this work, we propose a shift toward an event-centric perspective in VAD. We first audit widely used VAD benchmarks, including SHT[19], CHAD[6], NWPUC[4], and HuVAD[25], to characterize their event structure. We then introduce two strategies for temporal event localization: a score-refinement pipeline with hierarchical Gaussian smoothing and adaptive binarization, and an end-to-end Dual-Branch Model that directly generates event-level detections. Finally, we establish the first event-based evaluation standard for VAD by adapting Temporal Action Localization metrics, including tIoU-based event matching and multi-threshold F1 evaluation. Our results quantify a substantial performance gap: while all SoTA models achieve frame-level AUC-ROC exceeding 52% on the NWPUC[4], their event-level localization precision falls below 10% even at a minimal tIoU=0.2, with an average event-level F1 of only 0.11 across all thresholds. The code base for this work is available at https://github.com/TeCSAR-UNCC/EventCentric-VAD.

Problem

Research questions and friction points this paper is trying to address.

Video Anomaly Detection

Event-level Evaluation

Temporal Localization

Frame-level Metrics

Human-Centric Surveillance

Innovation

Methods, ideas, or system contributions that make the work stand out.

event-centric evaluation

temporal event localization

pose-based VAD