🤖 AI Summary
Existing evaluation metrics for time series anomaly detection suffer from several critical limitations, including point-level coverage bias, insensitivity to near-miss detections, insufficient penalization of false positives, and inconsistency due to threshold dependence, which collectively hinder reliable and interpretable assessment. To address these issues, this work proposes DQE, a novel evaluation metric grounded in detection semantics. DQE partitions the local temporal neighborhood of each anomaly event into three functional subregions and employs a fine-grained scoring mechanism tailored to their distinct roles. Furthermore, it incorporates an all-threshold spectrum aggregation strategy to eliminate bias arising from arbitrary threshold selection. Extensive experiments on both synthetic and real-world datasets demonstrate that DQE significantly outperforms ten state-of-the-art metrics in terms of stability, discriminability, and robustness, yielding more reliable and interpretable evaluation outcomes.
📝 Abstract
Time series anomaly detection has achieved remarkable progress in recent years. However, evaluation practices have received comparatively less attention, despite their critical importance. Existing metrics exhibit several limitations: (1) bias toward point-level coverage, (2) insensitivity or inconsistency in near-miss detections, (3) inadequate penalization of false alarms, and (4) inconsistency caused by threshold or threshold-interval selection. These limitations can produce unreliable or counterintuitive results, hindering objective progress. In this work, we revisit the evaluation of time series anomaly detection from the perspective of detection semantics and propose a novel metric for more comprehensive assessment. We first introduce a partitioning strategy grounded in detection semantics, which decomposes the local temporal region of each anomaly into three functionally distinct subregions. Using this partitioning, we evaluate overall detection behavior across events and design finer-grained scoring mechanisms for each subregion, enabling more reliable and interpretable assessment. Through a systematic study of existing metrics, we identify an evaluation bias associated with threshold-interval selection and adopt an approach that aggregates detection qualities across the full threshold spectrum, thereby eliminating evaluation inconsistency. Extensive experiments on synthetic and real-world data demonstrate that our metric provides stable, discriminative, and interpretable evaluation, while achieving robust assessment compared with ten widely used metrics.