🤖 AI Summary
Multivariate time-series anomaly detection (MTS-AD) suffers from complex inter-variable dependencies, strong temporal dynamics, and severe label scarcity, leading to a long-standing absence of standardized benchmarks—hindering fair method evaluation and model selection. To address this, we introduce the largest publicly available MTS-AD and unsupervised model selection benchmark to date: it encompasses 19 datasets, 344 labeled sequences, and 12 application domains; supports 24 detectors—including the first systematic evaluation of LLM-based methods; and provides a unified preprocessing pipeline, an LLM-driven anomaly scoring mechanism, a standardized evaluation protocol, and multi-dimensional metrics (F1, AUC, latency). Key findings reveal that “no universal detector” holds empirically, and the best model selection strategy achieves only 63.2% of oracle performance on average. All data, code, and tools are fully open-sourced to foster reproducible, equitable research.
📝 Abstract
Multivariate time series anomaly detection (MTS-AD) is critical in domains like healthcare, cybersecurity, and industrial monitoring, yet remains challenging due to complex inter-variable dependencies, temporal dynamics, and sparse anomaly labels. We introduce mTSBench, the largest benchmark to date for MTS-AD and unsupervised model selection, spanning 344 labeled time series across 19 datasets and 12 diverse application domains. mTSBench evaluates 24 anomaly detection methods, including large language model (LLM)-based detectors for multivariate time series, and systematically benchmarks unsupervised model selection techniques under standardized conditions. Consistent with prior findings, our results confirm that no single detector excels across datasets, underscoring the importance of model selection. However, even state-of-the-art selection methods remain far from optimal, revealing critical gaps. mTSBench provides a unified evaluation suite to enable rigorous, reproducible comparisons and catalyze future advances in adaptive anomaly detection and robust model selection.