🤖 AI Summary
This work addresses the semi-supervised anomaly detection (AD) problem, focusing on the critical challenge of effectively integrating genuine anomalies with synthetically generated anomalies—under unsupervised or semi-supervised settings—to enhance model performance. We propose the first formal mathematical definition for this integration and develop a unified training framework that jointly models known genuine anomalies and controllable synthetic anomalies within a classifier-based AD paradigm. Theoretically, we prove that synthetic anomalies improve modeling of low-density regions and establish, for the first time, optimal convergence guarantees for neural network classifiers in this context. Our method integrates classification-based AD, controllable anomaly generation, and generalization error analysis. Empirical evaluation across five benchmark datasets demonstrates consistent and significant performance gains. Moreover, the proposed synthetic anomaly mechanism is broadly applicable and transferable to other classification-based AD methods.
📝 Abstract
Anomaly detection (AD) is a critical task across domains such as cybersecurity and healthcare. In the unsupervised setting, an effective and theoretically-grounded principle is to train classifiers to distinguish normal data from (synthetic) anomalies. We extend this principle to semi-supervised AD, where training data also include a limited labeled subset of anomalies possibly present in test time. We propose a theoretically-grounded and empirically effective framework for semi-supervised AD that combines known and synthetic anomalies during training. To analyze semi-supervised AD, we introduce the first mathematical formulation of semi-supervised AD, which generalizes unsupervised AD. Here, we show that synthetic anomalies enable (i) better anomaly modeling in low-density regions and (ii) optimal convergence guarantees for neural network classifiers -- the first theoretical result for semi-supervised AD. We empirically validate our framework on five diverse benchmarks, observing consistent performance gains. These improvements also extend beyond our theoretical framework to other classification-based AD methods, validating the generalizability of the synthetic anomaly principle in AD.