A Framework for Evaluating and Benchmarking Concept Drift Detection Methods

📅 2026-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of a unified and fair evaluation benchmark for concept drift detection methods, which hinders meaningful cross-method comparisons. To this end, the authors propose a systematic evaluation framework that injects controlled, diverse types of drift—such as class prior changes and label swaps—into seven real-world datasets via Monte Carlo simulation. The framework introduces time-sensitive metrics, including F1 detection score and normalized detection delay, and employs a leave-one-dataset-out hyperparameter optimization strategy to enhance generalization. Through comprehensive evaluation of 14 state-of-the-art methods under this framework, the work establishes the first performance benchmarks for both abrupt and gradual drift scenarios, revealing the relative strengths, weaknesses, and applicability conditions of existing approaches.
📝 Abstract
Data stream mining is fundamentally challenged by concept drift, where distributional changes can degrade model performance. Despite the proliferation of drift detection methods, progress in the field is hindered by inconsistent evaluation practices: studies rely on oversimplified synthetic data generators, adopt incompatible metrics, and lack transparency in hyperparameter selection, making fair comparisons difficult. We address this gap with a novel benchmarking framework comprising three contributions: (1) a drift simulation method that injects controlled distributional changes into real-world datasets via Monte Carlo trials, enabling supervised evaluation while preserving real-world data complexity; (2) an evaluation protocol for drift detection with timing-aware criteria, including the derivation of new metrics (e.g., F1 detection score, normalized detection time) that are comparable across streams; and (3) we advocate for a leave-one-dataset-out hyperparameter optimization protocol for drift detection methods that promotes configuration robustness across heterogeneous stream dynamics. We benchmark 14 widely used drift detection methods on 7 realworld datasets across 4 drift types (class prior, label swap, feature permutation, feature filtering), each under both abrupt and gradual transitions. Our experimental results provide insights into the strengths and weaknesses of current drift detection approaches while establishing baseline performance metrics for future research in this area. All code and experiments are publicly available.
Problem

Research questions and friction points this paper is trying to address.

concept drift
evaluation
benchmarking
data stream mining
drift detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

concept drift detection
benchmarking framework
Monte Carlo simulation
timing-aware evaluation
hyperparameter robustness