Cluster Analysis and Concept Drift Detection in Malware

📅 2025-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address model performance degradation in malware detection caused by time-varying data distributions, this paper proposes an unsupervised concept drift detection and response method. The method leverages Mini-Batch K-Means clustering and dynamically thresholded silhouette coefficients—introducing, for the first time in the malware domain, an adaptive silhouette coefficient mechanism to automatically identify evolutionary inflection points. It further incorporates a drift-aware retraining strategy that maintains classification accuracy within ±1% of periodic retraining while substantially reducing computational overhead. Evaluated on the KronoDroid dataset with four classifiers—MLP, SVM, Random Forest, and XGBoost—the approach significantly outperforms static models in accuracy and achieves markedly higher retraining efficiency than periodic strategies. Overall, it delivers a lightweight, robust online adaptation capability for malware detection systems.

Technology Category

Application Category

📝 Abstract
Concept drift refers to gradual or sudden changes in the properties of data that affect the accuracy of machine learning models. In this paper, we address the problem of concept drift detection in the malware domain. Specifically, we propose and analyze a clustering-based approach to detecting concept drift. Using a subset of the KronoDroid dataset, malware samples are partitioned into temporal batches and analyzed using MiniBatch $K$-Means clustering. The silhouette coefficient is used as a metric to identify points in time where concept drift has likely occurred. To verify our drift detection results, we train learning models under three realistic scenarios, which we refer to as static training, periodic retraining, and drift-aware retraining. In each scenario, we consider four supervised classifiers, namely, Multilayer Perceptron (MLP), Support Vector Machine (SVM), Random Forest, and XGBoost. Experimental results demonstrate that drift-aware retraining guided by silhouette coefficient thresholding achieves classification accuracy far superior to static models, and generally within 1% of periodic retraining, while also being far more efficient than periodic retraining. These results provide strong evidence that our clustering-based approach is effective at detecting concept drift, while also illustrating a highly practical and efficient fully automated approach to improved malware classification via concept drift detection.
Problem

Research questions and friction points this paper is trying to address.

Detect concept drift in malware data
Clustering-based approach for drift detection
Improve malware classification accuracy efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Clustering-based concept drift detection
Silhouette coefficient for drift identification
Drift-aware retraining enhances malware classification
🔎 Similar Papers
No similar papers found.