🤖 AI Summary
To address the high computational cost and poor generalizability of malware detection caused by high-dimensional binary features (2,381 dimensions), this paper proposes a dual-path dimensionality reduction framework integrating XGBoost-based feature selection and Principal Component Analysis (PCA). Evaluated systematically on a unified multi-source dataset (EMBER-2018, ERMDS, BODMAS), the framework explores lightweight configurations ranging from 128 to 384 dimensions. It is the first work to empirically validate the optimal synergy between XGBoost-selected features and LightGBM modeling: achieving 97.52% accuracy using only 384 dimensions (5.4% of the original), with training time of 61 minutes and memory consumption of 30 GB. Cross-domain evaluation on TRITIUM and INFERNO yields 95.31% and 93.98% accuracy, respectively—demonstrating substantial improvements in both efficiency and generalization capability.
📝 Abstract
Malware detection using machine learning requires feature extraction from binary files, as models cannot process raw binaries directly. A common approach involves using LIEF for raw feature extraction and the EMBER vectorizer to generate 2381-dimensional feature vectors. However, the high dimensionality of these features introduces significant computational challenges. This study addresses these challenges by applying two dimensionality reduction techniques: XGBoost-based feature selection and Principal Component Analysis (PCA). We evaluate three reduced feature dimensions (128, 256, and 384), which correspond to approximately 5.4%, 10.8%, and 16.1% of the original 2381 features, across four models-XGBoost, LightGBM, Extra Trees, and Random Forest-using a unified training, validation, and testing split formed from the EMBER-2018, ERMDS, and BODMAS datasets. This approach ensures generalization and avoids dataset bias. Experimental results show that LightGBM trained on the 384-dimensional feature set after XGBoost feature selection achieves the highest accuracy of 97.52% on the unified dataset, providing an optimal balance between computational efficiency and detection performance. The best model, trained in 61 minutes using 30 GB of RAM and 19.5 GB of disk space, generalizes effectively to completely unseen datasets, maintaining 95.31% accuracy on TRITIUM and 93.98% accuracy on INFERNO. These findings present a scalable, compute-efficient approach for malware detection without compromising accuracy.