Machine Learning Transferability for Malware Detection

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited cross-dataset generalization of existing malware detection methods, which stems from inconsistent feature representations in public datasets and susceptibility to distributional shifts. To mitigate this, the authors propose a preprocessing framework that unifies the feature space based on EMBERv2, integrates the BODMAS dataset, and introduces an ERMDS regularization-enhanced training strategy to enable joint training on multi-source PE files within the EMBER framework. Experimental results demonstrate that the proposed approach significantly improves transfer detection performance across multiple heterogeneous external test sets—including TRITIUM, INFERNO, and SOREL-20M—exhibiting superior generalization capability and robustness compared to baseline methods.

Technology Category

Application Category

📝 Abstract
Malware continues to be a predominant operational risk for organizations, especially when obfuscation techniques are used to evade detection. Despite the ongoing efforts in the development of Machine Learning (ML) detection approaches, there is still a lack of feature compatibility in public datasets. This limits generalization when facing distribution shifts, as well as transferability to different datasets. This study evaluates the suitability of different data preprocessing approaches for the detection of Portable Executable (PE) files with ML models. The preprocessing pipeline unifies EMBERv2 (2,381-dim) features datasets, trains paired models under two training setups: EMBER + BODMAS and EMBER + BODMAS + ERMDS. Regarding model evaluation, both EMBER + BODMAS and EMBER + BODMAS + ERMDS models are tested against TRITIUM, INFERNO and SOREL-20M. ERMDS is also used for testing for the EMBER + BODMAS setup.
Problem

Research questions and friction points this paper is trying to address.

Machine Learning Transferability
Malware Detection
Feature Compatibility
Distribution Shift
Model Generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

transferability
malware detection
feature compatibility
distribution shift
preprocessing pipeline
🔎 Similar Papers
No similar papers found.
C
César Vieira
GECAD, ISEP, Polytechnic of Porto, Rua Dr. António Bernardino de Almeida, 4249-015 Porto, Portugal
J
João Vitorino
GECAD, ISEP, Polytechnic of Porto, Rua Dr. António Bernardino de Almeida, 4249-015 Porto, Portugal
Eva Maia
Eva Maia
GECAD-ISEP
CyberSecurityArtificial InteligenceMachine LearningIndustry 4.0Encryption
Isabel Praça
Isabel Praça
Professor, ISEP