Synthetic Data: AI's New Weapon Against Android Malware

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Android malware detection faces performance bottlenecks due to scarce labeled samples, high annotation costs, and rapidly outdated data. Method: We propose MalSynGen—the first conditional Generative Adversarial Network (cGAN) framework tailored for synthesizing tabular features of Android malware. It pioneers cGAN-based structured feature modeling to generate statistically faithful and semantically consistent synthetic samples. Contribution/Results: Evaluated across multiple heterogeneous datasets, MalSynGen-generated data boosts classifier accuracy by an average of +5.2%, demonstrates strong cross-dataset generalization, and maintains computational efficiency. We further introduce an interpretable, multi-dimensional evaluation metric suite to ensure controllable synthesis quality. This work establishes a reproducible, scalable data augmentation paradigm for Android malware detection under low-resource conditions.

Technology Category

Application Category

📝 Abstract
The ever-increasing number of Android devices and the accelerated evolution of malware, reaching over 35 million samples by 2024, highlight the critical importance of effective detection methods. Attackers are now using Artificial Intelligence to create sophisticated malware variations that can easily evade traditional detection techniques. Although machine learning has shown promise in malware classification, its success relies heavily on the availability of up-to-date, high-quality datasets. The scarcity and high cost of obtaining and labeling real malware samples presents significant challenges in developing robust detection models. In this paper, we propose MalSynGen, a Malware Synthetic Data Generation methodology that uses a conditional Generative Adversarial Network (cGAN) to generate synthetic tabular data. This data preserves the statistical properties of real-world data and improves the performance of Android malware classifiers. We evaluated the effectiveness of this approach using various datasets and metrics that assess the fidelity of the generated data, its utility in classification, and the computational efficiency of the process. Our experiments demonstrate that MalSynGen can generalize across different datasets, providing a viable solution to address the issues of obsolescence and low quality data in malware detection.
Problem

Research questions and friction points this paper is trying to address.

Addressing Android malware detection challenges from AI-powered evasion techniques
Overcoming scarcity of high-quality labeled malware datasets for ML models
Solving data obsolescence and quality issues in malware classification systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using cGAN to generate synthetic malware data
Preserving statistical properties of real-world data
Improving performance of Android malware classifiers
🔎 Similar Papers
No similar papers found.
A
Angelo Gaspar Diniz Nogueira
Federal University of Pampa (UNIPAMPA)
K
Kayua Oleques Paim
Federal University of Rio Grande do Sul (UFRGS)
H
Hendrio Bragança
Federal University of Amazonas (UFAM)
R
Rodrigo Brandão Mansilha
Federal University of Pampa (UNIPAMPA)
Diego Kreutz
Diego Kreutz
Federal University of Pampa (UNIPAMPA)
AutoML&XAI&AML for CybersecurityNetwork SecurityMalware & Attack DetectionBlockchainsSystems