MASCOT-Android: A Curated Dataset and Automated Collection Pipeline for Android Malware Source Code Specimens

📅 2026-06-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the scarcity of Android malware source code and the high cost of manual labeling, which hinder the construction of high-quality datasets. To overcome these challenges, the authors propose an automated collection framework that leverages README files from GitHub repositories to efficiently identify malware source code based solely on document content. The approach employs character-level TF-IDF features combined with a LinearSVC classifier to distinguish between malicious and benign projects, augmented by a confidence threshold mechanism to flexibly balance coverage and false positive rates. Experimental results demonstrate that the model achieves 96.28% accuracy and a remarkably low false positive rate of 1.06% in local evaluation, significantly enhancing the scalability and practicality of malware code discovery.
📝 Abstract
Compared with binaries and decompiled code, malware source code more directly reflects the attackers' original intent. However, the scarcity of source code and the high cost of manual review make such datasets difficult to build and maintain. We propose MASCOT-Android, a curated dataset of Android malware source code and an automated collection framework for scalable malware source code discovery on GitHub. A key finding of our work is that repository-level documentation alone provides a strong signal for malware source code collection. Our model extracts character-level TF-IDF features from 8,772 malware and 25,747 benign README documents and trains a LinearSVC classifier to distinguish malware repositories. This README-only model achieves an accuracy of 96.28\% and an FPR of 1.06\% in local evaluation. In addition, the model outputs confidence scores, allowing users to adjust the decision threshold to balance FPR and coverage, which is practical in real-world malware source code collection.
Problem

Research questions and friction points this paper is trying to address.

Android malware
source code dataset
malware collection
manual review cost
dataset scarcity
Innovation

Methods, ideas, or system contributions that make the work stand out.

malware source code
automated collection pipeline
README-based classification
TF-IDF features
Android malware dataset
B
Bojing Li
University of Maryland, Baltimore County
D
Duo Zhong
University of Maryland, Baltimore County
Prajna Bhandary
Prajna Bhandary
PhD candidate
CybersecurityMachine Learning
R
Raguvir S
University of Maryland, Baltimore County
C
Charles Maxa
University of Maryland, Baltimore County
R
Robert J Joyce
University of Maryland, Baltimore County
Charles Nicholas
Charles Nicholas
UMBC
computer science