MalVol-25: A Diverse, Labelled and Detailed Volatile Memory Dataset for Malware Detection and Response Testing and Validation

📅 2025-07-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current malware detection research is hindered by the absence of high-quality, multi-source, heterogeneous memory datasets with fine-grained, semantically rich annotations. Method: We propose a systematic generation framework that automates multi-family malware execution within virtualized environments, concurrently capturing memory snapshots and dynamic behavioral traces; annotation integrates human-in-the-loop validation with automated analysis to achieve cross-OS, fine-grained behavioral labeling—including clean/infected dual-state characterization—and state-transition modeling. Contribution/Results: The resulting dataset exhibits high diversity, legal compliance, and strong reproducibility. It is the first to enable reinforcement learning–driven defense strategy validation and joint detection-response evaluation within agent-based AI frameworks. This significantly strengthens the empirical foundation for memory forensics and AI security research.

Technology Category

Application Category

📝 Abstract
This paper addresses the critical need for high-quality malware datasets that support advanced analysis techniques, particularly machine learning and agentic AI frameworks. Existing datasets often lack diversity, comprehensive labelling, and the complexity necessary for effective machine learning and agent-based AI training. To fill this gap, we developed a systematic approach for generating a dataset that combines automated malware execution in controlled virtual environments with dynamic monitoring tools. The resulting dataset comprises clean and infected memory snapshots across multiple malware families and operating systems, capturing detailed behavioural and environmental features. Key design decisions include applying ethical and legal compliance, thorough validation using both automated and manual methods, and comprehensive documentation to ensure replicability and integrity. The dataset's distinctive features enable modelling system states and transitions, facilitating RL-based malware detection and response strategies. This resource is significant for advancing adaptive cybersecurity defences and digital forensic research. Its scope supports diverse malware scenarios and offers potential for broader applications in incident response and automated threat mitigation.
Problem

Research questions and friction points this paper is trying to address.

Lack of diverse malware datasets for advanced analysis
Insufficient complexity for machine learning and AI training
Need for ethical, validated memory snapshots for malware detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated malware execution in controlled environments
Dynamic monitoring tools for detailed behavioral capture
RL-based malware detection and response strategies
🔎 Similar Papers
No similar papers found.