Malware Detection based on API calls

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

To address the challenge of early-stage malware family detection, this paper proposes a lightweight, order-agnostic behavioral modeling paradigm that abandons traditional temporal dependencies and instead leverages set-based features derived solely from Windows system calls—specifically those exported by ntdll.dll—for efficient classification. Methodologically, we construct a large-scale, publicly available, labeled dataset comprising 300,000 samples (550 GB), and employ ensemble learning methods—including Random Forest—to extract API-set features and identify anomalous behavioral patterns. Our key contribution is the first empirical validation that the mere set of ntdll.dll API invocations suffices for highly discriminative malware family identification, drastically reducing feature engineering complexity and inference overhead. Experiments demonstrate an F1-score exceeding 85%, alongside low computational resource requirements, cross-platform deployability, and strong generalization across unseen families. The source code and dataset are publicly released.

Technology Category

Application Category

📝 Abstract

Malware attacks pose a significant threat in today's interconnected digital landscape, causing billions of dollars in damages. Detecting and identifying families as early as possible provides an edge in protecting against such malware. We explore a lightweight, order-invariant approach to detecting and mitigating malware threats: analyzing API calls without regard to their sequence. We publish a public dataset of over three hundred thousand samples and their function call parameters for this task, annotated with labels indicating benign or malicious activity. The complete dataset is above 550GB uncompressed in size. We leverage machine learning algorithms, such as random forests, and conduct behavioral analysis by examining patterns and anomalies in API call sequences. By investigating how the function calls occur regardless of their order, we can identify discriminating features that can help us identify malware early on. The models we've developed are not only effective but also efficient. They are lightweight and can run on any machine with minimal performance overhead, while still achieving an impressive F1-Score of over 85%. We also empirically show that we only need a subset of the function call sequence, specifically calls to the ntdll.dll library, to identify malware. Our research demonstrates the efficacy of this approach through empirical evaluations, underscoring its accuracy and scalability. The code is open source and available at Github along with the dataset on Zenodo.

Problem

Research questions and friction points this paper is trying to address.

Detecting malware using API calls

Analyzing API calls without sequence

Identifying malware with ntdll.dll library

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes API calls order-invariantly

Utilizes random forests for detection

Focuses on ntdll.dll library calls

🔎 Similar Papers

No similar papers found.