Explainability-Guided Adversarial Attacks on Transformer-Based Malware Detectors Using Control Flow Graphs

📅 2026-04-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the insufficient adversarial robustness of existing Transformer-based malware detectors—such as those built on RoBERTa—that linearize control flow graphs into sequences of function calls. It proposes the first approach leveraging interpretability techniques, specifically Integrated Gradients, to guide white-box adversarial attacks. By analyzing token- and word-level attributions, the method identifies critical function calls and replaces them with semantically equivalent synthetic external imports, thereby perturbing the input while preserving the program’s overall structure. Experimental results demonstrate that this strategy effectively induces misclassification in high-accuracy models across both small and large Windows PE datasets, exposing a key vulnerability inherent in the graph-to-sequence modeling paradigm that is revealed through interpretability mechanisms.
📝 Abstract
Transformer-based malware detection systems operating on graph modalities such as control flow graphs (CFGs) achieve strong performance by modeling structural relationships in program behavior. However, their robustness to adversarial evasion attacks remains underexplored. This paper examines the vulnerability of a RoBERTa-based malware detector that linearizes CFGs into sequences of function calls, a design choice that enables transformer modeling but may introduce token-level sensitivities and ordering artifacts exploitable by adversaries. By evaluating evasion strategies within this graph-to-sequence framework, we provide insight into the practical robustness of transformer-based malware detectors beyond aggregate detection accuracy. This paper proposes a white-box adversarial evasion attack that leverages explainability mechanisms to identify and perturb most influential graph components. Using token- and word-level attributions derived from integrated gradients, the attack iteratively replaces positively attributed function calls with synthetic external imports, producing adversarial CFG representations without altering overall program structure. Experimental evaluation on small- and large-scale Windows Portable Executable (PE) datasets demonstrates that the proposed method can reliably induce misclassification, even against models trained to high accuracy. Our results highlight that explainability tools, while valuable for interpretability, can also expose critical attack surfaces in transformer-based malware detectors.
Problem

Research questions and friction points this paper is trying to address.

adversarial attacks
malware detection
transformer models
control flow graphs
explainability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Explainability-Guided Attack
Transformer-Based Malware Detection
Control Flow Graph
Adversarial Evasion
Integrated Gradients