PROVSYN: Synthesizing Provenance Graphs for Data Augmentation in Intrusion Detection Systems

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Severe class imbalance in provenance graph data for APT detection critically degrades the performance of GNN- and NLP-based models. Method: This paper proposes the first end-to-end provenance graph synthesis framework, innovatively integrating structure–semantics joint modeling, rule-driven topological refinement, and LLM-guided textual attribute generation to produce high-fidelity, semantically correct, and temporally consistent attack graphs. Contribution/Results: We design a multidimensional evaluation protocol—covering structural, textual, temporal, and embedding properties—augmented by semantic-logical validation to ensure attack plausibility. Experiments across multiple APT detection tasks demonstrate a 12.7% improvement in F1-score after augmentation, significantly mitigating recognition bias for rare attack classes. Our framework establishes a verifiable, graph-centric data augmentation paradigm for APT detection.

Technology Category

Application Category

📝 Abstract
Provenance graph analysis plays a vital role in intrusion detection, particularly against Advanced Persistent Threats (APTs), by exposing complex attack patterns. While recent systems combine graph neural networks (GNNs) with natural language processing (NLP) to capture structural and semantic features, their effectiveness is limited by class imbalance in real-world data. To address this, we introduce PROVSYN, an automated framework that synthesizes provenance graphs through a three-phase pipeline: (1) heterogeneous graph structure synthesis with structural-semantic modeling, (2) rule-based topological refinement, and (3) context-aware textual attribute synthesis using large language models (LLMs). PROVSYN includes a comprehensive evaluation framework that integrates structural, textual, temporal, and embedding-based metrics, along with a semantic validation mechanism to assess the correctness of generated attack patterns and system behaviors. To demonstrate practical utility, we use the synthetic graphs to augment training datasets for downstream APT detection models. Experimental results show that PROVSYN produces high-fidelity graphs and improves detection performance through effective data augmentation.
Problem

Research questions and friction points this paper is trying to address.

Addressing class imbalance in intrusion detection data
Synthesizing provenance graphs for APT detection augmentation
Improving detection performance with high-fidelity synthetic graphs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Heterogeneous graph synthesis with structural-semantic modeling
Rule-based topological refinement for provenance graphs
Context-aware attribute synthesis using large language models
🔎 Similar Papers
No similar papers found.
Y
Yi Huang
Key Laboratory of High-Confidence Software Technologies (MOE), School of Computer Science, Peking University
W
Wajih UI Hassan
University of Virginia
Yao Guo
Yao Guo
Beijing Institute of Technology
Nanodevices
X
Xiangqun Chen
Key Laboratory of High-Confidence Software Technologies (MOE), School of Computer Science, Peking University
D
Ding Li
Key Laboratory of High-Confidence Software Technologies (MOE), School of Computer Science, Peking University