🤖 AI Summary
To address the challenge of detecting previously unseen malicious behaviors in Advanced Persistent Threat (APT) scenarios, this paper proposes a knowledge-driven, semantically enhanced provenance analysis method. It is the first to systematically integrate the deep semantic understanding capabilities of large language models (LLaMA/Qwen) into provenance graph modeling, jointly encoding system call sequences, process-relational graph structures, and event textual semantics to generate fine-grained semantic embeddings. Unlike conventional rule-based or shallow-feature approaches, our method introduces a hybrid supervised and semi-supervised learning framework—comprising XGBoost and GraphSAGE—to jointly infer execution context, software identity, and behavioral intent. Evaluated on real-world enterprise data, the supervised variant achieves 99.0% detection accuracy for known APTs, while the semi-supervised variant attains 96.9% anomaly detection accuracy—both substantially outperforming state-of-the-art provenance-based detection methods.
📝 Abstract
Advanced Persistent Threats (APTs) have caused significant losses across a wide range of sectors, including the theft of sensitive data and harm to system integrity. As attack techniques grow increasingly sophisticated and stealthy, the arms race between cyber defenders and attackers continues to intensify. The revolutionary impact of Large Language Models (LLMs) has opened up numerous opportunities in various fields, including cybersecurity. An intriguing question arises: can the extensive knowledge embedded in LLMs be harnessed for provenance analysis and play a positive role in identifying previously unknown malicious events? To seek a deeper understanding of this issue, we propose a new strategy for taking advantage of LLMs in provenance-based threat detection. In our design, the state-of-the-art LLM offers additional details in provenance data interpretation, leveraging their knowledge of system calls, software identity, and high-level understanding of application execution context. The advanced contextualized embedding capability is further utilized to capture the rich semantics of event descriptions. We comprehensively examine the quality of the resulting embeddings, and it turns out that they offer promising avenues. Subsequently, machine learning models built upon these embeddings demonstrated outstanding performance on real-world data. In our evaluation, supervised threat detection achieves a precision of 99.0%, and semi-supervised anomaly detection attains a precision of 96.9%.