Semantic-Aware Advanced Persistent Threat Detection Using Autoencoders on LLM-Encoded System Logs

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes an unsupervised detection method that integrates large language models with autoencoders to address the challenge of identifying “low-and-slow” advanced persistent threat (APT) attacks, which often evade traditional detection techniques. The approach leverages a pre-trained Transformer to convert system logs into semantic embeddings that capture high-level operational intent, followed by an autoencoder to identify anomalous patterns. By introducing semantic-level representations into APT detection for the first time, this method overcomes the limitations of conventional approaches relying solely on structural or statistical features. Experimental results on the DARPA TC dataset demonstrate that the proposed technique achieves significantly higher AUC-ROC performance compared to baseline methods such as Isolation Forest, One-Class SVM, and PCA, thereby enhancing the capability to detect stealthy APT activities.

Technology Category

Application Category

📝 Abstract
Advanced Persistent Threats (APTs) are among the most challenging cyberattacks to detect. They are carried out by highly skilled attackers who carefully study their targets and operate in a stealthy, long-term manner. Because APTs exhibit"low-and-slow"behavior, traditional statistical methods and shallow machine learning techniques often fail to detect them. Previous research on APT detection has explored machine learning approaches and provenance graph analysis. However, provenance-based methods often fail to capture the semantic intent behind system activities. This paper proposes a novel anomaly detection approach that leverages semantic embeddings generated by Large Language Models (LLMs). The method enhances APT detection by extracting meaningful semantic representations from unstructured system log data. First, raw system logs are transformed into high-dimensional semantic embeddings using a pre-trained transformer model. These embeddings are then analyzed using an Autoencoder (AE) to identify anomalous and potentially malicious patterns. The proposed method is evaluated using the DARPA Transparent Computing (TC) dataset, which contains realistic APT attack scenarios generated by red teams in live environments. Experimental results show that the AE trained on LLM-derived embeddings outperforms widely used unsupervised baseline methods, including Isolation Forest (IForest), One-Class Support Vector Machine (OC-SVM), and Principal Component Analysis (PCA). Performance is measured using the Area Under the Receiver Operating Characteristic Curve (AUC-ROC), where the proposed approach consistently achieves superior results, even in complex threat scenarios. These findings highlight the importance of semantic understanding in detecting non-linear and stealthy attack behaviors that are often missed by conventional detection techniques.
Problem

Research questions and friction points this paper is trying to address.

Advanced Persistent Threat
semantic intent
anomaly detection
system logs
cybersecurity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Embedding
Large Language Model (LLM)
Autoencoder
Advanced Persistent Threat (APT)
Anomaly Detection
🔎 Similar Papers
No similar papers found.
W
Waleed Khan Mohammed
Faculty of Artificial Intelligence and Engineering, Multimedia University Cyberjaya, Malaysia
Z
Zahirul Arief Irfan Bin Shahrul Anuar
Faculty of Artificial Intelligence and Engineering, Multimedia University Cyberjaya, Malaysia
M
Mousa Sufian Mousa Mitani
Faculty of Artificial Intelligence and Engineering, Multimedia University Cyberjaya, Malaysia
Hezerul Abdul Karim
Hezerul Abdul Karim
Professor, Faculty of Engineering, Multimedia University, Cyberjaya, Selangor, Malaysia
3D image and video codingvideo transmission over cognitive radioerror resiliencetelemetry
Nouar AlDahoul
Nouar AlDahoul
PHD, AI Research Scientist in New York University, Abu Dhabi, UAE
Social Science-Large Language Models-Machine learning-Computer Vision-Internet of Things