LogPurge: Log Data Purification for Anomaly Detection via Rule-Enhanced Filtering

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Log anomaly detection suffers from a scarcity of high-quality, anomaly-free training data, as manual labeling is prohibitively expensive and existing automated cleaning methods neglect log structural characteristics and system-level semantics. To address this, we propose LogPurge, the first framework featuring rule-augmented, two-stage iterative purification. In Stage I, a large language model (LLM) identifies fine-grained anomaly patterns; in Stage II, system-aware rules—including timestamp consistency and template frequency constraints—are integrated for semantic validation and divide-and-conquer filtering of log segments. Evaluated on multiple benchmark datasets, LogPurge achieves an average anomaly removal rate of 98.74% while preserving 82.39% of normal sequences. Its purified logs boost downstream detector F1-scores by up to 149.72% over state-of-the-art methods, significantly enhancing detection performance.

Technology Category

Application Category

📝 Abstract
Log anomaly detection, which is critical for identifying system failures and preempting security breaches, detects irregular patterns within large volumes of log data, and impacts domains such as service reliability, performance optimization, and database log analysis. Modern log anomaly detection methods rely on training deep learning models on clean, anomaly-free log sequences. However, obtaining such clean log data requires costly and tedious human labeling, and existing automatic cleaning methods fail to fully integrate the specific characteristics and actual semantics of logs in their purification process. In this paper, we propose a cost-aware, rule-enhanced purification framework, LogPurge, that automatically selects a sufficient subset of normal log sequences from contamination log sequences to train a anomaly detection model. Our approach involves a two-stage filtering algorithm: In the first stage, we use a large language model (LLM) to remove clustered anomalous patterns and enhance system rules to improve LLM's understanding of system logs; in the second stage, we utilize a divide-and-conquer strategy that decomposes the remaining contaminated regions into smaller subproblems, allowing each to be effectively purified through the first stage procedure. Our experiments, conducted on two public datasets and one industrial dataset, show that our method significantly removes an average of 98.74% of anomalies while retaining 82.39% of normal samples. Compared to the latest unsupervised log sample selection algorithms, our method achieves F-1 score improvements of 35.7% and 84.11% on the public datasets, and an impressive 149.72% F-1 improvement on the private dataset, demonstrating the effectiveness of our approach.
Problem

Research questions and friction points this paper is trying to address.

Automatically purifying contaminated log data for anomaly detection training
Reducing costly manual labeling of clean log sequences for AI models
Enhancing log anomaly detection by integrating rule-based filtering with LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rule-enhanced filtering framework purifies log data
Two-stage algorithm uses LLM and divide-conquer strategy
Automatically selects normal log sequences for training models