Anomaly Detection and Root Cause Analysis for Microservice Systems

📅 2026-06-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Microservice systems frequently suffer from failures, yet existing approaches to anomaly detection and root cause analysis (RCA) often treat these tasks in isolation, rely on predefined service call graphs, neglect event-based observability data, and lack a unified evaluation benchmark. This work proposes the first end-to-end multimodal RCA framework that integrates heterogeneous observability signals—such as metrics and events—without requiring prior knowledge of service dependencies, leveraging causal inference and multimodal fusion mechanisms. Furthermore, we introduce RCAEval, the first comprehensive benchmark for microservice RCA, comprising real-world system traces and standardized baselines. Extensive experiments demonstrate the effectiveness and robustness of our methods (e.g., BARO, EventADL, TORAI), while systematic evaluations delineate the performance boundaries of current techniques, establishing a reproducible foundation for future research in this domain.
📝 Abstract
Microservice systems are widely used to build cloud applications, yet their complexity makes failures inevitable, degrading user experience and causing economic loss. Automated anomaly detection and root cause analysis (RCA) are now active research areas, but existing techniques share five limitations. First, most treat anomaly detection and RCA separately, assuming anomalies are detected correctly, and falter when detection is imprecise due to noise or delay. Second, they focus on metrics, logs, and traces, leaving event data such as API calls and configuration changes underexplored. Third, many require a given service call graph and cannot diagnose without one. Fourth, the field lacks standardised datasets and evaluation frameworks, so methods are hard to compare fairly. Fifth, although causal inference-based RCA has become dominant, its effectiveness, efficiency, and robustness remain unclear. This thesis addresses these limitations through two groups of contributions. The first introduces methods that exploit observability data both independently and collectively. BARO is an end-to-end anomaly detection and RCA approach for metric data. EventADL is an end-to-end framework for event data. TORAI is a multimodal RCA framework that requires no service call graph. Extensive experiments on real microservice systems demonstrate their effectiveness and robustness. The second group delivers benchmarking datasets, an evaluation framework, and systematic evaluation efforts. RCAEval is a comprehensive benchmark providing ready-to-use datasets and reproducible baselines for future research. A systematic evaluation of existing RCA methods, especially causal inference-based approaches, offers insights that guide future directions. This thesis thereby advances automated anomaly detection and RCA for microservice failures, enabling future research on incident mitigation and remediation.
Problem

Research questions and friction points this paper is trying to address.

anomaly detection
root cause analysis
microservice systems
causal inference
observability data
Innovation

Methods, ideas, or system contributions that make the work stand out.

end-to-end anomaly detection and RCA
event data exploitation
call-graph-free multimodal RCA
standardized benchmarking
causal inference evaluation
💼 Related Jobs