🤖 AI Summary
This study addresses the challenges of root cause localization in complex cloud networks and the limited generalizability of traditional rule-based methods by proposing a novel paradigm that integrates spatiotemporal grouping, automated ontology construction, and a time-aware causal graph. The approach constructs the causal graph through bivariate Granger causality tests and conditional independence testing, and introduces an edge-specific time-lagged conditional probability inference mechanism to enable efficient and interpretable root cause scoring. Evaluated on 35 real-world production incidents, the method achieves a root cause recall rate of 85.7% and an exact match precision of 74.3%. It has been deployed in over 800 actual fault cases and received positive feedback from operations teams.
📝 Abstract
Cloud-computing relies on large-scale networks which are inherently complex systems. In this paper, we present a novel approach to root cause analysis (RCA) of cloud network incidents, leveraging graph-based causal discovery techniques. Our method addresses the limitations of rule-based automation by introducing a spatiotemporal grouping strategy and an automation ontology to reduce the dimensionality of the problem. We construct a causal graph from binary time series data using bivariate Granger causality and conditional independence tests. For inference, we introduce a probabilistic method that assigns edge-specific conditional probabilities as a function of time lag, allowing for interpretable, time-aware root cause scoring via causal graph traversal.
We evaluated the system using a labeled dataset of 35 production incidents from a major cloud provider. The model successfully recalled the correct root cause in 85.7% of incidents and produced an exact match in 74.3%. In production, the deployed system has been used in over 800 real-world incidents, with positive qualitative feedback from network engineers. These results highlight the practicality of a data-driven, causal approach to RCA in dynamic and large-scale operational environments.