🤖 AI Summary
To address the challenges of multi-modal (trace/log/metric) anomaly localization and weak root-cause inference in distributed microservices, this paper proposes the first end-to-end root-cause analysis framework integrating causal inference with heterogeneous hypergraph modeling. Methodologically, it constructs a causal heterogeneous hypergraph to uniformly represent cross-modal dependencies; designs an attention-driven heterogeneous message-passing mechanism for anomaly detection; and explicitly models cross-modal causal flows via causal hypergraph learning. Its key innovation lies in departing from conventional homogeneous graph representations and statistical correlation paradigms—introducing causal structure modeling into multi-modal microservice root-cause localization for the first time. Evaluated on two public benchmarks, the framework achieves 36.2% and 29.4% improvements in A@1 and Percentage@1, respectively, significantly outperforming existing state-of-the-art methods.
📝 Abstract
In recent years, the widespread adoption of distributed microservice architectures within the industry has significantly increased the demand for enhanced system availability and robustness. Due to the complex service invocation paths and dependencies at enterprise-level microservice systems, it is challenging to locate the anomalies promptly during service invocations, thus causing intractable issues for normal system operations and maintenance. In this paper, we propose a Causal Heterogeneous grAph baSed framEwork for root cause analysis, namely CHASE, for microservice systems with multimodal data, including traces, logs, and system monitoring metrics. Specifically, related information is encoded into representative embeddings and further modeled by a multimodal invocation graph. Following that, anomaly detection is performed on each instance node with attentive heterogeneous message passing from its adjacent metric and log nodes. Finally, CHASE learns from the constructed hypergraph with hyperedges representing the flow of causality and performs root cause localization. We evaluate the proposed framework on two public microservice datasets with distinct attributes and compare with the state-of-the-art methods. The results show that CHASE achieves the average performance gain up to 36.2%(A@1) and 29.4%(Percentage@1), respectively to its best counterpart.