🤖 AI Summary
Existing approaches to root cause analysis in Kubernetes often rely on scenario-specific heuristics, lacking auditability and reproducibility. This work proposes a graph-guided root cause analysis agent that systematically encodes operational constraints—such as read-only forensics, propagation-aware diagnosis, and independent validation—into a typed evidence graph and a LangGraph state machine. By integrating large language model reasoning with deterministic graph traversal over OpenTelemetry telemetry data, the method enables an auditable and reproducible diagnostic workflow. Evaluated on 23 scenarios from the ITBench benchmark, the approach improves the F1 score for root cause entities from 0.6087 to 0.9130; even without scenario-specific prompts, it maintains a score of 0.6958, demonstrating both effectiveness and partial generalization capability.
📝 Abstract
Kubernetes incidents are diagnosed reliably only when a root-cause system's reported gains come from incident evidence rather than scenario-specific shortcuts. We present Graph Traversal Agent, a graph-guided RCA agent that combines LLM reasoning with specialized tools. The model reasons over a typed evidence graph, while deterministic graph and tool operations collect evidence, bound the search, and check proposed verdicts. We map operational constraints, including read-only evidence collection, propagation-aware diagnosis, bounded execution, and independently validated verdicts, to a typed incident graph, a LangGraph traversal state machine, and a separate validation stage. On ITBench snapshots scored by one fixed qwen-plus judge, the audited system raises root-cause-entity F1 over an earlier iteration of the same system from 0.6087 to 0.9130 on a 23-scenario common subset. A prompt-level ablation separates prompt-tuned gains from gains that survive once scenario-specific hints are removed: the stripped-prompt configuration retains 0.6958 F1 on a 19-scenario subset. The surviving gain concentrates on ChaosMesh scenarios whose ground-truth root cause is the injected fault object already present in the evidence graph, so we report it as benchmark-coupled rather than broad cross-cluster RCA evidence. Lightweight checks, including same-judge comparison, prompt-level ablation, cascade-source checking, and a telemetry no-leak test, mark claims as supported, pending, or out of scope. We scope the work to ITBench OpenTelemetry-demo snapshots. Live-cluster trials served as an engineering stress test, but alert state and trace availability did not stay stable enough for controlled scoring, so we make no production-readiness or mean-time-to-repair claim.