Interpretable Failure Analysis in Multi-Agent Reinforcement Learning Systems

๐Ÿ“… 2026-02-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the lack of interpretable fault attribution mechanisms in multi-agent reinforcement learning (MARL) systems, which hinders the identification of initial failure sources and understanding of their propagation dynamics. The authors propose a two-stage, gradient-based explainable diagnostic framework: first, by leveraging Taylor remainder analysis combined with first-order sensitivity and second-order directional curvature derived from the critic, it detects single-agent failures within a causal window; then, it constructs a contagion graph to reveal how upstream policy deviations are amplified through coordinated strategies. This approach provides the first geometric explanation for the โ€œdownstream-first-alertโ€ phenomenon in MARL, overcoming the limitations of black-box diagnostics. Evaluated on Simple Spread and StarCraft II using MADDPg and HATRPO, the method achieves Patient-0 detection accuracy of 88.2%โ€“99.4% and offers interpretable geometric evidence to support diagnostic decisions.

Technology Category

Application Category

๐Ÿ“ Abstract
Multi-Agent Reinforcement Learning (MARL) is increasingly deployed in safety-critical domains, yet methods for interpretable failure detection and attribution remain underdeveloped. We introduce a two-stage gradient-based framework that provides interpretable diagnostics for three critical failure analysis tasks: (1) detecting the true initial failure source (Patient-0); (2) validating why non-attacked agents may be flagged first due to domino effects; and (3) tracing how failures propagate through learned coordination pathways. Stage 1 performs interpretable per-agent failure detection via Taylor-remainder analysis of policy-gradient costs, declaring an initial Patient-0 candidate at the first threshold crossing. Stage 2 provides validation through geometric analysis of critic derivatives-first-order sensitivity and directional second-order curvature aggregated over causal windows to construct interpretable contagion graphs. This approach explains"downstream-first"detection anomalies by revealing pathways that amplify upstream deviations. Evaluated across 500 episodes in Simple Spread (3 and 5 agents) and 100 episodes in StarCraft II using MADDPG and HATRPO, our method achieves 88.2-99.4% Patient-0 detection accuracy while providing interpretable geometric evidence for detection decisions. By moving beyond black-box detection to interpretable gradient-level forensics, this framework offers practical tools for diagnosing cascading failures in safety-critical MARL systems.
Problem

Research questions and friction points this paper is trying to address.

Multi-Agent Reinforcement Learning
Failure Analysis
Interpretability
Cascading Failures
Fault Attribution
Innovation

Methods, ideas, or system contributions that make the work stand out.

interpretable failure analysis
multi-agent reinforcement learning
gradient-based diagnostics
Patient-0 detection
contagion graph
๐Ÿ”Ž Similar Papers
No similar papers found.