Interpretable Failure Analysis in Multi-Agent Reinforcement Learning Systems

📅 2026-02-08

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the lack of interpretable fault attribution mechanisms in multi-agent reinforcement learning (MARL) systems, which hinders the identification of initial failure sources and understanding of their propagation dynamics. The authors propose a two-stage, gradient-based explainable diagnostic framework: first, by leveraging Taylor remainder analysis combined with first-order sensitivity and second-order directional curvature derived from the critic, it detects single-agent failures within a causal window; then, it constructs a contagion graph to reveal how upstream policy deviations are amplified through coordinated strategies. This approach provides the first geometric explanation for the “downstream-first-alert” phenomenon in MARL, overcoming the limitations of black-box diagnostics. Evaluated on Simple Spread and StarCraft II using MADDPg and HATRPO, the method achieves Patient-0 detection accuracy of 88.2%–99.4% and offers interpretable geometric evidence to support diagnostic decisions.

Technology Category

Application Category

📝 Abstract

Multi-Agent Reinforcement Learning (MARL) is increasingly deployed in safety-critical domains, yet methods for interpretable failure detection and attribution remain underdeveloped. We introduce a two-stage gradient-based framework that provides interpretable diagnostics for three critical failure analysis tasks: (1) detecting the true initial failure source (Patient-0); (2) validating why non-attacked agents may be flagged first due to domino effects; and (3) tracing how failures propagate through learned coordination pathways. Stage 1 performs interpretable per-agent failure detection via Taylor-remainder analysis of policy-gradient costs, declaring an initial Patient-0 candidate at the first threshold crossing. Stage 2 provides validation through geometric analysis of critic derivatives-first-order sensitivity and directional second-order curvature aggregated over causal windows to construct interpretable contagion graphs. This approach explains"downstream-first"detection anomalies by revealing pathways that amplify upstream deviations. Evaluated across 500 episodes in Simple Spread (3 and 5 agents) and 100 episodes in StarCraft II using MADDPG and HATRPO, our method achieves 88.2-99.4% Patient-0 detection accuracy while providing interpretable geometric evidence for detection decisions. By moving beyond black-box detection to interpretable gradient-level forensics, this framework offers practical tools for diagnosing cascading failures in safety-critical MARL systems.

Problem

Research questions and friction points this paper is trying to address.

Multi-Agent Reinforcement Learning

Failure Analysis

Interpretability

Cascading Failures

Fault Attribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

interpretable failure analysis

multi-agent reinforcement learning

gradient-based diagnostics