Detecting Silent Failures in Multi-Agentic AI Trajectories

📅 2025-11-06

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This paper addresses the challenge of detecting “silent failures”—such as objective drift, behavioral loops, and omission of critical details—in multi-agent AI systems induced by large language models (LLMs). We formally introduce and define the novel task of *agent trajectory anomaly detection*. Methodologically, we construct the first annotated trajectory dataset capturing user intent, agent non-determinism, and LLM output variability—comprising 4,275 training and 894 test trajectories—and establish a standardized benchmark. We model temporal and semantic trajectory features using supervised XGBoost and semi-supervised Support Vector Data Description (SVDD). Experimental results demonstrate that the best-performing model achieves 98% accuracy. Our work provides the first reproducible, scalable benchmark and methodological foundation for reliability research in multi-agent systems.

Technology Category

Application Category

📝 Abstract

Multi-Agentic AI systems, powered by large language models (LLMs), are inherently non-deterministic and prone to silent failures such as drift, cycles, and missing details in outputs, which are difficult to detect. We introduce the task of anomaly detection in agentic trajectories to identify these failures and present a dataset curation pipeline that captures user behavior, agent non-determinism, and LLM variation. Using this pipeline, we curate and label two benchmark datasets comprising extbf{4,275 and 894} trajectories from Multi-Agentic AI systems. Benchmarking anomaly detection methods on these datasets, we show that supervised (XGBoost) and semi-supervised (SVDD) approaches perform comparably, achieving accuracies up to 98% and 96%, respectively. This work provides the first systematic study of anomaly detection in Multi-Agentic AI systems, offering datasets, benchmarks, and insights to guide future research.

Problem

Research questions and friction points this paper is trying to address.

Detecting silent failures in multi-agent AI trajectories

Identifying drift, cycles, and missing output details

Providing datasets and benchmarks for anomaly detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Anomaly detection for multi-agent AI trajectories

Dataset curation pipeline capturing behavioral variations

Supervised and semi-supervised methods achieving high accuracy

🔎 Similar Papers

No similar papers found.