π€ AI Summary
Large language model (LLM)-driven agent systems are prone to runtime anomalies that jeopardize their stability and safety, necessitating systematic operational solutions. This work proposes AgentOps, the firstθΏη»΄ framework specifically designed for LLM-based agent systems, encompassing four key phases: monitoring, anomaly detection, root cause localization, and remediation. It also presents the first systematic taxonomy of internal and interaction-related anomalies inherent to such agents. By integrating categorization modeling, architectural design, and workflow abstraction with an empirical analysis of existing anomalous behaviors, this study delineates the core challenges and outlines critical research directions in agent system operations. AgentOps thus provides both a theoretical foundation and practical guidance for industrial deployment and academic inquiry in this emerging domain.
π Abstract
As the reasoning capabilities of Large Language Models (LLMs) continue to advance, LLM-based agent systems offer advantages in flexibility and interpretability over traditional systems, garnering increasing attention. However, despite the widespread research interest and industrial application of agent systems, these systems, like their traditional counterparts, frequently encounter anomalies. These anomalies lead to instability and insecurity, hindering their further development. Therefore, a comprehensive and systematic approach to the operation and maintenance of agent systems is urgently needed. Unfortunately, current research on the operations of agent systems is sparse. To address this gap, we have undertaken a survey on agent system operations with the aim of establishing a clear framework for the field, defining the challenges, and facilitating further development. Specifically, this paper begins by systematically defining anomalies within agent systems, categorizing them into intra-agent anomalies and inter-agent anomalies. Next, we introduce a novel and comprehensive operational framework for agent systems, dubbed Agent System Operations (AgentOps). We provide detailed definitions and explanations of its four key stages: monitoring, anomaly detection, root cause localization, and resolution.