🤖 AI Summary
This work addresses the challenge of diagnosing failures in multi-agent systems powered by large language models (LLMs), where complex agent behaviors render traditional log-based analysis inefficient for root cause identification. To this end, the authors propose DiLLS, a natural language–driven diagnostic framework that constructs structured behavioral summaries across three hierarchical levels—activities, actions, and operations—enabling multi-granular modeling and interpretable visualization of system behavior. Through user studies, DiLLS demonstrates significant improvements in developers’ efficiency and accuracy when identifying, diagnosing, and understanding faults in multi-agent LLM systems, offering a more intuitive and effective approach to system debugging compared to conventional methods.
📝 Abstract
Large language model (LLM)-based multi-agent systems have demonstrated impressive capabilities in handling complex tasks. However, the complexity of agentic behaviors makes these systems difficult to understand. When failures occur, developers often struggle to identify root causes and to determine actionable paths for improvement. Traditional methods that rely on inspecting raw log records are inefficient, given both the large volume and complexity of data. To address this challenge, we propose a framework and an interactive system, DiLLS, designed to reveal and structure the behaviors of multi-agent systems. The key idea is to organize information across three levels of query completion: activities, actions, and operations. By probing the multi-agent system through natural language, DiLLS derives and organizes information about planning and execution into a structured, multi-layered summary. Through a user study, we show that DiLLS significantly improves developers'effectiveness and efficiency in identifying, diagnosing, and understanding failures in LLM-based multi-agent systems.