π€ AI Summary
This work addresses the limitations of current LLM agent testing, which heavily relies on manual inspection and lacks observability into internal structures, thereby hindering automation, root cause analysis, and cost control. To bridge this gap, the paper introduces, for the first time, structured testing principles from software engineering into LLM agent evaluation, proposing a component-level automated testing approach based on execution trace tracking, LLM behavior simulation, and assertion-based validation. By integrating OpenTelemetry for fine-grained trace capture and leveraging multi-language testing frameworks with mock mechanisms, the method supports the testing automation pyramid, regression testing, and test-driven development. Experimental results demonstrate that the proposed approach significantly improves test coverage and reusability, reduces defect detection costs and testing overhead, and enables rapid root cause localization.
π Abstract
LLM-based agents are rapidly being adopted across diverse domains. Since they interact with users without supervision, they must be tested extensively. Current testing approaches focus on acceptance-level evaluation from the user's perspective. While intuitive, these tests require manual evaluation, are difficult to automate, do not facilitate root cause analysis, and incur expensive test environments. In this paper, we present methods to enable structural testing of LLM-based agents. Our approach utilizes traces (based on OpenTelemetry) to capture agent trajectories, employs mocking to enforce reproducible LLM behavior, and adds assertions to automate test verification. This enables testing agent components and interactions at a deeper technical level within automated workflows. We demonstrate how structural testing enables the adaptation of software engineering best practices to agents, including the test automation pyramid, regression testing, test-driven development, and multi-language testing. In representative case studies, we demonstrate automated execution and faster root-cause analysis. Collectively, these methods reduce testing costs and improve agent quality through higher coverage, reusability, and earlier defect detection. We provide an open source reference implementation on GitHub.