🤖 AI Summary
Current multi-model AI agent systems lack a deep understanding of system-level behaviors in general-purpose tasks, hindered by high execution non-determinism, substantial evaluation costs, and limited observability into model internals. To address this, this work introduces GAIATrace, the first fine-grained tracing dataset tailored for general tasks, capturing complete reasoning tokens, task structures, and per-model activity trajectories. Complementing this dataset, the authors develop Vidur-Agent, a trace-driven, reproducible simulator. Leveraging the GAIA benchmark and multi-agent simulation, this study enables, for the first time, low-cost behavioral analysis across diverse environments. The framework reveals emergent behavioral patterns in agent systems, quantifies the impact of various design choices on performance, and yields several novel empirical insights.
📝 Abstract
Agentic AI completes tasks through iterative planning, tool use, and reasoning based on observed outcomes. Despite its popularity, its system-level behavior remains poorly understood, particularly for complex datasets and agent architectures-owing to highly non-deterministic execution, prohibitive evaluation costs, and limited visibility into proprietary models. This paper presents GAIATrace, the first token-level trace dataset of two state-of-the-art agentic systems (MiroThinker and OWL) running GAIA, a benchmark composed of a heterogeneous mix of general-purpose tasks. Unlike prior trace datasets, GAIATrace captures full reasoning tokens, task-level structures, and activities of every major participating LLMs, enabling in-depth systems research. Complementing the dataset, we present Vidur-Agent, a trace-driven simulator that can replay GAIATrace to perform reproducible, low-cost system evaluation across diverse simulated environments. Using both artifacts, we characterize how modern agentic systems handle general tasks and how various system design choices shape their behavior, yielding several unique findings.