🤖 AI Summary
Current evaluations of intelligent agents predominantly rely on sandboxed environments, static tasks, and coarse-grained scoring, which inadequately capture their real-world capabilities. To address this limitation, this work proposes an automated evaluation framework centered on the final system state, which generates realistic benchmark tasks from task prompts by automatically synthesizing environment configurations, ground-truth outcomes, and programmatic validation logic. Integrating state modeling, automatic task generation, environment simulation, and programmatic verification, the framework establishes an end-to-end evaluation pipeline and introduces a benchmark suite comprising 40 realistic tasks. Comprehensive assessments of 11 state-of-the-art models using this benchmark yield in-depth analyses of task success rates, operational costs, tool-calling reliability, and failure modes, substantially enhancing both the fidelity and scalability of agent evaluation.
📝 Abstract
Large language models are increasingly used to power personal agents for everyday applications, but evaluating these agents remains a challenge. Existing benchmarks still rely on sandboxed artifacts, static task design, and coarse scoring, which hinder scalability and limit progress toward reliable personal-agent evaluation. This paper introduces STAGE-Claw, an automated framework for building and evaluating realistic personal-agent scenarios in state-based personal-computing environments. Given a task hint, STAGE-Claw automatically creates and validates a realistic benchmark task with its environment, task prompts, ground truth, and related verification programs. Agents are then evaluated in realistic operating environments, where performance is measured by the correctness of the final system state rather than only the textual response. Using STAGE-Claw, this paper creates a benchmark with 40 challenging real scenario agent tasks, evaluates 11 frontier models, and analyzes their task scores, costs, tool-call reliability, and common failure patterns. Overall, STAGE-Claw offers a scalable, state-based way to evaluate agents in realistic user scenarios.