STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios

📅 2026-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current evaluations of intelligent agents predominantly rely on sandboxed environments, static tasks, and coarse-grained scoring, which inadequately capture their real-world capabilities. To address this limitation, this work proposes an automated evaluation framework centered on the final system state, which generates realistic benchmark tasks from task prompts by automatically synthesizing environment configurations, ground-truth outcomes, and programmatic validation logic. Integrating state modeling, automatic task generation, environment simulation, and programmatic verification, the framework establishes an end-to-end evaluation pipeline and introduces a benchmark suite comprising 40 realistic tasks. Comprehensive assessments of 11 state-of-the-art models using this benchmark yield in-depth analyses of task success rates, operational costs, tool-calling reliability, and failure modes, substantially enhancing both the fidelity and scalability of agent evaluation.
📝 Abstract
Large language models are increasingly used to power personal agents for everyday applications, but evaluating these agents remains a challenge. Existing benchmarks still rely on sandboxed artifacts, static task design, and coarse scoring, which hinder scalability and limit progress toward reliable personal-agent evaluation. This paper introduces STAGE-Claw, an automated framework for building and evaluating realistic personal-agent scenarios in state-based personal-computing environments. Given a task hint, STAGE-Claw automatically creates and validates a realistic benchmark task with its environment, task prompts, ground truth, and related verification programs. Agents are then evaluated in realistic operating environments, where performance is measured by the correctness of the final system state rather than only the textual response. Using STAGE-Claw, this paper creates a benchmark with 40 challenging real scenario agent tasks, evaluates 11 frontier models, and analyzes their task scores, costs, tool-call reliability, and common failure patterns. Overall, STAGE-Claw offers a scalable, state-based way to evaluate agents in realistic user scenarios.
Problem

Research questions and friction points this paper is trying to address.

agent benchmarking
realistic scenarios
state-based evaluation
personal agents
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

state-based evaluation
automated benchmarking
realistic agent scenarios
system state correctness
personal agent evaluation
🔎 Similar Papers
No similar papers found.
S
Sirui Liang
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation; Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences; Zhongguancun Academy
B
Bohan Yu
School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences; The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation; Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences; Zhongguancun Academy
P
Peiyu Wang
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation; Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences
S
Shiguang Guo
Meituan
Wenxing Hu
Wenxing Hu
Tulane University | Biogen Inc.
Machine learningcomputational neurosciencebioinformaticsscRNA-seq
Pengfei Cao
Pengfei Cao
Institute of Automation, Chinese Academy of Sciences
Natural Language ProcessingLarge Language ModelsInformation Extraction
Jian Zhao
Jian Zhao
Zhongguancun Institute of Artificial Intelligence
Reinforcement LearningMulti-Agent System
C
Cao Liu
Meituan
K
Ke Zeng
Meituan
X
Xunliang Cai
Meituan
K
Kang Liu
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation; Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences