AutoEval: A Practical Framework for Autonomous Evaluation of Mobile Agents

📅 2025-03-04

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing mobile agent evaluation benchmarks rely on manually designed reward functions and hand-crafted evaluation logic, resulting in low practicality and poor scalability. This paper introduces the first end-to-end automated evaluation framework that generates fine-grained UI-state reward signals and performs performance assessment solely from natural language task descriptions. Our approach addresses the problem of labor-intensive, inflexible evaluation by (1) modeling UI state changes via structured sub-state representations; (2) implementing an autonomous adjudication mechanism that synergistically combines rule-based reasoning and large language models; and (3) enabling automatic generation and validation of reward signals. Experiments demonstrate >93% reward coverage and 94% adjudication accuracy. The framework successfully uncovers capability boundaries and prevalent failure modes of state-of-the-art mobile agents, thereby breaking the dependency on manual annotation for evaluation.

Technology Category

Application Category

📝 Abstract

Accurate and systematic evaluation of mobile agents can significantly advance their development and real-world applicability. However, existing benchmarks for mobile agents lack practicality and scalability due to the extensive manual effort required to define task reward signals and implement corresponding evaluation codes. To this end, we propose AutoEval, an autonomous agent evaluation framework that tests a mobile agent without any manual effort. First, we design a Structured Substate Representation to describe the UI state changes while agent execution, such that task reward signals can be automatically generated. Second, we utilize a Judge System that can autonomously evaluate agents' performance given the automatically generated task reward signals. By providing only a task description, our framework evaluates agents with fine-grained performance feedback to that task without any extra manual effort. We implement a prototype of our framework and validate the automatically generated task reward signals, finding over 93% coverage to human-annotated reward signals. Moreover, to prove the effectiveness of our autonomous Judge System, we manually verify its judge results and demonstrate that it achieves 94% accuracy. Finally, we evaluate the state-of-the-art mobile agents using our framework, providing detailed insights into their performance characteristics and limitations.

Problem

Research questions and friction points this paper is trying to address.

Automates evaluation of mobile agents without manual effort.

Generates task reward signals automatically using UI state changes.

Provides fine-grained performance feedback based on task descriptions.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured Substate Representation for UI state changes

Autonomous Judge System for performance evaluation

Task reward signals automatically generated without manual effort

🔎 Similar Papers

A Survey on Large Language Model based Autonomous Agents