MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing mobile agent benchmarks (e.g., AndroidWorld) suffer from saturation—achieving >90% success rates—and limited app coverage (e.g., lacking e-commerce and enterprise communication apps), failing to evaluate realistic challenges such as ambiguous instruction understanding, cross-app coordination, and MCP (Mobile Control Protocol) tool invocation. To address this, we propose MobileWorld: a more challenging benchmark comprising 201 long-horizon tasks across 20 mainstream Android applications. It introduces the first Agent-User interaction paradigm and an MCP-augmented task taxonomy. We build a snapshot-based, containerized Android environment with task callback APIs, database state validation, and hybrid GUI/NL/API action interfaces. Furthermore, we design a planner-executor framework supporting both conversational reasoning and MCP tool calling. Experiments reveal that the best-performing agent achieves only 51.7% success, while end-to-end models attain merely 20.9%, exposing fundamental bottlenecks in user intent modeling and tool orchestration. MobileWorld establishes a new standard and roadmap for advancing mobile agent research.

Technology Category

Application Category

📝 Abstract

Among existing online mobile-use benchmarks, AndroidWorld has emerged as the dominant benchmark due to its reproducible environment and deterministic evaluation; however, recent agents achieving over 90% success rates indicate its saturation and motivate the need for a more challenging benchmark. In addition, its environment lacks key application categories, such as e-commerce and enterprise communication, and does not reflect realistic mobile-use scenarios characterized by vague user instructions and hybrid tool usage. To bridge this gap, we introduce MobileWorld, a substantially more challenging benchmark designed to better reflect real-world mobile usage, comprising 201 tasks across 20 applications, while maintaining the same level of reproducible evaluation as AndroidWorld. The difficulty of MobileWorld is twofold. First, it emphasizes long-horizon tasks with cross-application interactions: MobileWorld requires nearly twice as many task-completion steps on average (27.8 vs. 14.3) and includes far more multi-application tasks (62.2% vs. 9.5%) compared to AndroidWorld. Second, MobileWorld extends beyond standard GUI manipulation by introducing novel task categories, including agent-user interaction and MCP-augmented tasks. To ensure robust evaluation, we provide snapshot-based container environment and precise functional verifications, including backend database inspection and task callback APIs. We further develop a planner-executor agentic framework with extended action spaces to support user interactions and MCP calls. Our results reveal a sharp performance drop compared to AndroidWorld, with the best agentic framework and end-to-end model achieving 51.7% and 20.9% success rates, respectively. Our analysis shows that current models struggle significantly with user interaction and MCP calls, offering a strategic roadmap toward more robust, next-generation mobile intelligence.

Problem

Research questions and friction points this paper is trying to address.

Introduces MobileWorld to benchmark autonomous agents in realistic mobile-use scenarios

Addresses limitations of AndroidWorld by including e-commerce and enterprise applications

Evaluates agents on long-horizon, cross-application tasks with user interaction and MCP calls

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MobileWorld benchmark with 201 tasks across 20 applications

Extends tasks to include agent-user interaction and MCP-augmented categories

Uses snapshot-based container environment with backend verification APIs

🔎 Similar Papers

Benchmarking Mobile Device Control Agents across Diverse Configurations

2024-04-25arXiv.orgCitations: 16

Authors to Follow