MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing mobile agent benchmarks (e.g., AndroidWorld) suffer from saturation—achieving >90% success rates—and limited app coverage (e.g., lacking e-commerce and enterprise communication apps), failing to evaluate realistic challenges such as ambiguous instruction understanding, cross-app coordination, and MCP (Mobile Control Protocol) tool invocation. To address this, we propose MobileWorld: a more challenging benchmark comprising 201 long-horizon tasks across 20 mainstream Android applications. It introduces the first Agent-User interaction paradigm and an MCP-augmented task taxonomy. We build a snapshot-based, containerized Android environment with task callback APIs, database state validation, and hybrid GUI/NL/API action interfaces. Furthermore, we design a planner-executor framework supporting both conversational reasoning and MCP tool calling. Experiments reveal that the best-performing agent achieves only 51.7% success, while end-to-end models attain merely 20.9%, exposing fundamental bottlenecks in user intent modeling and tool orchestration. MobileWorld establishes a new standard and roadmap for advancing mobile agent research.

Technology Category

Application Category

📝 Abstract
Among existing online mobile-use benchmarks, AndroidWorld has emerged as the dominant benchmark due to its reproducible environment and deterministic evaluation; however, recent agents achieving over 90% success rates indicate its saturation and motivate the need for a more challenging benchmark. In addition, its environment lacks key application categories, such as e-commerce and enterprise communication, and does not reflect realistic mobile-use scenarios characterized by vague user instructions and hybrid tool usage. To bridge this gap, we introduce MobileWorld, a substantially more challenging benchmark designed to better reflect real-world mobile usage, comprising 201 tasks across 20 applications, while maintaining the same level of reproducible evaluation as AndroidWorld. The difficulty of MobileWorld is twofold. First, it emphasizes long-horizon tasks with cross-application interactions: MobileWorld requires nearly twice as many task-completion steps on average (27.8 vs. 14.3) and includes far more multi-application tasks (62.2% vs. 9.5%) compared to AndroidWorld. Second, MobileWorld extends beyond standard GUI manipulation by introducing novel task categories, including agent-user interaction and MCP-augmented tasks. To ensure robust evaluation, we provide snapshot-based container environment and precise functional verifications, including backend database inspection and task callback APIs. We further develop a planner-executor agentic framework with extended action spaces to support user interactions and MCP calls. Our results reveal a sharp performance drop compared to AndroidWorld, with the best agentic framework and end-to-end model achieving 51.7% and 20.9% success rates, respectively. Our analysis shows that current models struggle significantly with user interaction and MCP calls, offering a strategic roadmap toward more robust, next-generation mobile intelligence.
Problem

Research questions and friction points this paper is trying to address.

Introduces MobileWorld to benchmark autonomous agents in realistic mobile-use scenarios
Addresses limitations of AndroidWorld by including e-commerce and enterprise applications
Evaluates agents on long-horizon, cross-application tasks with user interaction and MCP calls
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MobileWorld benchmark with 201 tasks across 20 applications
Extends tasks to include agent-user interaction and MCP-augmented categories
Uses snapshot-based container environment with backend verification APIs
🔎 Similar Papers
Quyu Kong
Quyu Kong
Alibaba Cloud
Multimodal LLMInformation Diffusion ModelingMachine Learning
X
Xu Zhang
Tongyi Lab, Alibaba Group
Z
Zhenyu Yang
HKUST (GZ)
N
Nolan Gao
University of Florida
C
Chen Liu
Tongyi Lab, Alibaba Group
P
Panrong Tong
Tongyi Lab, Alibaba Group
C
Chenglin Cai
Tongyi Lab, Alibaba Group
Hanzhang Zhou
Hanzhang Zhou
Nanyang Technological University
Large Language ModelsMechanistic InterpretabilityNatural Language Processing
Jianan Zhang
Jianan Zhang
Assistant Professor, Peking University
communication networksoptimizationnetworked intelligence
L
Liangyu Chen
Tongyi Lab, Alibaba Group
Zhidan Liu
Zhidan Liu
The Hong Kong University of Science and Technology (Guangzhou)
Artificial Internet of ThingsMobile ComputingUrban ComputingSmart Mobility
S
Steven Hoi
Tongyi Lab, Alibaba Group
Y
Yue Wang
Tongyi Lab, Alibaba Group