🤖 AI Summary
This work addresses the limitation of existing mobile agent benchmarks, which lack modeling of user identity, history, and preferences, thereby failing to evaluate genuine personal intelligence. The authors propose the first native iOS interaction simulator benchmark grounded in persistent user identities, encompassing 26 built-in applications and 133 tasks spanning single-app, cross-app, and personalized reasoning scenarios. For the first time, this benchmark introduces a simulated environment featuring coherent user identities and cross-app data associations. It supports visual inputs and optional XML accessibility trees, enabling evaluation with both state-of-the-art and open-source models. Experimental results show that the best-performing model achieves an overall accuracy of 52%, but only 37% on cross-app tasks. Combining privileged visual and XML inputs boosts performance of advanced models by up to 26 percentage points, whereas smaller models exhibit limited gains. The benchmark and code are publicly released.
📝 Abstract
A useful phone agent needs to be personally intelligent. It should reason over a user's identity, history, and preferences as they exist on the device, not just follow isolated instructions in an impersonal sandbox. Existing mobile agent benchmarks lack this kind of personalization. We introduce iOSWorld, the first interactive native iOS simulator benchmark built around a persistent user identity spanning 26 newly built iOS apps. These apps contain connected data such as transactions, messages, travel records, social relationships, and financial activity. iOSWorld includes 133 tasks across three increasingly difficult categories. Single-app tasks (27) test one app, multi-app tasks (60) span 2 to 8 apps, and memory and personalization tasks (46) require agents to infer patterns from personal data. We evaluate frontier and open-source computer-use models in both vision-only and privileged vision+XML settings. The best configuration reaches 52\% overall but only 37\% on multi-app tasks. Privileged vision+XML access improves frontier models by up to 26 percentage points, while smaller models do not benefit from added accessibility-tree input. We release iOSWorld as an open-source benchmark with all apps, seeded data, tasks, rubrics, and evaluation code.