MAI-UI Technical Report: Real-World Centric Foundation GUI Agents

📅 2025-12-26

📈 Citations: 0

✨ Influential: 0

career value

255K/year

🤖 AI Summary

To address four key challenges—lack of native human-computer interaction, limitations of pure UI-level operations, impractical deployment architectures, and poor robustness in dynamic environments—this paper introduces a family of multi-scale foundational GUI agents (2B–235B) designed for real-world scenarios. We propose an auto-evolving data pipeline integrating user interactions and MCP tool invocations; a device-cloud collaborative execution architecture enabling task-state-driven dynamic routing; and a scalable online reinforcement learning framework supporting ultra-long-context modeling and thousand-scale parallel simulation. Experiments demonstrate state-of-the-art performance: ScreenSpot-Pro achieves 73.5% GUI element localization accuracy, and AndroidWorld attains 76.7% mobile navigation success rate. On-device inference accelerates by 33%, cloud API calls decrease by over 40%, significantly enhancing privacy preservation and practical deployability.

Technology Category

Application Category

📝 Abstract

The development of GUI agents could revolutionize the next generation of human-computer interaction. Motivated by this vision, we present MAI-UI, a family of foundation GUI agents spanning the full spectrum of sizes, including 2B, 8B, 32B, and 235B-A22B variants. We identify four key challenges to realistic deployment: the lack of native agent-user interaction, the limits of UI-only operation, the absence of a practical deployment architecture, and brittleness in dynamic environments. MAI-UI addresses these issues with a unified methodology: a self-evolving data pipeline that expands the navigation data to include user interaction and MCP tool calls, a native device-cloud collaboration system routes execution by task state, and an online RL framework with advanced optimizations to scale parallel environments and context length. MAI-UI establishes new state-of-the-art across GUI grounding and mobile navigation. On grounding benchmarks, it reaches 73.5% on ScreenSpot-Pro, 91.3% on MMBench GUI L2, 70.9% on OSWorld-G, and 49.2% on UI-Vision, surpassing Gemini-3-Pro and Seed1.8 on ScreenSpot-Pro. On mobile GUI navigation, it sets a new SOTA of 76.7% on AndroidWorld, surpassing UI-Tars-2, Gemini-2.5-Pro and Seed1.8. On MobileWorld, MAI-UI obtains 41.7% success rate, significantly outperforming end-to-end GUI models and competitive with Gemini-3-Pro based agentic frameworks. Our online RL experiments show significant gains from scaling parallel environments from 32 to 512 (+5.2 points) and increasing environment step budget from 15 to 50 (+4.3 points). Finally, the native device-cloud collaboration system improves on-device performance by 33%, reduces cloud model calls by over 40%, and preserves user privacy.

Problem

Research questions and friction points this paper is trying to address.

Addresses lack of native agent-user interaction in GUI systems

Overcomes UI-only operation limits with device-cloud collaboration

Improves brittleness in dynamic environments via online RL framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-evolving data pipeline expands navigation with user interaction

Native device-cloud collaboration routes execution by task state

Online RL framework scales parallel environments and context length

🔎 Similar Papers

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents