UItron: Foundational GUI Agent with Advanced Perception and Planning

📅 2025-08-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address three key challenges—scarcity of GUI interaction trajectories, lack of cross-device interaction infrastructure, and insufficient capabilities of foundational models—this paper introduces UItron, the first general-purpose GUI agent supporting automated operation on both mobile and PC platforms. Methodologically, we construct a large-scale dataset of over one million Chinese GUI interaction steps from mainstream applications, design a curriculum-based reinforcement learning framework to jointly optimize vision-language models for GUI perception, element localization, and task planning, and build a dual-platform online interactive environment for training and evaluation. Our contributions include: (i) the first high-performance GUI agent tailored to real-world Chinese GUI environments; (ii) state-of-the-art results across multiple GUI perception, localization, and planning benchmarks; and (iii) superior performance on authentic Chinese mobile applications—marking a critical step toward practical deployment of GUI agents.

Technology Category

Application Category

📝 Abstract
GUI agent aims to enable automated operations on Mobile/PC devices, which is an important task toward achieving artificial general intelligence. The rapid advancement of VLMs accelerates the development of GUI agents, owing to their powerful capabilities in visual understanding and task planning. However, building a GUI agent remains a challenging task due to the scarcity of operation trajectories, the availability of interactive infrastructure, and the limitation of initial capabilities in foundation models. In this work, we introduce UItron, an open-source foundational model for automatic GUI agents, featuring advanced GUI perception, grounding, and planning capabilities. UItron highlights the necessity of systemic data engineering and interactive infrastructure as foundational components for advancing GUI agent development. It not only systematically studies a series of data engineering strategies to enhance training effects, but also establishes an interactive environment connecting both Mobile and PC devices. In training, UItron adopts supervised finetuning over perception and planning tasks in various GUI scenarios, and then develop a curriculum reinforcement learning framework to enable complex reasoning and exploration for online environments. As a result, UItron achieves superior performance in benchmarks of GUI perception, grounding, and planning. In particular, UItron highlights the interaction proficiency with top-tier Chinese mobile APPs, as we identified a general lack of Chinese capabilities even in state-of-the-art solutions. To this end, we manually collect over one million steps of operation trajectories across the top 100 most popular apps, and build the offline and online agent evaluation environments. Experimental results demonstrate that UItron achieves significant progress in Chinese app scenarios, propelling GUI agents one step closer to real-world application.
Problem

Research questions and friction points this paper is trying to address.

Automating operations on Mobile and PC GUI interfaces
Overcoming data scarcity and infrastructure limitations for GUI agents
Enhancing Chinese mobile app interaction capabilities in agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised finetuning for GUI perception and planning
Curriculum reinforcement learning for complex reasoning
Systemic data engineering and interactive infrastructure development
🔎 Similar Papers
No similar papers found.
Z
Zhixiong Zeng
Meituan
J
Jing Huang
Meituan
L
Liming Zheng
Meituan
Wenkang Han
Wenkang Han
Zhejiang University
Vision-Language ModelAgentic Intelligence
Yufeng Zhong
Yufeng Zhong
Meituan
Multimodal LLMComputer Vision
L
Lei Chen
Meituan
Longrong Yang
Longrong Yang
Zhejiang University
Computer Vision and Pattern Recognition
Y
Yingjie Chu
Meituan
Y
Yuzhi He
Meituan
L
Lin Ma
Meituan