Step-GUI Technical Report

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address three key challenges in GUI agent training—scarce high-quality data, prohibitively high annotation costs, and weak cross-device privacy protection—this paper proposes Step-GUI, a novel framework featuring a calibration-based step-level reward mechanism and the GUI-MCP contextual protocol. These innovations enable trajectory-level self-evolving annotation, on-device execution with strong privacy guarantees, and hierarchical task scheduling grounded in real-world scenarios. Step-GUI integrates multimodal large language models, hierarchical GUI action abstraction, and localized expert collaboration, and introduces AndroidDaily—the first behavior-driven, real-world Android benchmark. Experiments demonstrate that Step-GUI 8B achieves state-of-the-art performance: 80.2% on AndroidWorld, 48.5% on OSWorld, and 62.6% on ScreenShot-Pro. On AndroidDaily, it attains an end-to-end accuracy of 52.50%, annotation accuracy exceeding 90%, and reduces annotation cost by 10–100× compared to conventional approaches.

Technology Category

Application Category

📝 Abstract
Recent advances in multimodal large language models unlock unprecedented opportunities for GUI automation. However, a fundamental challenge remains: how to efficiently acquire high-quality training data while maintaining annotation reliability? We introduce a self-evolving training pipeline powered by the Calibrated Step Reward System, which converts model-generated trajectories into reliable training signals through trajectory-level calibration, achieving >90% annotation accuracy with 10-100x lower cost. Leveraging this pipeline, we introduce Step-GUI, a family of models (4B/8B) that achieves state-of-the-art GUI performance (8B: 80.2% AndroidWorld, 48.5% OSWorld, 62.6% ScreenShot-Pro) while maintaining robust general capabilities. As GUI agent capabilities improve, practical deployment demands standardized interfaces across heterogeneous devices while protecting user privacy. To this end, we propose GUI-MCP, the first Model Context Protocol for GUI automation with hierarchical architecture that combines low-level atomic operations and high-level task delegation to local specialist models, enabling high-privacy execution where sensitive data stays on-device. Finally, to assess whether agents can handle authentic everyday usage, we introduce AndroidDaily, a benchmark grounded in real-world mobile usage patterns with 3146 static actions and 235 end-to-end tasks across high-frequency daily scenarios (8B: static 89.91%, end-to-end 52.50%). Our work advances the development of practical GUI agents and demonstrates strong potential for real-world deployment in everyday digital interactions.
Problem

Research questions and friction points this paper is trying to address.

Efficiently acquire high-quality GUI training data with reliable annotation.
Develop GUI agents with robust performance across diverse platforms and tasks.
Enable practical GUI automation deployment with privacy protection and standardized interfaces.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-evolving pipeline with calibrated step reward system
Step-GUI models achieve state-of-the-art GUI performance
GUI-MCP protocol enables high-privacy on-device execution
🔎 Similar Papers
No similar papers found.
H
Haolong Yan
GELab-Team, StepFun
J
Jia Wang
GELab-Team, StepFun
X
Xin Huang
GELab-Team, StepFun
Y
Yeqing Shen
GELab-Team, StepFun
Z
Ziyang Meng
GELab-Team, StepFun
Z
Zhimin Fan
GELab-Team, StepFun
K
Kaijun Tan
GELab-Team, StepFun
J
Jin Gao
GELab-Team, StepFun
L
Lieyu Shi
GELab-Team, StepFun
Mi Yang
Mi Yang
Beijing Jiaotong University
Channel ModelingV2V communicationsWireless communication
S
Shiliang Yang
GELab-Team, StepFun
Zhirui Wang
Zhirui Wang
Aerospace Information Research Institute, Chinese Academy of Sciences
Remote sensing image interpretationtarget detectiontarget recognition
B
Brian Li
GELab-Team, StepFun
K
Kang An
GELab-Team, StepFun
C
Chenyang Li
GELab-Team, StepFun
L
Lei Lei
GELab-Team, StepFun
M
Mengmeng Duan
GELab-Team, StepFun
D
Danxun Liang
GELab-Team, StepFun
Guodong Liu
Guodong Liu
GELab-Team, StepFun
H
Hang Cheng
GELab-Team, StepFun
H
Hao Wu
GELab-Team, StepFun
J
Jie Dong
GELab-Team, StepFun
Junhao Huang
Junhao Huang
Victoria University of Wellington
Neural Architecture SearchDeep Neural NetworksEvolutionary Computation
M
Mei Chen
GELab-Team, StepFun
R
Renjie Yu
GELab-Team, StepFun