GUICourse: From General Vision Language Models to Versatile GUI Agents

📅 2024-06-17
🏛️ arXiv.org
📈 Citations: 35
Influential: 7
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) exhibit significant limitations in OCR accuracy, visual localization, and functional understanding of GUI components, hindering practical GUI navigation. To address this, we propose an end-to-end vision-driven GUI agent framework. We introduce three novel, task-specific datasets: GUIEnv (for GUI environment simulation), GUIAct (for action trajectory annotation), and GUIChat (for conversational interaction), systematically enhancing VLM capabilities in GUI perception, action execution, and interactive reasoning. Through supervised fine-tuning and ablation studies, our approach achieves efficient multi-step GUI task execution on a compact 3.1B-parameter model. Experiments demonstrate substantial improvements over state-of-the-art VLM baselines on both single-step and multi-step GUI tasks. All code and datasets are publicly released to advance research and deployment of GUI agents.

Technology Category

Application Category

📝 Abstract
Utilizing Graphic User Interface (GUI) for human-computer interaction is essential for accessing a wide range of digital tools. Recent advancements in Vision Language Models (VLMs) highlight the compelling potential to develop versatile agents to help humans finish GUI navigation tasks. However, current VLMs are challenged in terms of fundamental abilities (OCR and grounding) and GUI knowledge (the functions and control methods of GUI elements), preventing them from becoming practical GUI agents. To solve these challenges, we contribute GUICourse, a suite of datasets to train visual-based GUI agents from general VLMs. First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs. Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions. Experiments demonstrate that our GUI agents have better performance on common GUI tasks than their baseline VLMs. Even the small-size GUI agent (with 3.1B parameters) can still work well on single-step and multi-step GUI tasks. Finally, we analyze the different varieties in the training stage of this agent by ablation study. Our source codes and datasets are released at https://github.com/yiye3/GUICourse.
Problem

Research questions and friction points this paper is trying to address.

Enhancing OCR and grounding in Vision Language Models
Improving GUI element function and control knowledge
Developing versatile agents for GUI navigation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Develops GUICourse dataset suite for GUI agents
Enhances OCR and grounding with GUIEnv dataset
Improves GUI knowledge via GUIAct and GUIChat
🔎 Similar Papers
No similar papers found.
W
Wentong Chen
Renmin University of China
Junbo Cui
Junbo Cui
Tsinghua University
J
Jinyi Hu
Tsinghua University
Yujia Qin
Yujia Qin
ByteDance
Agent
J
Junjie Fang
Xiamen University
Y
Yue Zhao
Beijing University of Posts and Telecommunications
C
Chongyi Wang
ModelBest Inc.
J
Jun Liu
Insitute of Computing Technology, Chinese Academy of Sciences
G
Gui-Fang Chen
University of Electronic Science and Technology of China
Y
Yupeng Huo
University of Electronic Science and Technology of China
Y
Yuan Yao
Tsinghua University
Yankai Lin
Yankai Lin
Associate Professor (Tenure Track), Gaoling School of AI, Renmin University of China
Natural Language ProcessingLarge Language Models
Z
Zhiyuan Liu
Tsinghua University
Maosong Sun
Maosong Sun
Professor of Computer Science and Technology, Tsinghua University
Natural Language ProcessingArtificial IntelligenceSocial Computing