ScreenExplorer: Training a Vision-Language Model for Diverse Exploration in Open GUI World

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address poor generalization and heavy reliance on human-annotated data in GUI agents operating within open, dynamic environments, this paper proposes an end-to-end trainable vision-language model (VLM) framework for cross-application autonomous navigation and continual exploration. Our method introduces two key innovations: (1) a world-model-driven intrinsic curiosity reward mechanism to mitigate cold-start issues; and (2) a hybrid training strategy integrating group-relative policy optimization (GRPO) with online experience-stream distillation to enhance policy stability and knowledge reuse efficiency. Experiments demonstrate substantial improvements in GUI environment adaptability and long-horizon exploration robustness. Notably, the approach exhibits strong zero-shot generalization to unseen applications. This work establishes a scalable, low-supervision learning paradigm for AGI self-evolution in interactive, complex real-world scenarios.

Technology Category

Application Category

📝 Abstract
The rapid progress of large language models (LLMs) has sparked growing interest in building Artificial General Intelligence (AGI) within Graphical User Interface (GUI) environments. However, existing GUI agents based on LLMs or vision-language models (VLMs) often fail to generalize to novel environments and rely heavily on manually curated, diverse datasets. To overcome these limitations, we introduce ScreenExplorer, a VLM trained via Group Relative Policy Optimization(GRPO) in real, dynamic, and open-ended GUI environments. Innovatively, we introduced a world-model-based curiosity reward function to help the agent overcome the cold-start phase of exploration. Additionally, distilling experience streams further enhances the model's exploration capabilities. Our training framework enhances model exploration in open GUI environments, with trained models showing better environmental adaptation and sustained exploration compared to static deployment models. Our findings offer a scalable pathway toward AGI systems with self-improving capabilities in complex interactive settings.
Problem

Research questions and friction points this paper is trying to address.

Generalizing GUI agents to novel environments
Reducing reliance on manually curated datasets
Enhancing exploration in open GUI environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM trained via GRPO in dynamic GUI environments
World-model-based curiosity reward for exploration
Experience streams distillation enhances exploration
🔎 Similar Papers
No similar papers found.
Runliang Niu
Runliang Niu
Jilin University, China
Natural language processingInterpretability
J
Jinglong Ji
School of Artificial Intelligence, Jilin University
Y
Yi Chang
School of Artificial Intelligence, Jilin University
Q
Qi Wang
School of Artificial Intelligence, Jilin University