ScreenExplorer: Training a Vision-Language Model for Diverse Exploration in Open GUI World

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

To address poor generalization and heavy reliance on human-annotated data in GUI agents operating within open, dynamic environments, this paper proposes an end-to-end trainable vision-language model (VLM) framework for cross-application autonomous navigation and continual exploration. Our method introduces two key innovations: (1) a world-model-driven intrinsic curiosity reward mechanism to mitigate cold-start issues; and (2) a hybrid training strategy integrating group-relative policy optimization (GRPO) with online experience-stream distillation to enhance policy stability and knowledge reuse efficiency. Experiments demonstrate substantial improvements in GUI environment adaptability and long-horizon exploration robustness. Notably, the approach exhibits strong zero-shot generalization to unseen applications. This work establishes a scalable, low-supervision learning paradigm for AGI self-evolution in interactive, complex real-world scenarios.

Technology Category

Application Category

📝 Abstract

The rapid progress of large language models (LLMs) has sparked growing interest in building Artificial General Intelligence (AGI) within Graphical User Interface (GUI) environments. However, existing GUI agents based on LLMs or vision-language models (VLMs) often fail to generalize to novel environments and rely heavily on manually curated, diverse datasets. To overcome these limitations, we introduce ScreenExplorer, a VLM trained via Group Relative Policy Optimization(GRPO) in real, dynamic, and open-ended GUI environments. Innovatively, we introduced a world-model-based curiosity reward function to help the agent overcome the cold-start phase of exploration. Additionally, distilling experience streams further enhances the model's exploration capabilities. Our training framework enhances model exploration in open GUI environments, with trained models showing better environmental adaptation and sustained exploration compared to static deployment models. Our findings offer a scalable pathway toward AGI systems with self-improving capabilities in complex interactive settings.

Problem

Research questions and friction points this paper is trying to address.

Generalizing GUI agents to novel environments

Reducing reliance on manually curated datasets

Enhancing exploration in open GUI environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM trained via GRPO in dynamic GUI environments

World-model-based curiosity reward for exploration

Experience streams distillation enhances exploration

🔎 Similar Papers

GUICourse: From General Vision Language Models to Versatile GUI Agents