🤖 AI Summary
To address poor generalization and heavy reliance on human-annotated data in GUI agents operating within open, dynamic environments, this paper proposes an end-to-end trainable vision-language model (VLM) framework for cross-application autonomous navigation and continual exploration. Our method introduces two key innovations: (1) a world-model-driven intrinsic curiosity reward mechanism to mitigate cold-start issues; and (2) a hybrid training strategy integrating group-relative policy optimization (GRPO) with online experience-stream distillation to enhance policy stability and knowledge reuse efficiency. Experiments demonstrate substantial improvements in GUI environment adaptability and long-horizon exploration robustness. Notably, the approach exhibits strong zero-shot generalization to unseen applications. This work establishes a scalable, low-supervision learning paradigm for AGI self-evolution in interactive, complex real-world scenarios.
📝 Abstract
The rapid progress of large language models (LLMs) has sparked growing interest in building Artificial General Intelligence (AGI) within Graphical User Interface (GUI) environments. However, existing GUI agents based on LLMs or vision-language models (VLMs) often fail to generalize to novel environments and rely heavily on manually curated, diverse datasets. To overcome these limitations, we introduce ScreenExplorer, a VLM trained via Group Relative Policy Optimization(GRPO) in real, dynamic, and open-ended GUI environments. Innovatively, we introduced a world-model-based curiosity reward function to help the agent overcome the cold-start phase of exploration. Additionally, distilling experience streams further enhances the model's exploration capabilities. Our training framework enhances model exploration in open GUI environments, with trained models showing better environmental adaptation and sustained exploration compared to static deployment models. Our findings offer a scalable pathway toward AGI systems with self-improving capabilities in complex interactive settings.