GuirlVG: Incentivize GUI Visual Grounding via Empirical Exploration on Reinforcement Learning

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Supervised fine-tuning (SFT) for GUI Visual Grounding (GUI-VG) suffers from heavy reliance on large-scale annotated data and high training costs. Method: This paper proposes GuiRLVG—the first systematic framework exploring reinforcement learning (RL) for GUI-VG. It innovatively decomposes the core components of Reinforcement Fine-Tuning (RFT), introduces adversarial KL regularization to dynamically stabilize training and mitigate reward over-optimization, and refines reward modeling and policy update configurations. Results: Trained on only 5.2K samples, GuiRLVG surpasses SFT baselines trained on millions of annotations, achieving absolute accuracy improvements of 7.7%, 17.2%, and 91.9% on ScreenSpot, ScreenSpotPro, and ScreenSpotV2, respectively. The approach significantly reduces data dependency while enhancing generalization across diverse GUI grounding benchmarks.

Technology Category

Application Category

📝 Abstract

Graphical user interface visual grounding (GUI-VG), a core capability for GUI agents, has primarily relied on supervised fine-tuning (SFT) of multimodal large language models (MLLMs), which demands extensive data curation and significant training costs. However, as MLLMs continue to advance and even cover GUI domains during pretraining, the necessity of exhaustive SFT post-training becomes increasingly questionable. Meanwhile, recent successes of rule-based reinforcement fine-tuning (RFT) suggest a more efficient alternative. Despite this promise, the optimal manner of applying RFT for GUI-VG remains unexplored. To bridge this gap, we introduce GuirlVG, a reinforcement learning-based GUI-VG method built on a systematic empirical study and a novel stabilization technique. We find that naive application of RFT underperforms the SFT baseline, motivating a deeper exploration. First, we decompose RFT into its core components and analyze the optimal formulation of each. Second, we propose a novel Adversarial KL Factor that dynamically stabilizes training to mitigate reward over-optimization. Third, we further explore the training configurations of RFT to enhance effectiveness. Extensive experiments show that GuirlVG, with only 5.2K training samples, outperforms SFT methods trained on over 10M samples, achieving a 7.7% improvement on ScreenSpot, a 17.2% improvement on ScreenSpotPro, and 91.9% accuracy on ScreenSpotV2.

Problem

Research questions and friction points this paper is trying to address.

Optimizing reinforcement fine-tuning for GUI visual grounding

Reducing data and training costs in GUI agent development

Improving accuracy in GUI element localization tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning-based GUI-VG method

Adversarial KL Factor stabilizes training

Optimal RFT configurations enhance effectiveness

🔎 Similar Papers

Visual grounding for desktop graphical user interfaces