GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding

📅 2025-07-21

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

Existing GUI localization methods rely on sparse binary rewards, failing to capture spatial continuity. This paper proposes GUI-G², a Gaussian reward framework that models interface elements as 2D Gaussian distributions with adaptive variance—the first approach to shift from discrete hit-or-miss evaluation to continuous spatial optimization. GUI-G² synthesizes two dense, differentiable reward components: (i) a Gaussian point reward based on Euclidean distance from the predicted center to the ground-truth center, and (ii) a coverage reward quantifying overlap between predicted and target bounding boxes. To enhance robustness and geometric fidelity, it incorporates exponential decay for distance penalization and a size-aware variance mechanism. Evaluated on the ScreenSpot benchmark, GUI-G² substantially outperforms the state-of-the-art UI-TARS-72B, achieving up to a 24.7% absolute improvement in localization accuracy. Moreover, it demonstrates superior generalization across diverse UI layouts and robustness to interface variations.

Technology Category

Application Category

📝 Abstract

Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G$^2$), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G$^2$ incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G$^2$, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.

Problem

Research questions and friction points this paper is trying to address.

Models GUI elements as Gaussian distributions for precise localization

Replaces sparse binary rewards with dense continuous optimization

Improves robustness and generalization in GUI interaction tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Models GUI elements as Gaussian distributions

Uses adaptive variance for diverse element scales

Transforms GUI grounding to continuous optimization

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling