🤖 AI Summary
This paper addresses the challenge of precisely localizing interactive GUI elements from natural language instructions across diverse interfaces. We propose the “Scanner-Locator” collaborative framework: a general-purpose vision-language model (VLM) first performs coarse-grained scanning to identify candidate regions; a lightweight, task-specific model then refines localization via fine-grained pixel-coordinate regression. Our approach innovatively decouples generic visual-linguistic representation learning from task-specific localization, incorporating hierarchical search and cross-modal feature propagation to emulate human visual cognition. Evaluated on ScreenSpot-Pro, the framework achieves 35.7% overall accuracy—improving upon standalone Scanner and Locator baselines by 17× and 9.5×, respectively—and significantly outperforms multiple strong competitors. Results demonstrate superior robustness and generalization across applications and interface layouts, validating the efficacy of our modular, cognitively inspired design.
📝 Abstract
Grounding natural language queries in graphical user interfaces (GUIs) presents a challenging task that requires models to comprehend diverse UI elements across various applications and systems, while also accurately predicting the spatial coordinates for the intended operation. To tackle this problem, we propose GMS: Generalist Scanner Meets Specialist Locator, a synergistic coarse-to-fine framework that effectively improves GUI grounding performance. GMS leverages the complementary strengths of general vision-language models (VLMs) and small, task-specific GUI grounding models by assigning them distinct roles within the framework. Specifically, the general VLM acts as a 'Scanner' to identify potential regions of interest, while the fine-tuned grounding model serves as a 'Locator' that outputs precise coordinates within these regions. This design is inspired by how humans perform GUI grounding, where the eyes scan the interface and the brain focuses on interpretation and localization. Our whole framework consists of five stages and incorporates hierarchical search with cross-modal communication to achieve promising prediction results. Experimental results on the ScreenSpot-Pro dataset show that while the 'Scanner' and 'Locator' models achieve only $2.0%$ and $3.7%$ accuracy respectively when used independently, their integration within GMS framework yields an overall accuracy of $35.7%$, representing a $10 imes$ improvement. Additionally, GMS significantly outperforms other strong baselines under various settings, demonstrating its robustness and potential for general-purpose GUI grounding.