🤖 AI Summary
To address the dual requirements of low resource dependency and high localization accuracy for lightweight UI automation agents, this paper proposes a single-step GUI agent based on Florence-2-Base (0.27B parameters). Methodologically, we introduce the first vision-based multi-task training paradigm tailored for UI tasks, integrating coordinate regression fine-tuning with MLLM-driven data augmentation—significantly reducing reliance on annotated data and computational resources. Our approach achieves state-of-the-art performance on the Screenspot and OmniAct benchmarks, requiring only 56 GPU-hours (≈$40) for training and exhibiting ultra-low inference latency, enabling efficient on-device deployment. The core contribution lies in achieving high-precision UI element localization with an exceptionally compact model, thereby breaking the conventional dependence on large-scale annotations and high-end hardware.
📝 Abstract
We present an UI agent for user interface (UI) interaction tasks, using Vision-Language Model Florence-2-Base. The agent's primary task is identifying the screen coordinates of the UI element corresponding to the user's command. It demonstrates very strong performance on Screenspot and OmniAct annotations, while maintaining a very small size of 0.27B parameters and minimal latency. Moreover, training needs small compute budget of 56 GPU-hours (worth about 40 USD). Relevant improvement comes from vision-specific multi-task training and MLLM-based data augmentation. We hope that decreased needs for expensive compute resources and manually annotated data will allow to facilitate more inclusive and sustainable research of UI agents.