TinyClick: Single-Turn Agent for Empowering GUI Automation

📅 2024-10-09
🏛️ arXiv.org
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
To address the dual requirements of low resource dependency and high localization accuracy for lightweight UI automation agents, this paper proposes a single-step GUI agent based on Florence-2-Base (0.27B parameters). Methodologically, we introduce the first vision-based multi-task training paradigm tailored for UI tasks, integrating coordinate regression fine-tuning with MLLM-driven data augmentation—significantly reducing reliance on annotated data and computational resources. Our approach achieves state-of-the-art performance on the Screenspot and OmniAct benchmarks, requiring only 56 GPU-hours (≈$40) for training and exhibiting ultra-low inference latency, enabling efficient on-device deployment. The core contribution lies in achieving high-precision UI element localization with an exceptionally compact model, thereby breaking the conventional dependence on large-scale annotations and high-end hardware.

Technology Category

Application Category

📝 Abstract
We present an UI agent for user interface (UI) interaction tasks, using Vision-Language Model Florence-2-Base. The agent's primary task is identifying the screen coordinates of the UI element corresponding to the user's command. It demonstrates very strong performance on Screenspot and OmniAct annotations, while maintaining a very small size of 0.27B parameters and minimal latency. Moreover, training needs small compute budget of 56 GPU-hours (worth about 40 USD). Relevant improvement comes from vision-specific multi-task training and MLLM-based data augmentation. We hope that decreased needs for expensive compute resources and manually annotated data will allow to facilitate more inclusive and sustainable research of UI agents.
Problem

Research questions and friction points this paper is trying to address.

Identifies UI element coordinates from user commands
Achieves strong performance with minimal size and latency
Reduces compute and annotation costs for UI agent research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Vision-Language Model Florence-2-Base
Employs vision-specific multi-task training
Utilizes MLLM-based data augmentation
🔎 Similar Papers
No similar papers found.
Pawel Pawlowski
Pawel Pawlowski
Ghent University
Philosophical logicnon-deterministic semanticsmodal logics
K
K. Zawistowski
Samsung R&D Poland
W
Wojciech Lapacz
Samsung R&D Poland
A
Adam Wiacek
Samsung R&D Poland
M
Marcin Skorupa
Samsung R&D Poland
S
Sebastien Postansque
Samsung R&D Poland
Jakub Hoscilowicz
Jakub Hoscilowicz
Samsung Research, Warsaw University of Technology
AI AgentsMultimodal LLM