TinyClick: Single-Turn Agent for Empowering GUI Automation

📅 2024-10-09

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 0

🤖 AI Summary

To address the dual requirements of low resource dependency and high localization accuracy for lightweight UI automation agents, this paper proposes a single-step GUI agent based on Florence-2-Base (0.27B parameters). Methodologically, we introduce the first vision-based multi-task training paradigm tailored for UI tasks, integrating coordinate regression fine-tuning with MLLM-driven data augmentation—significantly reducing reliance on annotated data and computational resources. Our approach achieves state-of-the-art performance on the Screenspot and OmniAct benchmarks, requiring only 56 GPU-hours (≈$40) for training and exhibiting ultra-low inference latency, enabling efficient on-device deployment. The core contribution lies in achieving high-precision UI element localization with an exceptionally compact model, thereby breaking the conventional dependence on large-scale annotations and high-end hardware.

Technology Category

Application Category

📝 Abstract

We present an UI agent for user interface (UI) interaction tasks, using Vision-Language Model Florence-2-Base. The agent's primary task is identifying the screen coordinates of the UI element corresponding to the user's command. It demonstrates very strong performance on Screenspot and OmniAct annotations, while maintaining a very small size of 0.27B parameters and minimal latency. Moreover, training needs small compute budget of 56 GPU-hours (worth about 40 USD). Relevant improvement comes from vision-specific multi-task training and MLLM-based data augmentation. We hope that decreased needs for expensive compute resources and manually annotated data will allow to facilitate more inclusive and sustainable research of UI agents.

Problem

Research questions and friction points this paper is trying to address.

Identifies UI element coordinates from user commands

Achieves strong performance with minimal size and latency

Reduces compute and annotation costs for UI agent research

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Vision-Language Model Florence-2-Base

Employs vision-specific multi-task training

Utilizes MLLM-based data augmentation

🔎 Similar Papers

No similar papers found.

Authors to Follow