Retrieval-augmented GUI Agents with Generative Guidelines

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
GUI agents suffer from insufficient generalization in complex digital tasks due to missing long-tail scenario knowledge and scarce training data. Method: This paper proposes RAG-GUI—a model-agnostic, plug-and-play inference-time knowledge augmentation framework that jointly leverages web tutorial retrieval and generative guidance, augmented by a self-guided rejection sampling fine-tuning (RSF) strategy to improve robustness on rare scenarios. The framework unifies vision-language modeling, retrieval-augmented generation (RAG), supervised fine-tuning (SFT), and RSF for efficient knowledge utilization and lightweight adaptation. Contribution/Results: Experiments across three real-world GUI task categories demonstrate that RAG-GUI improves task completion rates by 2.6–13.3% over baseline models, significantly enhancing zero-shot transfer and long-tail generalization. It provides a scalable technical pathway for low-resource GUI automation.

Technology Category

Application Category

📝 Abstract
GUI agents powered by vision-language models (VLMs) show promise in automating complex digital tasks. However, their effectiveness in real-world applications is often limited by scarce training data and the inherent complexity of these tasks, which frequently require long-tailed knowledge covering rare, unseen scenarios. We propose RAG-GUI , a lightweight VLM that leverages web tutorials at inference time. RAG-GUI is first warm-started via supervised finetuning (SFT) and further refined through self-guided rejection sampling finetuning (RSF). Designed to be model-agnostic, RAG-GUI functions as a generic plug-in that enhances any VLM-based agent. Evaluated across three distinct tasks, it consistently outperforms baseline agents and surpasses other inference baselines by 2.6% to 13.3% across two model sizes, demonstrating strong generalization and practical plug-and-play capabilities in real-world scenarios.
Problem

Research questions and friction points this paper is trying to address.

Automating complex digital tasks with GUI agents
Overcoming limited training data for vision-language models
Addressing rare scenarios requiring long-tailed knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages web tutorials during inference time
Uses supervised finetuning with rejection sampling
Model-agnostic plugin enhancing vision-language agents
🔎 Similar Papers
No similar papers found.
R
Ran Xu
Emory University
Kaixin Ma
Kaixin Ma
Researcher, Apple
LLMsMultimodal Foundation ModelsAgents
W
Wenhao Yu
Tencent AI Lab
H
Hongming Zhang
Tencent AI Lab
J
Joyce C. Ho
Emory University
Carl Yang
Carl Yang
Waymo LLC, PhD at University of California, Davis
GPU ComputingParallel ComputingGraph Processing
D
Dong Yu
Tencent AI Lab