Retrieval-augmented GUI Agents with Generative Guidelines

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

GUI agents suffer from insufficient generalization in complex digital tasks due to missing long-tail scenario knowledge and scarce training data. Method: This paper proposes RAG-GUI—a model-agnostic, plug-and-play inference-time knowledge augmentation framework that jointly leverages web tutorial retrieval and generative guidance, augmented by a self-guided rejection sampling fine-tuning (RSF) strategy to improve robustness on rare scenarios. The framework unifies vision-language modeling, retrieval-augmented generation (RAG), supervised fine-tuning (SFT), and RSF for efficient knowledge utilization and lightweight adaptation. Contribution/Results: Experiments across three real-world GUI task categories demonstrate that RAG-GUI improves task completion rates by 2.6–13.3% over baseline models, significantly enhancing zero-shot transfer and long-tail generalization. It provides a scalable technical pathway for low-resource GUI automation.

Technology Category

Application Category

📝 Abstract

GUI agents powered by vision-language models (VLMs) show promise in automating complex digital tasks. However, their effectiveness in real-world applications is often limited by scarce training data and the inherent complexity of these tasks, which frequently require long-tailed knowledge covering rare, unseen scenarios. We propose RAG-GUI , a lightweight VLM that leverages web tutorials at inference time. RAG-GUI is first warm-started via supervised finetuning (SFT) and further refined through self-guided rejection sampling finetuning (RSF). Designed to be model-agnostic, RAG-GUI functions as a generic plug-in that enhances any VLM-based agent. Evaluated across three distinct tasks, it consistently outperforms baseline agents and surpasses other inference baselines by 2.6% to 13.3% across two model sizes, demonstrating strong generalization and practical plug-and-play capabilities in real-world scenarios.

Problem

Research questions and friction points this paper is trying to address.

Automating complex digital tasks with GUI agents

Overcoming limited training data for vision-language models

Addressing rare scenarios requiring long-tailed knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages web tutorials during inference time

Uses supervised finetuning with rejection sampling

Model-agnostic plugin enhancing vision-language agents

🔎 Similar Papers

No similar papers found.

Authors to Follow