SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control

📅 2025-08-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing single-agent approaches suffer from limited capability in modeling GUI structural semantics, while multi-agent reinforcement learning (MARL) enables capability decomposition but incurs inefficient training and poor compatibility with large vision-language models (LVLMs), hindering reliable natural language-to-GUI action mapping. To address these challenges, we propose SWIRL: a staged interleaved reinforcement learning framework that decomposes multi-agent collaboration into sequential single-agent tasks, thereby enhancing training stability and efficiency while guaranteeing theoretical safety, monotonic policy improvement, and reward convergence. SWIRL employs an LVLM-based Navigator-Interactor architecture to decouple high-level semantic planning from low-level action execution. Evaluated on both GUI control and multi-agent mathematical reasoning benchmarks, SWIRL achieves state-of-the-art performance, demonstrating superior training efficiency and strong generalization across diverse interactive tasks.

Technology Category

Application Category

📝 Abstract
The rapid advancement of large vision language models (LVLMs) and agent systems has heightened interest in mobile GUI agents that can reliably translate natural language into interface operations. Existing single-agent approaches, however, remain limited by structural constraints. Although multi-agent systems naturally decouple different competencies, recent progress in multi-agent reinforcement learning (MARL) has often been hindered by inefficiency and remains incompatible with current LVLM architectures. To address these challenges, we introduce SWIRL, a staged workflow for interleaved reinforcement learning designed for multi-agent systems. SWIRL reformulates MARL into a sequence of single-agent reinforcement learning tasks, updating one agent at a time while keeping the others fixed. This formulation enables stable training and promotes efficient coordination across agents. Theoretically, we provide a stepwise safety bound, a cross-round monotonic improvement theorem, and convergence guarantees on return, ensuring robust and principled optimization. In application to mobile GUI control, SWIRL instantiates a Navigator that converts language and screen context into structured plans, and an Interactor that grounds these plans into executable atomic actions. Extensive experiments demonstrate superior performance on both high-level and low-level GUI benchmarks. Beyond GUI tasks, SWIRL also demonstrates strong capability in multi-agent mathematical reasoning, underscoring its potential as a general framework for developing efficient and robust multi-agent systems.
Problem

Research questions and friction points this paper is trying to address.

Addresses inefficiency in multi-agent reinforcement learning systems
Reformulates MARL into sequential single-agent learning tasks
Enables stable training and coordination across multiple agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Staged workflow for multi-agent reinforcement learning
Reformulates MARL into sequential single-agent tasks
Navigator and Interactor agents for GUI control
🔎 Similar Papers
No similar papers found.
Quanfeng Lu
Quanfeng Lu
Ph.D. student, Shanghai Jiao Tong University
Multimodal LearningLarge Language Model
Z
Zhantao Ma
The University of Hong Kong
S
Shuai Zhong
The University of Hong Kong
J
Jin Wang
The University of Hong Kong
Dahai Yu
Dahai Yu
Florida State University
Uncertainty Quantification
M
Michael K. Ng
Hong Kong Baptist University
Ping Luo
Ping Luo
National University of Defense Technology
distributed_computing