D-Artemis: A Deliberative Cognitive Framework for Mobile GUI Multi-Agents

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

GUI agents for mobile task automation face challenges including data scarcity, delayed error detection, and instruction conflicts. This paper proposes a human-cognition-inspired multi-agent collaboration framework structured into three sequential phases: “Think–Align–Reflect.” It introduces a fine-grained application prompt retrieval mechanism; performs intent alignment verification via a Thought-Action Consistency Check prior to execution; and, post-execution, deploys a Status Reflection Agent to assess interface state and trigger an Action Correction Agent for recovery. Crucially, the approach operates without large-scale demonstration trajectory training, enhancing robustness and generalization. On AndroidWorld and ScreenSpot-V2 benchmarks, it achieves 75.8% and 96.8% task success rates, respectively—setting new state-of-the-art results. Ablation studies confirm the substantial individual contributions of each component.

Technology Category

Application Category

📝 Abstract

Graphical User Interface (GUI) agents aim to automate a wide spectrum of human tasks by emulating user interaction. Despite rapid advancements, current approaches are hindered by several critical challenges: data bottleneck in end-to-end training, high cost of delayed error detection, and risk of contradictory guidance. Inspired by the human cognitive loop of Thinking, Alignment, and Reflection, we present D-Artemis -- a novel deliberative framework in this paper. D-Artemis leverages a fine-grained, app-specific tip retrieval mechanism to inform its decision-making process. It also employs a proactive Pre-execution Alignment stage, where Thought-Action Consistency (TAC) Check module and Action Correction Agent (ACA) work in concert to mitigate the risk of execution failures. A post-execution Status Reflection Agent (SRA) completes the cognitive loop, enabling strategic learning from experience. Crucially, D-Artemis enhances the capabilities of general-purpose Multimodal large language models (MLLMs) for GUI tasks without the need for training on complex trajectory datasets, demonstrating strong generalization. D-Artemis establishes new state-of-the-art (SOTA) results across both major benchmarks, achieving a 75.8% success rate on AndroidWorld and 96.8% on ScreenSpot-V2. Extensive ablation studies further demonstrate the significant contribution of each component to the framework.

Problem

Research questions and friction points this paper is trying to address.

Automating GUI tasks without complex trajectory training

Reducing execution failures through proactive alignment mechanisms

Overcoming data bottlenecks and delayed error detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained tip retrieval mechanism for decisions

Pre-execution alignment with consistency check module

Post-execution reflection agent enabling strategic learning

🔎 Similar Papers

No similar papers found.

Authors to Follow