D-Artemis: A Deliberative Cognitive Framework for Mobile GUI Multi-Agents

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
GUI agents for mobile task automation face challenges including data scarcity, delayed error detection, and instruction conflicts. This paper proposes a human-cognition-inspired multi-agent collaboration framework structured into three sequential phases: “Think–Align–Reflect.” It introduces a fine-grained application prompt retrieval mechanism; performs intent alignment verification via a Thought-Action Consistency Check prior to execution; and, post-execution, deploys a Status Reflection Agent to assess interface state and trigger an Action Correction Agent for recovery. Crucially, the approach operates without large-scale demonstration trajectory training, enhancing robustness and generalization. On AndroidWorld and ScreenSpot-V2 benchmarks, it achieves 75.8% and 96.8% task success rates, respectively—setting new state-of-the-art results. Ablation studies confirm the substantial individual contributions of each component.

Technology Category

Application Category

📝 Abstract
Graphical User Interface (GUI) agents aim to automate a wide spectrum of human tasks by emulating user interaction. Despite rapid advancements, current approaches are hindered by several critical challenges: data bottleneck in end-to-end training, high cost of delayed error detection, and risk of contradictory guidance. Inspired by the human cognitive loop of Thinking, Alignment, and Reflection, we present D-Artemis -- a novel deliberative framework in this paper. D-Artemis leverages a fine-grained, app-specific tip retrieval mechanism to inform its decision-making process. It also employs a proactive Pre-execution Alignment stage, where Thought-Action Consistency (TAC) Check module and Action Correction Agent (ACA) work in concert to mitigate the risk of execution failures. A post-execution Status Reflection Agent (SRA) completes the cognitive loop, enabling strategic learning from experience. Crucially, D-Artemis enhances the capabilities of general-purpose Multimodal large language models (MLLMs) for GUI tasks without the need for training on complex trajectory datasets, demonstrating strong generalization. D-Artemis establishes new state-of-the-art (SOTA) results across both major benchmarks, achieving a 75.8% success rate on AndroidWorld and 96.8% on ScreenSpot-V2. Extensive ablation studies further demonstrate the significant contribution of each component to the framework.
Problem

Research questions and friction points this paper is trying to address.

Automating GUI tasks without complex trajectory training
Reducing execution failures through proactive alignment mechanisms
Overcoming data bottlenecks and delayed error detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained tip retrieval mechanism for decisions
Pre-execution alignment with consistency check module
Post-execution reflection agent enabling strategic learning
🔎 Similar Papers
No similar papers found.
H
Hongze Mi
Tianjin University
Y
Yibo Feng
Didichuxing Co. Ltd
W
Wenjie Lu
Didichuxing Co. Ltd
Y
Yuqi Wang
Didichuxing Co. Ltd
J
Jinyuan Li
Tianjin University
Song Cao
Song Cao
University of Southern California
Computer Vision
H
He Cui
Didichuxing Co. Ltd
T
Tengfei Tian
Didichuxing Co. Ltd
X
Xuelin Zhang
Sichuan University
H
Haotian Luo
Tianjin University of Science and Technology
D
Di Sun
Huazhong Agricultural University
N
Naiqiang Tan
Didichuxing Co. Ltd
Gang Pan
Gang Pan
Tianjin University
Computer visionMultimodalAI