MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task Automation

📅 2024-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing mobile GUI automation agents suffer from planning failures in dynamic multimodal interfaces, primarily due to strong coupling among textual, visual, and spatial modalities, as well as heterogeneous action spaces across pages and tasks. To address this, we propose the first adaptive MLLM-based agent framework for complex mobile GUIs. Our approach introduces a reflective adaptive planning module with error recovery capabilities, a hierarchical multi-dimensional memory system—integrating short-term operational traces, long-term task patterns, and cross-application experience—and a GUI state reflection mechanism coupled with dynamic action-space alignment. Evaluated on our newly constructed benchmarks MobBench and AndroidArena, our framework achieves an 18.7% absolute improvement in task success rate, significantly enhancing cross-page generalization. It is the first to enable robust end-to-end automation of complex, real-world mobile GUI tasks.

Technology Category

Application Category

📝 Abstract
Existing Multimodal Large Language Model (MLLM)-based agents face significant challenges in handling complex GUI (Graphical User Interface) interactions on devices. These challenges arise from the dynamic and structured nature of GUI environments, which integrate text, images, and spatial relationships, as well as the variability in action spaces across different pages and tasks. To address these limitations, we propose MobA, a novel MLLM-based mobile assistant system. MobA introduces an adaptive planning module that incorporates a reflection mechanism for error recovery and dynamically adjusts plans to align with the real environment contexts and action module's execution capacity. Additionally, a multifaceted memory module provides comprehensive memory support to enhance adaptability and efficiency. We also present MobBench, a dataset designed for complex mobile interactions. Experimental results on MobBench and AndroidArena demonstrate MobA's ability to handle dynamic GUI environments and perform complex mobile task.
Problem

Research questions and friction points this paper is trying to address.

Handles complex GUI interactions on mobile devices
Addresses dynamic and structured GUI environments
Improves adaptability and efficiency in task automation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive planning with reflection for error recovery
Multifaceted memory module enhances adaptability
MobBench dataset for complex mobile interactions
🔎 Similar Papers
No similar papers found.
Zichen Zhu
Zichen Zhu
Shanghai Jiao Tong University
GUI智能体,多模态大模型,人机交互
H
Hao Tang
Shanghai Jiao Tong University, China
Yansi Li
Yansi Li
Shanghai Jiao Tong University
Large Language ModelsReasoningGUI Agents
Kunyao Lan
Kunyao Lan
Shanghai Jiao Tong University
Natural Language Processing
Y
Yixuan Jiang
Shanghai Jiao Tong University, China
H
Hao Zhou
Shanghai Jiao Tong University, China
Y
Yixiao Wang
Shanghai Jiao Tong University, China
Situo Zhang
Situo Zhang
Shanghai Jiao Tong University
Large Language ModelsReinforcement Learning
Liangtai Sun
Liangtai Sun
Master, Shanghai Jiao Tong University
NLPGUI understandingMulti-modal
L
Lu Chen
Shanghai Jiao Tong University, China
K
Kai Yu
Shanghai Jiao Tong University, China