InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning

📅 2025-08-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language model (VLM)-driven mobile agents exhibit insufficient understanding and reasoning capabilities in complex environments, leading to safety-critical failures. Method: We propose an interactive mobile agent framework with proactive inquiry capability. We introduce InquireBench—the first benchmark dedicated to evaluating proactive human-in-the-loop querying—and design InquireMobile, a model incorporating a two-stage training strategy and pre-action interactive reasoning, jointly optimized via human feedback and reinforcement learning. Contribution/Results: This work is the first to systematically formalize, evaluate, and enhance mobile agents’ ability to proactively solicit human confirmation at critical decision points. Experiments demonstrate that our approach improves query success rate by 46.8% on InquireBench and achieves significantly higher task success rates than all existing baselines. The code and dataset are fully open-sourced.

Technology Category

Application Category

📝 Abstract
Recent advances in Vision-Language Models (VLMs) have enabled mobile agents to perceive and interact with real-world mobile environments based on human instructions. However, the current fully autonomous paradigm poses potential safety risks when model understanding or reasoning capabilities are insufficient. To address this challenge, we first introduce extbf{InquireBench}, a comprehensive benchmark specifically designed to evaluate mobile agents' capabilities in safe interaction and proactive inquiry with users, encompassing 5 categories and 22 sub-categories, where most existing VLM-based agents demonstrate near-zero performance. In this paper, we aim to develop an interactive system that actively seeks human confirmation at critical decision points. To achieve this, we propose extbf{InquireMobile}, a novel model inspired by reinforcement learning, featuring a two-stage training strategy and an interactive pre-action reasoning mechanism. Finally, our model achieves an 46.8% improvement in inquiry success rate and the best overall success rate among existing baselines on InquireBench. We will open-source all datasets, models, and evaluation codes to facilitate development in both academia and industry.
Problem

Research questions and friction points this paper is trying to address.

Teaching mobile agents to request human assistance safely
Addressing safety risks from insufficient model reasoning capabilities
Developing interactive systems for proactive human confirmation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning fine-tuning for mobile agents
Two-stage training strategy implementation
Interactive pre-action reasoning mechanism
🔎 Similar Papers
No similar papers found.
Q
Qihang Ai
Taobao & Tmall Group of Alibaba
P
Pi Bu
Taobao & Tmall Group of Alibaba
Y
Yue Cao
Taobao & Tmall Group of Alibaba
Yingyao Wang
Yingyao Wang
Alibaba Group, Harbin Institute of Technology
LVLMQuestion AnsweringKnowledge Reasoning
Jihao Gu
Jihao Gu
University College London
Computer Vision
J
Jingxuan Xing
Taobao & Tmall Group of Alibaba
Z
Zekun Zhu
Taobao & Tmall Group of Alibaba
W
Wei Jiang
Taobao & Tmall Group of Alibaba
Z
Zhicheng Zheng
Taobao & Tmall Group of Alibaba
Jun Song
Jun Song
Shenzhen University
nanophotonics
Y
Yuning Jiang
Taobao & Tmall Group of Alibaba
B
Bo Zheng
Taobao & Tmall Group of Alibaba