Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue

๐Ÿ“… 2025-09-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing embodied agents struggle with the ambiguity of human instructions in real-world settings due to their unidirectional execution paradigm and lack of proactive clarification mechanisms. To address this, we propose Ask-to-Clarifyโ€”a novel end-to-end embodied interaction framework that pioneers the โ€œask-first, act-laterโ€ paradigm. Our method integrates a vision-language model (VLM) to enable multi-turn collaborative dialogue, employs a diffusion model for low-level action generation, and introduces a connector module with signal-detection routing to dynamically switch between questioning and execution modes. A two-stage knowledge-isolation training strategy ensures modular decoupling and joint optimization. Evaluated on eight real-world tasks, our approach substantially outperforms state-of-the-art vision-language-action (VLA) models, achieving an average 12.6% improvement in task success rate. This demonstrates the critical role of proactive clarification in enhancing the robustness and reliability of human-agent embodied collaboration.

Technology Category

Application Category

๐Ÿ“ Abstract
The ultimate goal of embodied agents is to create collaborators that can interact with humans, not mere executors that passively follow instructions. This requires agents to communicate, coordinate, and adapt their actions based on human feedback. Recently, advances in VLAs have offered a path toward this goal. However, most current VLA-based embodied agents operate in a one-way mode: they receive an instruction and execute it without feedback. This approach fails in real-world scenarios where instructions are often ambiguous. In this paper, we address this problem with the Ask-to-Clarify framework. Our framework first resolves ambiguous instructions by asking questions in a multi-turn dialogue. Then it generates low-level actions end-to-end. Specifically, the Ask-to-Clarify framework consists of two components, one VLM for collaboration and one diffusion for action. We also introduce a connection module that generates conditions for the diffusion based on the output of the VLM. This module adjusts the observation by instructions to create reliable conditions. We train our framework with a two-stage knowledge-insulation strategy. First, we fine-tune the collaboration component using ambiguity-solving dialogue data to handle ambiguity. Then, we integrate the action component while freezing the collaboration one. This preserves the interaction abilities while fine-tuning the diffusion to generate actions. The training strategy guarantees our framework can first ask questions, then generate actions. During inference, a signal detector functions as a router that helps our framework switch between asking questions and taking actions. We evaluate the Ask-to-Clarify framework in 8 real-world tasks, where it outperforms existing state-of-the-art VLAs. The results suggest that our proposed framework, along with the training strategy, provides a path toward collaborative embodied agents.
Problem

Research questions and friction points this paper is trying to address.

Resolving ambiguous instructions through multi-turn dialogue interaction
Enabling embodied agents to ask clarifying questions before execution
Integrating vision-language models with action generation for collaboration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn dialogue resolves ambiguous instructions
VLM and diffusion model integration for actions
Two-stage training with knowledge-insulation strategy
X
Xingyao Lin
College of Computer Science and Artificial Intelligence, Fudan University, Shanghai, China
Xinghao Zhu
Xinghao Zhu
UC Berkeley
RoboticsPlanningMachine Learning
T
Tianyi Lu
College of Computer Science and Artificial Intelligence, Fudan University, Shanghai, China
S
Sicheng Xie
College of Computer Science and Artificial Intelligence, Fudan University, Shanghai, China
H
Hui Zhang
College of Computer Science and Artificial Intelligence, Fudan University, Shanghai, China
X
Xipeng Qiu
College of Computer Science and Artificial Intelligence, Fudan University, Shanghai, China
Zuxuan Wu
Zuxuan Wu
Fudan University
Yu-Gang Jiang
Yu-Gang Jiang
Professor, Fudan University. IEEE & IAPR Fellow
Video AnalysisEmbodied AITrustworthy AI