Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing embodied agents struggle with the ambiguity of human instructions in real-world settings due to their unidirectional execution paradigm and lack of proactive clarification mechanisms. To address this, we propose Ask-to-Clarify—a novel end-to-end embodied interaction framework that pioneers the “ask-first, act-later” paradigm. Our method integrates a vision-language model (VLM) to enable multi-turn collaborative dialogue, employs a diffusion model for low-level action generation, and introduces a connector module with signal-detection routing to dynamically switch between questioning and execution modes. A two-stage knowledge-isolation training strategy ensures modular decoupling and joint optimization. Evaluated on eight real-world tasks, our approach substantially outperforms state-of-the-art vision-language-action (VLA) models, achieving an average 12.6% improvement in task success rate. This demonstrates the critical role of proactive clarification in enhancing the robustness and reliability of human-agent embodied collaboration.

Technology Category

Application Category

📝 Abstract

The ultimate goal of embodied agents is to create collaborators that can interact with humans, not mere executors that passively follow instructions. This requires agents to communicate, coordinate, and adapt their actions based on human feedback. Recently, advances in VLAs have offered a path toward this goal. However, most current VLA-based embodied agents operate in a one-way mode: they receive an instruction and execute it without feedback. This approach fails in real-world scenarios where instructions are often ambiguous. In this paper, we address this problem with the Ask-to-Clarify framework. Our framework first resolves ambiguous instructions by asking questions in a multi-turn dialogue. Then it generates low-level actions end-to-end. Specifically, the Ask-to-Clarify framework consists of two components, one VLM for collaboration and one diffusion for action. We also introduce a connection module that generates conditions for the diffusion based on the output of the VLM. This module adjusts the observation by instructions to create reliable conditions. We train our framework with a two-stage knowledge-insulation strategy. First, we fine-tune the collaboration component using ambiguity-solving dialogue data to handle ambiguity. Then, we integrate the action component while freezing the collaboration one. This preserves the interaction abilities while fine-tuning the diffusion to generate actions. The training strategy guarantees our framework can first ask questions, then generate actions. During inference, a signal detector functions as a router that helps our framework switch between asking questions and taking actions. We evaluate the Ask-to-Clarify framework in 8 real-world tasks, where it outperforms existing state-of-the-art VLAs. The results suggest that our proposed framework, along with the training strategy, provides a path toward collaborative embodied agents.

Problem

Research questions and friction points this paper is trying to address.

Resolving ambiguous instructions through multi-turn dialogue interaction

Enabling embodied agents to ask clarifying questions before execution

Integrating vision-language models with action generation for collaboration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn dialogue resolves ambiguous instructions

VLM and diffusion model integration for actions

Two-stage training with knowledge-insulation strategy

🔎 Similar Papers

A Model-Agnostic Approach for Semantically Driven Disambiguation in Human-Robot Interaction

2024-09-25Citations: 1

💼 Related Jobs

Machine Learning Engineer - Agentic AI

Apple

Sunnyvale, United States of America

AI Research Scientist, VLM (vision language models)