Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues

📅 2024-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In language-driven embodied instance navigation, users must provide lengthy, pre-specified target descriptions—a major usability bottleneck. Method: This paper introduces Collaborative Instance Navigation (CoIN), a novel task wherein an embodied agent actively identifies visual uncertainty during navigation and dynamically seeks human clarification via natural, template-free open-ended dialogue. To this end, we propose AIUTA, a training-agnostic framework integrating self-questioning and uncertainty-aware interaction triggering; construct CoIN-Bench—the first benchmark supporting both simulation and real-world human-agent evaluation; and design a vision-language model (VLM)–large language model (LLM) fusion architecture enabling observation modeling, uncertainty estimation, and interactive decision-making without policy fine-tuning. Contribution/Results: As a zero-training baseline, AIUTA significantly outperforms existing methods: in multi-instance scenarios, it reduces human-agent dialogue turns by over 40%, demonstrating strong efficiency and generalization across diverse settings.

Technology Category

Application Category

📝 Abstract
Language-driven instance object navigation assumes that human users initiate the task by providing a detailed description of the target instance to the embodied agent.While this description is crucial for distinguishing the target from visually similar instances in a scene, providing it prior to navigation can be demanding for human. To bridge this gap, we introduce Collaborative Instance object Navigation (CoIN), a new task setting where the agent actively resolve uncertainties about the target instance during navigation in natural, template-free, open-ended dialogues with human. We propose a novel training-free method, Agent-user Interaction with UncerTainty Awareness (AIUTA), which operates independently from the navigation policy, and focuses on the human-agent interaction reasoning with Vision-Language Models (VLMs) and Large Language Models (LLMs). First, upon object detection, a Self-Questioner model initiates a self-dialogue within the agent to obtain a complete and accurate observation description with a novel uncertainty estimation technique. Then, an Interaction Trigger module determines whether to ask a question to the human, continue or halt navigation, minimizing user input. For evaluation, we introduce CoIN-Bench, with a curated dataset designed for challenging multi-instance scenarios. CoIN-Bench supports both online evaluation with humans and reproducible experiments with simulated user-agent interactions. On CoIN-Bench, we show that AIUTA serves as a competitive baseline, while existing language-driven instance navigation methods struggle in complex multi-instance scenes. Code and benchmark will be available upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Minimizes human-agent dialogues in object navigation tasks
Resolves target instance uncertainties during navigation
Enhances interaction with Vision-Language and Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

AIUTA enables uncertainty-aware human-agent dialogues.
Self-Questioner model enhances object observation accuracy.
CoIN-Bench evaluates multi-instance navigation effectively.
🔎 Similar Papers
No similar papers found.