Take That for Me: Multimodal Exophora Resolution with Interactive Questioning for Ambiguous Out-of-View Instructions

📅 2025-08-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Daily assistive robots must interpret ambiguous spoken commands containing deictic expressions (e.g., “hand me that cup”), yet vision-based exophoric resolution fails when users or target objects are outside the robot’s field of view. To address this, we propose MIEL—a novel multimodal framework that integrates sound source localization (SSL), semantic mapping, vision-language models (VLMs), and GPT-4o–driven active interactive question-answering for cross-modal reference resolution under occlusion or invisibility. Its key innovations are: (1) leveraging auditory cues to compensate for visual absence, and (2) deploying generative large language models to initiate semantically grounded clarifying questions. Experiments demonstrate that MIEL achieves 1.3× higher accuracy than baseline methods when users are visible, and 2.0× higher when users are invisible—significantly improving robotic robustness in dynamic real-world environments.

Technology Category

Application Category

📝 Abstract
Daily life support robots must interpret ambiguous verbal instructions involving demonstratives such as ``Bring me that cup,'' even when objects or users are out of the robot's view. Existing approaches to exophora resolution primarily rely on visual data and thus fail in real-world scenarios where the object or user is not visible. We propose Multimodal Interactive Exophora resolution with user Localization (MIEL), which is a multimodal exophora resolution framework leveraging sound source localization (SSL), semantic mapping, visual-language models (VLMs), and interactive questioning with GPT-4o. Our approach first constructs a semantic map of the environment and estimates candidate objects from a linguistic query with the user's skeletal data. SSL is utilized to orient the robot toward users who are initially outside its visual field, enabling accurate identification of user gestures and pointing directions. When ambiguities remain, the robot proactively interacts with the user, employing GPT-4o to formulate clarifying questions. Experiments in a real-world environment showed results that were approximately 1.3 times better when the user was visible to the robot and 2.0 times better when the user was not visible to the robot, compared to the methods without SSL and interactive questioning. The project website is https://emergentsystemlabstudent.github.io/MIEL/.
Problem

Research questions and friction points this paper is trying to address.

Resolving ambiguous out-of-view instructions for robots
Handling exophora when objects or users are not visible
Interpreting demonstrative references like 'that cup' accurately
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sound source localization for user orientation
Semantic mapping with skeletal data for candidate objects
GPT-4o interactive questioning for ambiguity resolution
🔎 Similar Papers
No similar papers found.