🤖 AI Summary
Daily assistive robots must interpret ambiguous spoken commands containing deictic expressions (e.g., “hand me that cup”), yet vision-based exophoric resolution fails when users or target objects are outside the robot’s field of view. To address this, we propose MIEL—a novel multimodal framework that integrates sound source localization (SSL), semantic mapping, vision-language models (VLMs), and GPT-4o–driven active interactive question-answering for cross-modal reference resolution under occlusion or invisibility. Its key innovations are: (1) leveraging auditory cues to compensate for visual absence, and (2) deploying generative large language models to initiate semantically grounded clarifying questions. Experiments demonstrate that MIEL achieves 1.3× higher accuracy than baseline methods when users are visible, and 2.0× higher when users are invisible—significantly improving robotic robustness in dynamic real-world environments.
📝 Abstract
Daily life support robots must interpret ambiguous verbal instructions involving demonstratives such as ``Bring me that cup,'' even when objects or users are out of the robot's view. Existing approaches to exophora resolution primarily rely on visual data and thus fail in real-world scenarios where the object or user is not visible. We propose Multimodal Interactive Exophora resolution with user Localization (MIEL), which is a multimodal exophora resolution framework leveraging sound source localization (SSL), semantic mapping, visual-language models (VLMs), and interactive questioning with GPT-4o. Our approach first constructs a semantic map of the environment and estimates candidate objects from a linguistic query with the user's skeletal data. SSL is utilized to orient the robot toward users who are initially outside its visual field, enabling accurate identification of user gestures and pointing directions. When ambiguities remain, the robot proactively interacts with the user, employing GPT-4o to formulate clarifying questions. Experiments in a real-world environment showed results that were approximately 1.3 times better when the user was visible to the robot and 2.0 times better when the user was not visible to the robot, compared to the methods without SSL and interactive questioning. The project website is https://emergentsystemlabstudent.github.io/MIEL/.