A Multimodal Depth-Aware Method For Embodied Reference Understanding

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address referential ambiguity arising from multiple candidate objects in embodied referring expression comprehension, this paper proposes an end-to-end multimodal framework that jointly leverages language instructions and pointing cues. Methodologically, it introduces a large language model (LLM)-driven data augmentation strategy, integrates depth-map modalities to explicitly model spatial geometric relationships, and designs a depth-aware decision module to enable cross-modal collaborative reasoning. The framework significantly improves robustness and accuracy of target localization in complex, cluttered environments. Extensive experiments on two mainstream benchmarks demonstrate that our approach consistently outperforms existing baselines, achieving substantial gains in both referring expression resolution accuracy and generalization capability. This work advances embodied intelligence by offering a novel paradigm for aligning vision, language, and action—particularly through geometrically grounded, multimodal reasoning.

Technology Category

Application Category

📝 Abstract
Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues. While prior works have shown progress in open-vocabulary object detection, they often fail in ambiguous scenarios where multiple candidate objects exist in the scene. To address these challenges, we propose a novel ERU framework that jointly leverages LLM-based data augmentation, depth-map modality, and a depth-aware decision module. This design enables robust integration of linguistic and embodied cues, improving disambiguation in complex or cluttered environments. Experimental results on two datasets demonstrate that our approach significantly outperforms existing baselines, achieving more accurate and reliable referent detection.
Problem

Research questions and friction points this paper is trying to address.

Identifying target objects using language and pointing cues
Resolving ambiguity when multiple candidate objects exist
Integrating depth information for robust object disambiguation
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based data augmentation for training
Depth-map modality integration for perception
Depth-aware decision module for disambiguation
🔎 Similar Papers
No similar papers found.