š¤ AI Summary
This work addresses the challenge of grounding natural language referring expressions in 3D scenes, where ambiguity may lead to multiple or no matching target objects. To tackle this, the authors propose HCF-RES, a multimodal framework that preserves object boundaries through hierarchical visual-semantic decomposition and employs a progressive multi-level fusion mechanism to enable language-guided cross-modal adaptive weighting and refinement. The method leverages SAM instance masks to guide CLIP in performing dual-granularity encoding at both pixel and instance levels, further enhanced by 2Dā3D projection and intra-modal collaboration strategies. Evaluated on the ScanRefer and Multi3DRefer benchmarks, HCF-RES achieves state-of-the-art performance, significantly improving the accuracy and robustness of 3D referring expression segmentation.
š Abstract
Generalised 3D Referring Expression Segmentation (3D-GRES) localizes objects in 3D scenes based on natural language, even when descriptions match multiple or zero targets. Existing methods rely solely on sparse point clouds, lacking rich visual semantics for fine-grained descriptions. We propose HCF-RES, a multi-modal framework with two key innovations. First, Hierarchical Visual Semantic Decomposition leverages SAM instance masks to guide CLIP encoding at dual granularities -- pixel-level and instance-level features -- preserving object boundaries during 2D-to-3D projection. Second, Progressive Multi-level Fusion integrates representations through intra-modal collaboration, cross-modal adaptive weighting between 2D semantic and 3D geometric features, and language-guided refinement. HCF-RES achieves state-of-the-art results on both ScanRefer and Multi3DRefer.