Hierarchical Collaborative Fusion for 3D Instance-aware Referring Expression Segmentation

šŸ“… 2026-03-06
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
This work addresses the challenge of grounding natural language referring expressions in 3D scenes, where ambiguity may lead to multiple or no matching target objects. To tackle this, the authors propose HCF-RES, a multimodal framework that preserves object boundaries through hierarchical visual-semantic decomposition and employs a progressive multi-level fusion mechanism to enable language-guided cross-modal adaptive weighting and refinement. The method leverages SAM instance masks to guide CLIP in performing dual-granularity encoding at both pixel and instance levels, further enhanced by 2D–3D projection and intra-modal collaboration strategies. Evaluated on the ScanRefer and Multi3DRefer benchmarks, HCF-RES achieves state-of-the-art performance, significantly improving the accuracy and robustness of 3D referring expression segmentation.

Technology Category

Application Category

šŸ“ Abstract
Generalised 3D Referring Expression Segmentation (3D-GRES) localizes objects in 3D scenes based on natural language, even when descriptions match multiple or zero targets. Existing methods rely solely on sparse point clouds, lacking rich visual semantics for fine-grained descriptions. We propose HCF-RES, a multi-modal framework with two key innovations. First, Hierarchical Visual Semantic Decomposition leverages SAM instance masks to guide CLIP encoding at dual granularities -- pixel-level and instance-level features -- preserving object boundaries during 2D-to-3D projection. Second, Progressive Multi-level Fusion integrates representations through intra-modal collaboration, cross-modal adaptive weighting between 2D semantic and 3D geometric features, and language-guided refinement. HCF-RES achieves state-of-the-art results on both ScanRefer and Multi3DRefer.
Problem

Research questions and friction points this paper is trying to address.

3D Referring Expression Segmentation
visual semantics
point clouds
fine-grained description
instance-aware segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Visual Semantic Decomposition
Progressive Multi-level Fusion
3D Referring Expression Segmentation
Multi-modal Fusion
SAM-guided CLIP Encoding
šŸ”Ž Similar Papers
No similar papers found.