Hierarchical Collaborative Fusion for 3D Instance-aware Referring Expression Segmentation

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the challenge of grounding natural language referring expressions in 3D scenes, where ambiguity may lead to multiple or no matching target objects. To tackle this, the authors propose HCF-RES, a multimodal framework that preserves object boundaries through hierarchical visual-semantic decomposition and employs a progressive multi-level fusion mechanism to enable language-guided cross-modal adaptive weighting and refinement. The method leverages SAM instance masks to guide CLIP in performing dual-granularity encoding at both pixel and instance levels, further enhanced by 2D–3D projection and intra-modal collaboration strategies. Evaluated on the ScanRefer and Multi3DRefer benchmarks, HCF-RES achieves state-of-the-art performance, significantly improving the accuracy and robustness of 3D referring expression segmentation.

Technology Category

Application Category

📝 Abstract

Generalised 3D Referring Expression Segmentation (3D-GRES) localizes objects in 3D scenes based on natural language, even when descriptions match multiple or zero targets. Existing methods rely solely on sparse point clouds, lacking rich visual semantics for fine-grained descriptions. We propose HCF-RES, a multi-modal framework with two key innovations. First, Hierarchical Visual Semantic Decomposition leverages SAM instance masks to guide CLIP encoding at dual granularities -- pixel-level and instance-level features -- preserving object boundaries during 2D-to-3D projection. Second, Progressive Multi-level Fusion integrates representations through intra-modal collaboration, cross-modal adaptive weighting between 2D semantic and 3D geometric features, and language-guided refinement. HCF-RES achieves state-of-the-art results on both ScanRefer and Multi3DRefer.

Problem

Research questions and friction points this paper is trying to address.

3D Referring Expression Segmentation

visual semantics

point clouds

fine-grained description

instance-aware segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Visual Semantic Decomposition

Progressive Multi-level Fusion

3D Referring Expression Segmentation