Segment and Select: Vision-Language Segmentation in 3D Scenarios

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of coarse-grained superpoint representations in 3D visual-language segmentation, which often lead to ambiguous boundaries and suboptimal segmentation accuracy. To overcome these issues, the authors propose SEGA3D, a novel paradigm that operates directly on fine-grained point clouds, eschewing conventional superpoint abstractions. SEGA3D integrates a mask proposal generator, a large language model-guided Semantic-Spatial Selector (SSS), and a Loopback Verification Module (LVM) to collaboratively achieve precise object segmentation. Extensive experiments demonstrate that SEGA3D achieves state-of-the-art performance across multiple benchmarks—ScanRefer, ScanNet, and Matterport3D—with notable gains of 8.3 and 5.3 mIoU over the current best methods on ScanNet and Matterport3D, respectively, significantly enhancing both segmentation quality and boundary sharpness.

📝 Abstract

3D vision-language segmentation aims to segment target objects in 3D scenarios according to the linguistic instructions and visual observations. Prior art heavily relies on the coarse superpoint representation to reduce the computation complexity, which suffers from poor segmentation quality and messy object boundaries. In this paper, we propose the SEGment-And-select (SEGA3D) paradigm for 3D visionlanguage segmentation that directly operates on the fine-grained visual information and is free from the superpoint dependency. Specifically, we first leverage a mask candidate generator to provide fine-grained categorical mask candidates, substantially improving the quality of candidate masks over the superpoint counterparts. Then, a Large Language Model (LLM) is utilized to generate the semantic and spatial information based on the linguistic description and visual features. The LLM output and visual features are fed to the Semantic-Spatial Selector (SSS) to produce the top-ranking mask candidates. Eventually, the Loopback Verification Module (LVM) is designed to yield the segmentation mask from the selected candidate masks. Our SEGA3D attains competitive performance on ScanRefer, ScanNet and Matterport3D benchmarks. Notably, our SEGA3D surpasses the top-performing counterpart by 8.3 mIoU and 5.3 mIoU on ScanNet and Matterport3D, respectively. Codes will be available upon publication.

Problem

Research questions and friction points this paper is trying to address.

3D vision-language segmentation

superpoint representation

object boundary

segmentation quality

fine-grained visual information

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D vision-language segmentation

superpoint-free

mask candidate generation