🤖 AI Summary
To address two key bottlenecks in few-shot/zero-shot 3D point cloud semantic segmentation—over-reliance on pretraining and underutilization of textual supervision—this paper proposes the first pretraining-free, end-to-end language-guided framework. Our method jointly models visual and linguistic modalities for zero-shot generalization without any pretrained vision or language backbones. Key contributions include: (1) a Language-Guided Prototype Embedding (LGPE) module that aligns category-specific textual descriptions with point-wise features; and (2) Prototype-Enhanced Register Attention (ProERA) coupled with Dual Relative Position Encoding (DRPE), which improves cross-class prototype matching accuracy and scene-level generalizability. Evaluated on S3DIS and ScanNet, our approach achieves new state-of-the-art mIoU scores, outperforming prior methods by +5.68% and +3.82%, respectively. This is the first work to empirically validate the feasibility and superiority of pretraining-free paradigms for few-shot and zero-shot point cloud semantic segmentation.
📝 Abstract
Recent approaches for few-shot 3D point cloud semantic segmentation typically require a two-stage learning process, i.e., a pre-training stage followed by a few-shot training stage. While effective, these methods face overreliance on pre-training, which hinders model flexibility and adaptability. Some models tried to avoid pre-training yet failed to capture ample information. In addition, current approaches focus on visual information in the support set and neglect or do not fully exploit other useful data, such as textual annotations. This inadequate utilization of support information impairs the performance of the model and restricts its zero-shot ability. To address these limitations, we present a novel pre-training-free network, named Efficient Point Cloud Semantic Segmentation for Few- and Zero-shot scenarios. Our EPSegFZ incorporates three key components. A Prototype-Enhanced Registers Attention (ProERA) module and a Dual Relative Positional Encoding (DRPE)-based cross-attention mechanism for improved feature extraction and accurate query-prototype correspondence construction without pre-training. A Language-Guided Prototype Embedding (LGPE) module that effectively leverages textual information from the support set to improve few-shot performance and enable zero-shot inference. Extensive experiments show that our method outperforms the state-of-the-art method by 5.68% and 3.82% on the S3DIS and ScanNet benchmarks, respectively.