HOLA: Holistic Multi-Modal Alignment for Open-Set 3D Recognition

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the limited generalization capability of models toward rare or unseen categories in open-set 3D recognition by proposing a multimodal alignment and disentangled contrastive learning framework. The approach jointly models point clouds with multi-view images and multi-segment textual descriptions, incorporating a lightweight text adapter to mitigate the domain gap between network-generated and human-annotated texts. Its core innovation lies in a disentangled multi-positive contrastive loss that decouples positive sample aggregation from negative sample repulsion, thereby avoiding mutual interference among multiple positives during softmax normalization. Evaluated on long-tailed open-set 3D recognition benchmarks, the method achieves state-of-the-art zero-shot recognition accuracy while maintaining high inference efficiency.

📝 Abstract

Open-set 3D recognition requires models that generalize to rare or unseen categories. Recent approaches address this by distilling language-vision knowledge into 3D encoders, typically relying on heavy 2D ViTs and aligning each point cloud with a single image or caption, thus anchoring representations to partial views. We propose aligning each point cloud with multiple images and textual descriptions to capture a more holistic understanding of 3D objects. To realize this idea, it is essential to design a loss function capable of jointly aligning a 3D instance with multiple matched signals, multi-view images and multiple texts, while separating positive aggregation from negative competition. We introduce such a function, termed the decoupled multi-positive contrastive loss. Our formulation enhances the loss's hardness-aware focus on challenging negatives, avoiding the "spotlight crowding" that occurs when many positives share the same softmax with all the negatives. Complementing this, we present a lightweight text adapter applied only to web captions, reducing the domain gap to curated annotations and enabling effective use of large-scale unsupervised text. Our model demonstrates state-of-the-art open-vocabulary performance on long-tail benchmarks, yielding substantial zero-shot improvements while sustaining high frame rates.

Problem

Research questions and friction points this paper is trying to address.

open-set 3D recognition

multi-modal alignment

holistic understanding

zero-shot learning

long-tail benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-modal alignment

open-set 3D recognition

decoupled multi-positive contrastive loss