TeDA: Boosting Vision-Lanuage Models for Zero-Shot 3D Object Retrieval via Testing-time Distribution Alignment

📅 2025-05-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Zero-shot 3D object retrieval suffers from inadequate 3D feature representation due to cross-modal distribution shift when leveraging 2D pre-trained vision-language models (e.g., CLIP). Method: We propose the first test-time adaptation framework explicitly designed for 3D feature learning. It integrates multi-view projection with CLIP-based feature extraction, introduces a self-enhanced iterative optimization mechanism, and leverages InternVL to generate descriptive text prompts—enabling text-guided cross-modal feature space alignment and fusion—all without fine-tuning or additional training. Contribution/Results: Our method achieves significant improvements over state-of-the-art approaches on four open-set 3D retrieval benchmarks. Notably, it demonstrates strong robustness on the Objaverse-LVIS depth-map evaluation, providing the first empirical validation of pure test-time adaptation’s effectiveness and generalizability in zero-shot 3D retrieval.

Technology Category

Application Category

📝 Abstract

Learning discriminative 3D representations that generalize well to unknown testing categories is an emerging requirement for many real-world 3D applications. Existing well-established methods often struggle to attain this goal due to insufficient 3D training data from broader concepts. Meanwhile, pre-trained large vision-language models (e.g., CLIP) have shown remarkable zero-shot generalization capabilities. Yet, they are limited in extracting suitable 3D representations due to substantial gaps between their 2D training and 3D testing distributions. To address these challenges, we propose Testing-time Distribution Alignment (TeDA), a novel framework that adapts a pretrained 2D vision-language model CLIP for unknown 3D object retrieval at test time. To our knowledge, it is the first work that studies the test-time adaptation of a vision-language model for 3D feature learning. TeDA projects 3D objects into multi-view images, extracts features using CLIP, and refines 3D query embeddings with an iterative optimization strategy by confident query-target sample pairs in a self-boosting manner. Additionally, TeDA integrates textual descriptions generated by a multimodal language model (InternVL) to enhance 3D object understanding, leveraging CLIP's aligned feature space to fuse visual and textual cues. Extensive experiments on four open-set 3D object retrieval benchmarks demonstrate that TeDA greatly outperforms state-of-the-art methods, even those requiring extensive training. We also experimented with depth maps on Objaverse-LVIS, further validating its effectiveness. Code is available at https://github.com/wangzhichuan123/TeDA.

Problem

Research questions and friction points this paper is trying to address.

Enhancing zero-shot 3D object retrieval with vision-language models

Bridging 2D-3D distribution gaps for better 3D representation

Adapting CLIP for test-time 3D feature learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Testing-time Distribution Alignment (TeDA) for 3D retrieval

Multi-view image projection and CLIP feature extraction

Iterative optimization with confident query-target pairs

🔎 Similar Papers

No similar papers found.

Authors to Follow