Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the insufficient robustness of 3D vision-language foundation models (VLFMs) against noisy, incomplete, or distribution-shifted point clouds, this paper proposes a training-free online test-time adaptation method. Our approach enables the first training-free test-time optimization for 3D VLFMs, comprising: (i) dynamic prototype learning to continuously refine class-wise centroid representations; (ii) graph-structure-guided label smoothing to enhance prediction consistency; and (iii) integration of a 3D cache mechanism, similarity-driven logit recalibration, and entropy-weighted prediction fusion. Evaluated on three major robustness benchmarks—ModelNet-40C, ScanObjectNN-C, and ShapeNet-C—our method improves classification accuracy by 10.55%, 8.26%, and 4.49%, respectively. These gains demonstrate substantial enhancement in generalization to real-world degraded point cloud data, without requiring any parameter updates or retraining.

Technology Category

Application Category

📝 Abstract

3D Vision-Language Foundation Models (VLFMs) have shown strong generalization and zero-shot recognition capabilities in open-world point cloud processing tasks. However, these models often underperform in practical scenarios where data are noisy, incomplete, or drawn from a different distribution than the training data. To address this, we propose Uni-Adapter, a novel training-free online test-time adaptation (TTA) strategy for 3D VLFMs based on dynamic prototype learning. We define a 3D cache to store class-specific cluster centers as prototypes, which are continuously updated to capture intra-class variability in heterogeneous data distributions. These dynamic prototypes serve as anchors for cache-based logit computation via similarity scoring. Simultaneously, a graph-based label smoothing module captures inter-prototype similarities to enforce label consistency among similar prototypes. Finally, we unify predictions from the original 3D VLFM and the refined 3D cache using entropy-weighted aggregation for reliable adaptation. Without retraining, Uni-Adapter effectively mitigates distribution shifts, achieving state-of-the-art performance on diverse 3D benchmarks over different 3D VLFMs, improving ModelNet-40C by 10.55%, ScanObjectNN-C by 8.26%, and ShapeNet-C by 4.49% over the source 3D VLFMs.

Problem

Research questions and friction points this paper is trying to address.

Addresses performance drop in 3D vision-language models with noisy data

Mitigates distribution shifts in point cloud processing without retraining

Enables online adaptation to heterogeneous data through dynamic prototypes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free online adaptation for 3D vision-language models

Dynamic prototype learning with continuously updated 3D cache

Entropy-weighted aggregation unifies original and refined predictions

🔎 Similar Papers

No similar papers found.

Authors to Follow