Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the insufficient robustness of 3D vision-language foundation models (VLFMs) against noisy, incomplete, or distribution-shifted point clouds, this paper proposes a training-free online test-time adaptation method. Our approach enables the first training-free test-time optimization for 3D VLFMs, comprising: (i) dynamic prototype learning to continuously refine class-wise centroid representations; (ii) graph-structure-guided label smoothing to enhance prediction consistency; and (iii) integration of a 3D cache mechanism, similarity-driven logit recalibration, and entropy-weighted prediction fusion. Evaluated on three major robustness benchmarks—ModelNet-40C, ScanObjectNN-C, and ShapeNet-C—our method improves classification accuracy by 10.55%, 8.26%, and 4.49%, respectively. These gains demonstrate substantial enhancement in generalization to real-world degraded point cloud data, without requiring any parameter updates or retraining.

Technology Category

Application Category

📝 Abstract
3D Vision-Language Foundation Models (VLFMs) have shown strong generalization and zero-shot recognition capabilities in open-world point cloud processing tasks. However, these models often underperform in practical scenarios where data are noisy, incomplete, or drawn from a different distribution than the training data. To address this, we propose Uni-Adapter, a novel training-free online test-time adaptation (TTA) strategy for 3D VLFMs based on dynamic prototype learning. We define a 3D cache to store class-specific cluster centers as prototypes, which are continuously updated to capture intra-class variability in heterogeneous data distributions. These dynamic prototypes serve as anchors for cache-based logit computation via similarity scoring. Simultaneously, a graph-based label smoothing module captures inter-prototype similarities to enforce label consistency among similar prototypes. Finally, we unify predictions from the original 3D VLFM and the refined 3D cache using entropy-weighted aggregation for reliable adaptation. Without retraining, Uni-Adapter effectively mitigates distribution shifts, achieving state-of-the-art performance on diverse 3D benchmarks over different 3D VLFMs, improving ModelNet-40C by 10.55%, ScanObjectNN-C by 8.26%, and ShapeNet-C by 4.49% over the source 3D VLFMs.
Problem

Research questions and friction points this paper is trying to address.

Addresses performance drop in 3D vision-language models with noisy data
Mitigates distribution shifts in point cloud processing without retraining
Enables online adaptation to heterogeneous data through dynamic prototypes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free online adaptation for 3D vision-language models
Dynamic prototype learning with continuously updated 3D cache
Entropy-weighted aggregation unifies original and refined predictions
🔎 Similar Papers
No similar papers found.
M
Mehran Tamjidi
School of Computer Science, University of Technology Sydney, Sydney, Australia
Hamidreza Dastmalchi
Hamidreza Dastmalchi
PhD at York University
Deep LearningComputer VisionLLMsLarge Vision-Language Models
M
Mohammadreza Alimoradijazi
Business school, The University of New South Wales, Sydney, Australia
A
Ali Cheraghian
School of Engineering, Macquarie University, Sydney, Australia
Aijun An
Aijun An
Tier 1 York Research Chair, Professor of Computer Science, York University
Data MiningMachine LearningNatural Language ProcessingArtificial Intelligence
M
Morteza Saberi
School of Computer Science, University of Technology Sydney, Sydney, Australia