Bayesian Test-time Adaptation for Object Recognition and Detection with Vision-language Models

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual-language models (e.g., CLIP, Grounding DINO) suffer significant performance degradation under distribution shifts in recognition and detection tasks. To address this, we propose BCA+, a training-free, backpropagation-free test-time adaptation framework. Methodologically, BCA+ introduces a dynamic caching mechanism that jointly models history-guided adaptive priors and feature-similarity likelihoods, integrating them via uncertainty-weighted fusion to jointly calibrate semantic and spatial context. Furthermore, it incorporates class-embedding alignment and multi-scale spatial matching to enhance cross-task generalization. Evaluated across multiple recognition and detection benchmarks, BCA+ achieves state-of-the-art performance with low latency, demonstrating substantial improvements in robustness and real-time applicability under distribution shifts.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) such as CLIP and Grounding DINO have achieved remarkable success in object recognition and detection. However, their performance often degrades under real-world distribution shifts. Test-time adaptation (TTA) aims to mitigate this issue by adapting models during inference. Existing methods either rely on computationally expensive backpropagation, which hinders real-time deployment, or focus solely on likelihood adaptation, which overlooks the critical role of the prior. Our prior work, Bayesian Class Adaptation (BCA), addressed these shortcomings for object recognition by introducing a training-free framework that incorporates adaptive priors. Building upon this foundation, we now present Bayesian Class Adaptation plus (BCA+), a unified, training-free framework for TTA for both object recognition and detection. BCA+ introduces a dynamic cache that adaptively stores and updates class embeddings, spatial scales (for detection), and, crucially, adaptive class priors derived from historical predictions. We formulate adaptation as a Bayesian inference problem, where final predictions are generated by fusing the initial VLM output with a cache-based prediction. This cache-based prediction combines a dynamically updated likelihood (measuring feature and scale similarity) and a prior (reflecting the evolving class distribution). This dual-adaptation mechanism, coupled with uncertainty-guided fusion, enables BCA+ to correct both the model's semantic understanding and its contextual confidence. As a training-free method requiring no backpropagation, BCA+ is highly efficient. Extensive experiments demonstrate that BCA+ achieves state-of-the-art performance on both recognition and detection benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Adapting vision-language models to distribution shifts during inference
Overcoming computational limitations of backpropagation in real-time deployment
Integrating both likelihood adaptation and dynamic prior updates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free Bayesian framework for object recognition and detection
Dynamic cache updating embeddings, scales, and adaptive priors
Uncertainty-guided fusion combining initial output with cache predictions
🔎 Similar Papers
No similar papers found.
Lihua Zhou
Lihua Zhou
Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science and Innovation, CAS
Machine LearningTransfer Learning
M
Mao Ye
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
Shuaifeng Li
Shuaifeng Li
School of Computer Science and Engineering, University of Electronic Science and Technology of China
domain adaptationobject detection
Nianxin Li
Nianxin Li
School of Computer Science and Engineering, University of Electronic Science and Technology of China
computer visionobject detectiondomain adaptation
Jinlin Wu
Jinlin Wu
Institute of Automation,Chinese Academy of Sciences
Xiatian Zhu
Xiatian Zhu
University of Surrey
Machine LearningComputer Vision
L
Lei Deng
School of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
H
Hongbin Liu
Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science and Innovation, Chinese Academy of Sciences, Hong Kong, China
J
Jiebo Luo
University of Rochester and performed this work while on a sabbatical leave at the Hong Kong Institute of Science and Innovation
Zhen Lei
Zhen Lei
Associate Professor, OSCO Research Chair in Off-site Construction
Offsite ConstructionConstruction Engineering and Management