🤖 AI Summary
This work addresses the fragility of multi-label recognition under distribution shifts, particularly in zero-shot settings with frozen vision-language models like CLIP, where performance degrades due to the neglect of label co-occurrence structures. To tackle this, the authors propose Bayesian Conditional Prior (BCP), a test-time adaptation method that explicitly models label dependencies through anchor-guided, closed-form Bayesian optimization without updating the backbone network. The approach integrates zero-shot logits with second-order co-occurrence statistics in logit space, enabling lightweight adaptation and admitting an interpretation via pointwise mutual information. Experiments demonstrate that BCP substantially improves label consistency and outperforms existing test-time adaptation methods across multiple benchmarks, boosting average mAP from 57.31 to 69.22 for RN50 and from 62.61 to 71.79 for ViT-B/16.
📝 Abstract
Multi-label recognition with frozen Vision-Language Models (VLMs) is brittle under distribution shift: standard zero-shot inference scores labels independently, ignoring co-occurrence structure and producing incoherent label sets where dominant concepts suppress weaker but compatible labels. We introduce Bayesian Conditional Priors (BCP) Estimation, a gradient-free test-time adaptation method that injects label dependency without tuning the backbone. BCP views zero-shot logits as a proxy for marginal posteriors under a fixed image-text likelihood and attributes shift-induced errors mainly to a mismatched label prior. For each test image, it selects a high-confidence anchor label and applies an anchor-conditioned Bayesian refinement. This update is closed-form in logit space and admits a pointwise mutual information (PMI) interpretation, explicitly promoting compatible labels and suppressing incompatible ones. BCP operates without target annotations by estimating anchor-conditioned priors online from the unlabeled test stream via lightweight second-order co-occurrence statistics, adding negligible overhead beyond a single forward pass. Across standard multi-label benchmarks and multiple CLIP backbones, BCP consistently outperforms strong TTA baselines, e.g., improving RN50 average mAP from 57.31 to 69.22 and ViT-B/16 from 62.61 to 71.79.