🤖 AI Summary
This work addresses the inherent multimodal ambiguity and ill-posedness in vision-language mapping for language-guided image classification, revealing that existing deterministic embedding-based approaches introduce bias and suffer from limited interpretability when generating reference attention maps. To this end, we propose the first probabilistic attention regularization framework: leveraging pretrained vision-language models, it explicitly models attention distribution uncertainty to generate confidence-calibrated probabilistic reference attention maps. By incorporating uncertainty-aware attention regularization and cross-modal alignment optimization, the framework achieves more robust semantic alignment. Evaluated on multiple benchmarks, our method significantly improves classification accuracy, mitigates class bias, enhances prediction consistency, and demonstrates superior robustness against adversarial perturbations and distributional shifts.
📝 Abstract
Language-guided attention frameworks have significantly enhanced both interpretability and performance in image classification; however, the reliance on deterministic embeddings from pre-trained vision-language foundation models to generate reference attention maps frequently overlooks the intrinsic multivaluedness and ill-posed characteristics of cross-modal mappings. To address these limitations, we introduce PARIC, a probabilistic framework for guiding visual attention via language specifications. Our approach enables pre-trained vision-language models to generate probabilistic reference attention maps, which align textual and visual modalities more effectively while incorporating uncertainty estimates, as compared to their deterministic counterparts. Experiments on benchmark test problems demonstrate that PARIC enhances prediction accuracy, mitigates bias, ensures consistent predictions, and improves robustness across various datasets.