PARIC: Probabilistic Attention Regularization for Language Guided Image Classification from Pre-trained Vison Language Models

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the inherent multimodal ambiguity and ill-posedness in vision-language mapping for language-guided image classification, revealing that existing deterministic embedding-based approaches introduce bias and suffer from limited interpretability when generating reference attention maps. To this end, we propose the first probabilistic attention regularization framework: leveraging pretrained vision-language models, it explicitly models attention distribution uncertainty to generate confidence-calibrated probabilistic reference attention maps. By incorporating uncertainty-aware attention regularization and cross-modal alignment optimization, the framework achieves more robust semantic alignment. Evaluated on multiple benchmarks, our method significantly improves classification accuracy, mitigates class bias, enhances prediction consistency, and demonstrates superior robustness against adversarial perturbations and distributional shifts.

Technology Category

Application Category

📝 Abstract

Language-guided attention frameworks have significantly enhanced both interpretability and performance in image classification; however, the reliance on deterministic embeddings from pre-trained vision-language foundation models to generate reference attention maps frequently overlooks the intrinsic multivaluedness and ill-posed characteristics of cross-modal mappings. To address these limitations, we introduce PARIC, a probabilistic framework for guiding visual attention via language specifications. Our approach enables pre-trained vision-language models to generate probabilistic reference attention maps, which align textual and visual modalities more effectively while incorporating uncertainty estimates, as compared to their deterministic counterparts. Experiments on benchmark test problems demonstrate that PARIC enhances prediction accuracy, mitigates bias, ensures consistent predictions, and improves robustness across various datasets.

Problem

Research questions and friction points this paper is trying to address.

Addresses limitations of deterministic embeddings in vision-language models.

Introduces probabilistic framework for better cross-modal alignment.

Improves accuracy, reduces bias, and enhances robustness in classification.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Probabilistic framework for visual attention

Generates probabilistic reference attention maps

Improves accuracy, robustness, and bias mitigation

🔎 Similar Papers

No similar papers found.

Authors to Follow