PARIC: Probabilistic Attention Regularization for Language Guided Image Classification from Pre-trained Vison Language Models

📅 2025-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inherent multimodal ambiguity and ill-posedness in vision-language mapping for language-guided image classification, revealing that existing deterministic embedding-based approaches introduce bias and suffer from limited interpretability when generating reference attention maps. To this end, we propose the first probabilistic attention regularization framework: leveraging pretrained vision-language models, it explicitly models attention distribution uncertainty to generate confidence-calibrated probabilistic reference attention maps. By incorporating uncertainty-aware attention regularization and cross-modal alignment optimization, the framework achieves more robust semantic alignment. Evaluated on multiple benchmarks, our method significantly improves classification accuracy, mitigates class bias, enhances prediction consistency, and demonstrates superior robustness against adversarial perturbations and distributional shifts.

Technology Category

Application Category

📝 Abstract
Language-guided attention frameworks have significantly enhanced both interpretability and performance in image classification; however, the reliance on deterministic embeddings from pre-trained vision-language foundation models to generate reference attention maps frequently overlooks the intrinsic multivaluedness and ill-posed characteristics of cross-modal mappings. To address these limitations, we introduce PARIC, a probabilistic framework for guiding visual attention via language specifications. Our approach enables pre-trained vision-language models to generate probabilistic reference attention maps, which align textual and visual modalities more effectively while incorporating uncertainty estimates, as compared to their deterministic counterparts. Experiments on benchmark test problems demonstrate that PARIC enhances prediction accuracy, mitigates bias, ensures consistent predictions, and improves robustness across various datasets.
Problem

Research questions and friction points this paper is trying to address.

Addresses limitations of deterministic embeddings in vision-language models.
Introduces probabilistic framework for better cross-modal alignment.
Improves accuracy, reduces bias, and enhances robustness in classification.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Probabilistic framework for visual attention
Generates probabilistic reference attention maps
Improves accuracy, robustness, and bias mitigation
🔎 Similar Papers
No similar papers found.
M
Mayank Nautiyal
Department of Information Technology, Uppsala University, Uppsala, Sweden
S
Stela Arranz Gheorghe
IT University of Copenhagen, Copenhagen, Denmark
K
Kristiana Stefa
IT University of Copenhagen, Copenhagen, Denmark
Li Ju
Li Ju
Department of Information Technology, Uppsala University
Federated LearningDistributed OptimizationUncertainty QuantificationMultimodal Language Models
I
I. Sintorn
Department of Information Technology, Uppsala University, Uppsala, Sweden
P
Prashant Singh
Department of Information Technology, Uppsala University, Uppsala, Sweden; Science for Life Laboratory, Uppsala University, Uppsala, Sweden