Learning to Look: Cognitive Attention Alignment with Vision-Language Models

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

CNNs often rely on superficial statistical shortcuts rather than semantic essence for prediction, undermining decision reliability and interpretability. To address this, we propose a scalable, annotation-free framework that leverages vision-language models (VLMs) to automatically generate natural-language-guided semantic attention maps. Through a cognitively inspired attention alignment mechanism and auxiliary loss, our method explicitly constrains CNN convolutional attention to high-level semantic regions. By integrating explanation regularization—without requiring expert-annotated concepts—it significantly improves model generalization and human-aligned attention. Experiments demonstrate state-of-the-art performance on ColorMNIST, competitive accuracy with strong supervised baselines on DecoyMNIST, and attention distributions that better conform to human intuition.

Technology Category

Application Category

📝 Abstract

Convolutional Neural Networks (CNNs) frequently "cheat" by exploiting superficial correlations, raising concerns about whether they make predictions for the right reasons. Inspired by cognitive science, which highlights the role of attention in robust human perception, recent methods have sought to guide model attention using concept-based supervision and explanation regularization. However, these techniques depend on labor-intensive, expert-provided annotations, limiting their scalability. We propose a scalable framework that leverages vision-language models to automatically generate semantic attention maps using natural language prompts. By introducing an auxiliary loss that aligns CNN attention with these language-guided maps, our approach promotes more reliable and cognitively plausible decision-making without manual annotation. Experiments on challenging datasets, ColoredMNIST and DecoyMNIST, show that our method achieves state-of-the-art performance on ColorMNIST and remains competitive with annotation-heavy baselines on DecoyMNIST, demonstrating improved generalization, reduced shortcut reliance, and model attention that better reflects human intuition.

Problem

Research questions and friction points this paper is trying to address.

CNNs exploit superficial correlations instead of meaningful features

Existing attention methods require labor-intensive expert annotations

Need scalable approach for cognitively plausible model decision-making

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages vision-language models for automatic attention generation

Uses natural language prompts to create semantic attention maps

Aligns CNN attention with language-guided maps via auxiliary loss

🔎 Similar Papers

Law of Vision Representation in MLLMs