VISaGE: Understanding Visual Generics and Exceptions

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates how vision-language models (VLMs) reason about concepts under semantic consistency (typical images) versus inconsistency (atypical, text-image mismatched images), focusing on the interplay between semantic priors (intrinsic conceptual properties) and pragmatic priors (text-visual co-occurrence assumptions). To this end, we introduce VISaGE—the first benchmark explicitly designed for evaluating generalization and exception reasoning—and propose a balanced experimental paradigm contrasting typical and anomalous image inputs. Empirical results reveal that text-image inconsistency substantially degrades conceptual understanding; pragmatic priors dominate inference, suppressing semantic priors. Consequently, VLMs exhibit systematic misjudgments on semantically valid but pragmatically anomalous visual content—effectively “trusting the text.” This work provides the first systematic characterization of VLMs’ reasoning biases under prior conflict, offering both theoretical insights into their limitations and a rigorous evaluation benchmark to advance robustness and interpretability.

Technology Category

Application Category

📝 Abstract

While Vision Language Models (VLMs) learn conceptual representations, in the form of generalized knowledge, during training, they are typically used to analyze individual instances. When evaluation instances are atypical, this paradigm results in tension between two priors in the model. The first is a pragmatic prior that the textual and visual input are both relevant, arising from VLM finetuning on congruent inputs; the second is a semantic prior that the conceptual representation is generally true for instances of the category. In order to understand how VLMs trade off these priors, we introduce a new evaluation dataset, VISaGE, consisting of both typical and exceptional images. In carefully balanced experiments, we show that conceptual understanding degrades when the assumption of congruency underlying the pragmatic prior is violated with incongruent images. This effect is stronger than the effect of the semantic prior when querying about individual instances.

Problem

Research questions and friction points this paper is trying to address.

Understanding how VLMs handle visual generics and exceptions in atypical instances

Investigating tension between pragmatic and semantic priors in vision-language models

Evaluating conceptual degradation when congruency assumptions are violated

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing VISaGE dataset for VLM evaluation

Analyzing pragmatic and semantic priors in VLMs

Testing VLMs with typical and exceptional images

🔎 Similar Papers

No similar papers found.

Authors to Follow