Language-Guided Invariance Probing of Vision-Language Models

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work investigates the robustness of mainstream vision-language models (VLMs) to linguistic perturbations in image–text matching—requiring invariance to semantically preserving rewrites while remaining sensitive to semantic flips (e.g., object, color, or cardinality changes). To this end, we propose Language-Guided Invariance Probing (LGIP), the first systematic evaluation framework that automatically constructs high-quality, semantically controlled rewrite and flip samples from 40K COCO images and five sets of human-annotated captions. LGIP introduces novel metrics—including invariance error and sensitivity gap—to expose latent deficiencies obscured by conventional retrieval metrics. Evaluating nine VLMs reveals that EVA02-CLIP and large OpenCLIP exhibit strong robustness, whereas SigLIP variants anomalously prefer flipped texts, confirming LGIP’s efficacy in diagnosing language-understanding biases in VLMs.

Technology Category

Application Category

📝 Abstract

Recent vision-language models (VLMs) such as CLIP, OpenCLIP, EVA02-CLIP and SigLIP achieve strong zero-shot performance, but it is unclear how reliably they respond to controlled linguistic perturbations. We introduce Language-Guided Invariance Probing (LGIP), a benchmark that measures (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image-text matching. Using 40k MS COCO images with five human captions each, we automatically generate paraphrases and rule-based flips that alter object category, color or count, and summarize model behavior with an invariance error, a semantic sensitivity gap and a positive-rate statistic. Across nine VLMs, EVA02-CLIP and large OpenCLIP variants lie on a favorable invariance-sensitivity frontier, combining low paraphrase-induced variance with consistently higher scores for original captions than for their flipped counterparts. In contrast, SigLIP and SigLIP2 show much larger invariance error and often prefer flipped captions to the human descriptions, especially for object and color edits. These failures are largely invisible to standard retrieval metrics, indicating that LGIP provides a model-agnostic diagnostic for the linguistic robustness of VLMs beyond conventional accuracy scores.

Problem

Research questions and friction points this paper is trying to address.

Measuring VLM invariance to paraphrases and sensitivity to semantic flips

Evaluating linguistic robustness beyond standard retrieval metrics

Diagnosing failures in object, color, and count attribute understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-guided invariance probing benchmark

Automatically generates paraphrases and semantic flips

Measures model invariance and sensitivity statistics

🔎 Similar Papers

No similar papers found.

Authors to Follow