CV-Probes: Studying the interplay of lexical and world knowledge in visually grounded verb understanding

📅 2024-09-02
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the visual grounding capability of vision-language (VL) transformers for verb phrases, specifically examining whether they can integrate social commonsense knowledge with visual context to interpret context-dependent verbs (e.g., “begging”)—actions that cannot be inferred from pixel-level image features alone and require external world knowledge. To this end, we introduce CV-Probes, the first benchmark explicitly designed to evaluate verb context dependency, and propose MM-SHAP, a multimodal attribution method that quantifies the contribution of verb tokens to cross-modal reasoning. Experiments across prominent VL models—including CLIP, ALPRO, and BLIP-2—reveal systematic failures in grounding context-dependent verbs, exposing a critical bottleneck in the joint integration of world knowledge and lexical representations. Our work is the first to systematically categorize verbs by context dependency, establishing a new paradigm for fine-grained linguistic understanding assessment and interpretable analysis of VL models.

Technology Category

Application Category

📝 Abstract
This study investigates the ability of various vision-language (VL) models to ground context-dependent and non-context-dependent verb phrases. To do that, we introduce the CV-Probes dataset, designed explicitly for studying context understanding, containing image-caption pairs with context-dependent verbs (e.g.,"beg") and non-context-dependent verbs (e.g.,"sit"). We employ the MM-SHAP evaluation to assess the contribution of verb tokens towards model predictions. Our results indicate that VL models struggle to ground context-dependent verb phrases effectively. These findings highlight the challenges in training VL models to integrate context accurately, suggesting a need for improved methodologies in VL model training and evaluation.
Problem

Research questions and friction points this paper is trying to address.

How VL models ground verb phrases with contextual knowledge
Assessing VL models' attention to verb tokens in captions
Improving methodologies for VL model training and evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CV-Probes dataset for verb understanding
Analyzes VL models using explainable AI techniques
Highlights need for better VL training methodologies
🔎 Similar Papers
No similar papers found.