COREVQA: A Crowd Observation and Reasoning Entailment Visual Question Answering Benchmark

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the weak visual entailment reasoning capability of existing vision-language models (VLMs) in crowded scenes. To this end, we introduce VE-Crowd—the first visual entailment benchmark specifically designed for dense crowd scenarios. Built upon the CrowdHuman dataset, VE-Crowd comprises 5,608 synthetic true/false statement pairs generated via controllable image-caption composition, and integrates dual-dimensional evaluation combining visual question answering with logical entailment verification. Experiments reveal that state-of-the-art VLMs achieve less than 80% accuracy on VE-Crowd—substantially below their performance on standard VQA benchmarks—highlighting fundamental limitations in modeling complex spatial relationships and enforcing semantic-logical consistency. VE-Crowd fills a critical gap in visual entailment evaluation under occlusion-dense conditions and establishes a new benchmark to advance fine-grained spatial reasoning and semantic coherence in VLMs.

Technology Category

Application Category

📝 Abstract

Recently, many benchmarks and datasets have been developed to evaluate Vision-Language Models (VLMs) using visual question answering (VQA) pairs, and models have shown significant accuracy improvements. However, these benchmarks rarely test the model's ability to accurately complete visual entailment, for instance, accepting or refuting a hypothesis based on the image. To address this, we propose COREVQA (Crowd Observations and Reasoning Entailment), a benchmark of 5608 image and synthetically generated true/false statement pairs, with images derived from the CrowdHuman dataset, to provoke visual entailment reasoning on challenging crowded images. Our results show that even the top-performing VLMs achieve accuracy below 80%, with other models performing substantially worse (39.98%-69.95%). This significant performance gap reveals key limitations in VLMs' ability to reason over certain types of image-question pairs in crowded scenes.

Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs' visual entailment reasoning in crowded scenes

Addressing lack of benchmarks for hypothesis acceptance/refusal in VQA

Assessing model accuracy on synthetic true/false statement-image pairs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic true/false statement pairs generation

CrowdHuman dataset for crowded images

Visual entailment reasoning benchmark

🔎 Similar Papers

No similar papers found.