🤖 AI Summary
This work addresses the weak visual entailment reasoning capability of existing vision-language models (VLMs) in crowded scenes. To this end, we introduce VE-Crowd—the first visual entailment benchmark specifically designed for dense crowd scenarios. Built upon the CrowdHuman dataset, VE-Crowd comprises 5,608 synthetic true/false statement pairs generated via controllable image-caption composition, and integrates dual-dimensional evaluation combining visual question answering with logical entailment verification. Experiments reveal that state-of-the-art VLMs achieve less than 80% accuracy on VE-Crowd—substantially below their performance on standard VQA benchmarks—highlighting fundamental limitations in modeling complex spatial relationships and enforcing semantic-logical consistency. VE-Crowd fills a critical gap in visual entailment evaluation under occlusion-dense conditions and establishes a new benchmark to advance fine-grained spatial reasoning and semantic coherence in VLMs.
📝 Abstract
Recently, many benchmarks and datasets have been developed to evaluate Vision-Language Models (VLMs) using visual question answering (VQA) pairs, and models have shown significant accuracy improvements. However, these benchmarks rarely test the model's ability to accurately complete visual entailment, for instance, accepting or refuting a hypothesis based on the image. To address this, we propose COREVQA (Crowd Observations and Reasoning Entailment), a benchmark of 5608 image and synthetically generated true/false statement pairs, with images derived from the CrowdHuman dataset, to provoke visual entailment reasoning on challenging crowded images. Our results show that even the top-performing VLMs achieve accuracy below 80%, with other models performing substantially worse (39.98%-69.95%). This significant performance gap reveals key limitations in VLMs' ability to reason over certain types of image-question pairs in crowded scenes.