Vision language models are blind

📅 2024-07-09
🏛️ arXiv.org
📈 Citations: 28
Influential: 1
📄 PDF
🤖 AI Summary
This work identifies a critical structural deficiency in state-of-the-art vision-language models (VLMs): although their visual encoders capture geometric information, their language decoders fail to effectively utilize it, resulting in substantially weaker low-level spatial reasoning than humans. To diagnose this, the authors introduce BlindTest—the first benchmark explicitly designed for precise spatial relation judgment—comprising seven elementary tasks. Evaluating four classes of SOTA VLMs, they achieve only 58.07% average accuracy (max 77.84%), far below human performance; accuracy rises to nearly 100% when shape separation is increased, confirming spatial interference as the key bottleneck. Through multi-resolution robustness analysis and linear probing, they provide the first empirical evidence that geometric information is encoded in VLM visual representations but remains inaccessible to downstream language decoding. This yields a novel diagnostic framework and attribution pathway for advancing multimodal spatial reasoning.

Technology Category

Application Category

📝 Abstract
While large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro, score high on many vision-understanding benchmarks, they are still struggling with low-level vision tasks that are easy to humans. Specifically, on BlindTest, our suite of 7 very simple tasks, including identifying (a) whether two circles overlap; (b) how many times two lines intersect; (c) which letter is being circled in a word; and (d) the number of circles in an Olympic-like logo, four state-of-the-art VLMs are only 58.07% accurate on average. Claude 3.5 Sonnet performs the best at 77.84% accuracy, far from the human expected accuracy of 100%. Across different image resolutions and line widths, VLMs including slow-thinking models consistently struggle with those tasks that require precise spatial information when geometric primitives overlap or are close. Yet, VLMs perform at near-100% accuracy when much more space is added to separate shapes and letters. Linear probing experiments show that vision encoders contain sufficient visual information to solve BlindTest and that language models fail to decode this information into correct answers. Code and data are at: https://vlmsareblind.github.io
Problem

Research questions and friction points this paper is trying to address.

VLMs fail to translate precise spatial visual features into words
VLMs struggle with low-level vision tasks involving overlapping geometric primitives
VLMs perform poorly on simple spatial reasoning tasks that humans find easy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces BlindTest benchmark for low-level vision tasks
Reveals VLMs' spatial precision limitations through geometric testing
Proposes linear probing to diagnose vision-language information decoding failures