Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) exhibit weak visual perception capabilities, with critical visual misperceptions often masked by spurious correctness in answers. Method: We introduce VPA-Bench—the first psychology-inspired, multidimensional visual perception benchmark—comprising 1,758 images and 2,612 questions spanning seven 2D/3D subtasks. It features a novel controllable-complexity vision–cognition joint evaluation framework grounded in human visual cognition theory, enabling structured image–question generation and unified quantitative assessment of both open- and closed-weight MLLMs. Contribution/Results: Our framework uncovers a 29% visual misperception rate beneath correct answers. Analysis identifies attention misalignment and unstable fine-grained visual representation as core bottlenecks. Evaluation across eight state-of-the-art MLLMs reveals a human accuracy of 96.49%, versus a model average of only 48.7%; error rates soar to 45% on cognitively demanding subtasks (e.g., perceptual constancy).

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) show reasoning promise, yet their visual perception is a critical bottleneck. Strikingly, MLLMs can produce correct answers even while misinterpreting crucial visual elements, masking these underlying failures. Our preliminary study on a joint perception-reasoning dataset revealed that for one leading MLLM, 29% of its correct answers to reasoning questions still exhibited visual perception errors. To systematically address this, we introduce"Do You See Me", a scalable benchmark with 1,758 images and 2,612 questions. It spans seven human-psychology inspired subtasks in 2D and 3D, featuring controllable complexity to rigorously evaluate MLLM visual skills. Our findings on 3 leading closed-source and 5 major open-source models reveal a stark deficit: humans achieve 96.49% accuracy, while top MLLMs average below 50%. This performance gap widens rapidly with increased task complexity (e.g., from 12% to 45% in the visual form constancy subtask). Further analysis into the root causes suggests that failures stem from challenges like misallocated visual attention and the instability of internal representations for fine-grained details, especially at or below encoder patch resolution. This underscores an urgent need for MLLMs with truly robust visual perception. The benchmark dataset, source code and evaluation scripts are available at https://github.com/microsoft/Do-You-See-Me.
Problem

Research questions and friction points this paper is trying to address.

Evaluating visual perception errors in Multimodal LLMs despite correct reasoning answers
Assessing MLLM performance gaps in 2D and 3D human-psychology inspired subtasks
Identifying root causes of visual perception failures like misallocated attention in MLLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces scalable benchmark with 1,758 images
Evaluates MLLMs across seven psychology-inspired subtasks
Identifies visual attention and representation instability issues
🔎 Similar Papers
No similar papers found.