Are MLMs Trapped in the Visual Room?

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work challenges whether multimodal large models (MLMs) genuinely *understand* images or merely exhibit superficial perceptual capabilities. Inspired by the “Chinese Room” thought experiment, we propose the novel “Visual Room” conceptual framework and a perception–cognition dual-layer evaluation paradigm, using irony comprehension—a demanding semantic reasoning task—as a litmus test for genuine understanding. We introduce the first high-quality, human-annotated, and independently verified multimodal irony dataset, comprising 924 static images and 100 videos. Our evaluation establishes the first benchmark explicitly disentangling perceptual accuracy from cognitive understanding. Systematic assessment of eight state-of-the-art MLMs reveals high perceptual accuracy but an average irony comprehension error rate of 16.1%, with core bottlenecks identified in affective reasoning, commonsense inference, and contextual alignment.

Technology Category

Application Category

📝 Abstract

Can multi-modal large models (MLMs) that can ``see'' an image be said to ``understand'' it? Drawing inspiration from Searle's Chinese Room, we propose the extbf{Visual Room} argument: a system may process and describe every detail of visual inputs by following algorithmic rules, without genuinely comprehending the underlying intention. This dilemma challenges the prevailing assumption that perceptual mastery implies genuine understanding. In implementation, we introduce a two-tier evaluation framework spanning perception and cognition. The perception component evaluates whether MLMs can accurately capture the surface-level details of visual contents, where the cognitive component examines their ability to infer sarcasm polarity. To support this framework, We further introduce a high-quality multi-modal sarcasm dataset comprising both 924 static images and 100 dynamic videos. All sarcasm labels are annotated by the original authors and verified by independent reviewers to ensure clarity and consistency. We evaluate eight state-of-the-art (SoTA) MLMs. Our results highlight three key findings: (1) MLMs perform well on perception tasks; (2) even with correct perception, models exhibit an average error rate of ~16.1% in sarcasm understanding, revealing a significant gap between seeing and understanding; (3) error analysis attributes this gap to deficiencies in emotional reasoning, commonsense inference, and context alignment. This work provides empirical grounding for the proposed Visual Room argument and offers a new evaluation paradigm for MLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating if MLMs genuinely understand images or just process details

Assessing MLMs' ability to infer sarcasm despite accurate perception

Identifying gaps in emotional reasoning and commonsense in MLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-tier evaluation framework for MLMs

High-quality multi-modal sarcasm dataset

Emotional reasoning and commonsense analysis

🔎 Similar Papers

Exploring the Limits of Zero Shot Vision Language Models for Hate Meme Detection: The Vulnerabilities and their Interpretations