🤖 AI Summary
This work challenges whether multimodal large models (MLMs) genuinely *understand* images or merely exhibit superficial perceptual capabilities. Inspired by the “Chinese Room” thought experiment, we propose the novel “Visual Room” conceptual framework and a perception–cognition dual-layer evaluation paradigm, using irony comprehension—a demanding semantic reasoning task—as a litmus test for genuine understanding. We introduce the first high-quality, human-annotated, and independently verified multimodal irony dataset, comprising 924 static images and 100 videos. Our evaluation establishes the first benchmark explicitly disentangling perceptual accuracy from cognitive understanding. Systematic assessment of eight state-of-the-art MLMs reveals high perceptual accuracy but an average irony comprehension error rate of 16.1%, with core bottlenecks identified in affective reasoning, commonsense inference, and contextual alignment.
📝 Abstract
Can multi-modal large models (MLMs) that can ``see'' an image be said to ``understand'' it? Drawing inspiration from Searle's Chinese Room, we propose the extbf{Visual Room} argument: a system may process and describe every detail of visual inputs by following algorithmic rules, without genuinely comprehending the underlying intention. This dilemma challenges the prevailing assumption that perceptual mastery implies genuine understanding. In implementation, we introduce a two-tier evaluation framework spanning perception and cognition. The perception component evaluates whether MLMs can accurately capture the surface-level details of visual contents, where the cognitive component examines their ability to infer sarcasm polarity. To support this framework, We further introduce a high-quality multi-modal sarcasm dataset comprising both 924 static images and 100 dynamic videos. All sarcasm labels are annotated by the original authors and verified by independent reviewers to ensure clarity and consistency. We evaluate eight state-of-the-art (SoTA) MLMs. Our results highlight three key findings: (1) MLMs perform well on perception tasks; (2) even with correct perception, models exhibit an average error rate of ~16.1% in sarcasm understanding, revealing a significant gap between seeing and understanding; (3) error analysis attributes this gap to deficiencies in emotional reasoning, commonsense inference, and context alignment. This work provides empirical grounding for the proposed Visual Room argument and offers a new evaluation paradigm for MLMs.