Vision Language Models Know Law of Conservation without Understanding More-or-Less

📅 2024-10-01
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether vision-language models (VLMs) genuinely comprehend physical conservation laws—a cornerstone of human cognitive development. To this end, we introduce ConserveBench, a novel benchmark comprising 365 controlled cognitive experiments spanning four conservation dimensions: volume, substance, length, and number. Crucially, ConserveBench is the first to distinguish *transformation tasks*—requiring reversible reasoning about dynamic object manipulations—from *non-transformation tasks*, which only demand static quantity judgments. We employ multimodal prompt engineering and zero-shot evaluation, using synthetically generated image–text pairs annotated with cognitive-logical ground truth. Results reveal a striking dissociation between reversibility-based and quantity-based understanding: VLMs achieve 78.3% accuracy on transformation tasks but only 41.9% on non-transformation tasks. This indicates that VLMs capture superficial behavioral patterns of conservation without acquiring deep, compositional semantic representations of quantity—challenging the classical unidimensional developmental account of conservation competence in cognitive psychology.

Technology Category

Application Category

📝 Abstract
Conservation is a critical milestone of cognitive development considered to be supported by both the understanding of quantitative concepts and the reversibility of operations. To assess whether this critical component of human intelligence has emerged in Vision Language Models, we have curated the ConserveBench, a battery of 365 cognitive experiments across four dimensions of physical quantities: volume, solid quantity, length, and number. The former two involve transformational tasks which require reversibility understanding. The latter two involve non-transformational tasks which assess quantity understanding. Surprisingly, we find that while Vision Language Models are generally good at transformational tasks, they tend to fail at non-transformational tasks. There is a dissociation between understanding the reversibility of operations and understanding of quantity, which both are believed to be the cornerstones of the understanding of law of conservation in humans. $href{https://growing-ai-like-a-child.github.io/pages/Conservation/}{Website}$
Problem

Research questions and friction points this paper is trying to address.

Assessing Vision Language Models' understanding of conservation law
Evaluating performance on transformational vs non-transformational tasks
Investigating dissociation between reversibility and quantity understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

ConserveBench cognitive experiments for assessment
Transformational tasks test reversibility understanding
Non-transformational tasks evaluate quantity comprehension
🔎 Similar Papers
No similar papers found.
Dezhi Luo
Dezhi Luo
University of Michigan
cognitive sciencephilosophyAI
H
Haiyun Lyu
University of North Carolina at Chapel Hill
Q
Qingying Gao
Johns Hopkins University
H
Haoran Sun
Johns Hopkins University
Yijiang Li
Yijiang Li
Argonne National Laboratory
Hokin Deng
Hokin Deng
Johns Hopkins University
cognition