Vision Language Models Know Law of Conservation without Understanding More-or-Less

📅 2024-10-01

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates whether vision-language models (VLMs) genuinely comprehend physical conservation laws—a cornerstone of human cognitive development. To this end, we introduce ConserveBench, a novel benchmark comprising 365 controlled cognitive experiments spanning four conservation dimensions: volume, substance, length, and number. Crucially, ConserveBench is the first to distinguish *transformation tasks*—requiring reversible reasoning about dynamic object manipulations—from *non-transformation tasks*, which only demand static quantity judgments. We employ multimodal prompt engineering and zero-shot evaluation, using synthetically generated image–text pairs annotated with cognitive-logical ground truth. Results reveal a striking dissociation between reversibility-based and quantity-based understanding: VLMs achieve 78.3% accuracy on transformation tasks but only 41.9% on non-transformation tasks. This indicates that VLMs capture superficial behavioral patterns of conservation without acquiring deep, compositional semantic representations of quantity—challenging the classical unidimensional developmental account of conservation competence in cognitive psychology.

Technology Category

Application Category

📝 Abstract

Conservation is a critical milestone of cognitive development considered to be supported by both the understanding of quantitative concepts and the reversibility of operations. To assess whether this critical component of human intelligence has emerged in Vision Language Models, we have curated the ConserveBench, a battery of 365 cognitive experiments across four dimensions of physical quantities: volume, solid quantity, length, and number. The former two involve transformational tasks which require reversibility understanding. The latter two involve non-transformational tasks which assess quantity understanding. Surprisingly, we find that while Vision Language Models are generally good at transformational tasks, they tend to fail at non-transformational tasks. There is a dissociation between understanding the reversibility of operations and understanding of quantity, which both are believed to be the cornerstones of the understanding of law of conservation in humans. $href{https://growing-ai-like-a-child.github.io/pages/Conservation/}{Website}$

Problem

Research questions and friction points this paper is trying to address.

Assessing Vision Language Models' understanding of conservation law

Evaluating performance on transformational vs non-transformational tasks

Investigating dissociation between reversibility and quantity understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

ConserveBench cognitive experiments for assessment

Transformational tasks test reversibility understanding

Non-transformational tasks evaluate quantity comprehension

🔎 Similar Papers

No similar papers found.

Authors to Follow