Reasoning or Pattern Matching? Probing Large Vision-Language Models with Visual Puzzles

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the fundamental question of whether large vision-language models possess genuine reasoning capabilities by proposing the first benchmark framework that unifies diverse visual puzzles into a cognitive diagnostic tool. The framework systematically encompasses inductive, analogical, algorithmic, deductive, and geometric/spatial reasoning. By establishing explicit mappings between reasoning types and puzzle design principles and conducting cross-model empirical analyses, the work reveals critical limitations prevalent across current models—namely, fragile generalization, strong coupling between perception and reasoning, and inconsistency between explanation and execution. This research provides a foundational evaluation paradigm and concrete directions for developing multimodal systems with robust and reliable reasoning abilities.

Technology Category

Application Category

📝 Abstract

Puzzles have long served as compact and revealing probes of human cognition, isolating abstraction, rule discovery, and systematic reasoning with minimal reliance on prior knowledge. Leveraging these properties, visual puzzles have recently emerged as a powerful diagnostic tool for evaluating the reasoning abilities of Large Vision-Language Models (LVLMs), offering controlled, verifiable alternatives to open-ended multimodal benchmarks. This survey provides a unified perspective of visual puzzle reasoning in LVLMs. We frame visual puzzles through a common abstraction and organize existing benchmarks by the reasoning mechanisms they target (inductive, analogical, algorithmic, deductive, and geometric/spatial), thereby linking puzzle design to the cognitive operations required for solving. Synthesizing empirical evidence across these categories, we identify consistent limitations in current models, including brittle generalization, tight entanglement between perception and reasoning, and a persistent gap between fluent explanations and faithful execution. By framing visual puzzles as diagnostic instruments rather than task formats, this survey elaborates on the state of LVLM reasoning and outlines key directions for future benchmarks and reasoning-aware multimodal systems.

Problem

Research questions and friction points this paper is trying to address.

visual puzzles

Large Vision-Language Models

reasoning

pattern matching

multimodal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual puzzles

large vision-language models

reasoning mechanisms

diagnostic benchmarking

multimodal reasoning

🔎 Similar Papers

No similar papers found.

Authors to Follow