Reasoning or Pattern Matching? Probing Large Vision-Language Models with Visual Puzzles

📅 2026-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the fundamental question of whether large vision-language models possess genuine reasoning capabilities by proposing the first benchmark framework that unifies diverse visual puzzles into a cognitive diagnostic tool. The framework systematically encompasses inductive, analogical, algorithmic, deductive, and geometric/spatial reasoning. By establishing explicit mappings between reasoning types and puzzle design principles and conducting cross-model empirical analyses, the work reveals critical limitations prevalent across current models—namely, fragile generalization, strong coupling between perception and reasoning, and inconsistency between explanation and execution. This research provides a foundational evaluation paradigm and concrete directions for developing multimodal systems with robust and reliable reasoning abilities.

Technology Category

Application Category

📝 Abstract
Puzzles have long served as compact and revealing probes of human cognition, isolating abstraction, rule discovery, and systematic reasoning with minimal reliance on prior knowledge. Leveraging these properties, visual puzzles have recently emerged as a powerful diagnostic tool for evaluating the reasoning abilities of Large Vision-Language Models (LVLMs), offering controlled, verifiable alternatives to open-ended multimodal benchmarks. This survey provides a unified perspective of visual puzzle reasoning in LVLMs. We frame visual puzzles through a common abstraction and organize existing benchmarks by the reasoning mechanisms they target (inductive, analogical, algorithmic, deductive, and geometric/spatial), thereby linking puzzle design to the cognitive operations required for solving. Synthesizing empirical evidence across these categories, we identify consistent limitations in current models, including brittle generalization, tight entanglement between perception and reasoning, and a persistent gap between fluent explanations and faithful execution. By framing visual puzzles as diagnostic instruments rather than task formats, this survey elaborates on the state of LVLM reasoning and outlines key directions for future benchmarks and reasoning-aware multimodal systems.
Problem

Research questions and friction points this paper is trying to address.

visual puzzles
Large Vision-Language Models
reasoning
pattern matching
multimodal reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual puzzles
large vision-language models
reasoning mechanisms
diagnostic benchmarking
multimodal reasoning
🔎 Similar Papers
No similar papers found.
Maria Lymperaiou
Maria Lymperaiou
National Technical University of Athens
Deep LearningNatural Language ProcessingExplainabilityMultimodal Learning
V
Vasileios Karampinis
National Technical University of Athens
Giorgos Filandrianos
Giorgos Filandrianos
Postdoctoral researcher
Explainable AINLP
A
Angelos Vlachos
National Technical University of Athens
C
Chrysoula Zerva
Instituto de Telecomunicações, Lisbon, Portugal
A
Athanasios Voulodimos
National Technical University of Athens