🤖 AI Summary
This study addresses the evaluation of visual language models’ (VLMs) reasoning capabilities on complex natural images, using the Bongard Openworld benchmark. To this end, we systematically introduce three human-cognition-inspired paradigms—holistic analysis, deductive rule learning, and compositional analysis—and formalize them for the first time into a computable multi-paradigm evaluation framework. Our method integrates prompt engineering, rule abstraction modeling, and image structural parsing. Experiments demonstrate that compositional analysis significantly enhances structured reasoning, yielding a 12.7% average accuracy gain. Notably, GPT-4o and Gemini outperform human baselines across multiple metrics. However, critical bottlenecks persist: performance drops by 23.4% on synthetic image recognition, and models struggle with fine-grained discrimination and context-sensitive reasoning. This work establishes a novel, empirically grounded paradigm for fine-grained assessment of VLMs’ visual reasoning capabilities.
📝 Abstract
Evaluating the reasoning capabilities of Vision-Language Models (VLMs) in complex visual tasks provides valuable insights into their potential and limitations. In this work, we assess the performance of VLMs on the challenging Bongard Openworld Problems benchmark, which involves reasoning over natural images. We propose and evaluate three human-inspired paradigms: holistic analysis (global context processing), deductive rule learning (explicit rule derivation and application), and componential analysis (structured decomposition of images into components). Our results demonstrate that state-of-the-art models, including GPT-4o and Gemini, not only surpass human benchmarks but also excel in structured reasoning tasks, with componential analysis proving especially effective. However, ablation studies reveal key challenges, such as handling synthetic images, making fine-grained distinctions, and interpreting nuanced contextual information. These insights underscore the need for further advancements in model robustness and generalization, while highlighting the transformative potential of structured reasoning approaches in enhancing VLM capabilities.