Reasoning Limitations of Multimodal Large Language Models. A case study of Bongard Problems

📅 2024-11-02

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

🤖 AI Summary

This study systematically investigates the fundamental limitations of multimodal large language models (MLLMs) in abstract visual reasoning (AVR), focusing on Bongard Problems (BPs)—a canonical benchmark for analogical reasoning. Method: We introduce Bongard-RWR, a novel dataset that explicitly decouples domain bias from general abstract reasoning capability. Evaluating eight state-of-the-art closed- and open-source MLLMs—including GPT-4o, Gemini 1.5 Pro, and InternVL2—we assess performance on both synthetic and real-image BPs (e.g., Bongard-HOI, OpenWorld) and analyze contextual reasoning via dialogue interaction. Contribution/Results: We find that MLLMs fail predominantly due to intrinsic deficits in AVR—not merely domain adaptation shortcomings. Even with real-world imagery, models show marginal improvement but cannot leverage conversational context for dynamic inference. We propose and evaluate multiple MLLM adaptation strategies, revealing structural deficiencies in core analogical reasoning. Our work establishes a new diagnostic benchmark and theoretical foundation for AVR evaluation and model advancement.

Technology Category

Application Category

📝 Abstract

Abstract visual reasoning (AVR) encompasses a suite of tasks whose solving requires the ability to discover common concepts underlying the set of pictures through an analogy-making process, similarly to human IQ tests. Bongard Problems (BPs), proposed in 1968, constitute a fundamental challenge in this domain mainly due to their requirement to combine visual reasoning and verbal description. This work poses a question whether multimodal large language models (MLLMs) inherently designed to combine vision and language are capable of tackling BPs. To this end, we propose a set of diverse MLLM-suited strategies to tackle BPs and examine four popular proprietary MLLMs: GPT-4o, GPT-4 Turbo, Gemini 1.5 Pro, and Claude 3.5 Sonnet, and four open models: InternVL2-8B, LLaVa-1.6 Mistral-7B, Phi-3.5-Vision, and Pixtral 12B. The above MLLMs are compared on three BP datasets: a set of original BP instances relying on synthetic, geometry-based images and two recent datasets based on real-world images, i.e., Bongard-HOI and Bongard-OpenWorld. The experiments reveal significant limitations of MLLMs in solving BPs. In particular, the models struggle to solve the classical set of synthetic BPs, despite their visual simplicity. Though their performance ameliorates on real-world concepts expressed in Bongard-HOI and Bongard-OpenWorld, the models still have difficulty in utilizing new information to improve their predictions, as well as utilizing a dialog context window effectively. To capture the reasons of performance discrepancy between synthetic and real-world AVR domains, we propose Bongard-RWR, a new BP dataset consisting of real-world images that translates concepts from hand-crafted synthetic BPs to real-world concepts. The MLLMs' results on Bongard-RWR suggest that their poor performance on classical BPs is not due to domain specificity but rather reflects their general AVR limitations.

Problem

Research questions and friction points this paper is trying to address.

Investigating MLLMs' ability to solve abstract visual reasoning problems

Exploring performance gaps in synthetic vs real-world Bongard Problems

Identifying general limitations in MLLMs' abstract reasoning capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Formulated diverse MLLM-suited solution strategies

Introduced Bongard-RWR dataset for synthetic BP concepts

Tested proprietary and open-access MLLMs on BP datasets

🔎 Similar Papers

No similar papers found.

Authors to Follow