The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs

📅 2024-10-02

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

🤖 AI Summary

Multimodal large language models (MLLMs) exhibit a critical deficiency in human-like foundational intelligence—specifically, the dynamic association between perception and experiential memory—yet this capability remains formally undefined and unassessed. Method: We introduce the first systematic definition and evaluation of semantic association ability in MLLMs, proposing a label-free, semantics-driven benchmark construction paradigm. We design ACBench—the first MLLM-specific association capability benchmark—covering single-step, synchronous, and asynchronous association tasks, integrated with memory-strategy analysis, zero-shot evaluation across diverse architectures (open-source, closed-source, MoE), and human expert comparison. Contribution/Results: Experiments reveal that state-of-the-art MLLMs—including GPT-4V—significantly underperform humans on association tasks, confirming this as a fundamental cognitive bottleneck. Our work establishes a novel metric and scalable evaluation infrastructure for advancing cognitive alignment research in MLLMs.

Technology Category

Application Category

📝 Abstract

Multi-modal Large Language Models (MLLMs) have exhibited impressive capability. However, recently many deficiencies of MLLMs have been found compared to human intelligence, $ extit{e.g.}$, hallucination. To drive the MLLMs study, the community dedicated efforts to building larger benchmarks with complex tasks. In this paper, we propose benchmarking an essential but usually overlooked intelligence: $ extbf{association}$, a human's basic capability to link observation and prior practice memory. To comprehensively investigate MLLM's performance on the association, we formulate the association task and devise a standard benchmark based on adjective and verb semantic concepts. Instead of costly data annotation and curation, we propose a convenient $ extbf{annotation-free}$ construction method transforming the general dataset for our association tasks. Simultaneously, we devise a rigorous data refinement process to eliminate confusion in the raw dataset. Building on this database, we establish three levels of association tasks: single-step, synchronous, and asynchronous associations. Moreover, we conduct a comprehensive investigation into the MLLMs' zero-shot association capabilities, addressing multiple dimensions, including three distinct memory strategies, both open-source and closed-source MLLMs, cutting-edge Mixture-of-Experts (MoE) models, and the involvement of human experts. Our systematic investigation shows that current open-source MLLMs consistently exhibit poor capability in our association tasks, even the currently state-of-the-art GPT-4V(vision) also has a significant gap compared to humans. We believe our benchmark would pave the way for future MLLM studies. $ extit{Our data and code are available at:}$ https://mvig-rhos.com/llm_inception.

Problem

Research questions and friction points this paper is trying to address.

Addresses deficiencies in Multi-modal Large Language Models (MLLMs) compared to human intelligence.

Proposes a benchmark for evaluating MLLMs' association capabilities using adjective and verb semantic concepts.

Investigates zero-shot association capabilities of MLLMs, including various memory strategies and model types.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Annotation-free dataset construction for association tasks

Three-level association tasks: single-step, synchronous, asynchronous

Comprehensive zero-shot association capability investigation

🔎 Similar Papers

No similar papers found.

Authors to Follow