The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs

📅 2024-10-02
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) exhibit a critical deficiency in human-like foundational intelligence—specifically, the dynamic association between perception and experiential memory—yet this capability remains formally undefined and unassessed. Method: We introduce the first systematic definition and evaluation of semantic association ability in MLLMs, proposing a label-free, semantics-driven benchmark construction paradigm. We design ACBench—the first MLLM-specific association capability benchmark—covering single-step, synchronous, and asynchronous association tasks, integrated with memory-strategy analysis, zero-shot evaluation across diverse architectures (open-source, closed-source, MoE), and human expert comparison. Contribution/Results: Experiments reveal that state-of-the-art MLLMs—including GPT-4V—significantly underperform humans on association tasks, confirming this as a fundamental cognitive bottleneck. Our work establishes a novel metric and scalable evaluation infrastructure for advancing cognitive alignment research in MLLMs.

Technology Category

Application Category

📝 Abstract
Multi-modal Large Language Models (MLLMs) have exhibited impressive capability. However, recently many deficiencies of MLLMs have been found compared to human intelligence, $ extit{e.g.}$, hallucination. To drive the MLLMs study, the community dedicated efforts to building larger benchmarks with complex tasks. In this paper, we propose benchmarking an essential but usually overlooked intelligence: $ extbf{association}$, a human's basic capability to link observation and prior practice memory. To comprehensively investigate MLLM's performance on the association, we formulate the association task and devise a standard benchmark based on adjective and verb semantic concepts. Instead of costly data annotation and curation, we propose a convenient $ extbf{annotation-free}$ construction method transforming the general dataset for our association tasks. Simultaneously, we devise a rigorous data refinement process to eliminate confusion in the raw dataset. Building on this database, we establish three levels of association tasks: single-step, synchronous, and asynchronous associations. Moreover, we conduct a comprehensive investigation into the MLLMs' zero-shot association capabilities, addressing multiple dimensions, including three distinct memory strategies, both open-source and closed-source MLLMs, cutting-edge Mixture-of-Experts (MoE) models, and the involvement of human experts. Our systematic investigation shows that current open-source MLLMs consistently exhibit poor capability in our association tasks, even the currently state-of-the-art GPT-4V(vision) also has a significant gap compared to humans. We believe our benchmark would pave the way for future MLLM studies. $ extit{Our data and code are available at:}$ https://mvig-rhos.com/llm_inception.
Problem

Research questions and friction points this paper is trying to address.

Addresses deficiencies in Multi-modal Large Language Models (MLLMs) compared to human intelligence.
Proposes a benchmark for evaluating MLLMs' association capabilities using adjective and verb semantic concepts.
Investigates zero-shot association capabilities of MLLMs, including various memory strategies and model types.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Annotation-free dataset construction for association tasks
Three-level association tasks: single-step, synchronous, asynchronous
Comprehensive zero-shot association capability investigation
🔎 Similar Papers
No similar papers found.
H
Hong Li
Shanghai Jiao Tong University
Nanxi Li
Nanxi Li
Institute of Microelectronics, A*STAR
PhotonicsApplied PhysicsOptoelectronicsSensors
Y
Yuanjie Chen
Shanghai Jiao Tong University
J
Jianbin Zhu
Shanghai Jiao Tong University
Q
Qinlu Guo
Shanghai Jiao Tong University
C
Cewu Lu
Shanghai Jiao Tong University
Yong-Lu Li
Yong-Lu Li
Associate Professor, Shanghai Jiao Tong University/Shanghai Innovation Institute
Physical ReasoningRoboticsComputer VisionMachine LearningEmbodied AI