Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Discern Causal Links Across Modalities

๐Ÿ“… 2024-08-15
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work investigates the robustness of multimodal large language models (MLLMs) in cross-modal causal reasoning when visual details implicitly encode causal cuesโ€”a capability inadequately assessed by existing benchmarks. To address this gap, we introduce MuCR, the first dedicated benchmark for multimodal causal reasoning, constructed from synthetic twin imageโ€“text pairs and spanning three granularities: image-level matching, phrase-level understanding, and sentence-level explanation. We further propose Visual-enhanced Chain-of-Thought (VcCoT), a prompting method that explicitly guides MLLMs to attend to critical visual causal cues. Experiments reveal that current MLLMs underperform significantly in multimodal causal reasoning compared to text-only settings; accurate visual causal cue identification constitutes the primary bottleneck for cross-modal generalization; and VcCoT boosts average accuracy across mainstream MLLMs by 12.7%. This work establishes a novel benchmark, evaluation paradigm, and reasoning mechanism for multimodal causal inference.

Technology Category

Application Category

๐Ÿ“ Abstract
Multimodal Large Language Models (MLLMs) have showcased exceptional Chain-of-Thought (CoT) reasoning ability in complex textual inference tasks including causal reasoning. However, will these causalities remain straightforward when crucial hints hide in visual details? If not, what factors might influence cross-modal generalization? Whether we can effectively enhance their capacity for robust causal inference across both text and vision? Motivated by these, we introduce MuCR - a novel Multimodal Causal Reasoning benchmark that leverages synthetic siamese images and text pairs to challenge MLLMs. Additionally, we develop tailored metrics from multiple perspectives, including image-level match, phrase-level understanding, and sentence-level explanation, to comprehensively assess MLLMs' comprehension abilities. Our experiments reveal that current MLLMs fall short in multimodal causal reasoning compared to their performance in purely textual settings. Additionally, we find that identifying visual cues across images is key to effective cross-modal generalization. Finally, we propose a VcCoT strategy that better highlights visual cues, and our results confirm its efficacy in enhancing multimodal causal reasoning. The project is available at: https://github.com/Zhiyuan-Li-John/MuCR
Problem

Research questions and friction points this paper is trying to address.

Assessing MLLMs' ability to discern causal links across text and vision
Identifying factors influencing cross-modal generalization in causal reasoning
Enhancing MLLMs' robustness in multimodal causal inference tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic siamese images and text pairs
Tailored metrics for comprehensive assessment
VcCoT strategy highlighting visual cues
๐Ÿ”Ž Similar Papers
No similar papers found.
Z
Zhiyuan Li
School of Computer Science, The University of Sydney
H
Heng Wang
School of Computer Science, The University of Sydney
Dongnan Liu
Dongnan Liu
The University of Sydney
computer visionlarge language modelmedical image analysis
C
Chaoyi Zhang
School of Computer Science, The University of Sydney
Ao Ma
Ao Ma
JD.com
Generative AIVideo Generation
J
Jieting Long
School of Computer Science, The University of Sydney
Weidong Cai
Weidong Cai
Clinical Associate Professor, Stanford University School of Medicine
functional neuroimagingmachine learningcognitivedevelopmentalclinical neuroscience