Seeing What's Not There: Spurious Correlation in Multimodal LLMs

📅 2025-03-11

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work identifies a critical spurious-vision bias in multimodal large language models (MLLMs), wherein false visual cues severely degrade object recognition accuracy and amplify hallucination rates by over an order of magnitude. To address this, we propose SpurLens—the first unsupervised diagnostic framework that synergizes GPT-4-assisted prompt engineering with open-set object detection to automatically discover and quantify spurious correlations across diverse MLLMs and benchmarks. We provide the first empirical validation of how such visual spuriousness exacerbates hallucinations and establish a reproducible robustness evaluation protocol. Furthermore, we design prompt ensembling and chain-of-reasoning strategies that substantially reduce hallucination incidence. Extensive experiments on leading MLLMs—including LLaVA and Qwen-VL—and multiple standard benchmarks confirm both the ubiquity of the problem and the efficacy of our mitigation approaches, advancing trustworthy evaluation and robustness research paradigms for MLLMs.

Technology Category

Application Category

📝 Abstract

Unimodal vision models are known to rely on spurious correlations, but it remains unclear to what extent Multimodal Large Language Models (MLLMs) exhibit similar biases despite language supervision. In this paper, we investigate spurious bias in MLLMs and introduce SpurLens, a pipeline that leverages GPT-4 and open-set object detectors to automatically identify spurious visual cues without human supervision. Our findings reveal that spurious correlations cause two major failure modes in MLLMs: (1) over-reliance on spurious cues for object recognition, where removing these cues reduces accuracy, and (2) object hallucination, where spurious cues amplify the hallucination by over 10x. We validate our findings in various MLLMs and datasets. Beyond diagnosing these failures, we explore potential mitigation strategies, such as prompt ensembling and reasoning-based prompting, and conduct ablation studies to examine the root causes of spurious bias in MLLMs. By exposing the persistence of spurious correlations, our study calls for more rigorous evaluation methods and mitigation strategies to enhance the reliability of MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Investigates spurious bias in Multimodal Large Language Models (MLLMs).

Identifies spurious visual cues using GPT-4 and object detectors.

Explores mitigation strategies for spurious correlations in MLLMs.

Innovation

Methods, ideas, or system contributions that make the work stand out.

SpurLens pipeline for spurious cue identification

GPT-4 and open-set object detectors integration

Prompt ensembling and reasoning-based mitigation strategies

🔎 Similar Papers

Hallucination of Multimodal Large Language Models: A Survey