VisualFLIP: Do Predictions Depend on Task-Critical Visual Evidence in Multimodal Reasoning?

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing research indicates that even when multimodal large language models (MLLMs) produce correct answers, it remains unclear whether their reasoning genuinely relies on task-critical visual evidence. To address this, this work introduces VisualFLIP, a paired benchmark comprising 1,374 images, where minimal perturbations alter only the key visual evidence while keeping the question unchanged, thereby deterministically flipping the ground-truth answer. This setup enables a rigorous evaluation of models’ dependence on visual evidence. The study proposes two new metrics—paired accuracy and collapse rate (CR)—and evaluates 24 state-of-the-art MLLMs. Results reveal that high-performing models often fail to update their answers in response to critical visual changes, and exhibit significantly increased collapse rates in sequential question-answering settings, exposing substantial deficiencies in both reasoning stability and reliance on visual evidence.

📝 Abstract

When a multimodal large language model answers a visual reasoning question correctly, is the prediction actually supported by the task-critical visual evidence? Correct answers can coexist with flawed reasoning, making accuracy alone an incomplete test of grounding. We introduce VisualFLIP, a paired benchmark with 1,374 images arranged as same-question perturbation pairs across cardinality, attribute, spatial, and logic tasks. Each pair keeps the question fixed but minimally changes the evidence so the gold answer deterministically flips. We evaluate 24 MLLMs with pair accuracy, which requires solving both sides of a pair, and Collapse Rate (CR), which measures how often a model that solves at least one side repeats the same non-empty answer for both images. Together, these metrics show that paired correctness and evidence dependence are related but distinct: capable models can still fail to update after task-critical visual changes, and collapse becomes more severe for some models when the edited image follows an earlier answer in a sequential setting. Further details are available on our project page: https://didizhu-judy.github.io/VisualFLIP/

Problem

Research questions and friction points this paper is trying to address.

multimodal reasoning

visual evidence

task-critical

grounding

model reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

VisualFLIP

multimodal reasoning

evidence dependence