See or Recall: A Sanity Check for the Role of Vision in Solving Visualization Question Answer Tasks with Multimodal LLMs

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Current VisQA evaluation methodologies fail to disentangle whether models rely on genuine visual understanding or linguistic priors (“recalling” answers without visual input), leading to inaccurate assessments of multimodal reasoning capabilities. To address this, we propose the first vision–recall disentanglement framework for VisQA, integrating rule-driven decision trees, vision–text alignment visualization, ablation-based prompt control, statistical significance testing, and an interpretable verification checklist. Experiments reveal that mainstream multimodal large language models (MLLMs) correctly answer over 60% of VisQA questions even when images are withheld; prominent benchmarks—including VizWiz and PlotQA—exhibit substantial recall bias. This work not only exposes a fundamental flaw in existing VisQA evaluation paradigms but also establishes a reproducible, interpretable, and principled disentanglement methodology. By rigorously isolating visual reasoning from linguistic memorization, our framework lays the groundwork for developing truly vision-grounded MLLM evaluation benchmarks.

Technology Category

Application Category

📝 Abstract

Recent developments in multimodal large language models (MLLM) have equipped language models to reason about vision and language jointly. This permits MLLMs to both perceive and answer questions about data visualization across a variety of designs and tasks. Applying MLLMs to a broad range of visualization tasks requires us to properly evaluate their capabilities, and the most common way to conduct evaluation is through measuring a model's visualization reasoning capability, analogous to how we would evaluate human understanding of visualizations (e.g., visualization literacy). However, we found that in the context of visualization question answering (VisQA), how an MLLM perceives and reasons about visualizations can be fundamentally different from how humans approach the same problem. During the evaluation, even without visualization, the model could correctly answer a substantial portion of the visualization test questions, regardless of whether any selection options were provided. We hypothesize that the vast amount of knowledge encoded in the language model permits factual recall that supersedes the need to seek information from the visual signal. It raises concerns that the current VisQA evaluation may not fully capture the models' visualization reasoning capabilities. To address this, we propose a comprehensive sanity check framework that integrates a rule-based decision tree and a sanity check table to disentangle the effects of"seeing"(visual processing) and"recall"(reliance on prior knowledge). This validates VisQA datasets for evaluation, highlighting where models are truly"seeing", positively or negatively affected by the factual recall, or relying on inductive biases for question answering. Our study underscores the need for careful consideration in designing future visualization understanding studies when utilizing MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' true visualization reasoning vs. factual recall

Disentangling visual processing and prior knowledge in VisQA tasks

Ensuring VisQA datasets accurately assess models' visual understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal LLMs analyze vision and language jointly

Rule-based decision tree for visual processing check

Sanity check table disentangles seeing and recall

🔎 Similar Papers

Insight Over Sight: Exploring the Vision-Knowledge Conflicts in Multimodal LLMs