🤖 AI Summary
This work addresses the challenge that existing evaluations struggle to disentangle the visual reasoning capabilities of large vision-language models (LVLMs) from their reliance on factual priors acquired during training. To this end, the authors propose the Counterfactual Visual-Language Assessment Testbed (CVLAT), which constructs samples where visual evidence conflicts with established facts, thereby enabling the first explicit separation and quantification of a model’s dependence on visual input versus memorized knowledge. They introduce the Visual-Fact Reliance Index (VFRI) as a normalized metric and complement it with human baselines and prompt-based interventions for systematic evaluation. Experiments reveal that most LVLMs are visually driven, though some exhibit strong factual priors; high accuracy does not necessarily indicate faithful visual reasoning; and prompt interventions yield model-specific, asymmetric effects—highlighting critical limitations in current evaluation paradigms.
📝 Abstract
Large Vision-Language Models (LVLMs) show strong visualization interpretation, yet it is unclear whether their responses reflect genuine reasoning over visual evidence or factual priors learned during training. Current evaluations mix these two sources, obscuring when correct visual interpretation is overridden by memorized facts. We present a framework that isolates visual correctness from factual correctness, revealing validity limitations in existing visualization literacy assessments. Across three experiments with 15 state-of-the-art LVLMs: (1) several models reach human-level performance on standard tests (VLAT), but this may reflect factual recall rather than visual understanding, while randomized-data tests (reVLAT) underestimate literacy when correct visual interpretation is superseded by factual priors. (2) Using our Counterfactual Visualization Literacy Assessment Test (CVLAT) with capability-normalized arbitration metrics, we classify models by the sign of their visual-factual reliance index (VFRI), revealing a visualization-oriented majority and a factual knowledge-oriented minority, though several near-zero cases warrant caution. A human baseline (N=30) on the same counterfactual items confirms that people overwhelmingly follow the chart under conflict, providing a human reference point. (3) Prompt-based intervention can shift prioritization, but its effectiveness is highly model-dependent and direction-asymmetric, and high chart-reading capability does not predict prompt-controllability. Overall, high visualization accuracy is not sufficient evidence of faithful visual reasoning: reliable integration into visual analytics requires evaluating not only visualization literacy but also how models arbitrate between visual evidence and factual priors when the two diverge. Benchmark and code: https://github.com/JaeyoungKim-HCIL/CVLAT