🤖 AI Summary
This study investigates why multimodal large language models (MLLMs) perform significantly worse when verifying scientific claims using chart-based evidence compared to tabular data, despite the semantic equivalence of the underlying information. Through layer-wise linear probing and attention analysis across three open-source vision-language models, the authors systematically compare how these models process identical data presented as tables versus charts. They find that while chart information is effectively encoded in intermediate layers, it fails to be adequately routed to the final prediction layers. The performance gap between tables and charts stems primarily from this routing failure rather than insufficient encoding, with distinct failure patterns emerging across different model architectures. These findings highlight a critical bottleneck in cross-modal reasoning within current MLLMs.
📝 Abstract
Multimodal LLMs are increasingly used to assist scientific peer review, where a core requirement is verifying whether claims in a paper are supported by its evidence. Prior work has shown that models perform substantially better at this task when the evidence is a table than when it is a chart of the same underlying data. This raises the question of whether models fail to extract information from charts, or do they extract it but fail to use it when forming their prediction? We study this question through layer-wise linear probing and attention analysis on three open-weight VLMs over table and chart evidence, representing the same underlying data. We find consistent evidence for the latter. Chart information is encoded in the models' intermediate representations but does not reach the prediction position, a gap that is absent for tables and holds across all conditions tested. Attention analysis further reveals that this disconnect takes two architecturally distinct forms across model families. These findings reframe the table-chart gap as a failure of how encoded visual information is routed at prediction time, rather than a failure of encoding itself.