🤖 AI Summary
This work addresses a critical gap in existing visual question answering (VQA) benchmarks: their inability to evaluate models’ understanding of the underlying data distributions behind scientific charts, particularly the complex, non-bijective relationships between visual representations and their source data. To this end, the authors propose the first data-distribution-centered VQA paradigm, introducing a novel benchmark constructed from synthetically generated histograms grounded in real underlying data. The dataset includes annotations for distribution parameters, raw data points, and bounding boxes of visual elements. Questions are carefully designed to probe distributional reasoning, combining human-authored and large language model–generated queries to comprehensively assess models’ grasp of the data generation process. The released open-source dataset not only exposes significant limitations of current VQA models in distributional reasoning but also establishes a robust foundation for future research.
📝 Abstract
Visual Question Answering (VQA) has become an important benchmark for assessing how large multimodal models (LMMs) interpret images. However, most VQA datasets focus on real-world images or simple diagrammatic analysis, with few focused on interpreting complex scientific charts. Indeed, many VQA datasets that analyze charts do not contain the underlying data behind those charts or assume a 1-to-1 correspondence between chart marks and underlying data. In reality, charts are transformations (i.e. analysis, simplification, modification) of data. This distinction introduces a reasoning challenge in VQA that the current datasets do not capture. In this paper, we argue for a dedicated VQA benchmark for scientific charts where there is no 1-to-1 correspondence between chart marks and underlying data. To do so, we survey existing VQA datasets and highlight limitations of the current field. We then generate synthetic histogram charts based on ground truth data, and ask both humans and a large reasoning model questions where precise answers depend on access to the underlying data. We release the open-source dataset, including figures, underlying data, distribution parameters used to generate the data, and bounding boxes for all figure marks and text for future research.