🤖 AI Summary
This work addresses the low ecological validity of existing chart question answering (CQA) benchmarks by introducing VizNoteQA—the first CQA dataset derived from real-world visualization notebooks. Anchored in analytical narratives, VizNoteQA jointly extracts multimodal charts and natural language questions, faithfully coupling visual presentation with complex, multi-step reasoning to enhance task authenticity and difficulty. Methodologically, we propose a notebook-structure-guided multimodal alignment strategy that integrates chart understanding and NLP techniques to establish fine-grained semantic correspondences between visual elements and narrative text. Evaluation on VizNoteQA reveals that state-of-the-art multimodal large language models (e.g., GPT-4.1) achieve only 69.3% accuracy, exposing systematic limitations in reasoning coherence, cross-view integration, and narrative-driven comprehension under realistic analytical settings. This work establishes a new ecologically grounded benchmark and methodological paradigm for chart understanding research.
📝 Abstract
We present a new dataset for chart question answering (CQA) constructed from visualization notebooks. The dataset features real-world, multi-view charts paired with natural language questions grounded in analytical narratives. Unlike prior benchmarks, our data reflects ecologically valid reasoning workflows. Benchmarking state-of-the-art multimodal large language models reveals a significant performance gap, with GPT-4.1 achieving an accuracy of 69.3%, underscoring the challenges posed by this more authentic CQA setting.