RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning

📅 2025-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing chart understanding methods primarily optimize answer accuracy while neglecting explicit localization of supporting visual elements, thus failing to model the intricate interplay between visual and numerical information. To address this, we propose RefChartQA—the first chart question-answering (ChartQA) benchmark supporting multi-granular visual referring (at chart, axis, bar, and data point levels) and the first framework jointly modeling ChartQA with fine-grained visual reference resolution. Methodologically, we introduce a spatially aware instruction-tuning paradigm, incorporating spatial coordinate regression and referential mask supervision to uncover critical text–space alignment mechanisms, including token-merged feature fusion. Fine-tuning five mainstream vision-language models (VLMs) under this paradigm improves answer accuracy by over 15% and substantially reduces hallucination. All data, models, and code are publicly released to advance interpretable chart understanding.

Technology Category

Application Category

📝 Abstract
Recently, Vision Language Models (VLMs) have increasingly emphasized document visual grounding to achieve better human-computer interaction, accessibility, and detailed understanding. However, its application to visualizations such as charts remains under-explored due to the inherent complexity of interleaved visual-numerical relationships in chart images. Existing chart understanding methods primarily focus on answering questions without explicitly identifying the visual elements that support their predictions. To bridge this gap, we introduce RefChartQA, a novel benchmark that integrates Chart Question Answering (ChartQA) with visual grounding, enabling models to refer elements at multiple granularities within chart images. Furthermore, we conduct a comprehensive evaluation by instruction-tuning 5 state-of-the-art VLMs across different categories. Our experiments demonstrate that incorporating spatial awareness via grounding improves response accuracy by over 15%, reducing hallucinations, and improving model reliability. Additionally, we identify key factors influencing text-spatial alignment, such as architectural improvements in TinyChart, which leverages a token-merging module for enhanced feature fusion. Our dataset is open-sourced for community development and further advancements. All models and code will be publicly available at https://github.com/moured/RefChartQA.
Problem

Research questions and friction points this paper is trying to address.

VLMs lack visual grounding for chart elements in QA tasks
Existing methods ignore visual support for chart predictions
Need for multi-granularity spatial awareness in chart understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates ChartQA with visual grounding
Instruction-tunes 5 state-of-the-art VLMs
Leverages token-merging module for feature fusion
🔎 Similar Papers
No similar papers found.
A
Alexander Vogel
CV:HCI lab, Karlsruhe Institute of Technology, Germany.
Omar Moured
Omar Moured
Karlsruhe Institue of Technology
Computer VisionVision-Language ModelsDocument AnalysisAssistive Tech
Y
Yufan Chen
CV:HCI lab, Karlsruhe Institute of Technology, Germany.
J
Jiaming Zhang
CV:HCI lab, Karlsruhe Institute of Technology, Germany., CVG, ETH, Switzerland.
Rainer Stiefelhagen
Rainer Stiefelhagen
Karlsruhe Institute of Technology, Karlsruhe, Germany
Computer visionMultimodal interactionAccessibility