ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing chart understanding benchmarks primarily focus on single-chart interpretation, offering limited capacity to evaluate models’ comparative reasoning across multiple charts. This work proposes ChartDiff—the first large-scale benchmark for cross-chart comparative summarization—comprising 8,541 chart pairs annotated with human-validated descriptions of their differences along dimensions such as trends, volatility, and anomalies. The authors employ a hybrid annotation pipeline combining large language model generation with rigorous human verification, enabling systematic evaluation of general-purpose vision-language models, chart-specialized architectures, and pipeline-based approaches. Experimental results reveal significant shortcomings in current models when comparing multi-series charts. Notably, automatic metrics like ROUGE exhibit misalignment with human judgment: while specialized models achieve higher ROUGE scores, they receive lower human ratings, whereas state-of-the-art general models lead in GPT-based evaluations.
📝 Abstract
Charts are central to analytical reasoning, yet existing benchmarks for chart understanding focus almost exclusively on single-chart interpretation rather than comparative reasoning across multiple charts. To address this gap, we introduce ChartDiff, the first large-scale benchmark for cross-chart comparative summarization. ChartDiff consists of 8,541 chart pairs spanning diverse data sources, chart types, and visual styles, each annotated with LLM-generated and human-verified summaries describing differences in trends, fluctuations, and anomalies. Using ChartDiff, we evaluate general-purpose, chart-specialized, and pipeline-based models. Our results show that frontier general-purpose models achieve the highest GPT-based quality, while specialized and pipeline-based methods obtain higher ROUGE scores but lower human-aligned evaluation, revealing a clear mismatch between lexical overlap and actual summary quality. We further find that multi-series charts remain challenging across model families, whereas strong end-to-end models are relatively robust to differences in plotting libraries. Overall, our findings demonstrate that comparative chart reasoning remains a significant challenge for current vision-language models and position ChartDiff as a new benchmark for advancing research on multi-chart understanding.
Problem

Research questions and friction points this paper is trying to address.

chart understanding
comparative reasoning
multi-chart analysis
vision-language models
benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

ChartDiff
comparative chart understanding
vision-language models
cross-chart summarization
benchmark dataset
🔎 Similar Papers
No similar papers found.