ClimateViz: A Benchmark for Statistical Reasoning and Fact Verification on Scientific Charts

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Scientific fact-checking has long overlooked charts—the primary vehicle for quantitative evidence—lacking large-scale benchmarks supporting statistical reasoning and claim verification. Method: We introduce SciChart-Fact, the first benchmark for scientific chart fact-checking, covering domains such as climate science. It comprises 2,896 expert-annotated charts and 49,872 claims, labeled with ternary veracity judgments (support/refute/insufficient information) and structured knowledge graph explanations capturing trends, comparisons, and causal relationships. We formally define the chart-level scientific fact-checking task, propose a visual-semantic alignment annotation framework, and develop a knowledge-graph-driven, interpretable evaluation methodology. Contribution/Results: Zero-shot and few-shot experiments with state-of-the-art multimodal models (e.g., Gemini 2.5, InternVL 2.5) yield accuracies of only 76.2–77.8%, substantially below human performance (89.3–92.7%). Knowledge graph explanations significantly improve model accuracy. All data, code, and evaluation protocols are publicly released.

Technology Category

Application Category

📝 Abstract

Scientific fact-checking has mostly focused on text and tables, overlooking scientific charts, which are key for presenting quantitative evidence and statistical reasoning. We introduce ClimateViz, the first large-scale benchmark for scientific fact-checking using expert-curated scientific charts. ClimateViz contains 49,862 claims linked to 2,896 visualizations, each labeled as support, refute, or not enough information. To improve interpretability, each example includes structured knowledge graph explanations covering trends, comparisons, and causal relations. We evaluate state-of-the-art multimodal language models, including both proprietary and open-source systems, in zero-shot and few-shot settings. Results show that current models struggle with chart-based reasoning: even the best systems, such as Gemini 2.5 and InternVL 2.5, reach only 76.2 to 77.8 percent accuracy in label-only settings, far below human performance (89.3 and 92.7 percent). Explanation-augmented outputs improve performance in some models. We released our dataset and code alongside the paper.

Problem

Research questions and friction points this paper is trying to address.

Lack of benchmarks for fact-checking scientific charts

Need for statistical reasoning in chart-based claims verification

Current models underperform in chart interpretation compared to humans

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale benchmark for scientific chart fact-checking

Structured knowledge graph explanations for interpretability

Evaluation of multimodal models in zero-shot settings

🔎 Similar Papers

CHARTOM: A Visual Theory-of-Mind Benchmark for Multimodal Large Language Models