In-Depth and In-Breadth: Pre-training Multimodal Language Models Customized for Comprehensive Chart Understanding

📅 2025-07-18

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Current large vision-language models (LVLMs) face two key bottlenecks in scientific chart understanding: (1) limited training data covering only a narrow set of chart types, resulting in poor generalization; and (2) insufficient explicit modeling of semantic alignment between visual chart representations and underlying numerical data. To address these, we propose a dual-path training framework leveraging large-scale synthetic chart–text–data triplets spanning over ten chart types—including bar charts, line plots, and scatter plots—enabling the first unified pretraining across diverse scientific visualizations. Our method integrates contrastive learning with inference-driven training grounded in real-world data distributions to explicitly learn joint visual-numerical representations. We introduce ChartDQA, a new benchmark featuring multi-level question answering and data provenance tasks. Experiments demonstrate substantial improvements over state-of-the-art methods in complex reasoning and cross-chart-type generalization. Code and data are publicly released.

Technology Category

Application Category

📝 Abstract

Recent methods for customizing Large Vision Language Models (LVLMs) for domain-specific tasks have shown promising results in scientific chart comprehension. However, existing approaches face two major limitations: First, they rely on paired data from only a few chart types, limiting generalization to wide range of chart types. Secondly, they lack targeted pre-training for chart-data alignment, which hampers the model's understanding of underlying data. In this paper, we introduce ChartScope, an LVLM optimized for in-depth chart comprehension across diverse chart types. We propose an efficient data generation pipeline that synthesizes paired data for a wide range of chart types, along with a novel Dual-Path training strategy that enabling the model to succinctly capture essential data details while preserving robust reasoning capabilities by incorporating reasoning over the underlying data. Lastly, we establish ChartDQA, a new benchmark for evaluating not only question-answering at different levels but also underlying data understanding. Experimental results demonstrate that ChartScope significantly enhances comprehension on a wide range of chart types. The code and data are available at https://davidhalladay.github.io/chartscope_demo.

Problem

Research questions and friction points this paper is trying to address.

Limited generalization across diverse chart types

Lack of targeted pre-training for chart-data alignment

Need for comprehensive chart understanding benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

Customized multimodal pre-training for charts

Dual-Path training for data-detail capture

Synthetic data pipeline for diverse chart types

🔎 Similar Papers

On Pre-training of Multimodal Language Models Customized for Chart Understanding