MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding

📅 2024-07-06
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
Existing scientific chart datasets are limited to simple graphics or single-disciplinary domains, lacking interdisciplinary, expert-level visualizations—such as schematics, microscopic images, and experimental data—that require graduate-level domain knowledge for interpretation. Method: We introduce the first multimodal scientific understanding dataset derived from peer-reviewed articles in *Nature Communications*, spanning 72 scientific disciplines and systematically incorporating high-difficulty figures. Annotation was performed by domain experts, and evaluation involved 19 state-of-the-art multimodal models. We conducted task-specific fine-tuning and interleaved image-text continual pretraining using Qwen2-VL-7B. Contribution/Results: Our fine-tuned model achieves higher accuracy than both GPT-4o and human experts on multiple-choice scientific chart comprehension tasks; downstream performance notably improves in domains like materials science. The dataset is publicly released, establishing critical infrastructure for AI-powered scientific assistants.

Technology Category

Application Category

📝 Abstract
Scientific figure interpretation is a crucial capability for AI-driven scientific assistants built on advanced Large Vision Language Models. However, current datasets and benchmarks primarily focus on simple charts or other relatively straightforward figures from limited science domains. To address this gap, we present a comprehensive dataset compiled from peer-reviewed Nature Communications articles covering 72 scientific fields, encompassing complex visualizations such as schematic diagrams, microscopic images, and experimental data which require graduate-level expertise to interpret. We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation. Our analysis revealed significant task challenges and performance gaps among models. Beyond serving as a benchmark, this dataset serves as a valuable resource for large-scale training. Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations. Furthermore, continuous pre-training on our interleaved article and figure data substantially enhanced the model's downstream task performance in materials science. We have released our dataset to support further research.
Problem

Research questions and friction points this paper is trying to address.

Graduate-level multimodal scientific understanding
Complex visualizations interpretation
Performance gaps in scientific figure tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal scientific dataset creation
Graduate-level figure interpretation
Fine-tuning enhances model performance
🔎 Similar Papers
No similar papers found.
Z
Zekun Li
University of California, Santa Barbara
X
Xianjun Yang
University of California, Santa Barbara
K
Kyuri Choi
POSCO HOLDINGS
Wanrong Zhu
Wanrong Zhu
Adobe Research
Vision and LanguageNatural Language Processing
R
Ryan Hsieh
University of California, Santa Barbara
H
Hyeonjung Kim
POSCO HOLDINGS
Jin Hyuk Lim
Jin Hyuk Lim
POSCO HOLDINGS
S
Sungyoung Ji
POSCO HOLDINGS
B
Byungju Lee
POSCO HOLDINGS, KIST
Xifeng Yan
Xifeng Yan
Professor, Computer Science, Univ. of California at Santa Barbara
Artificial IntelligenceData Mining
L
Linda R. Petzold
University of California, Santa Barbara
S
Stephen D. Wilson
University of California, Santa Barbara
W
Woosang Lim
POSCO HOLDINGS
William Yang Wang
William Yang Wang
Mellichamp Chair Professor, University of California, Santa Barbara
Natural Language ProcessingMachine LearningArtificial IntelligenceLanguage and Vision