🤖 AI Summary
Existing scientific chart datasets are limited to simple graphics or single-disciplinary domains, lacking interdisciplinary, expert-level visualizations—such as schematics, microscopic images, and experimental data—that require graduate-level domain knowledge for interpretation.
Method: We introduce the first multimodal scientific understanding dataset derived from peer-reviewed articles in *Nature Communications*, spanning 72 scientific disciplines and systematically incorporating high-difficulty figures. Annotation was performed by domain experts, and evaluation involved 19 state-of-the-art multimodal models. We conducted task-specific fine-tuning and interleaved image-text continual pretraining using Qwen2-VL-7B.
Contribution/Results: Our fine-tuned model achieves higher accuracy than both GPT-4o and human experts on multiple-choice scientific chart comprehension tasks; downstream performance notably improves in domains like materials science. The dataset is publicly released, establishing critical infrastructure for AI-powered scientific assistants.
📝 Abstract
Scientific figure interpretation is a crucial capability for AI-driven scientific assistants built on advanced Large Vision Language Models. However, current datasets and benchmarks primarily focus on simple charts or other relatively straightforward figures from limited science domains. To address this gap, we present a comprehensive dataset compiled from peer-reviewed Nature Communications articles covering 72 scientific fields, encompassing complex visualizations such as schematic diagrams, microscopic images, and experimental data which require graduate-level expertise to interpret. We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation. Our analysis revealed significant task challenges and performance gaps among models. Beyond serving as a benchmark, this dataset serves as a valuable resource for large-scale training. Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations. Furthermore, continuous pre-training on our interleaved article and figure data substantially enhanced the model's downstream task performance in materials science. We have released our dataset to support further research.