S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

📅 2026-01-01
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the significant semantic gap between scientific figures and their original captions, which hinders the application of multimodal models in scientific research. To bridge this gap, the authors construct a large-scale, multidisciplinary dataset comprising 15.5 million high-quality image–text pairs spanning physics, biology, engineering, and other domains. They further propose the first AI-ready semantic enrichment pipeline tailored for scientific figures: by integrating paper abstracts and citation contexts, they leverage the Qwen-VL multimodal large language model to rewrite figure captions, substantially enhancing semantic richness and cross-modal alignment. Evaluations using SciBERT and CLIP demonstrate that the enriched captions exhibit lower pseudo-perplexity and an 18.21% improvement in CLIP image–text alignment scores, providing a high-quality foundational resource for AI-driven scientific discovery.

Technology Category

Application Category

📝 Abstract
Multimodal learning has revolutionized general domain tasks, yet its application in scientific discovery is hindered by the profound semantic gap between complex scientific imagery and sparse textual descriptions. We present S1-MMAlign, a large-scale, multi-disciplinary multimodal dataset comprising over 15.5 million high-quality image-text pairs derived from 2.5 million open-access scientific papers. Spanning disciplines from physics and biology to engineering, the dataset captures diverse visual modalities including experimental setups, heatmaps, and microscopic imagery. To address the pervasive issue of weak alignment in raw scientific captions, we introduce an AI-ready semantic enhancement pipeline that utilizes the Qwen-VL multimodal large model series to recaption images by synthesizing context from paper abstracts and citation contexts. Technical validation demonstrates that this enhancement significantly improves data quality: SciBERT-based pseudo-perplexity metrics show reduced semantic ambiguity, while CLIP scores indicate an 18.21% improvement in image-text alignment. S1-MMAlign provides a foundational resource for advancing scientific reasoning and cross-modal understanding in the era of AI for Science. The dataset is publicly available at https://huggingface.co/datasets/ScienceOne-AI/S1-MMAlign.
Problem

Research questions and friction points this paper is trying to address.

scientific figure-text understanding
semantic gap
multimodal learning
image-text alignment
scientific discovery
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal dataset
semantic enhancement
scientific figure-text alignment
Qwen-VL
AI for Science
H
He Wang
Institute of Automation, Chinese Academy of Sciences, Beijing, China
L
Longteng Guo
Institute of Automation, Chinese Academy of Sciences, Beijing, China
P
Pengkang Huo
Institute of Automation, Chinese Academy of Sciences, Beijing, China
X
Xuanxu Lin
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Y
Yichen Yuan
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Jie Jiang
Jie Jiang
Institute of Automation,Chinese Academy of Sciences
Semantic SegmentationComputer Vision
Jing Liu
Jing Liu
Institute of Theoretical Physics, Chinese Academy of Sciences
Statistical physicsmachine learning