S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

📅 2026-01-01

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This study addresses the significant semantic gap between scientific figures and their original captions, which hinders the application of multimodal models in scientific research. To bridge this gap, the authors construct a large-scale, multidisciplinary dataset comprising 15.5 million high-quality image–text pairs spanning physics, biology, engineering, and other domains. They further propose the first AI-ready semantic enrichment pipeline tailored for scientific figures: by integrating paper abstracts and citation contexts, they leverage the Qwen-VL multimodal large language model to rewrite figure captions, substantially enhancing semantic richness and cross-modal alignment. Evaluations using SciBERT and CLIP demonstrate that the enriched captions exhibit lower pseudo-perplexity and an 18.21% improvement in CLIP image–text alignment scores, providing a high-quality foundational resource for AI-driven scientific discovery.

Technology Category

Application Category

📝 Abstract

Multimodal learning has revolutionized general domain tasks, yet its application in scientific discovery is hindered by the profound semantic gap between complex scientific imagery and sparse textual descriptions. We present S1-MMAlign, a large-scale, multi-disciplinary multimodal dataset comprising over 15.5 million high-quality image-text pairs derived from 2.5 million open-access scientific papers. Spanning disciplines from physics and biology to engineering, the dataset captures diverse visual modalities including experimental setups, heatmaps, and microscopic imagery. To address the pervasive issue of weak alignment in raw scientific captions, we introduce an AI-ready semantic enhancement pipeline that utilizes the Qwen-VL multimodal large model series to recaption images by synthesizing context from paper abstracts and citation contexts. Technical validation demonstrates that this enhancement significantly improves data quality: SciBERT-based pseudo-perplexity metrics show reduced semantic ambiguity, while CLIP scores indicate an 18.21% improvement in image-text alignment. S1-MMAlign provides a foundational resource for advancing scientific reasoning and cross-modal understanding in the era of AI for Science. The dataset is publicly available at https://huggingface.co/datasets/ScienceOne-AI/S1-MMAlign.

Problem

Research questions and friction points this paper is trying to address.

scientific figure-text understanding

semantic gap

multimodal learning

image-text alignment

scientific discovery

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal dataset

semantic enhancement

scientific figure-text alignment