🤖 AI Summary
This work addresses key challenges in biomedical image–text datasets—such as overly brief captions, strong contextual dependency, and structural noise introduced by automatic extraction (e.g., missing captions, residual markup, and incoherent descriptions)—by proposing a context-enhanced interleaved corpus construction method. The approach integrates figure references from main text with original captions, followed by caption restoration, text cleaning, sample reconstruction, and large language model–guided filtering for quality and medical relevance. To ensure modality balance, the authors further design a classification scheme based on four evidence modalities and a modality-aware resampling strategy. Evaluated on Qwen-VL-4B, the resulting dataset achieves significant performance gains on both medical and general multimodal tasks using fewer pretraining tokens, demonstrating the critical role of data quality and modality complementarity in biomedical multimodal learning.
📝 Abstract
Large-scale biomedical image-text datasets extracted from scientific literature provide valuable resources for medical multimodal model training. These datasets are commonly organized as image-caption pairs; however, figure captions are often short, context-dependent, and only partially informative without the surrounding article text. At the same time, large-scale automatic extraction introduces structural noise such as missing captions, residual markup, duplicated context, and incoherent multi-paragraph figure descriptions. We revisit data construction for medical multimodal continued pretraining (CPT) and present PMC-InterCPT, a context-grounded biomedical interleaved corpus that incorporates figure-referencing body text in addition to captions. Our pipeline recovers missing captions, cleans caption and context text, reconstructs coherent interleaved image-text samples, and applies LLM-supervised medical relevance and quality classifiers to filter noisy records. We further reveal strong modality imbalance in the resulting corpus and introduce a four-bucket evidence taxonomy for modality-aware resampling. Through CPT followed by supervised fine-tuning (SFT) on Qwen3.5-4B-Base, PMC-InterCPT effectively improves medical and general multimodal performance while using fewer CPT tokens than the raw source pool. The experimental results also illustrate the complementarity between the data quality and modality for medical multimodal CPT.