A PubMed-Scale Dataset of Structured Biomedical Abstracts

📅 2026-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the critical limitation posed by the lack of structured annotations in PubMed abstracts, which severely hinders information retrieval and knowledge integration in biomedicine. For the first time, the entire PubMed corpus has been systematically structured, yielding a high-quality dataset of 23.2 million abstracts—comprising 5.9 million human-annotated and 17.2 million large language model (LLM)-generated annotations—uniformly mapped to a five-section schema while preserving original metadata. The work introduces a hybrid construction paradigm that synergistically combines manual and LLM-based annotations, employing a token-level extraction pipeline enhanced by XML parsing and standardized schema alignment. This resource enables fine-grained, cross-document tasks such as sentence classification, text segmentation evaluation, and large-scale section-wise information extraction.
📝 Abstract
Structured abstracts are important for biomedical literature processing, by facilitating information retrieval, text mining, and knowledge synthesis. However, a vast portion of abstracts indexed in PubMed remain unstructured, presenting a significant bottleneck for downstream text-processing workflows and applications. To resolve this limitation, we introduce Structured PubMed, a comprehensive corpus of section-labeled biomedical abstracts compiled from the complete PubMed database, encompassing over 23.2 million research-article records. The corpus is divided into two distinct subsets: a collection of 5.9 million author-structured abstracts parsed from official XML files, and an automatically labeled collection of 17.2 million originally unstructured abstracts structured via a verbatim-extraction Large Language Model pipeline. Every record is harmonized under a unified five-section schema and mapped to its original PubMed identifier, publication type, and publication date. This dataset can be utilized to train sentence-classification models, benchmark text-segmentation architectures, and perform large-scale, section-specific information extraction at an unprecedented PubMed-wide scale.
Problem

Research questions and friction points this paper is trying to address.

structured abstracts
PubMed
biomedical literature
text processing
information extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

structured abstracts
large language model
verbatim extraction
PubMed-scale dataset
text segmentation