A PubMed-Scale Dataset of Structured Biomedical Abstracts

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the critical limitation posed by the lack of structured annotations in PubMed abstracts, which severely hinders information retrieval and knowledge integration in biomedicine. For the first time, the entire PubMed corpus has been systematically structured, yielding a high-quality dataset of 23.2 million abstracts—comprising 5.9 million human-annotated and 17.2 million large language model (LLM)-generated annotations—uniformly mapped to a five-section schema while preserving original metadata. The work introduces a hybrid construction paradigm that synergistically combines manual and LLM-based annotations, employing a token-level extraction pipeline enhanced by XML parsing and standardized schema alignment. This resource enables fine-grained, cross-document tasks such as sentence classification, text segmentation evaluation, and large-scale section-wise information extraction.

📝 Abstract

Structured abstracts are important for biomedical literature processing, by facilitating information retrieval, text mining, and knowledge synthesis. However, a vast portion of abstracts indexed in PubMed remain unstructured, presenting a significant bottleneck for downstream text-processing workflows and applications. To resolve this limitation, we introduce Structured PubMed, a comprehensive corpus of section-labeled biomedical abstracts compiled from the complete PubMed database, encompassing over 23.2 million research-article records. The corpus is divided into two distinct subsets: a collection of 5.9 million author-structured abstracts parsed from official XML files, and an automatically labeled collection of 17.2 million originally unstructured abstracts structured via a verbatim-extraction Large Language Model pipeline. Every record is harmonized under a unified five-section schema and mapped to its original PubMed identifier, publication type, and publication date. This dataset can be utilized to train sentence-classification models, benchmark text-segmentation architectures, and perform large-scale, section-specific information extraction at an unprecedented PubMed-wide scale.

Problem

Research questions and friction points this paper is trying to address.

structured abstracts

PubMed

biomedical literature

text processing

information extraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

structured abstracts

large language model

verbatim extraction