Automating Early Disease Prediction Via Structured and Unstructured Clinical Data

πŸ“… 2026-03-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the challenge of early disease prediction hindered by missing structured data in electronic health records by proposing an end-to-end, fully automated pipeline that, for the first time, integrates unstructured discharge summaries with structured clinical data without manual intervention. Leveraging natural language processing, the framework automatically extracts key information to enable patient cohort selection, dataset construction, and outcome label generation, substantially enhancing both data quality and clinical relevance for predictive modeling. Evaluated on atrial fibrillation progression prediction, models trained on this enriched dataset outperform those using only structured data in both accuracy and alignment with actual clinical outcomes, and further surpass conventional clinical risk scoring systems.
πŸ“ Abstract
This study presents a fully automated methodology for early prediction studies in clinical settings, leveraging information extracted from unstructured discharge reports. The proposed pipeline uses discharge reports to support the three main steps of early prediction: cohort selection, dataset generation, and outcome labeling. By processing discharge reports with natural language processing techniques, we can efficiently identify relevant patient cohorts, enrich structured datasets with additional clinical variables, and generate high-quality labels without manual intervention. This approach addresses the frequent issue of missing or incomplete data in codified electronic health records (EHR), capturing clinically relevant information that is often underrepresented. We evaluate the methodology in the context of predicting atrial fibrillation (AF) progression, showing that predictive models trained on datasets enriched with discharge report information achieve higher accuracy and correlation with true outcomes compared to models trained solely on structured EHR data, while also surpassing traditional clinical scores. These results demonstrate that automating the integration of unstructured clinical text can streamline early prediction studies, improve data quality, and enhance the reliability of predictive models for clinical decision-making.
Problem

Research questions and friction points this paper is trying to address.

early disease prediction
unstructured clinical data
electronic health records
data incompleteness
clinical text
Innovation

Methods, ideas, or system contributions that make the work stand out.

early disease prediction
unstructured clinical data
natural language processing
electronic health records
automated cohort selection
πŸ”Ž Similar Papers
No similar papers found.
A
Ane G Domingo-Aldama
University of the Basque Country (EHU)
M
Marcos Merino Prado
University of the Basque Country (EHU)
A
Alain GarcΓ­a Olea
Basurto University Hospital
Josu Goikoetxea
Josu Goikoetxea
Associate Professor in UPV/EHU
Koldo Gojenola
Koldo Gojenola
Researcher and Teacher of Computer Science, University of the Basque Country UPV/EHU
Natural Language ProcessingArtificial Intelligence
A
Aitziber Atutxa
University of the Basque Country (EHU)