Extracting Post-Acute Sequelae of SARS-CoV-2 Infection Symptoms from Clinical Notes via Hybrid Natural Language Processing

📅 2025-08-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Post-acute sequelae of SARS-CoV-2 infection (PASC) exhibit high symptom heterogeneity and temporal dynamics, hindering accurate clinical identification from unstructured electronic health records. Method: We developed an end-to-end hybrid NLP pipeline integrating rule-based named entity recognition with a fine-tuned BERT model for assertion classification, augmented by clinical text normalization and a curated PASC-specific terminology dictionary. Results: The system achieved an F1-score of 0.82 in single-center validation and 0.76 in ten-center external validation, with an average processing time of 2.45 seconds per note; assertion outputs showed strong correlation with ground-truth annotations (Spearman ρ > 0.83, *P* < 0.0001). Its key contribution is the first synergistic integration of structured linguistic rules and deep learning–based assertion modeling for PASC symptom extraction—enhancing both cross-center generalizability and clinical interpretability, thereby enabling robust large-scale PASC epidemiological studies.

Technology Category

Application Category

📝 Abstract
Accurately and efficiently diagnosing Post-Acute Sequelae of COVID-19 (PASC) remains challenging due to its myriad symptoms that evolve over long- and variable-time intervals. To address this issue, we developed a hybrid natural language processing pipeline that integrates rule-based named entity recognition with BERT-based assertion detection modules for PASC-symptom extraction and assertion detection from clinical notes. We developed a comprehensive PASC lexicon with clinical specialists. From 11 health systems of the RECOVER initiative network across the U.S., we curated 160 intake progress notes for model development and evaluation, and collected 47,654 progress notes for a population-level prevalence study. We achieved an average F1 score of 0.82 in one-site internal validation and 0.76 in 10-site external validation for assertion detection. Our pipeline processed each note at $2.448pm 0.812$ seconds on average. Spearman correlation tests showed $ρ>0.83$ for positive mentions and $ρ>0.72$ for negative ones, both with $P <0.0001$. These demonstrate the effectiveness and efficiency of our models and their potential for improving PASC diagnosis.
Problem

Research questions and friction points this paper is trying to address.

Extracting PASC symptoms from clinical notes accurately
Developing hybrid NLP for symptom and assertion detection
Improving PASC diagnosis efficiency with validated models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid NLP pipeline for symptom extraction
BERT-based assertion detection modules
Comprehensive PASC lexicon with specialists
🔎 Similar Papers
No similar papers found.
Z
Zilong Bai
Population Health Sciences, Weill Cornell Medicine, New York, USA.
Zihan Xu
Zihan Xu
Arizona State University
Machine LearningNeuromorphic ComputingMemory
C
Cong Sun
Population Health Sciences, Weill Cornell Medicine, New York, USA.
Chengxi Zang
Chengxi Zang
Weill Cornell Medicine, Cornell University
AI4HealthRWD/RWEAI for Drug DiscoveryAI for Drug DevelopmentHealth Data Science
H
H. Timothy Bunnell
Nemours Children’s Health, Wilmington, USA.
C
Catherine Sinfield
Population Health Sciences, Weill Cornell Medicine, New York, USA.
J
Jacqueline Rutter
RECOVER Patient, Caregiver, or Community Advocate Representative, New York, USA.
A
Aaron Thomas Martinez
RECOVER Patient, Caregiver, or Community Advocate Representative, New York, USA.
L
L. Charles Bailey
Applied Clinical Research Center, Children’s Hospital of Philadelphia, Philadelphia, USA.
M
Mark Weiner
Population Health Sciences, Weill Cornell Medicine, New York, USA.
T
Thomas R. Campion
Population Health Sciences, Weill Cornell Medicine, New York, USA.
T
Thomas Carton
Louisiana Public Health Institute, New Orleans, USA.
Christopher B. Forrest
Christopher B. Forrest
Professor of Pediatrics, University of Pennsylvania
Child HealthThrivingApplied Clinical ResearchLife Course Health Science
Rainu Kaushal
Rainu Kaushal
Professor and Chair, Department of Healthcare Policy and Research, Weill Cornell
information sciencehealth services researchpatient safetyquality
F
Fei Wang
Population Health Sciences, Weill Cornell Medicine, New York, USA.
Y
Yifan Peng
Population Health Sciences, Weill Cornell Medicine, New York, USA.