Cancer Diagnosis Categorization in Electronic Health Records Using Large Language Models and BioBERT: Model Performance Evaluation Study

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of unreliable automated diagnosis classification in electronic health records (EHRs), where structured (e.g., ICD codes) and unstructured (e.g., clinical notes) data coexist. We systematically evaluate five models—GPT-3.5, GPT-4o, Llama 3.2, Gemini 1.5, and BioBERT—on a multi-class cancer diagnosis classification task involving both ICD-coded and free-text inputs. Ground-truth labels provided by domain experts serve as the gold standard for rigorous evaluation. Results show BioBERT achieves the highest performance on structured ICD classification (F1 = 84.2%), whereas GPT-4o excels on free-text classification (F1 = 71.8%, accuracy = 81.9%). Crucially, we identify distinct error patterns across models and data modalities—first reported in this context—providing empirical guidance for model selection in EHR-based clinical decision support systems: large language models are suitable for research and administrative applications, while high-stakes clinical deployment necessitates human-in-the-loop validation.

Technology Category

Application Category

📝 Abstract
Electronic health records contain inconsistently structured or free-text data, requiring efficient preprocessing to enable predictive health care models. Although artificial intelligence-driven natural language processing tools show promise for automating diagnosis classification, their comparative performance and clinical reliability require systematic evaluation. The aim of this study is to evaluate the performance of 4 large language models (GPT-3.5, GPT-4o, Llama 3.2, and Gemini 1.5) and BioBERT in classifying cancer diagnoses from structured and unstructured electronic health records data. We analyzed 762 unique diagnoses (326 International Classification of Diseases (ICD) code descriptions, 436free-text entries) from 3456 records of patients with cancer. Models were tested on their ability to categorize diagnoses into 14predefined categories. Two oncology experts validated classifications. BioBERT achieved the highest weighted macro F1-score for ICD codes (84.2) and matched GPT-4o in ICD code accuracy (90.8). For free-text diagnoses, GPT-4o outperformed BioBERT in weighted macro F1-score (71.8 vs 61.5) and achieved slightly higher accuracy (81.9 vs 81.6). GPT-3.5, Gemini, and Llama showed lower overall performance on both formats. Common misclassification patterns included confusion between metastasis and central nervous system tumors, as well as errors involving ambiguous or overlapping clinical terminology. Although current performance levels appear sufficient for administrative and research use, reliable clinical applications will require standardized documentation practices alongside robust human oversight for high-stakes decision-making.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs and BioBERT for cancer diagnosis classification from EHR data
Comparing model performance on structured ICD codes and unstructured free-text entries
Assessing clinical reliability of automated diagnosis categorization systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using multiple large language models for diagnosis classification
Evaluating BioBERT and LLMs on structured and unstructured data
Validating classifications with oncology experts for reliability
🔎 Similar Papers
No similar papers found.
Soheil Hashtarkhani
Soheil Hashtarkhani
University of Tennessee Health Science Center
Biomedical Informatics
Rezaur Rashid
Rezaur Rashid
UT Health Science Center, UNC Charlotte
Data ScienceCausal AIXAIGNNCancer Research
C
Christopher L Brett
University of Tennessee Graduate School of Medicine, Knoxville, TN, United States
L
Lokesh Chinthala
Center for Biomedical Informatics, Department of Pediatrics, College of Medicine, University of Tennessee Health Science Center, Memphis, TN, United States
Fekede Asefa Kumsa
Fekede Asefa Kumsa
School of Public Health, College of Health and Medical Sciences, Haramaya University, Harar
Maternal HealthGestational Weight GainEpidemiologyCancer EpidemiologySocial Determinants of
J
Janet A Zink
Center for Biomedical Informatics, Department of Pediatrics, College of Medicine, University of Tennessee Health Science Center, Memphis, TN, United States
R
Robert L Davis
Center for Biomedical Informatics, Department of Pediatrics, College of Medicine, University of Tennessee Health Science Center, Memphis, TN, United States
D
David L Schwartz
Center for Biomedical Informatics, Department of Pediatrics, College of Medicine, University of Tennessee Health Science Center, Memphis, TN, United States
Arash Shaban-Nejad
Arash Shaban-Nejad
Center for Biomedical Informatics, University of Tennessee Health Science Center
Software as Medical DeviceSaMDPrecision Public HealthExplainable AIDigital Epidemiology