Hallucinations and Key Information Extraction in Medical Texts: A Comprehensive Assessment of Open-Source Large Language Models

📅 2025-04-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the reliability of open-source large language models (LLMs) in extracting critical information—such as admission diagnosis, significant in-hospital events, and follow-up recommendations—from clinical discharge summaries. We propose the first fine-grained hallucination classification and attribution evaluation framework specifically designed for clinical summarization. Our methodology integrates BERTScore/ROUGE metrics, customized event-matching rules, double-blind human annotation, adversarial prompting, and uncertainty calibration. Empirical evaluation reveals that mainstream open-source LLMs exhibit hallucination rates of 31–67%, predominantly manifesting as structural omissions, temporal misalignment, and fabricated diagnoses. To mitigate these issues, we introduce a lightweight post-processing module that operates without modifying the underlying model architecture. This intervention improves F1 score for key event extraction by 12.4%, substantially enhancing the clinical credibility and factual consistency of generated summaries.

Technology Category

Application Category

📝 Abstract
Clinical summarization is crucial in healthcare as it distills complex medical data into digestible information, enhancing patient understanding and care management. Large language models (LLMs) have shown significant potential in automating and improving the accuracy of such summarizations due to their advanced natural language understanding capabilities. These models are particularly applicable in the context of summarizing medical/clinical texts, where precise and concise information transfer is essential. In this paper, we investigate the effectiveness of open-source LLMs in extracting key events from discharge reports, such as reasons for hospital admission, significant in-hospital events, and critical follow-up actions. In addition, we also assess the prevalence of various types of hallucinations in the summaries produced by these models. Detecting hallucinations is vital as it directly influences the reliability of the information, potentially affecting patient care and treatment outcomes. We conduct comprehensive numerical simulations to rigorously evaluate the performance of these models, further probing the accuracy and fidelity of the extracted content in clinical summarization.
Problem

Research questions and friction points this paper is trying to address.

Assessing open-source LLMs for key event extraction in medical discharge reports
Evaluating hallucination prevalence in clinical summaries generated by LLMs
Measuring accuracy and reliability of LLMs in clinical text summarization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source LLMs for medical text summarization
Detecting hallucinations in clinical summaries
Numerical simulations for performance evaluation
🔎 Similar Papers
No similar papers found.