A Systematic Analysis of Large Language Models with RAG-enabled Dynamic Prompting for Medical Error Detection and Correction

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Factual, diagnostic, and management errors in clinical documentation pose serious threats to patient safety, yet existing large language models (LLMs) suffer from limited generalizability and static prompting strategies in error detection and correction. To address this, we propose Retrieval-Augmented Dynamic Prompting (RDP), a framework integrating retrieval-augmented generation (RAG), dynamic in-context example selection, and instruction tuning—designed for cross-model compatibility with GPT, Claude, Gemini, and others. Compared to zero-shot and static prompting baselines, RDP significantly improves reasoning consistency and contextual sensitivity: on the MEDEC dataset, it reduces false positive rates by 15% on average and increases recall by 5–10%, while generating more accurate, clinically appropriate corrections. Comprehensive evaluation using ROUGE-1, BLEURT, and BERTScore confirms substantial gains in correction accuracy and robustness. RDP establishes a scalable, verifiable paradigm for clinical text quality assurance.

Technology Category

Application Category

📝 Abstract
Objective: Clinical documentation contains factual, diagnostic, and management errors that can compromise patient safety. Large language models (LLMs) may help detect and correct such errors, but their behavior under different prompting strategies remains unclear. We evaluate zero-shot prompting, static prompting with random exemplars (SPR), and retrieval-augmented dynamic prompting (RDP) for three subtasks of medical error processing: error flag detection, error sentence detection, and error correction. Methods: Using the MEDEC dataset, we evaluated nine instruction-tuned LLMs (GPT, Claude, Gemini, and OpenAI o-series models). We measured performance using accuracy, recall, false-positive rate (FPR), and an aggregate score of ROUGE-1, BLEURT, and BERTScore for error correction. We also analyzed example outputs to identify failure modes and differences between LLM and clinician reasoning. Results: Zero-shot prompting showed low recall in both detection tasks, often missing abbreviation-heavy or atypical errors. SPR improved recall but increased FPR. Across all nine LLMs, RDP reduced FPR by about 15 percent, improved recall by 5 to 10 percent in error sentence detection, and generated more contextually accurate corrections. Conclusion: Across diverse LLMs, RDP outperforms zero-shot and SPR prompting. Using retrieved exemplars improves detection accuracy, reduces false positives, and enhances the reliability of medical error correction.
Problem

Research questions and friction points this paper is trying to address.

Evaluating prompting strategies for detecting medical documentation errors
Comparing LLM performance on error flag detection and correction tasks
Analyzing retrieval-augmented dynamic prompting for clinical error processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-augmented dynamic prompting reduces false positives
Dynamic prompting improves recall in error sentence detection
Retrieved exemplars enhance contextual accuracy for corrections
🔎 Similar Papers
No similar papers found.