ClinDEF: A Dynamic Evaluation Framework for Large Language Models in Clinical Reasoning

📅 2025-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based clinical evaluation benchmarks rely on static question-answering, failing to capture the dynamic, iterative nature of real-world clinical reasoning—such as multi-turn patient history gathering, differential diagnosis refinement, and prioritized test ordering—while suffering from data contamination and coarse-grained assessment. Method: We propose the first dynamic diagnostic dialogue evaluation framework: (1) automatically generating realistic patient cases via disease knowledge graphs; (2) simulating authentic interactions using hybrid rule- and generative-based patient agents; (3) employing a doctor agent to drive multi-turn diagnostic reasoning; and (4) introducing a fine-grained quality scoring system assessing hypothesis generation, test prioritization, and iterative differential diagnosis, alongside response efficiency metrics. Results: Experiments expose systematic deficits of state-of-the-art LLMs in dynamic clinical reasoning, significantly enhancing clinical fidelity, interpretability, and transparency of the diagnostic process.

Technology Category

Application Category

📝 Abstract
Clinical diagnosis begins with doctor-patient interaction, during which physicians iteratively gather information, determine examination and refine differential diagnosis through patients' response. This dynamic clinical-reasoning process is poorly represented by existing LLM benchmarks that focus on static question-answering. To mitigate these gaps, recent methods explore dynamic medical frameworks involving interactive clinical dialogues. Although effective, they often rely on limited, contamination-prone datasets and lack granular, multi-level evaluation. In this work, we propose ClinDEF, a dynamic framework for assessing clinical reasoning in LLMs through simulated diagnostic dialogues. Grounded in a disease knowledge graph, our method dynamically generates patient cases and facilitates multi-turn interactions between an LLM-based doctor and an automated patient agent. Our evaluation protocol goes beyond diagnostic accuracy by incorporating fine-grained efficiency analysis and rubric-based assessment of diagnostic quality. Experiments show that ClinDEF effectively exposes critical clinical reasoning gaps in state-of-the-art LLMs, offering a more nuanced and clinically meaningful evaluation paradigm.
Problem

Research questions and friction points this paper is trying to address.

Existing benchmarks poorly represent dynamic clinical reasoning processes in LLMs
Current methods lack granular evaluation and use contamination-prone datasets
Need for nuanced assessment beyond diagnostic accuracy in clinical reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic framework simulating diagnostic dialogues
Generates patient cases from disease knowledge graph
Multi-level evaluation including efficiency and quality
🔎 Similar Papers
No similar papers found.
Yuqi Tang
Yuqi Tang
Duke University
Medical ImagingComputer VisionImage Quality
Jing Yu
Jing Yu
Northwestern University
SustainabilityLife Cycle AnalysisTransportation ManagementOperations Research
Z
Zichang Su
ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University
Kehua Feng
Kehua Feng
Ph.D. student, Zhejiang University
Natural Language ProcessingLanguage ModelAI for Science
Zhihui Zhu
Zhihui Zhu
Assistant Professor, Ohio State University
Machine LearningData ScienceSignal ProcessingOptimization
L
Libin Wang
ZJU-UIUC Institute, Zhejiang University
Lei Liang
Lei Liang
Ant Group
Knowledge GraphAI
Q
Qiang Zhang
ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University
K
Keyan Ding
ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University
H
Huajun Chen
ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University