🤖 AI Summary
Large language models (LLMs) lack dynamic, quantifiable evaluation methods for multi-turn clinical dialogues. Method: This study introduces a medical-knowledge-driven multi-agent simulation framework that automatically transforms static medical QA data into temporally structured, history-aware virtual patient profiles, generating interactive dialogues grounded in authentic clinical contexts. It proposes the novel CARE evaluation framework—comprising Clinical Accuracy, Adaptive Reasoning Efficiency, Empathy, and Robustness—and integrates guideline-constrained response scoring with expert-calibrated automated assessment. Contribution/Results: Validated by clinical experts, the framework effectively discriminates comprehensive performance differences among leading healthcare LLMs in complex, multi-turn scenarios. It establishes the first reproducible, scalable, and high-fidelity dynamic clinical dialogue benchmark, enabling rigorous, standardized evaluation of conversational medical AI systems.
📝 Abstract
Evaluating large language models (LLMs) has recently emerged as a critical issue for safe and trustworthy application of LLMs in the medical domain. Although a variety of static medical question-answering (QA) benchmarks have been proposed, many aspects remain underexplored, such as the effectiveness of LLMs in generating responses in dynamic, interactive clinical multi-turn conversation situations and the identification of multi-faceted evaluation strategies beyond simple accuracy. However, formally evaluating a dynamic, interactive clinical situation is hindered by its vast combinatorial space of possible patient states and interaction trajectories, making it difficult to standardize and quantitatively measure such scenarios. Here, we introduce AutoMedic, a multi-agent simulation framework that enables automated evaluation of LLMs as clinical conversational agents. AutoMedic transforms off-the-shelf static QA datasets into virtual patient profiles, enabling realistic and clinically grounded multi-turn clinical dialogues between LLM agents. The performance of various clinical conversational agents is then assessed based on our CARE metric, which provides a multi-faceted evaluation standard of clinical conversational accuracy, efficiency/strategy, empathy, and robustness. Our findings, validated by human experts, demonstrate the validity of AutoMedic as an automated evaluation framework for clinical conversational agents, offering practical guidelines for the effective development of LLMs in conversational medical applications.