🤖 AI Summary
A lack of specialized evaluation benchmarks impedes the assessment of large language models (LLMs) in Chinese medical ethics. Method: We introduce CMEval—the first domain-specific benchmark for this purpose—featuring a three-tier difficulty taxonomy grounded in expert consensus (blatant violations, priority dilemmas, equilibrium dilemmas), a dual-dimensional evaluation framework (knowledge mastery and scenario-based application), and three high-quality, expert-annotated Chinese datasets. Evaluation employs structured prompting and multi-granularity scoring to quantify ethical reasoning capability. Contribution/Results: CMEval fills a critical gap in Chinese medical ethics AI evaluation. Comprehensive assessment of 12 mainstream Chinese and English LLMs reveals a pronounced weakness in resolving equilibrium dilemmas—complex trade-off scenarios requiring balanced moral judgment. The benchmark is fully reproducible and accompanied by targeted alignment strategies, thereby advancing the ethical safety and responsible deployment of medical LLMs.
📝 Abstract
Large language models (LLMs) demonstrate significant potential in advancing medical applications, yet their capabilities in addressing medical ethics challenges remain underexplored. This paper introduces MedEthicEval, a novel benchmark designed to systematically evaluate LLMs in the domain of medical ethics. Our framework encompasses two key components: knowledge, assessing the models' grasp of medical ethics principles, and application, focusing on their ability to apply these principles across diverse scenarios. To support this benchmark, we consulted with medical ethics researchers and developed three datasets addressing distinct ethical challenges: blatant violations of medical ethics, priority dilemmas with clear inclinations, and equilibrium dilemmas without obvious resolutions. MedEthicEval serves as a critical tool for understanding LLMs' ethical reasoning in healthcare, paving the way for their responsible and effective use in medical contexts.