TCM-5CEval: Extended Deep Evaluation Benchmark for LLM's Comprehensive Clinical Research Competence in Traditional Chinese Medicine

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing LLM evaluation benchmarks for Traditional Chinese Medicine (TCM) suffer from coarse granularity and insufficient cultural contextualization. To address this, we propose TCM-5CEval—the first fine-grained, five-dimensional benchmark assessing comprehensive clinical research competence, covering core TCM knowledge, classical text comprehension, clinical decision-making, pharmacology of Chinese herbs, and non-pharmacological therapies. We innovatively introduce *permutation consistency testing* to quantify positional bias and reasoning fragility in tasks such as classical text interpretation and clinical case reasoning. Leveraging diverse tasks—including multi-hop question answering, classical text exegesis, herb property identification, and therapy classification—we systematically evaluate 15 state-of-the-art LLMs. Results reveal robust foundational knowledge acquisition but pronounced deficits in deep classical text understanding and reasoning stability; even top-performing models (e.g., DeepSeek-R1, Gemini 2.5 Pro) exhibit significant positional bias. TCM-5CEval establishes a reproducible, culturally grounded, and mechanistically interpretable paradigm for TCM AI evaluation.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated exceptional capabilities in general domains, yet their application in highly specialized and culturally-rich fields like Traditional Chinese Medicine (TCM) requires rigorous and nuanced evaluation. Building upon prior foundational work such as TCM-3CEval, which highlighted systemic knowledge gaps and the importance of cultural-contextual alignment, we introduce TCM-5CEval, a more granular and comprehensive benchmark. TCM-5CEval is designed to assess LLMs across five critical dimensions: (1) Core Knowledge (TCM-Exam), (2) Classical Literacy (TCM-LitQA), (3) Clinical Decision-making (TCM-MRCD), (4) Chinese Materia Medica (TCM-CMM), and (5) Clinical Non-pharmacological Therapy (TCM-ClinNPT). We conducted a thorough evaluation of fifteen prominent LLMs, revealing significant performance disparities and identifying top-performing models like deepseek_r1 and gemini_2_5_pro. Our findings show that while models exhibit proficiency in recalling foundational knowledge, they struggle with the interpretative complexities of classical texts. Critically, permutation-based consistency testing reveals widespread fragilities in model inference. All evaluated models, including the highest-scoring ones, displayed a substantial performance degradation when faced with varied question option ordering, indicating a pervasive sensitivity to positional bias and a lack of robust understanding. TCM-5CEval not only provides a more detailed diagnostic tool for LLM capabilities in TCM but aldso exposes fundamental weaknesses in their reasoning stability. To promote further research and standardized comparison, TCM-5CEval has been uploaded to the Medbench platform, joining its predecessor in the "In-depth Challenge for Comprehensive TCM Abilities" special track.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' clinical competence across five Traditional Chinese Medicine dimensions

Identifying performance gaps in classical text interpretation and clinical reasoning

Assessing model robustness against positional bias in specialized medical contexts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extended benchmark for evaluating LLMs in Traditional Chinese Medicine

Assesses five critical clinical dimensions including classical literacy

Reveals model fragility through permutation-based consistency testing

🔎 Similar Papers

No similar papers found.

Authors to Follow