🤖 AI Summary
Current large language models (LLMs) for Traditional Chinese Medicine (TCM) face critical bottlenecks in scenario adaptability, lack of standardized evaluation benchmarks, and high computational resource demands.
Method: We propose a domain-specific LLM tailored to diverse TCM scenarios. Our approach builds a high-quality, large-scale TCM corpus; integrates context-aware reasoning with structured domain knowledge; and employs a lightweight, efficient training paradigm combining QLoRA, DeepSpeed Stage 2, and Flash Attention 2. A two-stage optimization strategy systematically tunes key hyperparameters—including LoRA rank (128), alpha (256), epochs (4), dropout (0.2), and sequence length (2048).
Contribution/Results: The model achieves state-of-the-art (SOTA) performance across 12 TCM benchmark tasks: ranking top-3 on all metrics for six datasets and attaining SOTA on the remaining six. All code, model weights, and training corpora are publicly released, significantly advancing systematic knowledge modeling and scalable intelligent applications in TCM.
📝 Abstract
Domain-specific LLMs in TCM face limitations in research settings due to constrained adaptability, insufficient evaluation datasets, and limited computational resources. This study presents TianHui, a specialized TCM LLM built through contextual data integration and domain knowledge fusion. We constructed a large-scale TCM corpus (0.97GB unsupervised data + 611,312 QA pairs) and employed a two-stage training strategy with QLoRA, DeepSpeed Stage 2, and Flash Attention 2. Evaluation on 12 benchmarks showed TianHui ranked top-three in all metrics for six datasets (APQ, TCMCD, HFR, HCCA, DHPE, TLAW) and achieved top results in the other six (TCMEE, APR, GCPMI, TCMKQA, TCMRC, ADTG). Optimal configuration was identified as LoRA rank=128, alpha=256, epoch=4, dropout=0.2, max length=2048. TianHui enables systematic preservation and scalable application of TCM knowledge. All resources are open-sourced.