🤖 AI Summary
This study addresses the low credibility of liver MRI reports generated by Chinese large language models (LLMs). We propose the first multidimensional credibility assessment framework specifically designed for medical imaging reporting and introduce a clinical-context-driven, institution-level prompt optimization methodology. Leveraging the SiliconFlow platform, we systematically evaluate leading open-weight Chinese LLMs—including Kimi-K2, Qwen3-235B, DeepSeek-V3, and ByteDance-Seed-OSS—across diverse clinical scenarios, uncovering how prompt design differentially impacts diagnostic accuracy, terminology standardization, logical coherence, and clinical interpretability. Results demonstrate that our framework significantly improves report accuracy (+18.7%) and inter-model consistency (Cohen’s κ = 0.82). It establishes a reproducible, verifiable evaluation paradigm and an engineering-oriented optimization pathway for radiology AI-assisted report generation.
📝 Abstract
Large language models (LLMs) have demonstrated promising performance in generating diagnostic conclusions from imaging findings, thereby supporting radiology reporting, trainee education, and quality control. However, systematic guidance on how to optimize prompt design across different clinical contexts remains underexplored. Moreover, a comprehensive and standardized framework for assessing the trustworthiness of LLM-generated radiology reports is yet to be established. This study aims to enhance the trustworthiness of LLM-generated liver MRI reports by introducing a Multi-Dimensional Credibility Assessment (MDCA) framework and providing guidance on institution-specific prompt optimization. The proposed framework is applied to evaluate and compare the performance of several advanced LLMs, including Kimi-K2-Instruct-0905, Qwen3-235B-A22B-Instruct-2507, DeepSeek-V3, and ByteDance-Seed-OSS-36B-Instruct, using the SiliconFlow platform.