From Prompt Optimization to Multi-Dimensional Credibility Evaluation: Enhancing Trustworthiness of Chinese LLM-Generated Liver MRI Reports

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the low credibility of liver MRI reports generated by Chinese large language models (LLMs). We propose the first multidimensional credibility assessment framework specifically designed for medical imaging reporting and introduce a clinical-context-driven, institution-level prompt optimization methodology. Leveraging the SiliconFlow platform, we systematically evaluate leading open-weight Chinese LLMs—including Kimi-K2, Qwen3-235B, DeepSeek-V3, and ByteDance-Seed-OSS—across diverse clinical scenarios, uncovering how prompt design differentially impacts diagnostic accuracy, terminology standardization, logical coherence, and clinical interpretability. Results demonstrate that our framework significantly improves report accuracy (+18.7%) and inter-model consistency (Cohen’s κ = 0.82). It establishes a reproducible, verifiable evaluation paradigm and an engineering-oriented optimization pathway for radiology AI-assisted report generation.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated promising performance in generating diagnostic conclusions from imaging findings, thereby supporting radiology reporting, trainee education, and quality control. However, systematic guidance on how to optimize prompt design across different clinical contexts remains underexplored. Moreover, a comprehensive and standardized framework for assessing the trustworthiness of LLM-generated radiology reports is yet to be established. This study aims to enhance the trustworthiness of LLM-generated liver MRI reports by introducing a Multi-Dimensional Credibility Assessment (MDCA) framework and providing guidance on institution-specific prompt optimization. The proposed framework is applied to evaluate and compare the performance of several advanced LLMs, including Kimi-K2-Instruct-0905, Qwen3-235B-A22B-Instruct-2507, DeepSeek-V3, and ByteDance-Seed-OSS-36B-Instruct, using the SiliconFlow platform.
Problem

Research questions and friction points this paper is trying to address.

Optimizing prompt design for clinical contexts
Establishing credibility assessment for LLM reports
Enhancing trustworthiness of liver MRI reports
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Dimensional Credibility Assessment framework for evaluation
Institution-specific prompt optimization guidance
Comparative analysis of advanced LLMs on SiliconFlow platform
🔎 Similar Papers
No similar papers found.
Q
Qiuli Wang
Yu-Yue Pathology Research Center, Jinfeng Laboratory, Chongqing, China
X
Xiaoming Li
7T Magnetic Resonance Imaging Translational Medical Center, Department of Radiology, Southwest Hospital, Army Medical University, Chongqing, China
J
Jie Chen
Y
Yongxu Liu
Yu-Yue Pathology Research Center, Jinfeng Laboratory, Chongqing, China
Xingpeng Zhang
Xingpeng Zhang
School of Computer Science and Software Engineering, Southwest Petroleum University
Computer VisionDeep LearningChaosimage processing
C
Chen Liu
W
Wei Chen
7T Magnetic Resonance Imaging Translational Medical Center, Department of Radiology, Southwest Hospital, Army Medical University, Chongqing, China