From Prompt Optimization to Multi-Dimensional Credibility Evaluation: Enhancing Trustworthiness of Chinese LLM-Generated Liver MRI Reports

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the low credibility of liver MRI reports generated by Chinese large language models (LLMs). We propose the first multidimensional credibility assessment framework specifically designed for medical imaging reporting and introduce a clinical-context-driven, institution-level prompt optimization methodology. Leveraging the SiliconFlow platform, we systematically evaluate leading open-weight Chinese LLMs—including Kimi-K2, Qwen3-235B, DeepSeek-V3, and ByteDance-Seed-OSS—across diverse clinical scenarios, uncovering how prompt design differentially impacts diagnostic accuracy, terminology standardization, logical coherence, and clinical interpretability. Results demonstrate that our framework significantly improves report accuracy (+18.7%) and inter-model consistency (Cohen’s κ = 0.82). It establishes a reproducible, verifiable evaluation paradigm and an engineering-oriented optimization pathway for radiology AI-assisted report generation.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated promising performance in generating diagnostic conclusions from imaging findings, thereby supporting radiology reporting, trainee education, and quality control. However, systematic guidance on how to optimize prompt design across different clinical contexts remains underexplored. Moreover, a comprehensive and standardized framework for assessing the trustworthiness of LLM-generated radiology reports is yet to be established. This study aims to enhance the trustworthiness of LLM-generated liver MRI reports by introducing a Multi-Dimensional Credibility Assessment (MDCA) framework and providing guidance on institution-specific prompt optimization. The proposed framework is applied to evaluate and compare the performance of several advanced LLMs, including Kimi-K2-Instruct-0905, Qwen3-235B-A22B-Instruct-2507, DeepSeek-V3, and ByteDance-Seed-OSS-36B-Instruct, using the SiliconFlow platform.

Problem

Research questions and friction points this paper is trying to address.

Optimizing prompt design for clinical contexts

Establishing credibility assessment for LLM reports

Enhancing trustworthiness of liver MRI reports

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Dimensional Credibility Assessment framework for evaluation

Institution-specific prompt optimization guidance

Comparative analysis of advanced LLMs on SiliconFlow platform

🔎 Similar Papers

No similar papers found.

Authors to Follow