🤖 AI Summary
This work addresses the limited clinical reasoning capability and poor interpretability of general-purpose vision-language models (VLMs) in radiological diagnosis. We propose a weakly supervised paradigm that requires no annotated image-lesion pairs, leveraging only free-text radiology reports. Our method automatically parses unstructured reports into structured, stepwise Chain-of-Thought (CoT) reasoning paths, then integrates contrastive image-report alignment with multi-granularity clinical reward-guided reinforcement fine-tuning. To our knowledge, this is the first framework to distill stepwise diagnostic supervision signals—aligned with radiologists’ cognitive reasoning—from raw text reports alone. Zero-shot evaluation on MIMIC-CXR demonstrates substantial improvements: +0.24 in disease classification AUC, +0.23 in lesion localization mIoU, and +0.22 in report generation BLEU score, outperforming state-of-the-art methods. The approach establishes a novel, interpretable, and scalable paradigm for training medical VLMs.
📝 Abstract
This study presents DiagCoT, a multi-stage framework that applies supervised fine-tuning to general-purpose vision-language models (VLMs) to emulate radiologists' stepwise diagnostic reasoning using only free-text reports. DiagCoT combines contrastive image-report tuning for domain alignment, chain-of-thought supervision to capture inferential logic, and reinforcement tuning with clinical reward signals to enhance factual accuracy and fluency. On the MIMIC-CXR benchmark, DiagCoT improved zero-shot disease classification AUC from 0.52 to 0.76 (absolute gain of 0.24), pathology grounding mIoU from 0.08 to 0.31 (absolute gain of 0.23), and report generation BLEU from 0.11 to 0.33 (absolute gain of 0.22). It outperformed state-of-the-art models including LLaVA-Med and CXR-LLAVA on long-tailed diseases and external datasets. By converting unstructured clinical narratives into structured supervision, DiagCoT offers a scalable approach for developing interpretable and diagnostically competent AI systems for radiology.