🤖 AI Summary
This study addresses the challenge of generating clinically relevant diagnostic descriptions from medical imaging modalities (CT, MRI, and X-ray). We propose a lightweight region-enhanced Swin-BART model, which employs a Swin Transformer as the image encoder and BART as the text decoder. A lightweight region attention module is embedded prior to cross-modal attention to explicitly strengthen visual representations of diagnostically critical regions, thereby improving vision–language alignment and generation interpretability. Evaluated on the ROCO dataset, our model achieves ROUGE-L of 0.603 ± std and BERTScore of 0.807 ± std—significantly outperforming ResNet-CNN and BLIP2-OPT baselines. The method delivers high accuracy, strong attribution capability (i.e., faithful localization of diagnostic regions), and computational efficiency, making it well-suited for integration into clinical radiology reporting workflows.
📝 Abstract
Automated medical image captioning translates complex radiological images into diagnostic narratives that can support reporting workflows. We present a Swin-BART encoder-decoder system with a lightweight regional attention module that amplifies diagnostically salient regions before cross-attention. Trained and evaluated on ROCO, our model achieves state-of-the-art semantic fidelity while remaining compact and interpretable. We report results as mean$pm$std over three seeds and include $95%$ confidence intervals. Compared with baselines, our approach improves ROUGE (proposed 0.603, ResNet-CNN 0.356, BLIP2-OPT 0.255) and BERTScore (proposed 0.807, BLIP2-OPT 0.645, ResNet-CNN 0.623), with competitive BLEU, CIDEr, and METEOR. We further provide ablations (regional attention on/off and token-count sweep), per-modality analysis (CT/MRI/X-ray), paired significance tests, and qualitative heatmaps that visualize the regions driving each description. Decoding uses beam search (beam size $=4$), length penalty $=1.1$, $no_repeat_ngram_size$ $=3$, and max length $=128$. The proposed design yields accurate, clinically phrased captions and transparent regional attributions, supporting safe research use with a human in the loop.