Regional Attention-Enhanced Swin Transformer for Clinically Relevant Medical Image Captioning

📅 2025-11-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of generating clinically relevant diagnostic descriptions from medical imaging modalities (CT, MRI, and X-ray). We propose a lightweight region-enhanced Swin-BART model, which employs a Swin Transformer as the image encoder and BART as the text decoder. A lightweight region attention module is embedded prior to cross-modal attention to explicitly strengthen visual representations of diagnostically critical regions, thereby improving vision–language alignment and generation interpretability. Evaluated on the ROCO dataset, our model achieves ROUGE-L of 0.603 ± std and BERTScore of 0.807 ± std—significantly outperforming ResNet-CNN and BLIP2-OPT baselines. The method delivers high accuracy, strong attribution capability (i.e., faithful localization of diagnostic regions), and computational efficiency, making it well-suited for integration into clinical radiology reporting workflows.

Technology Category

Application Category

📝 Abstract
Automated medical image captioning translates complex radiological images into diagnostic narratives that can support reporting workflows. We present a Swin-BART encoder-decoder system with a lightweight regional attention module that amplifies diagnostically salient regions before cross-attention. Trained and evaluated on ROCO, our model achieves state-of-the-art semantic fidelity while remaining compact and interpretable. We report results as mean$pm$std over three seeds and include $95%$ confidence intervals. Compared with baselines, our approach improves ROUGE (proposed 0.603, ResNet-CNN 0.356, BLIP2-OPT 0.255) and BERTScore (proposed 0.807, BLIP2-OPT 0.645, ResNet-CNN 0.623), with competitive BLEU, CIDEr, and METEOR. We further provide ablations (regional attention on/off and token-count sweep), per-modality analysis (CT/MRI/X-ray), paired significance tests, and qualitative heatmaps that visualize the regions driving each description. Decoding uses beam search (beam size $=4$), length penalty $=1.1$, $no_repeat_ngram_size$ $=3$, and max length $=128$. The proposed design yields accurate, clinically phrased captions and transparent regional attributions, supporting safe research use with a human in the loop.
Problem

Research questions and friction points this paper is trying to address.

Automated medical image captioning generates diagnostic narratives from radiological images
A regional attention module enhances diagnostically salient regions before cross-attention
The system achieves state-of-the-art semantic fidelity while remaining interpretable
Innovation

Methods, ideas, or system contributions that make the work stand out.

Swin-BART encoder-decoder for medical captioning
Lightweight regional attention enhances diagnostic regions
Model achieves state-of-the-art semantic fidelity
🔎 Similar Papers
Z
Zubia Naz
Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, South Korea
F
Farhan Asghar
Electrical and Computer Engineering Department, Chonnam National University, Yeosu, South Korea
M
Muhammad Ishfaq Hussain
AI Convergence Department, Gwangju Institute of Science and Technology, Gwangju, South Korea
Y
Yahya Hadadi
Department of Information Systems, CCSIT, King Faisal University, Hofuf, Saudi Arabia
M
M. Rafique
Department of Information Systems, CCSIT, King Faisal University, Hofuf, Saudi Arabia
Wookjin Choi
Wookjin Choi
Department of Radiation Oncology, Thomas Jefferson University, Philadelphia, United States
Moongu Jeon
Moongu Jeon
Gwangju Institute of Science and Technology
Artificial intelligenceMachine learningComputer visionAutonomous driving