Regional Attention-Enhanced Swin Transformer for Clinically Relevant Medical Image Captioning

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the challenge of generating clinically relevant diagnostic descriptions from medical imaging modalities (CT, MRI, and X-ray). We propose a lightweight region-enhanced Swin-BART model, which employs a Swin Transformer as the image encoder and BART as the text decoder. A lightweight region attention module is embedded prior to cross-modal attention to explicitly strengthen visual representations of diagnostically critical regions, thereby improving vision–language alignment and generation interpretability. Evaluated on the ROCO dataset, our model achieves ROUGE-L of 0.603 ± std and BERTScore of 0.807 ± std—significantly outperforming ResNet-CNN and BLIP2-OPT baselines. The method delivers high accuracy, strong attribution capability (i.e., faithful localization of diagnostic regions), and computational efficiency, making it well-suited for integration into clinical radiology reporting workflows.

Technology Category

Application Category

📝 Abstract

Automated medical image captioning translates complex radiological images into diagnostic narratives that can support reporting workflows. We present a Swin-BART encoder-decoder system with a lightweight regional attention module that amplifies diagnostically salient regions before cross-attention. Trained and evaluated on ROCO, our model achieves state-of-the-art semantic fidelity while remaining compact and interpretable. We report results as mean$pm$std over three seeds and include $95%$ confidence intervals. Compared with baselines, our approach improves ROUGE (proposed 0.603, ResNet-CNN 0.356, BLIP2-OPT 0.255) and BERTScore (proposed 0.807, BLIP2-OPT 0.645, ResNet-CNN 0.623), with competitive BLEU, CIDEr, and METEOR. We further provide ablations (regional attention on/off and token-count sweep), per-modality analysis (CT/MRI/X-ray), paired significance tests, and qualitative heatmaps that visualize the regions driving each description. Decoding uses beam search (beam size $=4$), length penalty $=1.1$, $no_repeat_ngram_size$ $=3$, and max length $=128$. The proposed design yields accurate, clinically phrased captions and transparent regional attributions, supporting safe research use with a human in the loop.

Problem

Research questions and friction points this paper is trying to address.

Automated medical image captioning generates diagnostic narratives from radiological images

A regional attention module enhances diagnostically salient regions before cross-attention

The system achieves state-of-the-art semantic fidelity while remaining interpretable

Innovation

Methods, ideas, or system contributions that make the work stand out.

Swin-BART encoder-decoder for medical captioning

Lightweight regional attention enhances diagnostic regions

Model achieves state-of-the-art semantic fidelity

🔎 Similar Papers

MedRG: Medical Report Grounding with Multi-modal Large Language Model

2024-04-10arXiv.orgCitations: 5

Authors to Follow