MedBLIP: Fine-tuning BLIP for Medical Image Captioning

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical image captioning suffers from insufficient clinical accuracy and poor domain generalizability. This work systematically investigates domain adaptation of the BLIP model for radiological imaging on the ROCO dataset. We propose a decoder-only fine-tuning strategy—freezing the visual encoder—which, for the first time, achieves 98% of full-parameter fine-tuning performance while reducing training overhead by 5%. To enhance interpretability, we introduce a cross-modal attention visualization framework coupled with controlled-variable ablation analysis. Experimental results demonstrate that our approach significantly outperforms zero-shot BLIP as well as state-of-the-art vision-language models including BLIP-2 and ViT-GPT2 across standard metrics such as CIDEr and BLEU-4. The decoder-only fine-tuning paradigm thus establishes a new efficiency–accuracy trade-off benchmark for radiology-specific captioning.

Technology Category

Application Category

📝 Abstract
Medical image captioning is a challenging task that requires generating clinically accurate and semantically meaningful descriptions of radiology images. While recent vision-language models (VLMs) such as BLIP, BLIP2, Gemini and ViT-GPT2 show strong performance on natural image datasets, they often produce generic or imprecise captions when applied to specialized medical domains. In this project, we explore the effectiveness of fine-tuning the BLIP model on the ROCO dataset for improved radiology captioning. We compare the fine-tuned BLIP against its zero-shot version, BLIP-2 base, BLIP-2 Instruct and a ViT-GPT2 transformer baseline. Our results demonstrate that domain-specific fine-tuning on BLIP significantly improves performance across both quantitative and qualitative evaluation metrics. We also visualize decoder cross-attention maps to assess interpretability and conduct an ablation study to evaluate the contributions of encoder-only and decoder-only fine-tuning. Our findings highlight the importance of targeted adaptation for medical applications and suggest that decoder-only fine-tuning (encoder-frozen) offers a strong performance baseline with 5% lower training time than full fine-tuning, while full model fine-tuning still yields the best results overall.
Problem

Research questions and friction points this paper is trying to address.

Improving medical image captioning accuracy via BLIP fine-tuning
Addressing generic captions in specialized medical domains
Evaluating encoder-decoder fine-tuning impact on performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning BLIP for medical image captioning
Domain-specific adaptation improves radiology descriptions
Decoder-only fine-tuning reduces training time effectively
🔎 Similar Papers
No similar papers found.