Adapting Lightweight Vision Language Models for Radiological Visual Question Answering

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Radiology visual question answering (VQA) faces three key challenges: scarcity of annotated data, complexity of medical imaging modalities, and lack of standardized evaluation frameworks. To address these, we propose a lightweight 3B-parameter multimodal model built upon a novel pipeline integrating synthetic QA pair generation and multi-stage domain-specific fine-tuning. We further introduce the first saliency-map-based lightweight diagnostic tool for radiology VQA, enabling experts to localize pathological failure modes of the model. Our method combines vision-language model adaptation, synthetic data augmentation, and fine-tuning on domain-specialized datasets—including ROCO v2.0 and MedPix v2.0—alongside interpretability analysis techniques. Despite its compact parameter count and limited training data, our model achieves performance on par with LLaVA-Med on both open- and closed-ended questions. Most notably, it is the first radiology VQA system to provide interpretable, failure-mode-aware diagnostics.

Technology Category

Application Category

📝 Abstract

Recent advancements in vision-language systems have improved the accuracy of Radiological Visual Question Answering (VQA) Models. However, some challenges remain across each stage of model development: limited expert-labeled images hinders data procurement at scale; the intricate and nuanced patterns of radiological images make modeling inherently difficult; and the lack of evaluation evaluation efforts makes it difficult to identify cases where the model might be ill-conditioned. In this study, we fine-tune a lightweight 3B parameter vision-language model for Radiological VQA, demonstrating that small models, when appropriately tuned with curated data, can achieve robust performance across both open- and closed-ended questions. We propose a cost-effective training pipeline from synthetic question-answer pair generation to multi-stage fine-tuning on specialised radiological domain-targeted datasets (e.g., ROCO v2.0, MedPix v2.0). Our results show that despite operating at a fraction of the scale of state-of-the-art models such as LLaVA-Med, our model achieves promising performance given its small parameter size and the limited scale of training data. We introduce a lightweight saliency-based diagnostic tool that enables domain experts to inspect VQA model performance and identify ill-conditioned failure modes through saliency analysis.

Problem

Research questions and friction points this paper is trying to address.

Limited expert-labeled radiological images for scalable data procurement

Complex patterns in radiological images make modeling difficult

Lack of evaluation efforts to identify ill-conditioned model cases

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tune lightweight 3B parameter vision-language model

Cost-effective training pipeline with synthetic data

Lightweight saliency-based diagnostic tool for analysis

🔎 Similar Papers

No similar papers found.

Authors to Follow