Adapting Lightweight Vision Language Models for Radiological Visual Question Answering

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Radiology visual question answering (VQA) faces three key challenges: scarcity of annotated data, complexity of medical imaging modalities, and lack of standardized evaluation frameworks. To address these, we propose a lightweight 3B-parameter multimodal model built upon a novel pipeline integrating synthetic QA pair generation and multi-stage domain-specific fine-tuning. We further introduce the first saliency-map-based lightweight diagnostic tool for radiology VQA, enabling experts to localize pathological failure modes of the model. Our method combines vision-language model adaptation, synthetic data augmentation, and fine-tuning on domain-specialized datasets—including ROCO v2.0 and MedPix v2.0—alongside interpretability analysis techniques. Despite its compact parameter count and limited training data, our model achieves performance on par with LLaVA-Med on both open- and closed-ended questions. Most notably, it is the first radiology VQA system to provide interpretable, failure-mode-aware diagnostics.

Technology Category

Application Category

📝 Abstract
Recent advancements in vision-language systems have improved the accuracy of Radiological Visual Question Answering (VQA) Models. However, some challenges remain across each stage of model development: limited expert-labeled images hinders data procurement at scale; the intricate and nuanced patterns of radiological images make modeling inherently difficult; and the lack of evaluation evaluation efforts makes it difficult to identify cases where the model might be ill-conditioned. In this study, we fine-tune a lightweight 3B parameter vision-language model for Radiological VQA, demonstrating that small models, when appropriately tuned with curated data, can achieve robust performance across both open- and closed-ended questions. We propose a cost-effective training pipeline from synthetic question-answer pair generation to multi-stage fine-tuning on specialised radiological domain-targeted datasets (e.g., ROCO v2.0, MedPix v2.0). Our results show that despite operating at a fraction of the scale of state-of-the-art models such as LLaVA-Med, our model achieves promising performance given its small parameter size and the limited scale of training data. We introduce a lightweight saliency-based diagnostic tool that enables domain experts to inspect VQA model performance and identify ill-conditioned failure modes through saliency analysis.
Problem

Research questions and friction points this paper is trying to address.

Limited expert-labeled radiological images for scalable data procurement
Complex patterns in radiological images make modeling difficult
Lack of evaluation efforts to identify ill-conditioned model cases
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tune lightweight 3B parameter vision-language model
Cost-effective training pipeline with synthetic data
Lightweight saliency-based diagnostic tool for analysis
🔎 Similar Papers
No similar papers found.
A
Aditya Shourya
Department of Advanced Computing Sciences, Maastricht University
Michel Dumontier
Michel Dumontier
Distinguished Professor of Data Science, Maastricht University
data scienceartificial intelligencebiomedical informaticssemantic webontology
C
Chang Sun
Institute of Data Science, Maastricht University; Department of Advanced Computing Sciences, Maastricht University