Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA

๐Ÿ“… 2025-10-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses wound care in medical visual question answering (MedVQA) by proposing a lightweight multimodal retrieval-augmented generation (RAG) method that requires no model fine-tuning or complex re-ranking. Instead, it leverages indexed textโ€“image dual-modal examples and context injection to enhance large language modelsโ€™ (LLMs) clinical reasoning and structured output capabilities. The approach integrates instruction-tuned LLMs with a multimodal RAG framework to jointly generate free-text responses and standardized wound attributes (e.g., size, exudate, margins). Evaluation employs dBLEU, ROUGE, BERTScore, and LLM-based metrics; on the MEDIQA-WV 2025 benchmark, the method achieves a mean score of 41.37, ranking third among 19 teams (51 submissions), demonstrating significant improvements in clinical relevance, factual accuracy, and adherence to structured output conventions.

Technology Category

Application Category

๐Ÿ“ Abstract
Medical Visual Question Answering (MedVQA) enables natural language queries over medical images to support clinical decision-making and patient care. The MEDIQA-WV 2025 shared task addressed wound-care VQA, requiring systems to generate free-text responses and structured wound attributes from images and patient queries. We present the MasonNLP system, which employs a general-domain, instruction-tuned large language model with a retrieval-augmented generation (RAG) framework that incorporates textual and visual examples from in-domain data. This approach grounds outputs in clinically relevant exemplars, improving reasoning, schema adherence, and response quality across dBLEU, ROUGE, BERTScore, and LLM-based metrics. Our best-performing system ranked 3rd among 19 teams and 51 submissions with an average score of 41.37%, demonstrating that lightweight RAG with general-purpose LLMs -- a minimal inference-time layer that adds a few relevant exemplars via simple indexing and fusion, with no extra training or complex re-ranking -- provides a simple and effective baseline for multimodal clinical NLP tasks.
Problem

Research questions and friction points this paper is trying to address.

Generating medical answers from clinical images and questions
Improving response quality through multimodal retrieval-augmented generation
Addressing wound-care visual question answering with lightweight framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-augmented generation framework with multimodal examples
Lightweight RAG using general-purpose LLMs without training
Simple indexing and fusion for clinical exemplar integration
๐Ÿ”Ž Similar Papers
No similar papers found.