Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This study addresses wound care in medical visual question answering (MedVQA) by proposing a lightweight multimodal retrieval-augmented generation (RAG) method that requires no model fine-tuning or complex re-ranking. Instead, it leverages indexed text–image dual-modal examples and context injection to enhance large language models’ (LLMs) clinical reasoning and structured output capabilities. The approach integrates instruction-tuned LLMs with a multimodal RAG framework to jointly generate free-text responses and standardized wound attributes (e.g., size, exudate, margins). Evaluation employs dBLEU, ROUGE, BERTScore, and LLM-based metrics; on the MEDIQA-WV 2025 benchmark, the method achieves a mean score of 41.37, ranking third among 19 teams (51 submissions), demonstrating significant improvements in clinical relevance, factual accuracy, and adherence to structured output conventions.

Technology Category

Application Category

📝 Abstract

Medical Visual Question Answering (MedVQA) enables natural language queries over medical images to support clinical decision-making and patient care. The MEDIQA-WV 2025 shared task addressed wound-care VQA, requiring systems to generate free-text responses and structured wound attributes from images and patient queries. We present the MasonNLP system, which employs a general-domain, instruction-tuned large language model with a retrieval-augmented generation (RAG) framework that incorporates textual and visual examples from in-domain data. This approach grounds outputs in clinically relevant exemplars, improving reasoning, schema adherence, and response quality across dBLEU, ROUGE, BERTScore, and LLM-based metrics. Our best-performing system ranked 3rd among 19 teams and 51 submissions with an average score of 41.37%, demonstrating that lightweight RAG with general-purpose LLMs -- a minimal inference-time layer that adds a few relevant exemplars via simple indexing and fusion, with no extra training or complex re-ranking -- provides a simple and effective baseline for multimodal clinical NLP tasks.

Problem

Research questions and friction points this paper is trying to address.

Generating medical answers from clinical images and questions

Improving response quality through multimodal retrieval-augmented generation

Addressing wound-care visual question answering with lightweight framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-augmented generation framework with multimodal examples

Lightweight RAG using general-purpose LLMs without training

Simple indexing and fusion for clinical exemplar integration

🔎 Similar Papers

The Geometry of Queries: Query-Based Innovations in Retrieval-Augmented Generation