๐ค AI Summary
This study addresses wound care in medical visual question answering (MedVQA) by proposing a lightweight multimodal retrieval-augmented generation (RAG) method that requires no model fine-tuning or complex re-ranking. Instead, it leverages indexed textโimage dual-modal examples and context injection to enhance large language modelsโ (LLMs) clinical reasoning and structured output capabilities. The approach integrates instruction-tuned LLMs with a multimodal RAG framework to jointly generate free-text responses and standardized wound attributes (e.g., size, exudate, margins). Evaluation employs dBLEU, ROUGE, BERTScore, and LLM-based metrics; on the MEDIQA-WV 2025 benchmark, the method achieves a mean score of 41.37, ranking third among 19 teams (51 submissions), demonstrating significant improvements in clinical relevance, factual accuracy, and adherence to structured output conventions.
๐ Abstract
Medical Visual Question Answering (MedVQA) enables natural language queries over medical images to support clinical decision-making and patient care. The MEDIQA-WV 2025 shared task addressed wound-care VQA, requiring systems to generate free-text responses and structured wound attributes from images and patient queries. We present the MasonNLP system, which employs a general-domain, instruction-tuned large language model with a retrieval-augmented generation (RAG) framework that incorporates textual and visual examples from in-domain data. This approach grounds outputs in clinically relevant exemplars, improving reasoning, schema adherence, and response quality across dBLEU, ROUGE, BERTScore, and LLM-based metrics. Our best-performing system ranked 3rd among 19 teams and 51 submissions with an average score of 41.37%, demonstrating that lightweight RAG with general-purpose LLMs -- a minimal inference-time layer that adds a few relevant exemplars via simple indexing and fusion, with no extra training or complex re-ranking -- provides a simple and effective baseline for multimodal clinical NLP tasks.