LaB-RAG: Label Boosted Retrieval Augmented Generation for Radiology Report Generation

📅 2024-11-25

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

181K/year

🤖 AI Summary

For radiology report generation, this work challenges the prevailing assumption that large multimodal models (LMMs) must be fine-tuned. We propose a zero-fine-tuning, text-driven retrieval-augmented generation (RAG) framework. Our method employs discrete, interpretable radiological labels as a vision–language bridge: first, a linear classifier (trained on ResNet-50 image features) predicts labels from input images; second, these labels serve as queries to retrieve semantically similar reports from a clinical corpus; third, a frozen, pretrained large language model (e.g., Llama-2) generates the final report—without any image input or joint vision–language training. This eliminates end-to-end optimization of visual encoders and LLMs. Experiments show our approach surpasses existing retrieval-based methods in both natural language and radiology-specific metrics, while matching the performance of fine-tuned multimodal models. Furthermore, we identify systematic inflation risks in current radiology report generation (RRG) evaluation protocols.

Technology Category

Application Category

📝 Abstract

In the current paradigm of image captioning, deep learning models are trained to generate text from image embeddings of latent features. We challenge the assumption that these latent features ought to be high-dimensional vectors which require model fine tuning to handle. Here we propose Label Boosted Retrieval Augmented Generation (LaB-RAG), a text-based approach to image captioning that leverages image descriptors in the form of categorical labels to boost standard retrieval augmented generation (RAG) with pretrained large language models (LLMs). We study our method in the context of radiology report generation (RRG), where the task is to generate a clinician's report detailing their observations from a set of radiological images, such as X-rays. We argue that simple linear classifiers over extracted image embeddings can effectively transform X-rays into text-space as radiology-specific labels. In combination with standard RAG, we show that these derived text labels can be used with general-domain LLMs to generate radiology reports. Without ever training our generative language model or image feature encoder models, and without ever directly"showing"the LLM an X-ray, we demonstrate that LaB-RAG achieves better results across natural language and radiology language metrics compared with other retrieval-based RRG methods, while attaining competitive results compared to other fine-tuned vision-language RRG models. We further present results of our experiments with various components of LaB-RAG to better understand our method. Finally, we critique the use of a popular RRG metric, arguing it is possible to artificially inflate its results without true data-leakage.

Problem

Research questions and friction points this paper is trying to address.

Improving radiology report generation without fine-tuning large models

Leveraging categorical labels to enhance retrieval-augmented generation

Transforming X-rays into text labels for general-domain LLM processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses classification models to convert X-rays into text labels

Combines derived labels with retrieval-augmented generation using LLMs

Achieves competitive results without task-specific generative model training

🔎 Similar Papers

Fact-Aware Multimodal Retrieval Augmentation for Accurate Medical Radiology Report Generation