🤖 AI Summary
Radiology report generation faces dual challenges: difficult cross-modal alignment between medical images and text, and extreme scarcity of annotated data. To address these, we propose UniCrossAdapter—a lightweight, plug-and-in module that operates with a frozen CLIP backbone. It enables collaborative optimization of visual and linguistic representations via dual-path cross-modal injection and multi-layer cross-attention, supporting zero-parameter-update domain adaptation. Our method adopts an end-to-end generative architecture without fine-tuning large model parameters, substantially reducing training cost and overfitting risk. Evaluated on two mainstream public benchmarks, it outperforms existing state-of-the-art methods across all metrics using significantly fewer trainable parameters. This demonstrates the effectiveness and generalizability of the frozen-large-model-plus-structured-adapter paradigm for few-shot medical multimodal tasks.
📝 Abstract
Automated radiology report generation aims to expedite the tedious and error-prone reporting process for radiologists. While recent works have made progress, learning to align medical images and textual findings remains challenging due to the relative scarcity of labeled medical data. For example, datasets for this task are much smaller than those used for image captioning in computer vision. In this work, we propose to transfer representations from CLIP, a large-scale pre-trained vision-language model, to better capture cross-modal semantics between images and texts. However, directly applying CLIP is suboptimal due to the domain gap between natural images and radiology. To enable efficient adaptation, we introduce UniCrossAdapter, lightweight adapter modules that are incorporated into CLIP and fine-tuned on the target task while keeping base parameters fixed. The adapters are distributed across modalities and their interaction to enhance vision-language alignment. Experiments on two public datasets demonstrate the effectiveness of our approach, advancing state-of-the-art in radiology report generation. The proposed transfer learning framework provides a means of harnessing semantic knowledge from large-scale pre-trained models to tackle data-scarce medical vision-language tasks. Code is available at https://github.com/chauncey-tow/MRG-CLIP.