OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limited generalization and reproducibility of existing medical vision-language models, which stems from the absence of large-scale open multimodal pretraining data. To overcome this, the authors present the first unified medical vision-language model trained on a curated collection of 14 publicly available datasets—encompassing pathology, radiology, microscopy images, and clinical question-answering pairs—totaling approximately 3.35 million samples. Employing a contrastive learning framework, the model jointly optimizes an image encoder and a language model. It demonstrates exceptional zero-shot transfer performance: achieving a BLEU-1 score of 75.9 on PathVQA, surpassing Med-PaLM M despite having 80 times fewer parameters, and setting a new state-of-the-art with a BLEU-1 of 64.5 on VQA-MED. Moreover, its visual encoder attains an average macro-F1 of 0.757 across eight zero-shot classification tasks, outperforming current medical CLIP variants, and establishes the first comprehensive open-source baseline for the community.

📝 Abstract

We present OpenMedQ, a medical vision-language model pretrained on the broadest fully-open medical mix to date: 14 datasets totaling ~3.35M pretraining samples spanning pathology, radiology, microscopy, and text-only clinical QA. OpenMedQ reaches state-of-the-art BLEU-1 on PathVQA (75.9), beating Med-PaLM M variants up to 562B parameters (~80x larger), and matches the best reported VQA-MED BLEU-1 (64.5). Its vision encoder, transferred to 8 unseen medical classification benchmarks under an identical downstream recipe, obtains the highest average macro-F1 (0.757) among BiomedCLIP (0.745), PMC-CLIP (0.745), PubMedCLIP (0.746), and a from-scratch baseline (0.616). We release our code and an interactive demo is publicly available as a reproducible baseline for the community.

Problem

Research questions and friction points this paper is trying to address.

medical vision-language model

open pretraining

medical VQA

multimodal learning

medical image classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

medical vision-language model

open pretraining

cross-domain generalization