🤖 AI Summary
This work addresses the limited reliability of existing medical vision-language models in high-stakes clinical settings, where trustworthy reasoning grounded in both visual evidence and medical knowledge is critical—mere answer correctness being insufficient for clinical credibility. To bridge this gap, the authors introduce OpenMedReason, a large-scale open-source multimodal medical reasoning corpus comprising approximately 450,000 human-written image-question-answer instances, along with the accompanying evaluation benchmark OpenMedReason-Bench. For the first time, high-fidelity reasoning traces are generated using high-quality biomedical literature, enabling fine-grained diagnostic assessment of models across perception, knowledge integration, and reasoning capabilities. Models trained via supervised fine-tuning and reinforcement learning alignment achieve a 20% average improvement in VQA accuracy over baselines, reaching 95.8% of the performance of the strongest same-scale medical LVLMs; moreover, 86.1% of their reasoning traces are preferred by human evaluators, demonstrating balanced gains across multiple dimensions.
📝 Abstract
High-stakes clinical use of large vision-language models (LVLMs) requires reasoning that is grounded in visual evidence and clinical knowledge, not just correct final answers. We introduce OpenMedReason, a large-scale, open multimodal medical reasoning corpus comprising approximately 450K image-question-answer instances whose reasoning traces are primarily derived from curated biomedical, human-authored scientific articles. OpenMedReason provides high-fidelity supervision beyond synthetic chains of thought, covering diverse medical domain vision modalities such as radiological scans, microscopic images, visible light photographs, charts, and others. We complement it with OpenMedReason-Bench, a held-out benchmark that allows fine-grained evaluation of LVLMs along three complementary axes of capability, including perception, medical knowledge, and rationale, enabling diagnostic evaluation beyond final-answer accuracy. OpenMedReason is a rich training resource that exhibits its effectiveness in both supervised fine-tuning (SFT) and reinforcement-based alignment. Training with OpenMedReason yields a 20% average improvement in VQA accuracy over the base model and achieves performance within 4.2% of the strongest comparable-scale medical LVLMs. Fine-grained performance analysis confirms that the gains are not concentrated in any single axis: OpenMedReason improves perception, medical knowledge, and rationale jointly, and its reasoning traces are preferred over those of the base model in 86.1% of pairwise comparisons. We release the code and dataset at huggingface.co/datasets/neginb/OpenMedReason.