🤖 AI Summary
To address global-local alignment bias caused by redundant information in medical vision-language pretraining, this paper proposes an anatomical region-level knowledge grounding framework for fine-grained semantic alignment between chest X-ray images and radiology reports. The method introduces the first Transformer-based grounding mechanism that explicitly aligns anatomical region features with domain-specific medical knowledge text embeddings. It incorporates a Transformer-based grounding knowledge enhancement module to jointly model region-level visual and textual representations, and establishes a multi-task joint pretraining paradigm to mitigate data bias. Evaluated on four downstream tasks—disease classification, lesion localization, report generation, and medical visual question answering—the framework achieves state-of-the-art or competitive performance, demonstrating that knowledge grounding significantly improves image-report semantic consistency and clinical representational capacity.
📝 Abstract
Medical foundation models have the potential to revolutionize healthcare by providing robust and generalized representations of medical data. Medical vision-language pre-training has emerged as a promising approach for learning domain-general representations of medical image and text. Current algorithms that exploit global and local alignment between medical image and text could however be marred by redundant information in medical data. To address this issue, we propose a grounded knowledge-enhanced medical vision-language pre-training (GK-MVLP) framework for chest X-ray. In this framework, medical knowledge was grounded to the appropriate anatomical regions by using a transformer-based grounded knowledge-enhanced module for fine-grained alignment between textural features of medical knowledge and the corresponding anatomical region-level visual features. The performance of GK-MVLP was competitive with or exceeded the state of the art on downstream image understanding tasks (chest X-ray disease classification, disease localization), generative task (report generation), and vision-language understanding task (medical visual question-answering). Our results demonstrate the advantage of incorporating grounding mechanism to remove biases and improve the alignment between chest X-ray image and radiology report.