Grounded Knowledge-Enhanced Medical Vision-Language Pre-training for Chest X-Ray

📅 2024-04-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address global-local alignment bias caused by redundant information in medical vision-language pretraining, this paper proposes an anatomical region-level knowledge grounding framework for fine-grained semantic alignment between chest X-ray images and radiology reports. The method introduces the first Transformer-based grounding mechanism that explicitly aligns anatomical region features with domain-specific medical knowledge text embeddings. It incorporates a Transformer-based grounding knowledge enhancement module to jointly model region-level visual and textual representations, and establishes a multi-task joint pretraining paradigm to mitigate data bias. Evaluated on four downstream tasks—disease classification, lesion localization, report generation, and medical visual question answering—the framework achieves state-of-the-art or competitive performance, demonstrating that knowledge grounding significantly improves image-report semantic consistency and clinical representational capacity.

Technology Category

Application Category

📝 Abstract
Medical foundation models have the potential to revolutionize healthcare by providing robust and generalized representations of medical data. Medical vision-language pre-training has emerged as a promising approach for learning domain-general representations of medical image and text. Current algorithms that exploit global and local alignment between medical image and text could however be marred by redundant information in medical data. To address this issue, we propose a grounded knowledge-enhanced medical vision-language pre-training (GK-MVLP) framework for chest X-ray. In this framework, medical knowledge was grounded to the appropriate anatomical regions by using a transformer-based grounded knowledge-enhanced module for fine-grained alignment between textural features of medical knowledge and the corresponding anatomical region-level visual features. The performance of GK-MVLP was competitive with or exceeded the state of the art on downstream image understanding tasks (chest X-ray disease classification, disease localization), generative task (report generation), and vision-language understanding task (medical visual question-answering). Our results demonstrate the advantage of incorporating grounding mechanism to remove biases and improve the alignment between chest X-ray image and radiology report.
Problem

Research questions and friction points this paper is trying to address.

Enhancing medical vision-language pre-training
Reducing redundant information in medical data
Improving alignment between X-ray images and reports
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based grounded knowledge-enhanced module
Fine-grained alignment between text and anatomy
Competitive performance in medical tasks
🔎 Similar Papers
No similar papers found.
Q
Qiao Deng
CU Lab for AI in Radiology (CLAIR), Department of Imaging and Interventional Radiology
Zhongzhen Huang
Zhongzhen Huang
Shanghai Jiao Tong University
Medical Image AnalysisVision and Language
Y
Yunqi Wang
CU Lab for AI in Radiology (CLAIR), Department of Imaging and Interventional Radiology
Z
Zhichuan Wang
CU Lab for AI in Radiology (CLAIR), Department of Imaging and Interventional Radiology
Z
Zhao Wang
Department of Computer Science and Engineering, The Chinese University of Hong, HKSAR, China
X
Xiaofan Zhang
Shanghai Jiao Tong University, China, Shanghai AI Laboratory, China
Q
Qi Dou
Department of Computer Science and Engineering, The Chinese University of Hong, HKSAR, China
Y
Yeung Yu Hui
China Unicom Global Limited
E
Edward S. Hui
CU Lab for AI in Radiology (CLAIR), Department of Imaging and Interventional Radiology, Department of Psychiatry