🤖 AI Summary
Current medical image–report alignment models exhibit limitations in fine-grained disease understanding and cross-modal semantic association, leading to diagnostic bias. To address this, we propose a novel framework for fine-grained image–report alignment. First, we design a disease-detail disentanglement module that explicitly separates lesion location, morphology, and pathological semantics. Second, we introduce a visual-attribute knowledge injection mechanism to incorporate domain-specific prior knowledge. Third, we construct a fine-grained annotation–guided semantic similarity matrix to enable zero-shot knowledge transfer to unseen diseases. Our method integrates large language model–based prompt engineering, multi-granularity contrastive alignment, and fine-grained disease semantic modeling. Evaluated on RSNA and CheX14 benchmarks, the framework achieves state-of-the-art performance in single-label, multi-label, and fine-grained classification tasks—yielding up to a 6.69% absolute accuracy improvement—while significantly enhancing model interpretability and generalization capability.
📝 Abstract
Medical vision-language pretraining (VLP) that leverages naturally-paired medical image-report data is crucial for medical image analysis. However, existing methods struggle to accurately characterize associations between images and diseases, leading to inaccurate or incomplete diagnostic results. In this work, we propose MedFILIP, a fine-grained VLP model, introduces medical image-specific knowledge through contrastive learning, specifically: 1) An information extractor based on a large language model is proposed to decouple comprehensive disease details from reports, which excels in extracting disease deals through flexible prompt engineering, thereby effectively reducing text complexity while retaining rich information at a tiny cost. 2) A knowledge injector is proposed to construct relationships between categories and visual attributes, which help the model to make judgments based on image features, and fosters knowledge extrapolation to unfamiliar disease categories. 3) A semantic similarity matrix based on fine-grained annotations is proposed, providing smoother, information-richer labels, thus allowing fine-grained image-text alignment. 4) We validate MedFILIP on numerous datasets, e.g., RSNA-Pneumonia, NIH ChestX-ray14, VinBigData, and COVID-19. For single-label, multi-label, and fine-grained classification, our model achieves state-of-the-art performance, the classification accuracy has increased by a maximum of 6.69%. The code is available in https://github.com/PerceptionComputingLab/MedFILIP.