🤖 AI Summary
Weak semantic alignment and imprecise modeling of fine anatomical structures and pathological lesions hinder self-supervised representation learning for medical X-ray images. To address this, we propose the first contrastive self-supervised framework integrating Optimal Transport (OT) theory. Our method introduces: (1) a Cross-View Semantic Injection Module (CV-SIM) to enhance anatomical consistency across multiple views; and (2) OT-constrained variance/covariance regularization to enforce dense semantic invariance. Evaluated on three public chest X-ray datasets, our approach consistently outperforms state-of-the-art methods across downstream tasks—including classification, object detection, and segmentation—demonstrating significant performance gains. These results validate the effectiveness and generalizability of OT-driven semantic alignment for medical image representation learning.
📝 Abstract
Self-supervised learning (SSL) has emerged as a promising technique for medical image analysis due to its ability to learn without annotations. However, despite the promising potential, conventional SSL methods encounter limitations, including challenges in achieving semantic alignment and capturing subtle details. This leads to suboptimal representations, which fail to accurately capture the underlying anatomical structures and pathological details. In response to these constraints, we introduce a novel SSL framework OPTiML, employing optimal transport (OT), to capture the dense semantic invariance and fine-grained details, thereby enhancing the overall effectiveness of SSL in medical image representation learning. The core idea is to integrate OT with a cross-viewpoint semantics infusion module (CV-SIM), which effectively captures complex, fine-grained details inherent in medical images across different viewpoints. In addition to the CV-SIM module, OPTiML imposes the variance and covariance regularizations within OT framework to force the model focus on clinically relevant information while discarding less informative features. Through these, the proposed framework demonstrates its capacity to learn semantically rich representations that can be applied to various medical imaging tasks. To validate its effectiveness, we conduct experimental studies on three publicly available datasets from chest X-ray modality. Our empirical results reveal OPTiML's superiority over state-of-the-art methods across all evaluated tasks.