🤖 AI Summary
This study investigates how to select optimal self-supervised learning (SSL) methods based on the structural and noise characteristics of medical images to enhance the learning of clinically relevant features. We systematically compare joint embedding architectures (JEAs) and joint embedding prediction architectures (JEPAs) on ultrasound and histopathology images, establishing—for the first time—a principled framework linking SSL objectives to modality-specific properties such as localized signals versus global structure. Expert evaluation by radiologists and pathologists reveals that JEAs are better suited for localized tasks like histopathology analysis, whereas JEPAs excel in tasks requiring global structural understanding, such as liver ultrasound interpretation. This modality-aware selection significantly improves the clinical utility of the learned representations.
📝 Abstract
Though self-supervised learning (SSL) has demonstrated incredible ability to learn robust representations from unlabeled data, the choice of optimal SSL strategy can lead to vastly different performance outcomes in specialized domains. Joint embedding architectures (JEAs) and joint embedding predictive architectures (JEPAs) have shown robustness to noise and strong semantic feature learning compared to pixel reconstruction-based SSL methods, leading to widespread adoption in medical imaging. However, no prior work has systematically investigated which SSL objective is better aligned with the spatial organization of clinically relevant signal. In this work, we empirically investigate how the choice of SSL method impacts the learned representations in medical imaging. We select two representative imaging modalities characterized by unique noise profiles: ultrasound and histopathology. When informative signal is spatially localized, as in histopathology, JEAs are more effective due to their view-invariance objective. In contrast, when diagnostically relevant information is globally structured, such as the macroscopic anatomy present in liver ultrasounds, JEPAs are optimal. These differences are especially evident in the clinical relevance of the learned features, as independently validated by board-certified radiologists and pathologists. Together, our results provide a framework for matching SSL objectives to the structural and noise properties of medical imaging modalities.