🤖 AI Summary
Medical ultrasound images suffer from low signal-to-noise ratios and speckle noise, which limit the effectiveness of conventional pixel-level self-supervised learning methods. To address this challenge, this work proposes US-JEPA, a framework based on the Joint Embedding Predictive Architecture (JEPA), which employs a frozen, domain-specific static teacher model to provide stable latent targets. By leveraging masked latent prediction, US-JEPA enables decoupled optimization between student and teacher networks, circumventing the instability associated with online teacher updates that are sensitive to hyperparameter tuning. This approach is the first to systematically evaluate ultrasound foundation models across multiple organs and pathologies on the UltraBench benchmark, achieving performance in linear probe classification tasks that matches or exceeds existing domain-specific and general-purpose vision models.
📝 Abstract
Ultrasound (US) imaging poses unique challenges for representation learning due to its inherently noisy acquisition process. The low signal-to-noise ratio and stochastic speckle patterns hinder standard self-supervised learning methods relying on a pixel-level reconstruction objective. Joint-Embedding Predictive Architectures (JEPAs) address this drawback by predicting masked latent representations rather than raw pixels. However, standard approaches depend on hyperparameter-brittle and computationally expensive online teachers updated via exponential moving average. We propose US-JEPA, a self-supervised framework that adopts the Static-teacher Asymmetric Latent Training (SALT) objective. By using a frozen, domain-specific teacher to provide stable latent targets, US-JEPA decouples student-teacher optimization and pushes the student to expand upon the semantic priors of the teacher. In addition, we provide the first rigorous comparison of all publicly available state-of-the-art ultrasound foundation models on UltraBench, a public dataset benchmark spanning multiple organs and pathological conditions. Under linear probing for diverse classification tasks, US-JEPA achieves performance competitive with or superior to domain-specific and universal vision foundation model baselines. Our results demonstrate that masked latent prediction provides a stable and efficient path toward robust ultrasound representations.