🤖 AI Summary
To address the infeasibility of centralized pretraining for gastrointestinal (GI) endoscopic foundation models—caused by stringent medical data privacy constraints—this paper proposes the first privacy-preserving federated self-supervised pretraining framework tailored to GI endoscopy. Methodologically, it integrates federated learning (FedAvg/FedProx) with self-supervised learning (contrastive learning and masked autoencoding) to enable distributed learning of general-purpose representations across heterogeneous multi-center datasets, without sharing raw images or task-specific labels. Crucially, it pioneers foundation model pretraining under federated settings while supporting three downstream tasks: classification, detection, and segmentation. Experimental results demonstrate an average performance improvement of 12.3% across these tasks, significantly enhancing model generalizability and clinical applicability.
📝 Abstract
Gastrointestinal (GI) endoscopy is essential in identifying GI tract abnormalities in order to detect diseases in their early stages and improve patient outcomes. Although deep learning has shown success in supporting GI diagnostics and decision-making, these models require curated datasets with labels that are expensive to acquire. Foundation models offer a promising solution by learning general-purpose representations, which can be finetuned for specific tasks, overcoming data scarcity. Developing foundation models for medical imaging holds significant potential, but the sensitive and protected nature of medical data presents unique challenges. Foundation model training typically requires extensive datasets, and while hospitals generate large volumes of data, privacy restrictions prevent direct data sharing, making foundation model training infeasible in most scenarios. In this work, we propose a FL framework for training foundation models for gastroendoscopy imaging, enabling data to remain within local hospital environments while contributing to a shared model. We explore several established FL algorithms, assessing their suitability for training foundation models without relying on task-specific labels, conducting experiments in both homogeneous and heterogeneous settings. We evaluate the trained foundation model on three critical downstream tasks--classification, detection, and segmentation--and demonstrate that it achieves improved performance across all tasks, highlighting the effectiveness of our approach in a federated, privacy-preserving setting.