A Universal Metric of Dataset Similarity for Cross-silo Federated Learning

📅 2024-04-29
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
In cross-institutional federated learning, non-independent and identically distributed (Non-IID) data across clients degrade model performance, while existing distribution shift assessment methods often require data sharing or task-specific assumptions, compromising privacy and generality. To address this, we propose the first general-purpose, training-free, and privacy-preserving dataset similarity metric for federated learning. Our approach leverages geometric analysis in feature space and statistical distance modeling, integrating differential privacy–compatible kernel density estimation with nonparametric similarity inference. We theoretically establish its connection to federated training dynamics. Extensive experiments on synthetic, benchmark, and medical imaging datasets demonstrate strong correlation with final model performance (average Pearson *r* > 0.87) and over 90% reduction in computational overhead compared to baseline methods.

Technology Category

Application Category

📝 Abstract
Federated Learning is increasingly used in domains such as healthcare to facilitate collaborative model training without data-sharing. However, datasets located in different sites are often non-identically distributed, leading to degradation of model performance in FL. Most existing methods for assessing these distribution shifts are limited by being dataset or task-specific. Moreover, these metrics can only be calculated by exchanging data, a practice restricted in many FL scenarios. To address these challenges, we propose a novel metric for assessing dataset similarity. Our metric exhibits several desirable properties for FL: it is dataset-agnostic, is calculated in a privacy-preserving manner, and is computationally efficient, requiring no model training. In this paper, we first establish a theoretical connection between our metric and training dynamics in FL. Next, we extensively evaluate our metric on a range of datasets including synthetic, benchmark, and medical imaging datasets. We demonstrate that our metric shows a robust and interpretable relationship with model performance and can be calculated in privacy-preserving manner. As the first federated dataset similarity metric, we believe this metric can better facilitate successful collaborations between sites.
Problem

Research questions and friction points this paper is trying to address.

Measuring dataset similarity without data sharing in federated learning
Addressing performance degradation from non-identical data distributions
Providing privacy-preserving dataset-agnostic similarity assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes a universal dataset similarity metric
Calculates metric in privacy-preserving manner
Requires no model training for computation
🔎 Similar Papers
No similar papers found.
A
Ahmed Elhussein
Department of Biomedical Informatics, Columbia University, New York Genome Center
G
Gamze Gürsoy
Department of Biomedical Informatics, Columbia University, New York Genome Center, Department of Computer Science, Columbia University