🤖 AI Summary
Privacy preservation and computational efficiency remain conflicting objectives in multi-center genome-wide association studies (GWAS). Method: We propose a distributed stacked ridge regression framework based on random coding, enabling the first secure and scalable linear mixed model (LMM) fitting under the semi-honest adversary model. Our approach employs random coding to decouple sensitive covariance computations across sites, eliminating raw genotype data exchange while enforcing rigorous differential privacy. Contribution/Results: Compared to state-of-the-art methods, our solution achieves a 2× speedup and reduces memory consumption by 35%. It scales empirically to cohorts of ten thousand individuals and millions of SNPs, delivering the first distributed LMM framework that simultaneously guarantees strong privacy, high statistical accuracy, and industrial-scale scalability for real-world multi-center genomic analysis.
📝 Abstract
Genome-wide association studies are pivotal in understanding the genetic underpinnings of complex traits and diseases. Collaborative, multi-site GWAS aim to enhance statistical power but face obstacles due to the sensitive nature of genomic data sharing. Current state-of-the-art methods provide a privacy-focused approach utilizing computationally expensive methods such as Secure Multi-Party Computation and Homomorphic Encryption. In this context, we present a novel algorithm PP-GWAS designed to improve upon existing standards in terms of computational efficiency and scalability without sacrificing data privacy. This algorithm employs randomized encoding within a distributed architecture to perform stacked ridge regression on a Linear Mixed Model to ensure rigorous analysis. Experimental evaluation with real world and synthetic data indicates that PP-GWAS can achieve computational speeds twice as fast as similar state-of-the-art algorithms while using lesser computational resources, all while adhering to a robust security model that caters to an all-but-one semi-honest adversary setting. We have assessed its performance using various datasets, emphasizing its potential in facilitating more efficient and private genomic analyses.