PP-GWAS: Privacy Preserving Multi-Site Genome-wide Association Studies

📅 2024-10-10

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Privacy preservation and computational efficiency remain conflicting objectives in multi-center genome-wide association studies (GWAS). Method: We propose a distributed stacked ridge regression framework based on random coding, enabling the first secure and scalable linear mixed model (LMM) fitting under the semi-honest adversary model. Our approach employs random coding to decouple sensitive covariance computations across sites, eliminating raw genotype data exchange while enforcing rigorous differential privacy. Contribution/Results: Compared to state-of-the-art methods, our solution achieves a 2× speedup and reduces memory consumption by 35%. It scales empirically to cohorts of ten thousand individuals and millions of SNPs, delivering the first distributed LMM framework that simultaneously guarantees strong privacy, high statistical accuracy, and industrial-scale scalability for real-world multi-center genomic analysis.

Technology Category

Application Category

📝 Abstract

Genome-wide association studies are pivotal in understanding the genetic underpinnings of complex traits and diseases. Collaborative, multi-site GWAS aim to enhance statistical power but face obstacles due to the sensitive nature of genomic data sharing. Current state-of-the-art methods provide a privacy-focused approach utilizing computationally expensive methods such as Secure Multi-Party Computation and Homomorphic Encryption. In this context, we present a novel algorithm PP-GWAS designed to improve upon existing standards in terms of computational efficiency and scalability without sacrificing data privacy. This algorithm employs randomized encoding within a distributed architecture to perform stacked ridge regression on a Linear Mixed Model to ensure rigorous analysis. Experimental evaluation with real world and synthetic data indicates that PP-GWAS can achieve computational speeds twice as fast as similar state-of-the-art algorithms while using lesser computational resources, all while adhering to a robust security model that caters to an all-but-one semi-honest adversary setting. We have assessed its performance using various datasets, emphasizing its potential in facilitating more efficient and private genomic analyses.

Problem

Research questions and friction points this paper is trying to address.

Enhances computational efficiency in multi-site GWAS

Maintains data privacy without using costly encryption methods

Improves scalability for genomic analysis while ensuring security

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributed architecture with randomized encoding

Stacked ridge regression on Linear Mixed Model

Faster computation with fewer resources securely

🔎 Similar Papers

PROVGEN: A Privacy-Preserving Approach for Outcome Validation in Genomic Research