Deep Generative Models for Discrete Genotype Simulation

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses privacy and access constraints in genomic data by proposing a deep generative modeling framework tailored to discrete genotypic data, supporting both unconditional and phenotype-conditioned generation. Methodologically, it adapts variational autoencoders (VAEs), diffusion models, and generative adversarial networks (GANs) to accommodate the discrete nature of genotype data—constituting the first systematic performance comparison of these architectures on full-chromosome-scale bovine and human genomic datasets. The framework integrates quantitative genetics priors with deep learning evaluation metrics to jointly preserve population genetic structure (e.g., linkage disequilibrium, allele frequency spectra) and genotype–phenotype associations. Experiments demonstrate substantial improvements over baselines across key metrics, including LD decay patterns, minor allele frequency distributions, and phenotypic prediction consistency. The implementation is publicly released, establishing the first reproducible, scalable benchmark platform and practical paradigm for genotype simulation.

Technology Category

Application Category

📝 Abstract
Deep generative models open new avenues for simulating realistic genomic data while preserving privacy and addressing data accessibility constraints. While previous studies have primarily focused on generating gene expression or haplotype data, this study explores generating genotype data in both unconditioned and phenotype-conditioned settings, which is inherently more challenging due to the discrete nature of genotype data. In this work, we developed and evaluated commonly used generative models, including Variational Autoencoders (VAEs), Diffusion Models, and Generative Adversarial Networks (GANs), and proposed adaptation tailored to discrete genotype data. We conducted extensive experiments on large-scale datasets, including all chromosomes from cow and multiple chromosomes from human. Model performance was assessed using a well-established set of metrics drawn from both deep learning and quantitative genetics literature. Our results show that these models can effectively capture genetic patterns and preserve genotype-phenotype association. Our findings provide a comprehensive comparison of these models and offer practical guidelines for future research in genotype simulation. We have made our code publicly available at https://github.com/SihanXXX/DiscreteGenoGen.
Problem

Research questions and friction points this paper is trying to address.

Simulating discrete genotype data using deep generative models
Comparing VAEs, Diffusion Models, and GANs for genotype generation
Preserving genetic patterns and genotype-phenotype associations in simulations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapted VAEs, GANs, Diffusion Models for discrete genotypes
Generated unconditioned and phenotype-conditioned genotype data
Evaluated models using genetics and deep learning metrics
🔎 Similar Papers
No similar papers found.
S
Sihan Xie
Université Paris-Saclay, INRAE, AgroParisTech, GABI, 78350, Jouy-en-Josas, France.
T
Thierry Tribout
Université Paris-Saclay, INRAE, AgroParisTech, GABI, 78350, Jouy-en-Josas, France.
D
Didier Boichard
Université Paris-Saclay, INRAE, AgroParisTech, GABI, 78350, Jouy-en-Josas, France.
Blaise Hanczar
Blaise Hanczar
Professor, Université Paris-Saclay (Univ. Evry)
Machine LearningBioinformatics
Julien Chiquet
Julien Chiquet
Université Paris-Saclay, INRAE, AgroParisTech
StatisticsMachine LearningComputational Biology
E
Eric Barrey
Université Paris-Saclay, INRAE, AgroParisTech, GABI, 78350, Jouy-en-Josas, France.