Efficient Chromosome Parallelization for Precision Medicine Genomic Workflows

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Large-scale genomic workflows in precision medicine face challenges including substantial inter-chromosomal memory demand variability, high peak memory usage due to static resource allocation, I/O intensity, and frequent task failures. To address these, we propose a chromosome-level adaptive parallelization framework: (1) a novel memory prediction model integrating symbolic regression with interpolation-based bias correction to accurately estimate per-chromosome RAM consumption; (2) task packing formulated as a constrained knapsack problem for dynamic, resource-aware scheduling; and (3) a static processing-order optimization strategy to jointly minimize peak memory. Evaluated on real whole-genome sequencing (WGS) pipelines and large-scale simulations, our approach reduces out-of-memory failures by 92%, accelerates end-to-end execution by 1.8–3.2×, and improves resource utilization by over 40%, significantly enhancing analytical stability and scalability.

Technology Category

Application Category

📝 Abstract

Large-scale genomic workflows used in precision medicine can process datasets spanning tens to hundreds of gigabytes per sample, leading to high memory spikes, intensive disk I/O, and task failures due to out-of-memory errors. Simple static resource allocation methods struggle to handle the variability in per-chromosome RAM demands, resulting in poor resource utilization and long runtimes. In this work, we propose multiple mechanisms for adaptive, RAM-efficient parallelization of chromosome-level bioinformatics workflows. First, we develop a symbolic regression model that estimates per-chromosome memory consumption for a given task and introduces an interpolating bias to conservatively minimize over-allocation. Second, we present a dynamic scheduler that adaptively predicts RAM usage with a polynomial regression model, treating task packing as a Knapsack problem to optimally batch jobs based on predicted memory requirements. Additionally, we present a static scheduler that optimizes chromosome processing order to minimize peak memory while preserving throughput. Our proposed methods, evaluated on simulations and real-world genomic pipelines, provide new mechanisms to reduce memory overruns and balance load across threads. We thereby achieve faster end-to-end execution, showcasing the potential to optimize large-scale genomic workflows.

Problem

Research questions and friction points this paper is trying to address.

Optimizing memory allocation for chromosome-level genomic workflows

Reducing task failures caused by excessive memory consumption

Improving resource utilization in precision medicine data processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Symbolic regression model estimates chromosome memory consumption

Dynamic scheduler predicts RAM usage via polynomial regression

Static scheduler optimizes chromosome order to minimize memory

🔎 Similar Papers

SequenceLab: A Comprehensive Benchmark of Computational Methods for Comparing Genomic Sequences