🤖 AI Summary
Large-scale genomic workflows in precision medicine face challenges including substantial inter-chromosomal memory demand variability, high peak memory usage due to static resource allocation, I/O intensity, and frequent task failures. To address these, we propose a chromosome-level adaptive parallelization framework: (1) a novel memory prediction model integrating symbolic regression with interpolation-based bias correction to accurately estimate per-chromosome RAM consumption; (2) task packing formulated as a constrained knapsack problem for dynamic, resource-aware scheduling; and (3) a static processing-order optimization strategy to jointly minimize peak memory. Evaluated on real whole-genome sequencing (WGS) pipelines and large-scale simulations, our approach reduces out-of-memory failures by 92%, accelerates end-to-end execution by 1.8–3.2×, and improves resource utilization by over 40%, significantly enhancing analytical stability and scalability.
📝 Abstract
Large-scale genomic workflows used in precision medicine can process datasets spanning tens to hundreds of gigabytes per sample, leading to high memory spikes, intensive disk I/O, and task failures due to out-of-memory errors. Simple static resource allocation methods struggle to handle the variability in per-chromosome RAM demands, resulting in poor resource utilization and long runtimes. In this work, we propose multiple mechanisms for adaptive, RAM-efficient parallelization of chromosome-level bioinformatics workflows. First, we develop a symbolic regression model that estimates per-chromosome memory consumption for a given task and introduces an interpolating bias to conservatively minimize over-allocation. Second, we present a dynamic scheduler that adaptively predicts RAM usage with a polynomial regression model, treating task packing as a Knapsack problem to optimally batch jobs based on predicted memory requirements. Additionally, we present a static scheduler that optimizes chromosome processing order to minimize peak memory while preserving throughput. Our proposed methods, evaluated on simulations and real-world genomic pipelines, provide new mechanisms to reduce memory overruns and balance load across threads. We thereby achieve faster end-to-end execution, showcasing the potential to optimize large-scale genomic workflows.