🤖 AI Summary
This work addresses the computational challenge of simulating non-Markovian epidemics on contact networks with millions of nodes, where age-dependent holding times—such as log-normal or Weibull distributions—lead to dense event updates that hinder efficiency. The authors propose the first fully pipelined GPU kernel fusion framework, integrating multi-stage computations into a single Triton kernel that performs intermediate operations entirely in registers. By combining degree-aware CSR scheduling, block-wise scalar skipping, and CUDA Graph support, the method achieves high throughput while minimizing I/O overhead. Additional innovations include mixed-precision storage, numerically stable hazard rate computation via erfcx, and a Bernoulli tau-leaping strategy. Evaluated on an A100 GPU, the approach attains 8.09 Giga-NUPS on uniform graphs—217× faster than CPU—and 4.5× higher throughput on Barabási–Albert graphs, scaling to hundreds of millions of nodes per card with a threefold increase in L2-cache-resident problem size, all while maintaining errors within typical epidemiological parameter uncertainty bounds.
📝 Abstract
Non-Markovian (renewal) epidemic simulation on multi-million-node contact networks is essential for realistic forecasting under general age-dependent holding-time distributions (log-normal, Weibull, Erlang, and similar), but the age-dependent hazard forces dense per-step updates that render the sparse event-queue strategies of standard CPU methods ineffective. We present FlashSpread, a GPU framework that consolidates the per-step renewal pipeline (CSR traversal, numerically stable erfcx-based hazard evaluation, Bernoulli tau-leaping, state transition, and next-step infectivity write-back) into a single fused Triton kernel whose intermediates never leave streaming-multiprocessor registers, with block-scalar skips that preserve CUDA Graph capture and a degree-aware CSR dispatch (thread / warp / edge-merge) that keeps the peak throughput on scale-free graphs. On an NVIDIA A100 the fused CUDA-Graph engine reaches 8.09 Giga-NUPS at N = 10^6 on a uniform-degree graph, a 217x strict hardware speedup over optimised CPU tau-leaping at the same N; on a Barabasi-Albert graph of the same size the merge-based dispatch recovers 4.5x (0.45 to 2.0 Giga-NUPS) over the default kernel, and the framework scales to N = 10^8 on a single A100 (40 GB), with a mixed-precision storage path that extends the L2-reachable scale by roughly 3x and delivers a 1.15x throughput lift at the far bandwidth-bound end. Validation against an exact non-Markovian Gillespie reference shows a structural-bias floor of approximately 6% on peak infection and approximately 7% on final attack rate that does not detectably decrease as epsilon nears 0 across two decades of tolerance, comfortably within typical epidemiological parameter uncertainty. Code: https://github.com/Shakeri-Lab/FlashSpread.