pith. sign in

arxiv: 2604.22092 · v2 · submitted 2026-04-23 · 💻 cs.DC

FlashSpread: IO-Aware GPU Simulation of Non-Markovian Epidemic Dynamics via Kernel Fusion

Pith reviewed 2026-05-08 13:48 UTC · model grok-4.3

classification 💻 cs.DC
keywords GPU simulationnon-Markovian epidemickernel fusiontau-leapingcontact networkshigh-performance computingCUDA Graph
0
0 comments X

The pith

A single fused GPU kernel delivers 217x speedup for non-Markovian epidemic simulation on million-node networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Non-Markovian epidemic models require tracking the age of each infection to apply realistic recovery and transmission rates drawn from general distributions such as log-normal or Weibull. On large contact networks this forces an update at nearly every node in every time step, which defeats the sparse event queues that make CPU methods fast. The paper consolidates the entire pipeline of network traversal, hazard evaluation, probabilistic state jumps, and result write-back into one GPU kernel that keeps all working data inside fast registers and uses degree-aware scheduling to maintain high throughput on both uniform and scale-free graphs. On an A100 GPU this produces 8.09 billion node updates per second at one million nodes and scales to one hundred million nodes while introducing only a fixed structural bias of roughly 6-7 percent in peak prevalence and final attack rate. The bias stays flat even when the numerical tolerance is tightened by two orders of magnitude.

Core claim

The fused CUDA-Graph engine, built around a single Triton kernel that performs CSR traversal, numerically stable erfcx hazard evaluation, Bernoulli tau-leaping, state transition, and infectivity write-back without leaving streaming-multiprocessor registers, reaches 8.09 Giga-NUPS at N=10^6 on uniform-degree graphs and 2.0 Giga-NUPS on Barabasi-Albert graphs, delivering a 217x hardware speedup over optimized CPU tau-leaping at the same size while scaling to N=10^8 on a single 40 GB GPU with a mixed-precision path that adds a further 1.15x at the bandwidth limit.

What carries the argument

The single fused Triton kernel with block-scalar skips for CUDA Graph capture and degree-aware CSR dispatch (thread, warp, or edge-merge) that keeps all intermediates inside registers.

Load-bearing premise

The observed 6-7 percent structural bias floor stays acceptable and does not grow when the method is applied to new holding-time distributions, different network topologies, or substantially larger problem sizes.

What would settle it

Run an exact non-Markovian Gillespie simulation on a 10^5-node network with the same parameters and holding-time distribution, then measure whether the peak infection and final attack rate differ from the fused-kernel results by more than the reported 6-7 percent floor.

Figures

Figures reproduced from arXiv: 2604.22092 by Aram Vajdi, Behnaz Moradi-Jamei, Ehsan Ardjmand, Heman Shakeri.

Figure 1
Figure 1. Figure 1: System overview. Every kernel reads from a shared GPU-resident state block (left). The Markovian engine (top) branches at runtime between a global rebuild (Control Mode, FlashNeighbor) and a sparse atomic update (Inertial Mode) depending on whether the number of per-step events is ≪ N; both modes converge on a common tau-leap Bernoulli sampler. The Renewal engine (bottom) is a single fused Triton kernel th… view at source ↗
Figure 2
Figure 2. Figure 2: SEIR model with mixed dynamics. The S→E transition is edge-mediated and Markovian; the E→I and I→R transitions are nodal and non-Markovian, with age-dependent hazard functions h(τi). 3 Problem formulation Consider a contact network G = (V, E,W) with N = |V| nodes and L interaction layers. Each node i occupies one of M compartments. The system state is: (X(t), τ (t)) ∈ {1, . . . , M} N × R N ≥0 (1) where X(… view at source ↗
Figure 3
Figure 3. Figure 3: Practitioner decision tree. A single top-level question (“is the hazard age-dependent?”) selects the engine; a single scalar (ρ = Dmax/Davg, inspected once at construction) selects the CSR traversal strategy within the Renewal engine. Thresholds (ρw, ρm) = (4, 50) are calibrated in Appendix B. Both engines share the FlashNeighbor kernel for computing inducer influence (Eq. 3). A naive approach requires mat… view at source ↗
Figure 4
Figure 4. Figure 4: Per-thread control flow inside Algorithm 3: five coalesced HBM reads on entry, a two-way branch selected by block-scalar reductions over the current compartments in the thread block, register-only work (Bernoulli sample, transition, age reset, next-step infectivity), and one coalesced HBM write on exit. The branch is block-level (a scalar decision for the whole warp/block), not per-lane: blocks with no E o… view at source ↗
Figure 5
Figure 5. Figure 5: Active-node compaction: NUPS(t) stratified into ten temporal windows of the TF = 50 renewal benchmark, 5 replicate trials (mean ± 1 std shaded). Left: ER d=8, N=106 , where the epidemic only reaches ≈ 15% attack rate within TF=50; compaction is ∼3% slower than baseline during the takeoff and peak windows (refresh overhead exceeds marginal skip gain) and crosses over to a small lift only in the last tempora… view at source ↗
Figure 6
Figure 6. Figure 6: Renewal engine throughput (SEIR, log-normal, d=8) on Erdős–Rényi networks. Panel (a) shows the full scaling range N ∈ [102 , 108 ] with six curves: Fused CG b=50 (red, headline), Fused CG b=50 with mixed-precision storage (pink dashed; Section 5.7), unfused CUDA-Graph b=50, an 8-core CPU tau-leaping baseline (derived from NUPS via the shared 7.24 × 10−4 events-per-NUPS ratio calibrated at N = 106 ; a prope… view at source ↗
Figure 7
Figure 7. Figure 7: Fidelity validation. Fused GPU kernel (solid), fused CUDA Graph (dashed), and exact non￾Markovian Gillespie (dotted) on the validation benchmark: Erdős–Rényi graph with N = 103 nodes and average degree d = 8; β = 0.25; log-normal E → I with mean 5.0 d and median 4.0 d; log-normal I → R with mean 7.5 d and median 5.0 d; 100 independent Monte Carlo runs. The residual gap visible against the exact reference i… view at source ↗
Figure 8
Figure 8. Figure 8: Roofline analysis on NVIDIA A100 (peak 19.5 TFLOPS FP32 / 2039 GB/s HBM, ridge = 19500/2039 ≈ 9.56 FLOPs/byte). In panel (a), kernel fusion shifts the operating point upward by 4×; CUDA Graph batching adds 2.8×. Enabling mixed-precision storage on top of Fused CG (Section 5.7) shifts the point up-and-right by the ∼32% per-step byte reduction: the AI grows from 0.42 to ≈ 0.62 (same FLOPs, fewer bytes), and … view at source ↗
Figure 9
Figure 9. Figure 9: FlashSpread Markovian engine (blue dashed) versus exact Doob–Gillespie (red solid, with 25–75% quantile band) on ER d = 8, N = 103 , β = 0.25, recovery rate 0.15. Left: SIS (endemic steady state). Right: SIR (single epidemic wave). 100 independent Monte Carlo runs per curve. The tau-leaping curve tracks the exact reference across the full trajectory on both models, confirming that the Markovian engine’s ε-… view at source ↗
Figure 10
Figure 10. Figure 10: Ensemble-mean SEIR trajectories overlaid on the exact non-Markovian Gillespie reference (red solid, with 10–90% Monte Carlo quantile band), plus four tau-leaping tolerances, on the validation benchmark (Erdős–Rényi graph with N = 103 nodes and average degree d = 8; β = 2/d = 0.25; log-normal E→I and I→R with the parameters of Section 6; 100 independent Monte Carlo trajectories per curve). The exact refere… view at source ↗
Figure 11
Figure 11. Figure 11: Error decomposition for Bernoulli tau-leaping, with bootstrap 95% confidence intervals over the 100-run ensemble. Solid markers (with error bars) are per-run absolute errors against the exact non-Markovian Gillespie reference produced by the companion MATLAB simulator. Dashed markers are self-consistency trajectory-level errors against the finest ε = 0.005 tau-leaping ensemble. The solid curves’ CIs overl… view at source ↗
Figure 12
Figure 12. Figure 12: Monte Carlo ensemble-mean I(t)/N for the non-Markovian SEIR model (log-normal E→I and I→R) across three graph topologies (rows) and three sizes (columns). Each panel overlays an exact non￾Markovian Gillespie reference (red, with 25–75% inter-quartile band shaded; implemented via our companion CPU simulator, cited in the repository README) and tau-leaping at ε ∈ {0.005, 0.03, 0.1} (the ε = 0.005 curve is t… view at source ↗
Figure 13
Figure 13. Figure 13: Wall-clock–fidelity Pareto. Each point is one ε value (color: log10 ε, labelled). The vertical axis is max(|∆Ipeak|, |∆Rfinal|) — the larger of the two per-run absolute errors against the exact Gillespie reference — chosen so that no metric can hide a worse metric. Error bars on both axes are 95% bootstrap CIs (1,000 resamples over the 100-run ensemble): horizontal bars for wall-clock, vertical bars for t… view at source ↗
Figure 14
Figure 14. Figure 14: CSR layout used throughout FlashSpread. Left: a small undirected contact graph; node 2 (red) is the focus. Right: the three CSR arrays. row_ptr[i..i+1) delimits node i’s neighbor slice in col_ind and weights. Reading node 2’s neighbors therefore costs two coalesced scalar loads (pointer endpoints) plus a contiguous burst of D2 int/float pairs. Incoming and outgoing CSRs are stored separately: the incoming… view at source ↗
Figure 15
Figure 15. Figure 15: shows the data flow for a single warp. Global memory (HBM) row_ptr0 2 3 6 8 9 · · · col_ind1 3 0 1 3 4 0 2 1 · · · weightsw0 w1 w2 w3 w4 w5 w6 w7 w8 · · · infectivityι0 ι1 ι2 ι3 ι4 ι5 ι6 ι7 ι8 · · · One block of 128 threads (4 warps); each thread = 1 node thread 0 thread 1 thread 2 thread 3 · · · p0 ∈ register p1 ∈ register p2 ∈ register p3 ∈ register · · · loop over row_ptr[1..2) Single HBM write per thr… view at source ↗
Figure 16
Figure 16. Figure 16: GPU memory-hierarchy flow, before and after fusion. (a) The legacy PyTorch/Triton pipeline launches five kernels per step; each kernel reads from HBM and writes its O(N) intermediate (infectivity, pressure, rates, event probabilities, event mask) back to HBM before the next kernel starts. Total traffic is ∼64N bytes per step. (b) The fused kernel keeps all of these intermediates in per-thread registers; o… view at source ↗
read the original abstract

Non-Markovian (renewal) epidemic simulation on multi-million-node contact networks is essential for realistic forecasting under general age-dependent holding-time distributions (log-normal, Weibull, Erlang, and similar), but the age-dependent hazard forces dense per-step updates that render the sparse event-queue strategies of standard CPU methods ineffective. We present FlashSpread, a GPU framework that consolidates the per-step renewal pipeline (CSR traversal, numerically stable erfcx-based hazard evaluation, Bernoulli tau-leaping, state transition, and next-step infectivity write-back) into a single fused Triton kernel whose intermediates never leave streaming-multiprocessor registers, with block-scalar skips that preserve CUDA Graph capture and a degree-aware CSR dispatch (thread / warp / edge-merge) that keeps the peak throughput on scale-free graphs. On an NVIDIA A100 the fused CUDA-Graph engine reaches 8.09 Giga-NUPS at N = 10^6 on a uniform-degree graph, a 217x strict hardware speedup over optimised CPU tau-leaping at the same N; on a Barabasi-Albert graph of the same size the merge-based dispatch recovers 4.5x (0.45 to 2.0 Giga-NUPS) over the default kernel, and the framework scales to N = 10^8 on a single A100 (40 GB), with a mixed-precision storage path that extends the L2-reachable scale by roughly 3x and delivers a 1.15x throughput lift at the far bandwidth-bound end. Validation against an exact non-Markovian Gillespie reference shows a structural-bias floor of approximately 6% on peak infection and approximately 7% on final attack rate that does not detectably decrease as epsilon nears 0 across two decades of tolerance, comfortably within typical epidemiological parameter uncertainty. Code: https://github.com/Shakeri-Lab/FlashSpread.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents FlashSpread, a GPU framework for non-Markovian epidemic simulation on large contact networks. It consolidates the renewal pipeline (CSR traversal, erfcx hazard evaluation, Bernoulli tau-leaping, state updates) into a single fused Triton kernel with block-scalar skips and degree-aware dispatch, reporting 8.09 Giga-NUPS at N=10^6 on uniform graphs (217x over CPU tau-leaping), 4.5x recovery on Barabasi-Albert graphs via merge dispatch, scaling to N=10^8 with mixed-precision, and a structural bias floor of ~6% on peak infection and ~7% on final attack rate versus exact Gillespie that is independent of epsilon.

Significance. If the performance numbers and bias characterization hold after verification, the work enables practical large-scale non-Markovian epidemic modeling on commodity GPUs, with clear hardware speedups and open code that supports reproducibility. The kernel-fusion approach and quantified scaling limits are potentially impactful for the field.

major comments (2)
  1. [Abstract] Abstract (validation paragraph): The claim that the 6-7% bias floor on peak infection and final attack rate is structural to Bernoulli tau-leaping (and independent of epsilon) is load-bearing for the correctness argument, yet the manuscript provides no direct comparison of the fused CUDA-Graph kernel (including erfcx, mixed-precision storage, warp-level merging, and register-resident intermediates) against a CPU reference executing identical algorithmic steps at full precision. Without this isolation, the non-vanishing bias could originate from implementation artifacts rather than the discretization itself.
  2. [Abstract] Abstract (scaling and mixed-precision paragraph): The mixed-precision storage path is reported to extend L2-reachable scale by ~3x and deliver 1.15x throughput at the bandwidth-bound end, but the manuscript does not quantify how this path affects the reported bias floor at N=10^8 or whether the 6-7% floor remains stable under the reduced-precision intermediates used at that scale.
minor comments (2)
  1. [Abstract] The abstract mentions 'block-scalar skips that preserve CUDA Graph capture' but does not define the skip criterion or its interaction with the degree-aware CSR dispatch; a brief definition or pseudocode would improve clarity.
  2. [Abstract] The throughput figures (8.09 Giga-NUPS, 0.45 to 2.0 Giga-NUPS) would benefit from explicit statement of the graph generation parameters (e.g., exact degree distribution for the uniform case) to aid reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and valuable suggestions. We address the two major comments point-by-point below. We will incorporate additional validation experiments to strengthen the bias analysis as requested.

read point-by-point responses
  1. Referee: [Abstract] Abstract (validation paragraph): The claim that the 6-7% bias floor on peak infection and final attack rate is structural to Bernoulli tau-leaping (and independent of epsilon) is load-bearing for the correctness argument, yet the manuscript provides no direct comparison of the fused CUDA-Graph kernel (including erfcx, mixed-precision storage, warp-level merging, and register-resident intermediates) against a CPU reference executing identical algorithmic steps at full precision. Without this isolation, the non-vanishing bias could originate from implementation artifacts rather than the discretization itself.

    Authors: We agree that a direct isolation of the bias source is necessary to support the claim that it is structural to Bernoulli tau-leaping. In the revised manuscript, we will add a comparison of the fused kernel against a CPU implementation that executes the exact same steps (CSR traversal, erfcx hazard, Bernoulli tau-leaping, state updates) at full double precision. This will allow us to confirm that the ~6-7% floor persists independently of the GPU-specific optimizations such as register residency and mixed-precision storage. We expect this to validate the structural nature, but will report the results transparently. revision: yes

  2. Referee: [Abstract] Abstract (scaling and mixed-precision paragraph): The mixed-precision storage path is reported to extend L2-reachable scale by ~3x and deliver 1.15x throughput at the bandwidth-bound end, but the manuscript does not quantify how this path affects the reported bias floor at N=10^8 or whether the 6-7% floor remains stable under the reduced-precision intermediates used at that scale.

    Authors: We acknowledge the importance of assessing the impact of mixed-precision on the bias at large scales. In the revision, we will extend the validation to include bias measurements at N=10^8 using the mixed-precision storage path. Where direct comparison to full-precision is feasible (e.g., at intermediate scales), we will quantify any deviation. Given that the bias floor appears independent of epsilon and arises from the tau-leaping discretization, we anticipate stability, but will provide explicit data or note limitations if full-precision at N=10^8 is prohibitive due to memory. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on direct empirical benchmarks and external reference validation.

full rationale

The manuscript presents a GPU kernel implementation for epidemic simulation, with central claims consisting of measured throughput (Giga-NUPS), hardware speedups, and a reported structural bias floor obtained by direct comparison to an independent non-Markovian Gillespie reference simulator. No equations, fitted parameters, or self-citations are used to derive performance or correctness results; the bias is stated as an observed floor that persists as epsilon approaches zero, and all numbers are tied to shipped code and hardware runs rather than any reduction to inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a systems and implementation paper; the central performance claims rest on empirical measurements rather than mathematical axioms or free parameters. No free parameters, domain axioms, or invented entities are introduced beyond standard GPU programming assumptions.

pith-pipeline@v0.9.0 · 5671 in / 1294 out tokens · 37873 ms · 2026-05-08T13:48:24.265410+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

  1. [1]

    Incubation period of 2019 novel coronavirus (2019-ncov) infections among travellers from wuhan, china, 20–28 january 2020.Eurosurveillance, 25(5):2000062,

    Jantien A Backer, Don Klinkenberg, and Jacco Wallinga. Incubation period of 2019 novel coronavirus (2019-ncov) infections among travellers from wuhan, china, 20–28 january 2020.Eurosurveillance, 25(5):2000062,

  2. [2]

    The incubation period of coronavirus disease 2019 (COVID-19) from publicly reported confirmed cases: estimation and application

    Stephen A Lauer, Kyra H Grantz, Qifang Bi, Forrest K Jones, Qulu Zheng, Hannah R Meredith, Andrew S Azman, Nicholas G Reich, and Justin Lessler. The incubation period of coronavirus disease 2019 (COVID-19) from publicly reported confirmed cases: estimation and application. Annals of Internal Medicine, 172(9):577–582,

  3. [3]

    Eon (epidemics on networks): a fast, flexible python package for simulation, analytic approximation, and analysis of epidemics on networks.arXiv preprint arXiv:2001.02436,

    Joel C Miller and Tony Ting. Eon (epidemics on networks): a fast, flexible python package for simulation, analytic approximation, and analysis of epidemics on networks.arXiv preprint arXiv:2001.02436,

  4. [4]

    Ndlib: a python library to model and analyze diffusion processes over complex networks

    Giulio Rossetti, Letizia Milli, and Salvatore Rinzivillo. Ndlib: a python library to model and analyze diffusion processes over complex networks. InCompanion Proceedings of the The Web Conference 2018, pages 183–186,

  5. [5]

    Speedup graph processing by graph ordering

    Hao Wei, Jeffrey Xu Yu, Can Lu, and Xuemin Lin. Speedup graph processing by graph ordering. InProceedings of the 2016 International Conference on Management of Data (SIGMOD), pages 1813–1828. ACM,

  6. [6]

    Epidemic simulation of large-scale social contact networks on GPU clusters

    Peng Zou, Yong-Sheng Lv, Xiao-Gang Deng, and Hang Ji. Epidemic simulation of large-scale social contact networks on GPU clusters. InProceedings of the 2013 Winter Simulation Conference, pages 1267–1278. IEEE,