StageFrontier computes an exact additive accounting of exposed step time in distributed training by taking the frontier of per-rank coarse stage durations reported with unsynchronized CPU wall clocks.
Shende and Allen D
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
Portable Ewald summation algorithms for Stokes flow achieve ~8M particles/sec on H200 GPU with a novel P2G kernel providing 16x speedup and good multi-GPU scaling.
citing papers explorer
-
StageFrontier: Synchronization-Aware Stage Accounting for Distributed ML Training
StageFrontier computes an exact additive accounting of exposed step time in distributed training by taking the frontier of per-rank coarse stage durations reported with unsynchronized CPU wall clocks.
-
A performance portable fast Ewald summation for Stokes flow
Portable Ewald summation algorithms for Stokes flow achieve ~8M particles/sec on H200 GPU with a novel P2G kernel providing 16x speedup and good multi-GPU scaling.