StageFrontier computes an exact additive accounting of exposed step time in distributed training by taking the frontier of per-rank coarse stage durations reported with unsynchronized CPU wall clocks.
Chakra: Advancing performance benchmarking and co-design using standardized execution traces
5 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.DC 5years
2026 5representative citing papers
PrismLLM constructs a sliced execution graph and uses hybrid emulation to faithfully reproduce performance and memory behavior of up to 8192-GPU LLM training runs on fewer than 1% of the original GPUs.
Chakra introduces a standardized graph-based execution trace representation for distributed ML workloads along with supporting tools to enable benchmarking, analysis, generation, and co-design across simulators and hardware.
Flint generates compiler-derived workload graphs that support cluster-free design space exploration for distributed machine learning systems.
StableHLO serves as a viable unified representation for cross-architecture performance modeling of distributed ML workloads, preserving relative trends while exposing fidelity trade-offs.
citing papers explorer
-
StageFrontier: Synchronization-Aware Stage Accounting for Distributed ML Training
StageFrontier computes an exact additive accounting of exposed step time in distributed training by taking the frontier of per-rank coarse stage durations reported with unsynchronized CPU wall clocks.
-
A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM
PrismLLM constructs a sliced execution graph and uses hybrid emulation to faithfully reproduce performance and memory behavior of up to 8192-GPU LLM training runs on fewer than 1% of the original GPUs.
-
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces
Chakra introduces a standardized graph-based execution trace representation for distributed ML workloads along with supporting tools to enable benchmarking, analysis, generation, and co-design across simulators and hardware.
-
Flint: Compiler Enabled Cluster-Free Design Space Exploration for Distributed ML
Flint generates compiler-derived workload graphs that support cluster-free design space exploration for distributed machine learning systems.
-
Evaluating Cross-Architecture Performance Modeling of Distributed ML Workloads Using StableHLO
StableHLO serves as a viable unified representation for cross-architecture performance modeling of distributed ML workloads, preserving relative trends while exposing fidelity trade-offs.