StageFrontier computes an exact additive accounting of exposed step time in distributed training by taking the frontier of per-rank coarse stage durations reported with unsynchronized CPU wall clocks.
Chakra: Advancing performance benchmarking and co-design using standardized execution traces
7 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.DC 7years
2026 7representative citing papers
Xsim is a heterogeneity-aware simulator for distributed LLM training supporting load balancing, customized collectives, tensor resharding, and pluggable network simulation, reporting under 5% error in training time predictions.
PrismLLM constructs a sliced execution graph and uses hybrid emulation to faithfully reproduce performance and memory behavior of up to 8192-GPU LLM training runs on fewer than 1% of the original GPUs.
Chakra introduces a standardized graph-based execution trace representation for distributed ML workloads along with supporting tools to enable benchmarking, analysis, generation, and co-design across simulators and hardware.
ASTRA-sim 3.0 introduces cache-line load-store simulation, a detailed GPU execution model, and InfraGraph to support high-fidelity distributed machine learning infrastructure simulations.
Flint generates compiler-derived workload graphs that support cluster-free design space exploration for distributed machine learning systems.
StableHLO serves as a viable unified representation for cross-architecture performance modeling of distributed ML workloads, preserving relative trends while exposing fidelity trade-offs.
citing papers explorer
-
StageFrontier: Synchronization-Aware Stage Accounting for Distributed ML Training
StageFrontier computes an exact additive accounting of exposed step time in distributed training by taking the frontier of per-rank coarse stage durations reported with unsynchronized CPU wall clocks.
-
Simulating Unified Tensor Resharding in heterogeneous AI systems
Xsim is a heterogeneity-aware simulator for distributed LLM training supporting load balancing, customized collectives, tensor resharding, and pluggable network simulation, reporting under 5% error in training time predictions.
-
A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM
PrismLLM constructs a sliced execution graph and uses hybrid emulation to faithfully reproduce performance and memory behavior of up to 8192-GPU LLM training runs on fewer than 1% of the original GPUs.
-
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces
Chakra introduces a standardized graph-based execution trace representation for distributed ML workloads along with supporting tools to enable benchmarking, analysis, generation, and co-design across simulators and hardware.
-
ASTRA-sim 3.0: Next-Level Distributed Machine Learning Simulations via High-Fidelity GPU and Infrastructure Modeling
ASTRA-sim 3.0 introduces cache-line load-store simulation, a detailed GPU execution model, and InfraGraph to support high-fidelity distributed machine learning infrastructure simulations.
-
Flint: Compiler Enabled Cluster-Free Design Space Exploration for Distributed ML
Flint generates compiler-derived workload graphs that support cluster-free design space exploration for distributed machine learning systems.
-
Evaluating Cross-Architecture Performance Modeling of Distributed ML Workloads Using StableHLO
StableHLO serves as a viable unified representation for cross-architecture performance modeling of distributed ML workloads, preserving relative trends while exposing fidelity trade-offs.