pith. sign in

arxiv: 2605.15617 · v1 · pith:UPZLLDUDnew · submitted 2026-05-15 · 💻 cs.DC · cs.AI

A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM

Pith reviewed 2026-05-19 19:52 UTC · model grok-4.3

classification 💻 cs.DC cs.AI
keywords LLM trainingcluster emulationGPU performancehybrid simulationdistributed computingmemory modelingexecution graph
0
0 comments X p. Extension
pith:UPZLLDUD Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{UPZLLDUD}

Prints a linked pith:UPZLLDUD badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

PrismLLM emulates 8192-GPU LLM training using fewer than 1% of the GPUs with 0.58% average iteration time error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language model training runs on clusters of thousands of GPUs, yet engineers struggle to reproduce production behavior for debugging or tuning because full-scale hardware is scarce and reserved for production. PrismLLM decouples the need for large clusters by first building a high-fidelity execution graph through slicing that records computation, communication, and dependencies at the target scale. It then runs a hybrid emulation in which a few selected ranks execute the real program on physical GPUs while all other ranks replay as virtual participants. Experiments on real workloads confirm that this matches actual iteration times within 0.58% on average and peak memory usage within 0.01%, while supporting emulations up to 8192 GPUs on under 1% of the original hardware count. If the method works as described, developers can iterate on training frameworks far more often without competing for production resources.

Core claim

PrismLLM constructs a high-fidelity execution graph via a slicing-based approach that captures computation, communication, and dependencies of the target scale. Then, PrismLLM performs hybrid emulation where selected ranks execute the original program while the remaining ranks are replayed as virtual participants. Experiments on large-scale LLM training workloads show that PrismLLM accurately reproduces performance and memory behavior, achieving only 0.58% average error in iteration time and less than 0.01% error in peak GPU memory usage. PrismLLM can emulate clusters of up to 8192 GPUs using fewer than 1% of the physical GPUs required by the original deployment.

What carries the argument

Slicing-based high-fidelity execution graph enabling hybrid emulation of selected real ranks and virtual replay participants.

If this is right

  • Training framework developers can diagnose failures and evaluate optimizations without needing exclusive access to production-scale clusters.
  • Scale-dependent performance and memory behaviors become reproducible during everyday development.
  • Research workloads can share hardware more efficiently with production because emulation uses far fewer physical GPUs.
  • Iteration cycles for distributed training software shorten because faithful large-scale tests no longer require full hardware reservations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same slicing and hybrid replay approach could be tested on other distributed workloads such as scientific simulations or data analytics frameworks.
  • Combining PrismLLM with automated search tools might let engineers discover scale-specific bottlenecks earlier in the development process.
  • Wider use could lower the hardware threshold for academic groups to study techniques that currently require industrial-scale clusters.

Load-bearing premise

The slicing method fully captures every computation, communication pattern, and dependency at the target scale so that running only some ranks in real hardware still produces accurate overall large-scale behavior.

What would settle it

Run the same LLM training job on both PrismLLM and a real full-scale GPU cluster, then compare measured iteration times, peak memory, and communication volumes; large mismatches would show that scale-dependent effects were missed.

Figures

Figures reproduced from arXiv: 2605.15617 by Boyi Jia, Brian Sutioso, ChonLam Lao, Ennan Zhai, Erci Xu, Jiamin Cao, Jiaqi Gao, Jingren Zhou, Kui Ren, Minlan Yu, Shaoke Xi, Yong Li, Zhengping Qian, Zhipeng Zhang.

Figure 1
Figure 1. Figure 1: System overview • Code reuse. The emulation system should directly reuse the current LLM training code base to avoid unnecessary development or maintenance overhead. A natural approach is record-and-replay where we record execution trace of a program and replay it in the test envi￾ronment. But, there are two challenges. Challenge 1: You need scale to see scale. Our goal is to emulate large-scale behavior u… view at source ↗
Figure 2
Figure 2. Figure 2: Snippet from PrismTrace capturing dependencies of 1F1B where all virtual ranks are replayed using the complete exe￾cution graph with accurate timing. This enables PrismLLM to accurately measure metrics of interest (e.g., end-to-end it￾eration time and GPU memory usage over time) for sandbox ranks without requiring full-scale hardware ○10. Both phases operate under minimal GPU resources, re￾quiring as few a… view at source ↗
Figure 4
Figure 4. Figure 4: Workflow for generating a complete graph for emulation. all-reduce). The coordinator then checks whether all partici￾pants of the collective are active. If not, the collective cannot proceed. For example, in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Runtime Communication Pruning. To address this, PrismLLM introduces NCCL group reduc￾tion, which reduces overhead by selectively instantiating only the necessary groups and ranks. During group cre￾ation, sandbox ranks remain unchanged and are initialized normally. For virtual ranks, we instantiate only the NCCL groups whose members overlap with sandbox ranks. Groups that do not communicate with the sandbox… view at source ↗
Figure 6
Figure 6. Figure 6: Guarantee collective numeric correctness with pruning. 6.3 Runtime Communication Pruning Since we remove inactive, non-neighboring ranks, the origi￾nal collective communication pattern is changed at runtime ( [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: End-to-end iteration time estimation results. 235B (S.A) 235B (S.B) 503B (S.A) 503B (S.B) 1.01T (S.C) 1.01T (S.D) 0 0.3 0.6 0.9 Normalized Memory OOM OOM OOM Scale: 512 GPUs 235B (S.A) 235B (S.B) 503B (S.A) 503B (S.B) 1.01T (S.C) 1.01T (S.D) 0 0.3 0.6 0.9 OOM OOM OOM Scale: 1024 GPUs 235B (S.A) 235B (S.B) 503B (S.A) 503B (S.B) 1.01T (S.C) 1.01T (S.D) 0 0.3 0.6 0.9 OOM OOM OOM Scale: 2048 GPUs MoE Balance (… view at source ↗
Figure 8
Figure 8. Figure 8: End-to-end peak memory allocation estimation results. All estimation errors are consistently below 0.01%. 512 1024 1024 2048 4096 8192 Scale 0 40 80 120 Elapsed Time (min) 0 4 8 12 16 Assistant Nodes Emulate Time Fill & Calibration [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Emulation time breakdown. 235B (S.A) 235B (S.B) 503B (S.A) 503B (S.B) 1.01T (S.C) 1.01T (S.D) 10 0 10 Dev. from Median (µs) Compute Kernels 235B (S.A) 235B (S.B) 503B (S.A) 503B (S.B) 1.01T (S.C) 1.01T (S.D) NCCL Kernels Natural Variance PrismLLM [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Memory usage reduction and bootstrap acceleration of large-scale emulation in PrismLLM. 0 20000 40000 60000 80000 Kernel Index -0.50% -0.25% 0.00% 0.25% 0.50% 0.75% Relative Start Time Dev. (%) Natural Variance PrismLLM [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Kernel launch deviation. 16M 128M 1G 8G 32G Message Size 10 1 10 0 10 1 10 2 10 3 10 4 Transmission Latency (ms) Baseline Vanilla PrismLLM [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 15
Figure 15. Figure 15: End-to-end iteration time estimation results compared with SimAI. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
read the original abstract

Large language model (LLM) training today runs on clusters spanning thousands of GPUs. While this scale enables rapid model advances, developing, debugging, and performance-tuning the training framework inevitably becomes complex and costly. This is because engineers often need to reproduce production behaviors to diagnose failures or evaluate optimizations, thereby demanding frequent and even exclusive access to production-scale clusters -- which becomes increasingly hard given that the majority of GPUs are already committed to production workloads. Simulation relies on complex performance models that are difficult to maintain, and downscaled experiments often fail to capture scale-dependent behaviors. We present PrismLLM to decouple large-scale execution from the need to access large clusters, enabling engineers to run and observe ranks of interest under faithful large-scale behavior using only a few GPUs. PrismLLM constructs a high-fidelity execution graph via a slicing-based approach that captures computation, communication, and dependencies of the target scale. Then, PrismLLM performs hybrid emulation where selected ranks execute the original program while the remaining ranks are replayed as virtual participants. Experiments on large-scale LLM training workloads show that PrismLLM accurately reproduces performance and memory behavior, achieving only 0.58\% average error in iteration time and less than 0.01\% error in peak GPU memory usage. PrismLLM can emulate clusters of up to 8192 GPUs using fewer than 1\% of the physical GPUs required by the original deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents PrismLLM, a system to emulate large-scale LLM training on clusters of up to 8192 GPUs using fewer than 1% of the required physical GPUs. It constructs a high-fidelity execution graph from the target program via a slicing-based approach that captures computation, communication, and dependencies, then performs hybrid emulation in which selected ranks execute the original program while remaining ranks are replayed as virtual participants. Experiments report 0.58% average error in iteration time and less than 0.01% error in peak GPU memory usage.

Significance. If the emulation results hold, the work has substantial significance for distributed systems and LLM infrastructure research. It directly addresses the practical barrier of limited access to production-scale clusters for debugging and performance tuning by providing a program-grounded emulation technique that avoids both complex analytical models and downscaling artifacts. The reported quantitative accuracy at extreme scale and the construction from actual program traces (rather than fitted parameters) are notable strengths that could enable broader experimentation if validated more thoroughly.

major comments (2)
  1. [Experiments section] Experiments section: The central claim of faithful reproduction rests on the reported 0.58% average iteration-time error and <0.01% peak-memory error, yet the manuscript provides insufficient detail on the exact workloads evaluated, number of runs, variance, and any data exclusion or post-processing rules. This is load-bearing because without these, it is impossible to determine whether the low errors generalize or depend on unstated choices.
  2. [§3 (Design, slicing-based graph construction)] §3 (Design, slicing-based graph construction): The approach assumes the slicing-derived graph plus hybrid replay of virtual ranks fully encodes all relevant dependencies, latencies, and bandwidth interactions present at target scale. However, the description does not explicitly address how non-linear scaling of collectives (all-reduce, all-gather) or dynamic network contention at 8192 participants is captured when only per-rank traces from smaller observations are used; this directly affects whether scale-dependent effects are reproduced.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'hybrid emulation' is introduced without a one-sentence definition, which would help readers quickly grasp the selected-rank vs. virtual-participant distinction.
  2. [Figures] Figure captions (throughout): Captions should explicitly state the physical GPU count used for each emulation experiment and the precise error metric being plotted.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The feedback highlights important aspects of clarity in the experimental reporting and the handling of scale-dependent behaviors in our emulation approach. We address each major comment below and have made revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [Experiments section] Experiments section: The central claim of faithful reproduction rests on the reported 0.58% average iteration-time error and <0.01% peak-memory error, yet the manuscript provides insufficient detail on the exact workloads evaluated, number of runs, variance, and any data exclusion or post-processing rules. This is load-bearing because without these, it is impossible to determine whether the low errors generalize or depend on unstated choices.

    Authors: We agree that additional detail on experimental methodology strengthens the paper and aids reproducibility. The manuscript describes the workloads (LLM training jobs with models up to the scale requiring 8192 GPUs) and reports aggregate error metrics, but we acknowledge the lack of explicit per-configuration run counts, variance measures, and post-processing rules. In the revised version, we have expanded the Experiments section with a dedicated subsection and summary table listing: the precise model sizes and parallelism configurations evaluated, the number of independent runs per data point (five runs), mean and standard deviation of iteration times and memory usage, and confirmation that no data points were excluded beyond standard logging of complete iterations. revision: yes

  2. Referee: [§3 (Design, slicing-based graph construction)] §3 (Design, slicing-based graph construction): The approach assumes the slicing-derived graph plus hybrid replay of virtual ranks fully encodes all relevant dependencies, latencies, and bandwidth interactions present at target scale. However, the description does not explicitly address how non-linear scaling of collectives (all-reduce, all-gather) or dynamic network contention at 8192 participants is captured when only per-rank traces from smaller observations are used; this directly affects whether scale-dependent effects are reproduced.

    Authors: Thank you for this observation on potential scale-dependent effects. The slicing constructs the execution graph from the target program's structure at the full intended scale, extracting operation dependencies, computation durations, and communication volumes and patterns directly rather than relying exclusively on smaller-scale traces. In hybrid emulation, real ranks execute the original program and issue actual collective calls over the physical network, while virtual ranks replay their scheduled operations using the graph-derived sizes and relative timings; this allows real ranks to observe and participate in network interactions induced by the full participant set. We recognize that certain non-linear collective performance behaviors or highly dynamic contention patterns may not be fully reproduced if they emerge only at extreme scale and are absent from the base observations. We have therefore revised §3 to include an explicit discussion of these assumptions, how the dependency graph and hybrid execution approximate collective scaling, and the corresponding limitations of the current fidelity guarantees. revision: partial

Circularity Check

0 steps flagged

PrismLLM derives its execution graph and hybrid emulation directly from the target program without circular reduction to inputs or self-citations.

full rationale

The paper presents PrismLLM as building a high-fidelity execution graph through a slicing-based approach that directly captures computation, communication, and dependencies from the target-scale program, followed by hybrid emulation of selected ranks executing the original code while others are replayed virtually. This construction is grounded in observation of the actual program rather than any fitted parameters, self-definitional loops, or load-bearing self-citations. No equations or claims in the provided description reduce the reported accuracy metrics (0.58% iteration time error, <0.01% memory error) to definitional necessities or prior author results; the low errors are framed as experimental outcomes of the emulation process. The approach remains self-contained against external benchmarks, with no evidence of renaming known results, smuggling ansatzes, or uniqueness theorems imported from overlapping authors.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the sliced execution graph accurately represents all relevant dependencies and that hybrid emulation introduces no artifacts for the observed ranks. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Slicing the execution graph captures all computation, communication, and inter-rank dependencies at the target scale.
    This premise is required for the hybrid emulation to produce faithful large-scale behavior as claimed.

pith-pipeline@v0.9.0 · 5836 in / 1201 out tokens · 33228 ms · 2026-05-19T19:52:01.110099+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 7 internal anchors

  1. [1]

    Jehyeon Bang, Yujeong Choi, Myeongwoo Kim, Yongdeok Kim, and Minsoo Rhu. 2024. vTrain: A Simulation Framework for Evaluating Cost-Effective and Compute-Optimal Large Language Model Training. InProceedings of the 2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO ’24). IEEE Press, 153–167.https://doi .org/ 10.1109/MICRO61859.2024.00021

  2. [2]

    Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin Jin, and Xin Liu. 2024. FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion. (2024). arXiv:cs.LG/2406.06858

  3. [3]

    CRIU Project Developers. 2026. Github - CRIU: Checkpoint/Restore In Userspace.https://github.com/checkpoint-restore/criu. (2026)

  4. [4]

    Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Par- allelism and Work Partitioning. (2023). arXiv:cs.LG/2307.08691 https://arxiv.org/abs/2307.08691

  5. [5]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

  6. [6]

    InProceedings of the 36th International Conference on Neural Information Processing Systems (NIPS ’22)

    FLASHATTENTION: fast and memory-efficient exact attention with IO-awareness. InProceedings of the 36th International Conference on Neural Information Processing Systems (NIPS ’22). Curran Associates Inc., Red Hook, NY, USA, Article 1189, 16 pages

  7. [7]

    Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

  8. [8]

    DeepSeek-V3 Technical Report. (2025). arXiv:cs.CL/2412.19437 https://arxiv.org/abs/2412.19437

  9. [9]

    Data on AI Models

    Epoch AI. 2025. "Data on AI Models". (7 2025).https://epoch .ai/data/ ai-models/Accessed: 13 Mar 2026

  10. [10]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravanku- mar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Au- relien Rodriguez, Austen Gregerson, A...

  11. [11]

    Fei Gui, Kaihui Gao, Li Chen, Dan Li, Vincent Liu, Ran Zhang, Hong- bing Yang, and Dian Xiong. 2025. Accelerating design space explo- ration for LLM training systems with multi-experiment parallel sim- ulation. InProceedings of the 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI ’25). USENIX Association, USA, Article 25, 16 pages

  12. [12]

    Hanpeng Hu, Chenyu Jiang, Yuchen Zhong, Yanghua Peng, Chuan Wu, Yibo Zhu, Haibin Lin, and Chuanxiong Guo. 2022. dPRO: A Generic Performance Diagnosis and Optimization Toolkit for Expe- diting Distributed DNN Training. InProceedings of Machine Learn- ing and Systems, D. Marculescu, Y. Chi, and C. Wu (Eds.), Vol. 4. 623–637.https://proceedings .mlsys.org/pa...

  13. [13]

    Zhiyi Hu, Siyuan Shen, Tommaso Bonato, Sylvain Jeaugey, Cedell Alexander, Eric Spada, James Dinan, Jeff Hammond, and Torsten Hoe- fler. 2026. Demystifying NCCL: An In-depth Analysis of GPU Com- munication Protocols and Algorithms. (2026). arXiv:cs.DC/2507.04786 https://arxiv.org/abs/2507.04786

  14. [14]

    Jinkun Lin, Ziheng Jiang, Zuquan Song, Sida Zhao, Menghan Yu, Zhanghan Wang, Chenyuan Wang, Zuocheng Shi, Xiang Shi, Wei Jia, Zherui Liu, Shuguang Wang, Haibin Lin, Xin Liu, Aurojit Panda, and Jinyang Li. 2025. Understanding stragglers in large model training using what-if analysis. InProceedings of the 19th USENIX Conference on Operating Systems Design a...

  15. [15]

    Guandong Lu, Runzhe Chen, Yakai Wang, Yangjie Zhou, Rui Zhang, Zheng Hu, Yanming Miao, Zhifang Cai, Li Li, Jingwen Leng, and Minyi Guo. 2023. DistSim: A performance model of large-scale hybrid distributed DNN training. InProceedings of the 20th ACM International Conference on Computing Frontiers (CF ’23). Association for Computing Machinery, New York, NY,...

  16. [16]

    Qingkai Meng, Hao Zheng, Zhenhui Zhang, ChonLam Lao, Chengyuan Huang, Baojia Li, Ziyuan Zhu, Hao Lu, Weizhen Dang, Zitong Lin, Weifeng Zhang, Lingfeng Liu, Yuanyuan Gong, Chunzhi He, Xiaoyuan Hu, Yinben Xia, Xiang Li, Zekun He, Yachen Wang, Xianneng Zou, Kun Yang, Gianni Antichi, Guihai Chen, and Chen Tian. 2025. Astral: A Datacenter Infrastructure for La...

  17. [17]

    Meta. 2025. Holistic Trace Analysis. (2025).https://github .com/ facebookresearch/HolisticTraceAnalysisGitHub repository, latest release May 28, 2025

  18. [18]

    NVIDIA Corporation. 2025. Github - CUDA Checkpoint and Restore Utility.https://github.com/NVIDIA/cuda-checkpoint. (2025)

  19. [19]

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Bern...

  20. [20]

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Na- talia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, Hi...

  21. [21]

    Lebeck, and Danyang Zhuo

    Jianxing Qin, Jingrong Chen, Xinhao Kong, Yongji Wu, Tianjun Yuan, Liang Luo, Zhaodong Wang, Ying Zhang, Tingjun Chen, Alvin R. Lebeck, and Danyang Zhuo. 2026. Phantora: Maximizing Code Reuse in Simulation-based Machine Learning System Performance Estimation. InNSDI ’26

  22. [22]

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

  23. [23]

    InProceedings of the International Conference for High Per- formance Computing, Networking, Storage and Analysis (SC ’20)

    ZeRO: memory optimizations toward training trillion parameter models. InProceedings of the International Conference for High Per- formance Computing, Networking, Storage and Analysis (SC ’20). IEEE Press, Article 20, 16 pages

  24. [24]

    Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna. 2020. ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms. In2020 IEEE International Sym- posium on Performance Analysis of Systems and Software (ISPASS). 81– 92.https://doi.org/10.1109/ISPASS48437.2020.00018

  25. [25]

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He

  26. [26]

    InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’20)

    DeepSpeed: System Optimizations Enable Training Deep Learn- ing Models with Over 100 Billion Parameters. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’20). Association for Computing Machinery, New York, NY, USA, 3505–3506.https://doi.org/10.1145/3394486.3406703

  27. [27]

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ra- mani, and Tri Dao. 2024. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. (2024). arXiv:cs.LG/2407.08608 https://arxiv.org/abs/2407.08608

  28. [28]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi- billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053(2019)

  29. [29]

    Srinivas Sridharan, Taekyung Heo, Louis Feng, Zhaodong Wang, Matt Bergeron, Wenyin Fu, Shengbao Zheng, Brian Coutinho, Saeed Rashidi, Changhai Man, et al. 2023. Chakra: Advancing performance bench- marking and co-design using standardized execution traces.arXiv preprint arXiv:2305.14516(2023)

  30. [30]

    Borui Wan, Gaohong Liu, Zuquan Song, Jun Wang, Yun Zhang, Guangming Sheng, Shuguang Wang, Houmin Wei, Chenyuan Wang, Weiqiang Lou, Xi Yang, Mofan Zhang, Kaihua Jiang, Cheng Ren, Xi- aoyun Zhi, Menghan Yu, Zhe Nan, Zhuolin Zheng, Baoquan Zhong, Qinlong Wang, Huan Yu, Jinxin Chi, Wang Zhang, Yuhan Li, Zixian Du, Sida Zhao, Yongqiang Zhang, Jingzhe Tang, Zhe...

  31. [31]

    Xizheng Wang, Qingxu Li, Yichi Xu, Gang Lu, Dan Li, Li Chen, Heyang Zhou, Linkang Zheng, Sen Zhang, Yikai Zhu, Yang Liu, Pengcheng Zhang, Kun Qian, Kunling He, Jiaqi Gao, Ennan Zhai, Dennis Cai, and Binzhang Fu. 2025. SimAI: Unifying Architecture Design and Performance Tuning for Large-Scale Large Language Model Training with Scalability and Precision. In...

  32. [32]

    William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudar- shan Srinivasan, and Tushar Krishna. 2023. ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale. In2023 IEEE International Symposium on Perfor- mance Analysis of Systems and Software (ISPASS). 283–294.https: //doi.org/10.1109/ISPASS5752...

  33. [33]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  34. [34]

    Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, and Tri Dao. 2026. FlashAttention-4: Algorithm and Ker- nel Pipelining Co-Design for Asymmetric Hardware Scaling. (2026). arXiv:cs.CL/2603.05451https://arxiv.org/abs/2603.05451

  35. [35]

    Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, and Xin Liu. 2025. Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts. (2025). arXiv:cs.DC/2502.19811

  36. [36]

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. 2023. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel. (2023). arXiv:cs.DC/2304.11277https://a...

  37. [37]

    successful

    Hongyu Zhu, Amar Phanishayee, and Gennady Pekhimenko. 2020. Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training. In2020 USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, 337–352.https://www .usenix.org/ conference/atc20/presentation/zhu-hongyu 17 A COORDINATOR AND PRIORITY-BASED CON- TEXT SWITCHING ALGORITH...