arxiv: 2605.06544 · v1 · submitted 2026-05-07 · 💻 cs.DC · cs.NI

Recognition: unknown

CCL-Bench 1.0: A Trace-Based Benchmark for LLM Infrastructure

Abhishek Vijaya Kumar, Arjun Devraj, Atharv Sonwane, Bhaskar Kataria, Byungsoo Oh, Emaad Manzoor, Eric Ding, Jelena Gvero, Kaiwen Guo, Lindsey Bowen, Rachee Singh

Authors on Pith no claims yet

Pith reviewed 2026-05-08 05:04 UTC · model grok-4.3

classification 💻 cs.DC cs.NI

keywords LLM infrastructuretrace-based benchmarkingexecution tracescompute-communication overlaptraining frameworksinterconnect bandwidthperformance metricsparallelization efficiency

0 comments

The pith

Trace-based benchmarking for LLM infrastructure reveals that higher compute-communication overlap can produce longer training steps and that framework choices can create up to 3x gaps on identical hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CCL-Bench to record reusable execution traces, workload cards, and launch scripts for every ML workload instead of publishing only aggregate performance numbers. A toolkit then derives fine-grained metrics on compute efficiency, memory use, and communication overlap from these traces. Readers would care because this evidence supports three specific observations that summary benchmarks cannot explain: overlap can mask inefficient parallelization, TPU interconnect bandwidth increases deliver larger gains than equivalent GPU increases on small and medium workloads, and best-tuned configurations across frameworks can differ by up to 3x on the same hardware. The approach packages data so the community can extend the analysis to new workloads and configurations.

Core claim

CCL-Bench records an execution trace, a YAML workload card, and launch scripts for each contributed data point, then applies a community-extensible toolkit to compute fine-grained compute, memory, and communication efficiency metrics from the trace. This evidence shows that higher compute-communication overlap can coincide with longer training step time and reveal inefficient parallelization choices, that doubling TPU interconnect bandwidth yields substantially higher end-to-end step-time improvement than doubling GPU interconnect bandwidth on small and medium workloads, and that the best-tuned configuration on one training framework can run up to 3x slower than the best-tuned configuration

What carries the argument

The toolkit that processes execution traces to compute fine-grained compute, memory, and communication efficiency metrics from reusable trace packages.

If this is right

Higher compute-communication overlap can indicate inefficient parallelization when it coincides with longer step times.
Doubling interconnect bandwidth produces larger step-time gains on TPUs than on GPUs for small and medium workloads.
Training frameworks can differ by up to 3x in end-to-end performance even after best tuning on identical hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Infrastructure teams could adopt trace collection as a standard practice to diagnose parallelization issues before scaling to larger workloads.
Comparisons across hardware generations might become more reliable if benchmarks always ship the underlying traces rather than derived numbers alone.
Automated tools that suggest parallelism plans from trace patterns could reduce reliance on manual tuning.

Load-bearing premise

The collected execution traces and the toolkit's metric computations faithfully represent true performance characteristics without bias from tracing overhead or workload selection.

What would settle it

A replication on the same workloads using an independent tracing method that finds the three reported performance relationships do not hold or reverse.

Figures

Figures reproduced from arXiv: 2605.06544 by Abhishek Vijaya Kumar, Arjun Devraj, Atharv Sonwane, Bhaskar Kataria, Byungsoo Oh, Emaad Manzoor, Eric Ding, Jelena Gvero, Kaiwen Guo, Lindsey Bowen, Rachee Singh.

**Figure 1.** Figure 1: CCL-Bench overview: a standardized trace plus workload card recording each run, a metric toolkit computing fine-grained metrics, and downstream analysis and optimization plug-ins. Outcome vs. Explanation. The common thread across these limitations is that current benchmarks report only outcomes but not explanations. An outcome tells the reader which infrastructure combination is fastest on a given workloa… view at source ↗

**Figure 2.** Figure 2: LLM infrastructure evaluated by CCLBench. The workload defines what computation must happen. The infrastructure—hardware platform and software—determines how it is executed. Workload specifies the computation task to be performed, i.e., the model family and size, quantization, the phase (training or inference), the batch size, and the sequence length. CCL-Bench targets open-weight models that can be hos… view at source ↗

**Figure 3.** Figure 3: (a) MFU vs. step time and compute-comm overlap vs. step time on A100 GPU (Perlmutter). Each point is one CCL-Bench entry. (b) Step-time and communication traffic-volume break-down for WL5 DeepSeekV3-16B EP=4 and EP=8 MoE training runs (TP=4, DP=2, PP=1). infrastructure spans communication libraries (NCCL, MSCCL++, XLA collectives) and training and serving engines (Megatron-LM, TorchTitan, MaxText, vLLM, SGLang) view at source ↗

**Figure 4.** Figure 4: CCL-Bench trace-driven what-if pipeline: empirical traces feed a network simulator that estimates step time and the utility of doubling a hardware resource. We feed these execution graphs to Astra-Sim [76], a distributed ML workload simulator that models network topologies with configurable bandwidth, latency, and collective algorithms. Using Astra-Sim, we replay the computation nodes at their empirical… view at source ↗

**Figure 5.** Figure 5: Utility metrics for scale-out bandwidth, scale-up/ICI bandwidth, and scale-up domain size on Perlmutter A100 Slingshot and TPUv6e Torus. We are not able to run WL6,7 on TPUs due to resource limits. Claim 2: Doubling TPU ICI bandwidth yields higher end-to-end utility than doubling GPU scale-up bandwidth for small workloads (up to 100× better for inference and 22× for training). For medium-to-large GPU train… view at source ↗

**Figure 6.** Figure 6: CCL-Search loop. update_policy is an LLM call. Other steps are local/tool execution. How CCL-Search works. CCL-Search is an LLM-agentbased optimizer that iteratively explores the configuration space for a distributed LLM workload ( view at source ↗

**Figure 7.** Figure 7: CCL-Search results. Bottom panels show explored TP, DP, PP, micro-batch size, and activationcheckpointing choices. Grayed-out boxes indicate failed runs. (a) Step-time objective: CCL-Search reduces step time by up to 8× on TorchTitan and 19× on Megatron-LM within 15 iterations using Score = −avg_step_time. (b) Composite objective: CCL-Search finds the Pareto-frontier using Score = w × T0/T + (1 − w) × N0/… view at source ↗

**Figure 8.** Figure 8: CCL-Bench user interface. The interface provides a dashboard for performance ranking, view at source ↗

**Figure 9.** Figure 9: Example cross-system comparisons on Qwen3-4B (WL1), Llama-3.1-8B (WL2), and view at source ↗

**Figure 10.** Figure 10: Example cross-system comparisons on Qwen3-4B (WL1), Llama-3.1-8B (WL2), and view at source ↗

**Figure 11.** Figure 11: Supplementary TPU v6e tensor-parallel sweep for Llama-3.1-8B batch-128/input-1024 view at source ↗

**Figure 12.** Figure 12: GPU vs. TPU holistic comparison across WL1–WL5. Each bar is the average over all view at source ↗

**Figure 13.** Figure 13: Cluster architecture sweep for WL7 (TP=4, PP=8, DP=8, EP=32) DeepSeek-V3-236B, view at source ↗

**Figure 15.** Figure 15: Composite objective: Score = w × T0/T + (1 − w) × N0/N. WL5 DeepSeek-V3- 16B on Perlmutter. This run uses a different seed policy compared to view at source ↗

read the original abstract

Evaluative claims about LLM infrastructure -- ``workload X is fastest on hardware Y with software Z'' -- depend on a complex configuration space spanning hardware accelerators, interconnect bandwidth, software frameworks, parallelism plans, and communication libraries. Current infrastructure evaluation benchmarks publish a small set of end-to-end numbers that do not explain why one configuration outperforms another. We present CCL-Bench, a trace-based benchmark that addresses the limitations of existing benchmarks by recording reusable evidence for every ML workload. Each contributed data point in CCL-Bench packages an execution trace, a YAML workload card, and the launch scripts. We have developed a community-extensible toolkit to compute fine-grained compute, memory, and communication efficiency metrics from this evidence. Using CCL-Bench, we surface three claims that summary-statistic benchmarks cannot support: (i) higher compute-communication overlap can coincide with longer training step time and reveal inefficient parallelization choices, (ii) doubling TPU interconnect bandwidth yields a much higher end-to-end improvement in step time than doubling GPU interconnect bandwidth on small and medium workloads, and (iii) the best-tuned configuration on one training framework can run up to 3$\times$ slower than the best-tuned configuration on a peer framework on identical hardware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces CCL-Bench 1.0, a trace-based benchmark for LLM infrastructure evaluation. Each data point consists of an execution trace, a YAML workload card, and launch scripts. A community-extensible toolkit computes fine-grained compute, memory, and communication efficiency metrics from these traces. The authors use the benchmark to surface three claims not supportable by summary-statistic benchmarks: (i) higher compute-communication overlap can coincide with longer training step time and indicate inefficient parallelization, (ii) doubling TPU interconnect bandwidth yields substantially higher step-time improvement than doubling GPU interconnect bandwidth on small and medium workloads, and (iii) the best-tuned configuration on one training framework can be up to 3× slower than the best-tuned configuration on a peer framework on identical hardware.

Significance. If the traces are made publicly available, the toolkit is shown to be extensible, and the metrics are validated, CCL-Bench could meaningfully advance reproducible evaluation of distributed LLM training by exposing why one configuration outperforms another rather than reporting only end-to-end numbers. The packaging of reusable evidence (trace + card + script) directly addresses a documented limitation of existing infrastructure benchmarks.

major comments (2)

[Abstract] Abstract: the three claims are presented as observations enabled by CCL-Bench, yet no workloads, configurations, quantitative results, or error bars are supplied to support them. This absence prevents assessment of whether the evidence actually sustains the claims, especially claim (iii) on the 3× slowdown.
[Benchmark Design] Benchmark design and toolkit description: the manuscript provides no details on how the fine-grained metrics are computed from traces, no validation against ground-truth performance counters, and no analysis of tracing overhead or workload-selection bias. These omissions directly affect the weakest assumption that the collected traces faithfully represent true performance characteristics.

minor comments (1)

[Abstract] Abstract: the LaTeX fragment “3$×$” should be rendered consistently as “3×” in the final PDF; check for similar formatting issues in any tables or figures that report speedups.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and describe the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the three claims are presented as observations enabled by CCL-Bench, yet no workloads, configurations, quantitative results, or error bars are supplied to support them. This absence prevents assessment of whether the evidence actually sustains the claims, especially claim (iii) on the 3× slowdown.

Authors: We agree that the abstract, in its current form, presents the three claims at a high level without the supporting quantitative details. The revised manuscript will incorporate concise quantitative summaries directly into the abstract (e.g., the specific workloads, hardware configurations, measured step-time deltas with error bars for claim (ii), and the exact frameworks, hardware, and workload yielding the 3× difference for claim (iii)). The full supporting data, including all workloads, configurations, and error bars, will also be expanded and clearly cross-referenced in the evaluation section. revision: yes
Referee: [Benchmark Design] Benchmark design and toolkit description: the manuscript provides no details on how the fine-grained metrics are computed from traces, no validation against ground-truth performance counters, and no analysis of tracing overhead or workload-selection bias. These omissions directly affect the weakest assumption that the collected traces faithfully represent true performance characteristics.

Authors: We acknowledge that the current manuscript lacks these details. The revised version will add a dedicated subsection describing the exact formulas, algorithms, and trace-parsing logic used to compute each fine-grained metric (compute, memory, and communication efficiency). We will also include a validation subsection that compares toolkit-derived metrics against ground-truth hardware performance counters, report measured tracing overhead as a percentage of step time on representative workloads, and add an analysis of workload-selection criteria together with a discussion of potential biases and how they were mitigated. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark with no derivations or self-referential predictions

full rationale

The paper introduces CCL-Bench as a trace-collection and metric-computation toolkit for LLM infrastructure evaluation. Its central claims are direct observations computed from collected execution traces on specific workloads and hardware configurations; no equations, fitted parameters, or predictions are derived from prior results within the paper. The three surfaced claims (overlap vs. step time, TPU vs. GPU bandwidth scaling, framework tuning differences) are presented as empirical findings enabled by the benchmark rather than as universally quantified theorems. No self-citation chains, ansatzes, or uniqueness theorems are invoked to justify the methodology or results. The work is self-contained as a reusable artifact whose validity rests on external reproducibility of the traces and scripts, not on internal reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests on the assumption that traces capture execution events with sufficient fidelity for metric computation; no free parameters or new entities are introduced.

axioms (1)

domain assumption Execution traces accurately record all relevant compute, memory, and communication events without significant overhead or omission.
Invoked when the toolkit derives fine-grained efficiency metrics from the evidence packages.

pith-pipeline@v0.9.0 · 5562 in / 1300 out tokens · 84738 ms · 2026-05-08T05:04:50.754141+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

93 extracted references · 15 canonical work pages · 1 internal anchor

[1]

Fathom: Reference workloads for modern deep learning methods

Robert Adolf, Saketh Rama, Brandon Reagen, Gu-Yeon Wei, and David Brooks. Fathom: Reference workloads for modern deep learning methods. In2016 IEEE International Symposium on Workload Characterization (IISWC), pages 1–10. IEEE, 2016

2016
[2]

Vidur: A large-scale simulation framework for llm inference.Proceedings of Machine Learning and Systems, 6:351–366, 2024

Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S Gulavani, Ramachandran Ramjee, and Alexey Tumanov. Vidur: A large-scale simulation framework for llm inference.Proceedings of Machine Learning and Systems, 6:351–366, 2024

2024
[3]

MaxText: A Simple, Performant and Scalable JAX LLM

AI Hypercomputer. MaxText: A Simple, Performant and Scalable JAX LLM. https://github.com/ AI-Hypercomputer/maxtext, 2026. Accessed: 2026-05-05

2026
[4]

Amazon web services (aws)

Amazon Web Services. Amazon web services (aws). https://aws.amazon.com/, 2026. Accessed: 2026-05-06

2026
[5]

Accelerating generative AI: How AMD Instinct GPUs delivered breakthrough effi- ciency and scalability in MLPerf Inference v5.1

AMD. Accelerating generative AI: How AMD Instinct GPUs delivered breakthrough effi- ciency and scalability in MLPerf Inference v5.1. https://www.amd.com/en/blogs/2025/ accelerating-generative-ai-how-instinct-gpus-delivered.html , 2025. Accessed 2026-04- 26

2025
[6]

Accelerating AI training: How AMD Instinct MI350 Series GPUs delivered breakthrough performance and efficiency in MLPerf Training v5.1

AMD. Accelerating AI training: How AMD Instinct MI350 Series GPUs delivered breakthrough performance and efficiency in MLPerf Training v5.1. https://www.amd.com/en/blogs/2025/ accelerating-ai-training.html, 2025. Accessed 2026-04-26

2025
[7]

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael V oznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michael ...
[8]

doi: 10.1145/3620665.3640366

work page doi:10.1145/3620665.3640366
[9]

LLM Leaderboard: Comparison of over 100 AI Models

Artificial Analysis. LLM Leaderboard: Comparison of over 100 AI Models. https:// artificialanalysis.ai/leaderboards/models, 2026. Accessed: 2026-05-05

2026
[10]

MINT: Securely Mitigating Rowhammer with a Minimalist in-DRAM Tracker ,

Jehyeon Bang, Yujeong Choi, Myeongwoo Kim, Yongdeok Kim, and Minsoo Rhu. vTrain: A simulation framework for evaluating cost-effective and compute-optimal large language model training. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 153–167. IEEE, 2024. doi: 10.1109/MICRO61859.2024.00021

work page doi:10.1109/micro61859.2024.00021 2024
[11]

Jahs-bench-201: A foundation for research on joint architecture and hyperparameter search.Advances in Neural Information Processing Systems, 35:38788–38802, 2022

Archit Bansal, Danny Stoll, Maciej Janowski, Arber Zela, and Frank Hutter. Jahs-bench-201: A foundation for research on joint architecture and hyperparameter search.Advances in Neural Information Processing Systems, 35:38788–38802, 2022

2022
[12]

Barbarians at the gate: How AI is upending systems research

Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Bowen Wang, Alex Krentsel, Tian Xia, Mert Cemri, Jongseok Park, Shuo Yang, et al. Barbarians at the gate: How ai is upending systems research.arXiv preprint arXiv:2510.06189, 2025

work page arXiv 2025
[13]

Llmservingsim: A hw/sw co-simulation infrastructure for llm inference serving at scale

Jaehong Cho, Minsu Kim, Hyunmin Choi, Guseul Heo, and Jongse Park. Llmservingsim: A hw/sw co-simulation infrastructure for llm inference serving at scale. In2024 IEEE International Symposium on Workload Characterization (IISWC), pages 15–29. IEEE, 2024. 10

2024
[14]

Palm: Scaling language modeling with pathways.Journal of machine learning research, 24(240):1–113, 2023

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways.Journal of machine learning research, 24(240):1–113, 2023

2023
[15]

Jae-Won Chung, Jeff J Ma, Ruofan Wu, Jiachen Liu, Oh Jun Kweon, Yuxuan Xia, Zhiyu Wu, and Mosharaf Chowdhury. The ml. energy benchmark: Toward automated inference energy measurement and optimization.arXiv preprint arXiv:2505.06371, 2025

work page arXiv 2025
[16]

DAWNBench: An end-to-end deep learning benchmark and competition

Cody Coleman, Deepak Narayanan, Daniel Kang, Tian Zhao, Jian Zhang, Luigi Nardi, Peter Bailis, Kunle Olukotun, Christopher Ré, and Matei Zaharia. DAWNBench: An end-to-end deep learning benchmark and competition. InNIPS ML Systems Workshop, 2017

2017
[17]

Efficient allreduce with stragglers, 2025

Arjun Devraj, Eric Ding, Abhishek Vijaya Kumar, Robert Kleinberg, and Rachee Singh. Efficient allreduce with stragglers, 2025. URLhttps://arxiv.org/abs/2505.23523

work page arXiv 2025
[18]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

2022
[19]

Distributed training of large language models on aws trainium

Xinwei Fu, Zhen Zhang, Haozheng Fan, Guangtai Huang, Mohammad El-Shabani, Randy Huang, Rahul Solanki, Fei Wu, Ron Diamant, and Yida Wang. Distributed training of large language models on aws trainium. InProceedings of the 2024 ACM Symposium on Cloud Computing, pages 961–976, 2024

2024
[20]

Introducing Cloud TPU v5p and AI Hypercom- puter

Google Cloud. Introducing Cloud TPU v5p and AI Hypercom- puter. https://cloud.google.com/blog/products/ai-machine-learning/ introducing-cloud-tpu-v5p-and-ai-hypercomputer, 2023. Accessed 2026-04-26

2023
[21]

From LLMs to image generation: Accelerate inference workloads with AI Hypercomputer

Google Cloud. From LLMs to image generation: Accelerate inference workloads with AI Hypercomputer. https://cloud.google.com/blog/products/compute/ ai-hypercomputer-inference-updates-for-google-cloud-tpu-and-gpu , 2025. Accessed 2026-04-26

2025
[22]

Tensor processing units (tpus)

Google Cloud. Tensor processing units (tpus). https://cloud.google.com/tpu?hl=en, 2026. Ac- cessed: 2026-05-06

2026
[23]

Cloud TPU v6e

Google Cloud. Cloud TPU v6e. https://cloud.google.com/tpu/docs/v6e, 2026. Accessed: 2026- 05-06

2026
[24]

Campbell

Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H. Campbell. TicTac: Accelerating distributed deep learning with communication scheduling.arXiv preprint arXiv:1803.03288, 2018. URL https: //arxiv.org/abs/1803.03288

work page arXiv 2018
[25]

LLM-Perf Leaderboard

Hugging Face Optimum Team. LLM-Perf Leaderboard. https://huggingface.co/spaces/optimum/ llm-perf-leaderboard, 2024. Accessed: 2026-05-05

2024
[26]

MSCCL++: Rethinking GPU communication abstractions for AI inference

Changho Hwang, Peng Cheng, Roshan Dathathri, Abhinav Jangda, Saeed Maleki, Madan Musuvathi, Olli Saarikivi, Aashaka Shah, Ziyue Yang, Binyang Li, Caio Rocha, Qinghua Zhou, Mahdieh Ghazimirsaeed, Sreevatsa Anantharamu, and Jithin Jose. MSCCL++: Rethinking GPU communication abstractions for AI inference. InACM International Conference on Architectural Suppo...

2026
[27]

Beyond data and model parallelism for deep neural networks

Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond data and model parallelism for deep neural networks. InMLSys, 2019

2019
[28]

Moe-cap: Benchmarking cost, accuracy and performance of sparse mixture-of-experts systems.arXiv preprint arXiv:2412.07067, 2024

Yinsicheng Jiang, Yao Fu, Yeqi Huang, Ping Nie, Zhan Lu, Leyang Xue, Congjie He, Man-Kit Sit, Jilong Xue, Li Dong, et al. Moe-cap: Benchmarking cost, accuracy and performance of sparse mixture-of-experts systems.arXiv preprint arXiv:2412.07067, 2024

work page arXiv 2024
[29]

Norman P. Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, Cliff Young, Xiang Zhou, Zongwei Zhou, and David Patterson. TPU v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. InProceedings of the 50th Annual Internation...
[30]

doi: 10.1145/3579371.3589350

work page doi:10.1145/3579371.3589350
[31]

Technology-driven, highly-scalable dragonfly topology.ACM SIGARCH Computer Architecture News, 36(3):77–88, 2008

John Kim, Wiliam J Dally, Steve Scott, and Dennis Abts. Technology-driven, highly-scalable dragonfly topology.ACM SIGARCH Computer Architecture News, 36(3):77–88, 2008

2008
[32]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023. 11

2023
[33]

XSP: Across- stack profiling and analysis of machine learning models on GPUs

Cheng Li, Abdul Dakkak, Jinjun Xiong, Wei Wei, Lingjie Xu, and Wen-Mei Hwu. XSP: Across- stack profiling and analysis of machine learning models on GPUs. InProceedings of the 34th IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 326–327. IEEE, 2020. doi: 10.1109/IPDPS47924.2020.00042

work page doi:10.1109/ipdps47924.2020.00042 2020
[34]

Lumos: Efficient performance modeling and estimation for large-scale llm training.Proceedings of Machine Learning and Systems, 7, 2025

Mingyu Liang, Hiwot T Kassa, Wenyin Fu, Brian Coutinho, Louis Feng, and Christina Delimitrou. Lumos: Efficient performance modeling and estimation for large-scale llm training.Proceedings of Machine Learning and Systems, 7, 2025

2025
[35]

Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al

Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, et al. Torchtitan: One-stop pytorch native solution for production ready llm pre-training.arXiv preprint arXiv:2410.06511, 2024

work page arXiv 2024
[36]

Mlperf training benchmark.Proceedings of Machine Learning and Systems, 2:336–349, 2020

Peter Mattson, Christine Cheng, Gregory Diamos, Cody Coleman, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, et al. Mlperf training benchmark.Proceedings of Machine Learning and Systems, 2:336–349, 2020

2020
[37]

MTIA v1: Meta’s first-generation AI inference accelerator

Meta. MTIA v1: Meta’s first-generation AI inference accelerator. https://ai.meta.com/blog/ meta-training-inference-accelerator-AI-MTIA/, 2023. Accessed: 2026-05-04

2023
[38]

Holistic trace analysis

Meta Platforms, Inc. Holistic trace analysis. https://hta.readthedocs.io/en/latest/, 2024. Accessed 2026-04-22

2024
[39]

Galvatron: Efficient transformer training over multiple GPUs using automatic parallelism

Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. Galvatron: Efficient transformer training over multiple GPUs using automatic parallelism. InVLDB, 2023

2023
[40]

MSCCL++: A GPU-driven communication stack for scalable AI applications

Microsoft. MSCCL++: A GPU-driven communication stack for scalable AI applications. https: //github.com/microsoft/mscclpp, 2026. Accessed 2026-04-26

2026
[41]

Microsoft azure.https://azure.microsoft.com/en-us, 2026

Microsoft. Microsoft azure.https://azure.microsoft.com/en-us, 2026. Accessed: 2026-05-06

2026
[42]

MLCommons. Chakra. https://mlcommons.org/working-groups/research/chakra/, 2026. Ac- cessed 2026-04-22

2026
[43]

Efficient large-scale language model training on gpu clusters using megatron-lm

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High ...

work page doi:10.1145/3458817.3476209 2021
[44]

Perlmutter architecture

National Energy Research Scientific Computing Center. Perlmutter architecture. https://docs.nersc. gov/systems/perlmutter/architecture/, 2026. Accessed: 2026-05-06

2026
[45]

H100 GPUs Set Standard for Gen AI in Debut MLPerf Benchmark

NVIDIA. H100 GPUs Set Standard for Gen AI in Debut MLPerf Benchmark. https://blogs.nvidia. com/blog/generative-ai-debut-mlperf/, 2023. Accessed 2026-04-26

2023
[46]

NVIDIA Blackwell takes pole position in latest MLPerf inference results

NVIDIA. NVIDIA Blackwell takes pole position in latest MLPerf inference results. https://blogs. nvidia.com/blog/blackwell-mlperf-inference/, 2025. Accessed 2026-04-26

2025
[47]

NVIDIA Blackwell Ultra sets the bar in new MLPerf inference benchmark

NVIDIA. NVIDIA Blackwell Ultra sets the bar in new MLPerf inference benchmark. https://blogs. nvidia.com/blog/mlperf-inference-blackwell-ultra/, 2025. Accessed 2026-04-26

2025
[48]

Understanding NCCL tuning to accelerate GPU-to- GPU communication

NVIDIA. Understanding NCCL tuning to accelerate GPU-to- GPU communication. https://developer.nvidia.com/blog/ understanding-nccl-tuning-to-accelerate-gpu-to-gpu-communication/ , 2025. Accessed 2026-04-26

2025
[49]

NVIDIA wins every MLPerf Training v5.1 benchmark

NVIDIA. NVIDIA wins every MLPerf Training v5.1 benchmark. https://blogs.nvidia.com/blog/ mlperf-training-benchmark-blackwell-ultra/, 2025. Accessed 2026-04-26

2025
[50]

Nvidia collective communication library (NCCL) documentation

NVIDIA. Nvidia collective communication library (NCCL) documentation. https://docs.nvidia. com/deeplearning/nccl/user-guide/docs/, 2026. Accessed: 2026-05-04

2026
[51]

NCCL Tests.https://github.com/NVIDIA/nccl-tests, 2026

NVIDIA. NCCL Tests.https://github.com/NVIDIA/nccl-tests, 2026. Accessed 2026-04-26

2026
[52]

NVIDIA Nsight Systems

NVIDIA. NVIDIA Nsight Systems. https://docs.nvidia.com/nsight-systems/, 2026. Accessed: 2026-05-06

2026
[53]

NVIDIA NVLink and NVLink Switch

NVIDIA. NVIDIA NVLink and NVLink Switch. https://www.nvidia.com/en-us/data-center/ nvlink/, 2026. Accessed: 2026-05-06. 12

2026
[54]

XLA Operation Semantics

OpenXLA Project. XLA Operation Semantics. https://openxla.org/xla/operation_semantics,
[55]

Accessed: 2026-05-06

2026
[56]

Profiling jax computations with xprof

OpenXLA Project. Profiling jax computations with xprof. https://openxla.org/xprof/jax_ profiling, 2026. Accessed: 2026-05-05

2026
[57]

Splitwise: Efficient generative llm inference using phase splitting

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132. IEEE, 2024

2024
[58]

Accelerating PyTorch with CUDA Graphs

PyTorch. Accelerating PyTorch with CUDA Graphs. https://pytorch.org/blog/ accelerating-pytorch-with-cuda-graphs/, 2021. Accessed: 2026-05-05

2021
[59]

Libkineto readme

PyTorch Foundation. Libkineto readme. https://github.com/pytorch/kineto/blob/main/ libkineto/README.md, 2026. Accessed 2026-04-23

2026
[60]

Alibaba hpn: A data center network for large language model training

Kun Qian, Yongqing Xi, Jiamin Cao, Jiaqi Gao, Yichi Xu, Yu Guan, Binzhang Fu, Xuemei Shi, Fangbo Zhu, Rui Miao, et al. Alibaba hpn: A data center network for large language model training. InProceedings of the ACM SIGCOMM 2024 Conference, pages 691–706, 2024

2024
[61]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. InSC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020

2020
[62]

Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning

Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. InProceedings of the international conference for high performance computing, networking, storage and analysis, pages 1–14, 2021

2021
[63]

MLPerf inference benchmark

Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breez, et al. MLPerf inference benchmark. InISCA, 2020

2020
[64]

ZeRO-Offload: Democratizing Billion-Scale model training

Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. ZeRO-Offload: Democratizing Billion-Scale model training. In2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 551–564. USENIX Association, July 2021. ISBN 978-1-939133-23-6. URL https://www.usenix.org/conference/atc21/p...

2021
[65]

Xla: Compiling machine learning for peak performance

Amit Sabne. Xla: Compiling machine learning for peak performance. 2020

2020
[66]

InferenceX: Inference benchmarking

SemiAnalysis. InferenceX: Inference benchmarking. https://inferencex.semianalysis.com/ inference, 2026. Accessed 2026-04-26

2026
[67]

SGLang bench serving guide

SGLang Team. SGLang bench serving guide. https://github.com/sgl-project/sglang/blob/ main/docs/developer_guide/bench_serving.md, 2026. Accessed 2026-04-26

2026
[68]

SGLang documentation.https://docs.sglang.io/, 2026

SGLang Team. SGLang documentation.https://docs.sglang.io/, 2026. Accessed 2026-04-26

2026
[69]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review arXiv 1909
[70]

Dynamollm: Designing llm inference clusters for performance and energy efficiency

Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse. Dynamollm: Designing llm inference clusters for performance and energy efficiency. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 1348–1362. IEEE, 2025

2025
[71]

Triton: an intermediate language and compiler for tiled neural network computations

Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019

2019
[72]

Ultra Accelerator Link (UALink) 200G 1.0 Specification

UALink Consortium. Ultra Accelerator Link (UALink) 200G 1.0 Specification. https:// ualinkconsortium.org/specification/, 2026. Accessed: 2026-05-06

2026
[73]

vLLM Bench Latency

vLLM Project. vLLM Bench Latency. https://docs.vllm.ai/en/stable/cli/bench/latency/. Accessed: 2026-05-05

2026
[74]

vLLM v0.6.0: 2.7x throughput improvement and 5x latency reduction

vLLM Team. vLLM v0.6.0: 2.7x throughput improvement and 5x latency reduction. https://vllm.ai/ blog/perf-update, 2024. Accessed 2026-04-26. 13

2024
[75]

vLLM benchmarks

vLLM Team. vLLM benchmarks. https://docs.vllm.ai/en/v0.14.1/api/vllm/benchmarks/,
[76]

SimAI: Unifying architecture design and performance tuning for large-scale large language model training with scalability and precision.NSDI, 2025

Xizheng Wang et al. SimAI: Unifying architecture design and performance tuning for large-scale large language model training with scalability and precision.NSDI, 2025

2025
[77]

A systematic methodology for analysis of deep learning hardware and software platforms.Proceedings of Machine Learning and Systems, 2:30–43, 2020

Yu Wang, Gu-Yeon Wei, and David Brooks. A systematic methodology for analysis of deep learning hardware and software platforms.Proceedings of Machine Learning and Systems, 2:30–43, 2020

2020
[78]

Burstgpt: A real-world workload dataset to optimize llm serving systems

Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, et al. Burstgpt: A real-world workload dataset to optimize llm serving systems. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 5831–5841, 2025

2025
[79]

ASTRA-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale.arXiv preprint arXiv:2303.14006, 2023

William Won, Taekyung Shi, Dhruvit Ajith, Saeed Sudarshan, Adarsh Ravichandran, and Tushar Krishna. ASTRA-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale.arXiv preprint arXiv:2303.14006, 2023

work page arXiv 2023
[80]

LLMCompass: Enabling efficient hardware design for large language model inference

Hengrui Zhang, August Ning, Rohan Baskar Prabhakar, and David Wentzlaff. LLMCompass: Enabling efficient hardware design for large language model inference. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 1080–1096. IEEE, 2024. doi: 10.1109/ISCA59077. 2024.00082

work page doi:10.1109/isca59077 2024

Showing first 80 references.