Instant GPU Efficiency Visibility at Fleet Scale

Collin Neale; Connor Pedersen; Dong H. Ahn; Michel Migdal; Nik Konyuchenko

arxiv: 2605.20799 · v1 · pith:2FQL2UO6new · submitted 2026-05-20 · 💻 cs.DC · cs.LG

Instant GPU Efficiency Visibility at Fleet Scale

Connor Pedersen , Dong H. Ahn , Michel Migdal , Collin Neale , Nik Konyuchenko This is my paper

Pith reviewed 2026-05-21 02:38 UTC · model grok-4.3

classification 💻 cs.DC cs.LG

keywords GPU efficiencyOverall FLOP Utilizationhardware performance countersmodel FLOPs utilizationAI trainingfleet monitoringtensor corestile quantization

0 comments

The pith

Two on-chip GPU counters predict application MFU to within two percentage points after tile-quantization correction

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Overall FLOP Utilization as a hardware-level metric for GPU efficiency in AI workloads that requires no application changes. It derives the value from just Tensor Pipe Activity and SM clock frequency counters on the GPU. Controlled tests on H100 and GB200 across FP16, TF32, FP8 and NVFP4 show that correcting for tile quantization brings the estimate within two percentage points of true model FLOPs utilization. When run against 608 production training jobs the metric correlates at r equals 0.78 and has already flagged framework FLOP miscalculations plus a 2.5 times efficiency regression in live fleets. This setup makes continuous efficiency tracking feasible at data-center scale without per-job overhead.

Core claim

Overall FLOP Utilization (OFU) is obtained directly from the product of two on-chip counters, Tensor Pipe Activity and SM clock frequency, yielding a precision-agnostic proxy for total FLOP throughput. After a tile-quantization correction the resulting value matches application-reported model FLOPs utilization to within two percentage points on GEMM workloads spanning multiple precisions and GPU generations. Applied to hundreds of real training jobs OFU surfaces both misreported FLOPs in frameworks and measurable utilization shifts during mixed-precision pretraining.

What carries the argument

Overall FLOP Utilization (OFU), the normalized product of Tensor Pipe Activity and SM clock frequency that serves as a hardware proxy for aggregate floating-point throughput without per-application calibration

If this is right

OFU can be collected fleet-wide without modifying or instrumenting any running applications
The metric has already identified a 2.5 times efficiency regression in deployed production systems
It reveals framework-level errors in reported FLOP counts on hundreds of real jobs
OFU tracks utilization changes that occur when switching precisions during pretraining
Correlation of 0.78 with measured MFU on 608 production jobs supports its role as a complement to application-level monitoring

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Fleet schedulers could use live OFU readings to pause or migrate inefficient jobs automatically
The same counter-based approach might transfer to other accelerators if comparable activity and frequency signals are exposed
Long-term fleet data from OFU could expose systematic efficiency differences across model families or software stacks
Adding further counters might shrink the remaining error for workloads that under-count non-tensor operations

Load-bearing premise

The two on-chip counters provide a sufficient and unbiased proxy for total FLOP throughput across all workload types and precisions without requiring per-application calibration

What would settle it

A workload where the tile-quantization-corrected OFU deviates by more than two percentage points from independently measured application MFU, such as a non-GEMM task or an untested precision or GPU generation

Figures

Figures reproduced from arXiv: 2605.20799 by Collin Neale, Connor Pedersen, Dong H. Ahn, Michel Migdal, Nik Konyuchenko.

**Figure 3.** Figure 3: Throughput speedup over TF32 on GB200. • CUTLASS 2/3: Open-source GEMM templates; CUTLASS 2 lacks Cooperative Grid Array (CGA) support and is typically selected for small or poorly aligned matrices. Modern kernels (nvJet, XMMA, CUTLASS 3) use CGAs [20], which group (CM, CN ) thread blocks into clusters that share distributed shared memory across SMs. This introduces a two-level tiling hierarchy: at the f… view at source ↗

**Figure 2.** Figure 2: Tile-quantization overhead in GEMM execution [27]. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: OFU prediction error. 0 10 20 30 40 50 60 70 App MFU (%) 0 10 20 30 40 50 60 70 OFU (%) r = 0.53 (all) r = 0.78 (excl. case studies) Other (526) Case study jobs (82) y = x [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: App MFU vs. OFU for 608 production training jobs. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: OFU before and after removing debug overhead, [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: OFU and MFU relative to mean MFU on a pretraining [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

read the original abstract

We present Overall FLOP Utilization (OFU), a hardware-level, precision-agnostic GPU efficiency metric for AI workloads on HPC systems, derived from two on-chip performance counters: Tensor Pipe Activity and SM clock frequency. OFU requires no application instrumentation and works across GPU generations and numeric precisions. We characterize five properties of the OFU approximation -- tile quantization, floating-point precision scaling, clock sampling noise, Tensor Core clock domains, and non-tensor undercounting -- through controlled GEMM experiments on H100 and GB200 across FP16, TF32, FP8, and NVFP4. After tile-quantization correction, OFU predicts application-level MFU to within <=2 percentage points. Against 608 production training jobs, OFU achieves r = 0.78 correlation with application-level MFU and surfaces two framework-level FLOPs miscalculations. Deployed across large-scale GPU fleets, OFU has detected a 2.5x efficiency regression and tracked precision-dependent utilization changes in mixed-precision pretraining. Our evaluation and operational experience suggest OFU is a practical, deployment-ready complement to application-level MFU for continuous fleet-wide efficiency monitoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Overall FLOP Utilization (OFU), a hardware-level, precision-agnostic GPU efficiency metric derived solely from Tensor Pipe Activity and SM clock frequency counters. It characterizes five properties of the approximation (tile quantization, floating-point precision scaling, clock sampling noise, Tensor Core clock domains, and non-tensor undercounting) via controlled GEMM experiments on H100 and GB200 across FP16/TF32/FP8/NVFP4, applies a tile-quantization correction, and claims that the corrected OFU predicts application-level MFU to within <=2 percentage points. On 608 production training jobs, OFU achieves r=0.78 correlation with application MFU, detects two framework-level FLOP miscalculations, and in fleet deployment has identified a 2.5x efficiency regression while tracking precision-dependent changes.

Significance. If the accuracy and generalization claims hold, OFU would provide a valuable, low-overhead complement to application-level MFU for continuous fleet-scale efficiency monitoring in large AI/HPC systems, requiring no instrumentation and spanning GPU generations and precisions. The authors merit explicit credit for deriving the metric directly from raw hardware counters without fitting to target MFU values, for the controlled multi-precision GEMM characterization of its limitations, and for the concrete operational evidence of detecting real regressions and miscalculations in production.

major comments (2)

[Abstract] Abstract: The load-bearing claim that 'After tile-quantization correction, OFU predicts application-level MFU to within <=2 percentage points' immediately follows the GEMM characterization but precedes the separate statement of r=0.78 on 608 production jobs. The manuscript must explicitly state on which workloads (GEMM kernels only, or full training applications) and with what error statistics (e.g., MAE, 95th-percentile deviation, or per-precision breakdown) this bound was measured, because non-tensor undercounting and mixed kernels in arbitrary workloads could exceed 2pp and undermine the generalization to fleet-scale use.
[Production evaluation section] Production evaluation section: The reported r=0.78 correlation on 608 jobs is presented as validation, yet without the distribution of absolute per-job errors or relation to MFU variance across the fleet it is impossible to determine whether typical deviations remain within the claimed 2pp or are substantially larger. This directly affects whether the metric supports the central goal of reliable, instant efficiency visibility.

minor comments (2)

[Definition section] The explicit formula combining Tensor Pipe Activity and SM clock frequency into OFU should be stated as an equation in the definition section to aid reproducibility and clarity.
[GEMM characterization figures] Figures in the GEMM characterization should include error bars or sample sizes for the measured properties across precisions to allow readers to assess variability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their careful reading and constructive feedback on the clarity of our accuracy claims. We address each major comment point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: The load-bearing claim that 'After tile-quantization correction, OFU predicts application-level MFU to within <=2 percentage points' immediately follows the GEMM characterization but precedes the separate statement of r=0.78 on 608 production jobs. The manuscript must explicitly state on which workloads (GEMM kernels only, or full training applications) and with what error statistics (e.g., MAE, 95th-percentile deviation, or per-precision breakdown) this bound was measured, because non-tensor undercounting and mixed kernels in arbitrary workloads could exceed 2pp and undermine the generalization to fleet-scale use.

Authors: We agree the abstract phrasing creates ambiguity about the source of the <=2pp bound. This bound was measured as the maximum observed deviation between corrected OFU and known utilization on the controlled GEMM kernels across FP16/TF32/FP8/NVFP4 on H100 and GB200. For full applications we rely on the separate r=0.78 correlation result. We will revise the abstract to explicitly attribute the 2pp bound to the GEMM characterization, state the error metric used (maximum deviation after tile correction), and add a per-precision breakdown of errors to the evaluation section. Non-tensor undercounting is already characterized as one of the five properties and its effect is reflected in the production correlation. revision: yes
Referee: [Production evaluation section] Production evaluation section: The reported r=0.78 correlation on 608 jobs is presented as validation, yet without the distribution of absolute per-job errors or relation to MFU variance across the fleet it is impossible to determine whether typical deviations remain within the claimed 2pp or are substantially larger. This directly affects whether the metric supports the central goal of reliable, instant efficiency visibility.

Authors: We acknowledge that absolute error statistics would help readers assess whether deviations stay within the 2pp range in practice. The r=0.78 correlation on 608 diverse production jobs quantifies a strong linear relationship across the observed range of fleet MFU values. We will add a short discussion in the production evaluation section explaining how this correlation, combined with the fleet MFU variance, is consistent with typical deviations remaining small. However, the full distribution of absolute per-job errors is not pre-computed from the raw data. revision: partial

standing simulated objections not resolved

Distribution of absolute per-job errors for the 608 production jobs

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained from hardware counters

full rationale

OFU is defined directly from two raw on-chip counters (Tensor Pipe Activity and SM clock frequency) with no fitting to target MFU values. The five properties are characterized via independent controlled GEMM experiments on H100/GB200 across precisions; the tile-quantization correction is derived from those experiments and the <=2pp claim follows as a reported outcome. The r=0.78 correlation is measured separately on 608 production training jobs and presented as validation, not as part of the metric definition. No self-definitional equations, fitted-input predictions, or load-bearing self-citations appear in the abstract or described chain. The approach remains externally falsifiable against application-level MFU.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the two named counters capture tensor throughput accurately enough for the approximation to hold after a single tile-quantization correction. No free parameters are explicitly fitted in the abstract description, and no new physical entities are postulated.

axioms (1)

domain assumption Tensor Pipe Activity and SM clock frequency counters are available and stable across GPU generations and precisions.
Invoked when stating that OFU works across generations and numeric precisions without instrumentation.

invented entities (1)

Overall FLOP Utilization (OFU) no independent evidence
purpose: Hardware-level efficiency metric derived from two counters.
New named quantity introduced to approximate MFU; no independent falsifiable prediction beyond the reported correlation is given.

pith-pipeline@v0.9.0 · 5742 in / 1211 out tokens · 27640 ms · 2026-05-21T02:38:42.692592+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present Overall FLOP Utilization (OFU), a hardware-level, precision-agnostic GPU efficiency metric... derived from two on-chip performance counters: Tensor Pipe Activity and SM clock frequency.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

After tile-quantization correction, OFU predicts application-level MFU to within ≤2 percentage points.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 6 internal anchors

[1]

Microsoft fy2025 q4 earnings,

Microsoft, “Microsoft fy2025 q4 earnings,” Microsoft Investor Relations,

work page
[2]

Available: https://www.microsoft.com/en-us/investor/eve nts/fy-2025/earnings-fy-2025-q4

[Online]. Available: https://www.microsoft.com/en-us/investor/eve nts/fy-2025/earnings-fy-2025-q4

work page 2025
[3]

Alphabet announces fourth quarter and fiscal year 2025 results,

Alphabet Inc., “Alphabet announces fourth quarter and fiscal year 2025 results,” February 2026. [Online]. Available: https: //www.sec.gov/Archives/edgar/data/1652044/000165204426000012/goo gexhibit991q42025.htm

work page arXiv 2025
[4]

Meta reports fourth quarter and full year 2025 results,

Meta Platforms, “Meta reports fourth quarter and full year 2025 results,” January 2026. [Online]. Available: https://investor.atmeta.com/investor-n ews/press-release-details/2026/Meta-Reports-Fourth-Quarter-and-Ful l-Year-2025-Results/default.aspx

work page 2025
[5]

Citigroup forecasts big tech’s ai spending to cross $2.8 trillion by 2029,

Reuters, “Citigroup forecasts big tech’s ai spending to cross $2.8 trillion by 2029,” September 2025. [Online]. Available: https://www.reuters.com/world/china/citigroup-forecasts-big-techs-ai-s pending-cross-28-trillion-by-2029-2025-09-30/

work page 2029
[6]

Pytorch profiler,

PyTorch Contributors, “Pytorch profiler,” PyTorch Documentation, 2024. [Online]. Available: https://pytorch.org/tutorials/recipes/recipes/profiler r ecipe.html

work page 2024
[7]

Deepspeed flops profiler,

Microsoft DeepSpeed Team, “Deepspeed flops profiler,” DeepSpeed Documentation, 2024. [Online]. Available: https://www.deepspeed.ai/tut orials/flops-profiler/

work page 2024
[8]

Hpctoolkit: Tools for performance analysis of optimized parallel programs,

L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor- Crummey, and N. R. Tallent, “Hpctoolkit: Tools for performance analysis of optimized parallel programs,” inConcurrency and Computation: Practice and Experience, vol. 22, no. 6, 2010, pp. 685–701. [Online]. Available: https://doi.org/10.1002/cpe.1553

work page doi:10.1002/cpe.1553 2010
[9]

Nvidia/megatron-lm: Ongoing research training transformer models at scale,

NVIDIA, “Nvidia/megatron-lm: Ongoing research training transformer models at scale,” GitHub README, 2025. [Online]. Available: https://github.com/NVIDIA/Megatron-LM

work page 2025
[10]

Pytorch lightning,

Lightning AI, “Pytorch lightning,” GitHub repository, 2024. [Online]. Available: https://github.com/Lightning-AI/pytorch-lightning

work page 2024
[11]

Nvidia nemo: A toolkit for building ai models,

NVIDIA, “Nvidia nemo: A toolkit for building ai models,” GitHub repository, 2024. [Online]. Available: https://github.com/NVIDIA/NeMo

work page 2024
[12]

PaLM: Scaling Language Modeling with Pathways

A. Chowdheryet al., “Palm: Scaling language modeling with pathways,”arXiv:2204.02311, 2022. [Online]. Available: https: //doi.org/10.48550/arXiv.2204.02311

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.02311 2022
[13]

Nvidia dcgm exporter,

NVIDIA, “Nvidia dcgm exporter,” GitHub repository, 2025. [Online]. Available: https://github.com/NVIDIA/dcgm-exporter

work page 2025
[14]

nv-one-logger: Nvidia’s distributed metrics logging system,

——, “nv-one-logger: Nvidia’s distributed metrics logging system,” GitHub repository, 2024. [Online]. Available: https://github.com/NVIDI A/nv-one-logger

work page 2024
[15]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI, “Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model,”arXiv preprint arXiv:2405.04434,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

[Online]. Available: https://doi.org/10.48550/arXiv.2405.04434

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.04434
[17]

Nvidia tesla v100 gpu architecture,

NVIDIA, “Nvidia tesla v100 gpu architecture,” NVIDIA, Tech. Rep. WP-08608-001 v1.1, 2017. [Online]. Available: https://images.nvidia.co m/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

work page 2017
[18]

Nvidia a100 tensor core gpu architecture,

——, “Nvidia a100 tensor core gpu architecture,” NVIDIA, Tech. Rep.,

work page
[19]

Available: https://images.nvidia.com/aem-dam/en-zz/So lutions/data-center/nvidia-ampere-architecture-whitepaper.pdf

[Online]. Available: https://images.nvidia.com/aem-dam/en-zz/So lutions/data-center/nvidia-ampere-architecture-whitepaper.pdf

work page
[20]

Nvidia h100 tensor core gpu architecture,

——, “Nvidia h100 tensor core gpu architecture,” NVIDIA, Tech. Rep., 2022, whitepaper. [Online]. Available: https://www.advancedclustering.c om/wp-content/uploads/2022/03/gtc22-whitepaper-hopper.pdf

work page 2022
[21]

Nvidia tensor core evolution: From volta to blackwell,

SemiAnalysis, “Nvidia tensor core evolution: From volta to blackwell,” SemiAnalysis Newsletter, June 2025. [Online]. Available: https: //newsletter.semianalysis.com/p/nvidia-tensor-core-evolution-from-vol ta-to-blackwell

work page 2025
[22]

Accelerating ai training with nvidia tf32 tensor cores,

D. Stosic and P. Micikevicius, “Accelerating ai training with nvidia tf32 tensor cores,” NVIDIA Technical Blog, January 2021. [Online]. Available: https://developer.nvidia.com/blog/accelerating-ai-training-wit h-tf32-tensor-cores/

work page 2021
[23]

Nvidia h100 tensor core gpu architecture,

NVIDIA, “Nvidia h100 tensor core gpu architecture,” NVIDIA, Tech. Rep., 2022, whitepaper. [Online]. Available: https://www.advancedcluste ring.com/wp-content/uploads/2022/03/gtc22-whitepaper-hopper.pdf

work page 2022
[24]

Floating-point 8: An introduction to efficient, lower-precision ai training,

K. Sevegnani, G. Fiameni, U. Uppal, S. Perez, and A. Pilzer, “Floating-point 8: An introduction to efficient, lower-precision ai training,” NVIDIA Technical Blog, June 2025. [Online]. Available: https://developer.nvidia.com/blog/floating-point-8-an-introduction-to-e fficient-lower-precision-ai-training/

work page 2025
[25]

Cutlass tutorial: Fast matrix-multiplication with wgmma on nvidia hopper gpus,

Colfax Research, “Cutlass tutorial: Fast matrix-multiplication with wgmma on nvidia hopper gpus,” Colfax Research Blog, August 2024. [Online]. Available: https://research.colfax-intl.com/cutlass-tutorial-wgm ma-hopper/

work page 2024
[26]

Introducing nvfp4 for efficient and accurate low- precision inference,

E. Alvarez, O. Almog, E. Chung, S. Layton, D. Stosic, R. Krashinsky, and K. Aubrey, “Introducing nvfp4 for efficient and accurate low- precision inference,” NVIDIA Technical Blog, June 2025. [Online]. Available: https://developer.nvidia.com/blog/introducing-nvfp4-for-effic ient-and-accurate-low-precision-inference/

work page 2025
[27]

Nvidia rtx blackwell gpu architecture,

NVIDIA, “Nvidia rtx blackwell gpu architecture,” NVIDIA, Tech. Rep., 2025, architecture Whitepaper. [Online]. Available: https: //images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-b lackwell-gpu-architecture.pdf

work page 2025
[28]

Pretraining large language models with NVFP4

P. Micikevicius, D. Stosic, N. Burgess, M. Cornea, P. Dubey, R. Grisenthwaite, S. Ha, A. Heinecke, P. Judd, J. Kamalu, N. Mellempudi, S. Oberman, M. Shoeybi, M. Renber, K. Siu, and H. Wu, “Pretraining large language models with NVFP4,”arXiv preprint arXiv:2509.25149,

work page arXiv
[29]

Pretraining large language models with NVFP4

[Online]. Available: https://doi.org/10.48550/arXiv.2509.25149

work page doi:10.48550/arxiv.2509.25149
[30]

Data movement is all you need: A case study on optimizing transformers,

A. Ivanov, N. Dryden, T. Ben-Nun, S. Li, and T. Hoefler, “Data movement is all you need: A case study on optimizing transformers,” in Proceedings of Machine Learning and Systems (MLSys), vol. 3, 2021, pp. 711–732. [Online]. Available: https://doi.org/10.48550/arXiv.2007.00072

work page doi:10.48550/arxiv.2007.00072 2021
[31]

Matrix multiplication background user’s guide,

NVIDIA, “Matrix multiplication background user’s guide,” NVIDIA Deep Learning Performance Documentation, 2024. [Online]. Available: https://docs.nvidia.com/deeplearning/performance/dl-performance-matri x-multiplication/index.html

work page 2024
[32]

cublas library user’s guide,

——, “cublas library user’s guide,” CUDA Toolkit Documentation, 2025. [Online]. Available: https://docs.nvidia.com/cuda/cublas/index.html

work page 2025
[33]

nvmatmulheuristics,

——, “nvmatmulheuristics,” 2025. [Online]. Available: https://developer. nvidia.com/blog/improving-gemm-kernel-auto-tuning-efficiency-on-nvi dia-gpus-with-heuristics-and-cutlass-4-2/

work page 2025
[34]

Improving gemm kernel auto-tuning efficiency on nvidia gpus with heuristics and cutlass 4.2,

H. Zhao, D. Yan, A. Wang, A. Kerr, and M. Yan, “Improving gemm kernel auto-tuning efficiency on nvidia gpus with heuristics and cutlass 4.2,” NVIDIA Technical Blog, January 2025. [Online]. Available: https://developer.nvidia.com/blog/improving-gemm-kernel-auto-tunin g-efficiency-on-nvidia-gpus-with-heuristics-and-cutlass-4-2/

work page 2025
[35]

Nvidia h100 tensor core gpu,

NVIDIA, “Nvidia h100 tensor core gpu,” NVIDIA Data Center GPU Product Page, 2024. [Online]. Available: https://www.nvidia.com/en-us/ data-center/h100/

work page 2024
[36]

Nvidia gb200 tensor core gpu,

——, “Nvidia gb200 tensor core gpu,” NVIDIA Data Center GPU Product Page, 2024. [Online]. Available: https://www.nvidia.com/en-us/ data-center/gb200-nvl72/

work page 2024
[37]

Kuaishou

V . Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro, “Reducing activation recomputation in large transformer models,”arXiv preprint arXiv:2205.05198, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2205.05198

work page doi:10.48550/arxiv.2205.05198 2022
[38]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020. [Online]. Available: https://doi.org/10.48550/arXiv.2001.08361

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2001.08361 2001
[39]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv preprint arXiv:2312.00752, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2312.00752

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.00752 2024
[40]

Nvidia osmo: Cloud-native orchestration platform for physical ai,

NVIDIA, “Nvidia osmo: Cloud-native orchestration platform for physical ai,” Product page, 2025. [Online]. Available: https: //us-west-2-aws.osmo.nvidia.com/

work page 2025
[41]

Training Deep Nets with Sublinear Memory Cost

T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training deep nets with sublinear memory cost,”arXiv preprint arXiv:1604.06174, 2016. [Online]. Available: https://doi.org/10.48550/arXiv.1604.06174

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1604.06174 2016
[42]

Benchmarking flops utilization on tpu v4,

J. Bradbury, Q. Zhang, and A. Selvan, “Benchmarking flops utilization on tpu v4,” Google Cloud (whitepaper), 2023. [Online]. Available: https://services.google.com/fh/files/blogs/tpu v4 benchmarking.pdf

work page 2023
[43]

Megascale: Scaling large language model training to more than 10,000 gpus,

Z. Jiang, H. Linet al., “Megascale: Scaling large language model training to more than 10,000 gpus,” inNSDI 2024, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2402.15627

work page doi:10.48550/arxiv.2402.15627 2024
[44]

MLPerf training benchmark,

P. Mattson, C. Cheng, G. Diamos, C. Coleman, P. Micikevicius, D. Patterson, H. Tang, G.-Y . Wei, P. Bailis, V . Bittorf, D. Brooks, D. Chen, D. Dutta, U. Gupta, K. Hazelwood, A. Hock, X. Huang, A. Ike, B. Jia, D. Kang, D. Kanter, N. Kumar, J. Liao, G. Ma, D. Narayanan, T. Oguntebi, G. Pekhimenko, L. Pentecost, V . J. Reddi, T. Robie, T. St. John, T. Tabbe...

work page doi:10.48550/arxiv.1910.01500 2020
[45]

Datacenter-scale analysis and optimization of GPU machine learning workloads,

C.-J. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan, K. Hazelwood, E. Isaac, Y . Jia, B. Jia, T. Leesatapornwongsa, H. Li, Y . Lianget al., “Datacenter-scale analysis and optimization of GPU machine learning workloads,”IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 11, pp. 2766–2780, 2021. [Online]. Available: https://doi.o...

work page doi:10.1109/tpds.2021.3084540 2021
[46]

Roofline: An insightful visual performance model for multicore architectures,

S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for multicore architectures,”Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009. [Online]. Available: https://doi.org/10.1145/1498765.1498785

work page doi:10.1145/1498765.1498785 2009

[1] [1]

Microsoft fy2025 q4 earnings,

Microsoft, “Microsoft fy2025 q4 earnings,” Microsoft Investor Relations,

work page

[2] [2]

Available: https://www.microsoft.com/en-us/investor/eve nts/fy-2025/earnings-fy-2025-q4

[Online]. Available: https://www.microsoft.com/en-us/investor/eve nts/fy-2025/earnings-fy-2025-q4

work page 2025

[3] [3]

Alphabet announces fourth quarter and fiscal year 2025 results,

Alphabet Inc., “Alphabet announces fourth quarter and fiscal year 2025 results,” February 2026. [Online]. Available: https: //www.sec.gov/Archives/edgar/data/1652044/000165204426000012/goo gexhibit991q42025.htm

work page arXiv 2025

[4] [4]

Meta reports fourth quarter and full year 2025 results,

Meta Platforms, “Meta reports fourth quarter and full year 2025 results,” January 2026. [Online]. Available: https://investor.atmeta.com/investor-n ews/press-release-details/2026/Meta-Reports-Fourth-Quarter-and-Ful l-Year-2025-Results/default.aspx

work page 2025

[5] [5]

Citigroup forecasts big tech’s ai spending to cross $2.8 trillion by 2029,

Reuters, “Citigroup forecasts big tech’s ai spending to cross $2.8 trillion by 2029,” September 2025. [Online]. Available: https://www.reuters.com/world/china/citigroup-forecasts-big-techs-ai-s pending-cross-28-trillion-by-2029-2025-09-30/

work page 2029

[6] [6]

Pytorch profiler,

PyTorch Contributors, “Pytorch profiler,” PyTorch Documentation, 2024. [Online]. Available: https://pytorch.org/tutorials/recipes/recipes/profiler r ecipe.html

work page 2024

[7] [7]

Deepspeed flops profiler,

Microsoft DeepSpeed Team, “Deepspeed flops profiler,” DeepSpeed Documentation, 2024. [Online]. Available: https://www.deepspeed.ai/tut orials/flops-profiler/

work page 2024

[8] [8]

Hpctoolkit: Tools for performance analysis of optimized parallel programs,

L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor- Crummey, and N. R. Tallent, “Hpctoolkit: Tools for performance analysis of optimized parallel programs,” inConcurrency and Computation: Practice and Experience, vol. 22, no. 6, 2010, pp. 685–701. [Online]. Available: https://doi.org/10.1002/cpe.1553

work page doi:10.1002/cpe.1553 2010

[9] [9]

Nvidia/megatron-lm: Ongoing research training transformer models at scale,

NVIDIA, “Nvidia/megatron-lm: Ongoing research training transformer models at scale,” GitHub README, 2025. [Online]. Available: https://github.com/NVIDIA/Megatron-LM

work page 2025

[10] [10]

Pytorch lightning,

Lightning AI, “Pytorch lightning,” GitHub repository, 2024. [Online]. Available: https://github.com/Lightning-AI/pytorch-lightning

work page 2024

[11] [11]

Nvidia nemo: A toolkit for building ai models,

NVIDIA, “Nvidia nemo: A toolkit for building ai models,” GitHub repository, 2024. [Online]. Available: https://github.com/NVIDIA/NeMo

work page 2024

[12] [12]

PaLM: Scaling Language Modeling with Pathways

A. Chowdheryet al., “Palm: Scaling language modeling with pathways,”arXiv:2204.02311, 2022. [Online]. Available: https: //doi.org/10.48550/arXiv.2204.02311

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.02311 2022

[13] [13]

Nvidia dcgm exporter,

NVIDIA, “Nvidia dcgm exporter,” GitHub repository, 2025. [Online]. Available: https://github.com/NVIDIA/dcgm-exporter

work page 2025

[14] [14]

nv-one-logger: Nvidia’s distributed metrics logging system,

——, “nv-one-logger: Nvidia’s distributed metrics logging system,” GitHub repository, 2024. [Online]. Available: https://github.com/NVIDI A/nv-one-logger

work page 2024

[15] [15]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI, “Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model,”arXiv preprint arXiv:2405.04434,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

[Online]. Available: https://doi.org/10.48550/arXiv.2405.04434

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.04434

[17] [17]

Nvidia tesla v100 gpu architecture,

NVIDIA, “Nvidia tesla v100 gpu architecture,” NVIDIA, Tech. Rep. WP-08608-001 v1.1, 2017. [Online]. Available: https://images.nvidia.co m/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

work page 2017

[18] [18]

Nvidia a100 tensor core gpu architecture,

——, “Nvidia a100 tensor core gpu architecture,” NVIDIA, Tech. Rep.,

work page

[19] [19]

Available: https://images.nvidia.com/aem-dam/en-zz/So lutions/data-center/nvidia-ampere-architecture-whitepaper.pdf

[Online]. Available: https://images.nvidia.com/aem-dam/en-zz/So lutions/data-center/nvidia-ampere-architecture-whitepaper.pdf

work page

[20] [20]

Nvidia h100 tensor core gpu architecture,

——, “Nvidia h100 tensor core gpu architecture,” NVIDIA, Tech. Rep., 2022, whitepaper. [Online]. Available: https://www.advancedclustering.c om/wp-content/uploads/2022/03/gtc22-whitepaper-hopper.pdf

work page 2022

[21] [21]

Nvidia tensor core evolution: From volta to blackwell,

SemiAnalysis, “Nvidia tensor core evolution: From volta to blackwell,” SemiAnalysis Newsletter, June 2025. [Online]. Available: https: //newsletter.semianalysis.com/p/nvidia-tensor-core-evolution-from-vol ta-to-blackwell

work page 2025

[22] [22]

Accelerating ai training with nvidia tf32 tensor cores,

D. Stosic and P. Micikevicius, “Accelerating ai training with nvidia tf32 tensor cores,” NVIDIA Technical Blog, January 2021. [Online]. Available: https://developer.nvidia.com/blog/accelerating-ai-training-wit h-tf32-tensor-cores/

work page 2021

[23] [23]

Nvidia h100 tensor core gpu architecture,

NVIDIA, “Nvidia h100 tensor core gpu architecture,” NVIDIA, Tech. Rep., 2022, whitepaper. [Online]. Available: https://www.advancedcluste ring.com/wp-content/uploads/2022/03/gtc22-whitepaper-hopper.pdf

work page 2022

[24] [24]

Floating-point 8: An introduction to efficient, lower-precision ai training,

K. Sevegnani, G. Fiameni, U. Uppal, S. Perez, and A. Pilzer, “Floating-point 8: An introduction to efficient, lower-precision ai training,” NVIDIA Technical Blog, June 2025. [Online]. Available: https://developer.nvidia.com/blog/floating-point-8-an-introduction-to-e fficient-lower-precision-ai-training/

work page 2025

[25] [25]

Cutlass tutorial: Fast matrix-multiplication with wgmma on nvidia hopper gpus,

Colfax Research, “Cutlass tutorial: Fast matrix-multiplication with wgmma on nvidia hopper gpus,” Colfax Research Blog, August 2024. [Online]. Available: https://research.colfax-intl.com/cutlass-tutorial-wgm ma-hopper/

work page 2024

[26] [26]

Introducing nvfp4 for efficient and accurate low- precision inference,

E. Alvarez, O. Almog, E. Chung, S. Layton, D. Stosic, R. Krashinsky, and K. Aubrey, “Introducing nvfp4 for efficient and accurate low- precision inference,” NVIDIA Technical Blog, June 2025. [Online]. Available: https://developer.nvidia.com/blog/introducing-nvfp4-for-effic ient-and-accurate-low-precision-inference/

work page 2025

[27] [27]

Nvidia rtx blackwell gpu architecture,

NVIDIA, “Nvidia rtx blackwell gpu architecture,” NVIDIA, Tech. Rep., 2025, architecture Whitepaper. [Online]. Available: https: //images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-b lackwell-gpu-architecture.pdf

work page 2025

[28] [28]

Pretraining large language models with NVFP4

P. Micikevicius, D. Stosic, N. Burgess, M. Cornea, P. Dubey, R. Grisenthwaite, S. Ha, A. Heinecke, P. Judd, J. Kamalu, N. Mellempudi, S. Oberman, M. Shoeybi, M. Renber, K. Siu, and H. Wu, “Pretraining large language models with NVFP4,”arXiv preprint arXiv:2509.25149,

work page arXiv

[29] [29]

Pretraining large language models with NVFP4

[Online]. Available: https://doi.org/10.48550/arXiv.2509.25149

work page doi:10.48550/arxiv.2509.25149

[30] [30]

Data movement is all you need: A case study on optimizing transformers,

A. Ivanov, N. Dryden, T. Ben-Nun, S. Li, and T. Hoefler, “Data movement is all you need: A case study on optimizing transformers,” in Proceedings of Machine Learning and Systems (MLSys), vol. 3, 2021, pp. 711–732. [Online]. Available: https://doi.org/10.48550/arXiv.2007.00072

work page doi:10.48550/arxiv.2007.00072 2021

[31] [31]

Matrix multiplication background user’s guide,

NVIDIA, “Matrix multiplication background user’s guide,” NVIDIA Deep Learning Performance Documentation, 2024. [Online]. Available: https://docs.nvidia.com/deeplearning/performance/dl-performance-matri x-multiplication/index.html

work page 2024

[32] [32]

cublas library user’s guide,

——, “cublas library user’s guide,” CUDA Toolkit Documentation, 2025. [Online]. Available: https://docs.nvidia.com/cuda/cublas/index.html

work page 2025

[33] [33]

nvmatmulheuristics,

——, “nvmatmulheuristics,” 2025. [Online]. Available: https://developer. nvidia.com/blog/improving-gemm-kernel-auto-tuning-efficiency-on-nvi dia-gpus-with-heuristics-and-cutlass-4-2/

work page 2025

[34] [34]

Improving gemm kernel auto-tuning efficiency on nvidia gpus with heuristics and cutlass 4.2,

H. Zhao, D. Yan, A. Wang, A. Kerr, and M. Yan, “Improving gemm kernel auto-tuning efficiency on nvidia gpus with heuristics and cutlass 4.2,” NVIDIA Technical Blog, January 2025. [Online]. Available: https://developer.nvidia.com/blog/improving-gemm-kernel-auto-tunin g-efficiency-on-nvidia-gpus-with-heuristics-and-cutlass-4-2/

work page 2025

[35] [35]

Nvidia h100 tensor core gpu,

NVIDIA, “Nvidia h100 tensor core gpu,” NVIDIA Data Center GPU Product Page, 2024. [Online]. Available: https://www.nvidia.com/en-us/ data-center/h100/

work page 2024

[36] [36]

Nvidia gb200 tensor core gpu,

——, “Nvidia gb200 tensor core gpu,” NVIDIA Data Center GPU Product Page, 2024. [Online]. Available: https://www.nvidia.com/en-us/ data-center/gb200-nvl72/

work page 2024

[37] [37]

Kuaishou

V . Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro, “Reducing activation recomputation in large transformer models,”arXiv preprint arXiv:2205.05198, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2205.05198

work page doi:10.48550/arxiv.2205.05198 2022

[38] [38]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020. [Online]. Available: https://doi.org/10.48550/arXiv.2001.08361

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2001.08361 2001

[39] [39]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv preprint arXiv:2312.00752, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2312.00752

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.00752 2024

[40] [40]

Nvidia osmo: Cloud-native orchestration platform for physical ai,

NVIDIA, “Nvidia osmo: Cloud-native orchestration platform for physical ai,” Product page, 2025. [Online]. Available: https: //us-west-2-aws.osmo.nvidia.com/

work page 2025

[41] [41]

Training Deep Nets with Sublinear Memory Cost

T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training deep nets with sublinear memory cost,”arXiv preprint arXiv:1604.06174, 2016. [Online]. Available: https://doi.org/10.48550/arXiv.1604.06174

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1604.06174 2016

[42] [42]

Benchmarking flops utilization on tpu v4,

J. Bradbury, Q. Zhang, and A. Selvan, “Benchmarking flops utilization on tpu v4,” Google Cloud (whitepaper), 2023. [Online]. Available: https://services.google.com/fh/files/blogs/tpu v4 benchmarking.pdf

work page 2023

[43] [43]

Megascale: Scaling large language model training to more than 10,000 gpus,

Z. Jiang, H. Linet al., “Megascale: Scaling large language model training to more than 10,000 gpus,” inNSDI 2024, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2402.15627

work page doi:10.48550/arxiv.2402.15627 2024

[44] [44]

MLPerf training benchmark,

P. Mattson, C. Cheng, G. Diamos, C. Coleman, P. Micikevicius, D. Patterson, H. Tang, G.-Y . Wei, P. Bailis, V . Bittorf, D. Brooks, D. Chen, D. Dutta, U. Gupta, K. Hazelwood, A. Hock, X. Huang, A. Ike, B. Jia, D. Kang, D. Kanter, N. Kumar, J. Liao, G. Ma, D. Narayanan, T. Oguntebi, G. Pekhimenko, L. Pentecost, V . J. Reddi, T. Robie, T. St. John, T. Tabbe...

work page doi:10.48550/arxiv.1910.01500 2020

[45] [45]

Datacenter-scale analysis and optimization of GPU machine learning workloads,

C.-J. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan, K. Hazelwood, E. Isaac, Y . Jia, B. Jia, T. Leesatapornwongsa, H. Li, Y . Lianget al., “Datacenter-scale analysis and optimization of GPU machine learning workloads,”IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 11, pp. 2766–2780, 2021. [Online]. Available: https://doi.o...

work page doi:10.1109/tpds.2021.3084540 2021

[46] [46]

Roofline: An insightful visual performance model for multicore architectures,

S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for multicore architectures,”Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009. [Online]. Available: https://doi.org/10.1145/1498765.1498785

work page doi:10.1145/1498765.1498785 2009