pith. sign in

arxiv: 2605.20799 · v1 · pith:2FQL2UO6new · submitted 2026-05-20 · 💻 cs.DC · cs.LG

Instant GPU Efficiency Visibility at Fleet Scale

Pith reviewed 2026-05-21 02:38 UTC · model grok-4.3

classification 💻 cs.DC cs.LG
keywords GPU efficiencyOverall FLOP Utilizationhardware performance countersmodel FLOPs utilizationAI trainingfleet monitoringtensor corestile quantization
0
0 comments X

The pith

Two on-chip GPU counters predict application MFU to within two percentage points after tile-quantization correction

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Overall FLOP Utilization as a hardware-level metric for GPU efficiency in AI workloads that requires no application changes. It derives the value from just Tensor Pipe Activity and SM clock frequency counters on the GPU. Controlled tests on H100 and GB200 across FP16, TF32, FP8 and NVFP4 show that correcting for tile quantization brings the estimate within two percentage points of true model FLOPs utilization. When run against 608 production training jobs the metric correlates at r equals 0.78 and has already flagged framework FLOP miscalculations plus a 2.5 times efficiency regression in live fleets. This setup makes continuous efficiency tracking feasible at data-center scale without per-job overhead.

Core claim

Overall FLOP Utilization (OFU) is obtained directly from the product of two on-chip counters, Tensor Pipe Activity and SM clock frequency, yielding a precision-agnostic proxy for total FLOP throughput. After a tile-quantization correction the resulting value matches application-reported model FLOPs utilization to within two percentage points on GEMM workloads spanning multiple precisions and GPU generations. Applied to hundreds of real training jobs OFU surfaces both misreported FLOPs in frameworks and measurable utilization shifts during mixed-precision pretraining.

What carries the argument

Overall FLOP Utilization (OFU), the normalized product of Tensor Pipe Activity and SM clock frequency that serves as a hardware proxy for aggregate floating-point throughput without per-application calibration

If this is right

  • OFU can be collected fleet-wide without modifying or instrumenting any running applications
  • The metric has already identified a 2.5 times efficiency regression in deployed production systems
  • It reveals framework-level errors in reported FLOP counts on hundreds of real jobs
  • OFU tracks utilization changes that occur when switching precisions during pretraining
  • Correlation of 0.78 with measured MFU on 608 production jobs supports its role as a complement to application-level monitoring

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Fleet schedulers could use live OFU readings to pause or migrate inefficient jobs automatically
  • The same counter-based approach might transfer to other accelerators if comparable activity and frequency signals are exposed
  • Long-term fleet data from OFU could expose systematic efficiency differences across model families or software stacks
  • Adding further counters might shrink the remaining error for workloads that under-count non-tensor operations

Load-bearing premise

The two on-chip counters provide a sufficient and unbiased proxy for total FLOP throughput across all workload types and precisions without requiring per-application calibration

What would settle it

A workload where the tile-quantization-corrected OFU deviates by more than two percentage points from independently measured application MFU, such as a non-GEMM task or an untested precision or GPU generation

Figures

Figures reproduced from arXiv: 2605.20799 by Collin Neale, Connor Pedersen, Dong H. Ahn, Michel Migdal, Nik Konyuchenko.

Figure 1
Figure 1. Figure 1: FLOP overhead for square matrices due to tiling and cuBLAS kernel selection. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Throughput speedup over TF32 on GB200. • CUTLASS 2/3: Open-source GEMM templates; CUT￾LASS 2 lacks Cooperative Grid Array (CGA) support and is typically selected for small or poorly aligned matrices. Modern kernels (nvJet, XMMA, CUTLASS 3) use CGAs [20], which group (CM, CN ) thread blocks into clus￾ters that share distributed shared memory across SMs. This introduces a two-level tiling hierarchy: at the f… view at source ↗
Figure 2
Figure 2. Figure 2: Tile-quantization overhead in GEMM execution [27]. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: OFU prediction error. 0 10 20 30 40 50 60 70 App MFU (%) 0 10 20 30 40 50 60 70 OFU (%) r = 0.53 (all) r = 0.78 (excl. case studies) Other (526) Case study jobs (82) y = x [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: App MFU vs. OFU for 608 production training jobs. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: OFU before and after removing debug overhead, [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: OFU and MFU relative to mean MFU on a pretraining [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

We present Overall FLOP Utilization (OFU), a hardware-level, precision-agnostic GPU efficiency metric for AI workloads on HPC systems, derived from two on-chip performance counters: Tensor Pipe Activity and SM clock frequency. OFU requires no application instrumentation and works across GPU generations and numeric precisions. We characterize five properties of the OFU approximation -- tile quantization, floating-point precision scaling, clock sampling noise, Tensor Core clock domains, and non-tensor undercounting -- through controlled GEMM experiments on H100 and GB200 across FP16, TF32, FP8, and NVFP4. After tile-quantization correction, OFU predicts application-level MFU to within <=2 percentage points. Against 608 production training jobs, OFU achieves r = 0.78 correlation with application-level MFU and surfaces two framework-level FLOPs miscalculations. Deployed across large-scale GPU fleets, OFU has detected a 2.5x efficiency regression and tracked precision-dependent utilization changes in mixed-precision pretraining. Our evaluation and operational experience suggest OFU is a practical, deployment-ready complement to application-level MFU for continuous fleet-wide efficiency monitoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Overall FLOP Utilization (OFU), a hardware-level, precision-agnostic GPU efficiency metric derived solely from Tensor Pipe Activity and SM clock frequency counters. It characterizes five properties of the approximation (tile quantization, floating-point precision scaling, clock sampling noise, Tensor Core clock domains, and non-tensor undercounting) via controlled GEMM experiments on H100 and GB200 across FP16/TF32/FP8/NVFP4, applies a tile-quantization correction, and claims that the corrected OFU predicts application-level MFU to within <=2 percentage points. On 608 production training jobs, OFU achieves r=0.78 correlation with application MFU, detects two framework-level FLOP miscalculations, and in fleet deployment has identified a 2.5x efficiency regression while tracking precision-dependent changes.

Significance. If the accuracy and generalization claims hold, OFU would provide a valuable, low-overhead complement to application-level MFU for continuous fleet-scale efficiency monitoring in large AI/HPC systems, requiring no instrumentation and spanning GPU generations and precisions. The authors merit explicit credit for deriving the metric directly from raw hardware counters without fitting to target MFU values, for the controlled multi-precision GEMM characterization of its limitations, and for the concrete operational evidence of detecting real regressions and miscalculations in production.

major comments (2)
  1. [Abstract] Abstract: The load-bearing claim that 'After tile-quantization correction, OFU predicts application-level MFU to within <=2 percentage points' immediately follows the GEMM characterization but precedes the separate statement of r=0.78 on 608 production jobs. The manuscript must explicitly state on which workloads (GEMM kernels only, or full training applications) and with what error statistics (e.g., MAE, 95th-percentile deviation, or per-precision breakdown) this bound was measured, because non-tensor undercounting and mixed kernels in arbitrary workloads could exceed 2pp and undermine the generalization to fleet-scale use.
  2. [Production evaluation section] Production evaluation section: The reported r=0.78 correlation on 608 jobs is presented as validation, yet without the distribution of absolute per-job errors or relation to MFU variance across the fleet it is impossible to determine whether typical deviations remain within the claimed 2pp or are substantially larger. This directly affects whether the metric supports the central goal of reliable, instant efficiency visibility.
minor comments (2)
  1. [Definition section] The explicit formula combining Tensor Pipe Activity and SM clock frequency into OFU should be stated as an equation in the definition section to aid reproducibility and clarity.
  2. [GEMM characterization figures] Figures in the GEMM characterization should include error bars or sample sizes for the measured properties across precisions to allow readers to assess variability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their careful reading and constructive feedback on the clarity of our accuracy claims. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The load-bearing claim that 'After tile-quantization correction, OFU predicts application-level MFU to within <=2 percentage points' immediately follows the GEMM characterization but precedes the separate statement of r=0.78 on 608 production jobs. The manuscript must explicitly state on which workloads (GEMM kernels only, or full training applications) and with what error statistics (e.g., MAE, 95th-percentile deviation, or per-precision breakdown) this bound was measured, because non-tensor undercounting and mixed kernels in arbitrary workloads could exceed 2pp and undermine the generalization to fleet-scale use.

    Authors: We agree the abstract phrasing creates ambiguity about the source of the <=2pp bound. This bound was measured as the maximum observed deviation between corrected OFU and known utilization on the controlled GEMM kernels across FP16/TF32/FP8/NVFP4 on H100 and GB200. For full applications we rely on the separate r=0.78 correlation result. We will revise the abstract to explicitly attribute the 2pp bound to the GEMM characterization, state the error metric used (maximum deviation after tile correction), and add a per-precision breakdown of errors to the evaluation section. Non-tensor undercounting is already characterized as one of the five properties and its effect is reflected in the production correlation. revision: yes

  2. Referee: [Production evaluation section] Production evaluation section: The reported r=0.78 correlation on 608 jobs is presented as validation, yet without the distribution of absolute per-job errors or relation to MFU variance across the fleet it is impossible to determine whether typical deviations remain within the claimed 2pp or are substantially larger. This directly affects whether the metric supports the central goal of reliable, instant efficiency visibility.

    Authors: We acknowledge that absolute error statistics would help readers assess whether deviations stay within the 2pp range in practice. The r=0.78 correlation on 608 diverse production jobs quantifies a strong linear relationship across the observed range of fleet MFU values. We will add a short discussion in the production evaluation section explaining how this correlation, combined with the fleet MFU variance, is consistent with typical deviations remaining small. However, the full distribution of absolute per-job errors is not pre-computed from the raw data. revision: partial

standing simulated objections not resolved
  • Distribution of absolute per-job errors for the 608 production jobs

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained from hardware counters

full rationale

OFU is defined directly from two raw on-chip counters (Tensor Pipe Activity and SM clock frequency) with no fitting to target MFU values. The five properties are characterized via independent controlled GEMM experiments on H100/GB200 across precisions; the tile-quantization correction is derived from those experiments and the <=2pp claim follows as a reported outcome. The r=0.78 correlation is measured separately on 608 production training jobs and presented as validation, not as part of the metric definition. No self-definitional equations, fitted-input predictions, or load-bearing self-citations appear in the abstract or described chain. The approach remains externally falsifiable against application-level MFU.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the two named counters capture tensor throughput accurately enough for the approximation to hold after a single tile-quantization correction. No free parameters are explicitly fitted in the abstract description, and no new physical entities are postulated.

axioms (1)
  • domain assumption Tensor Pipe Activity and SM clock frequency counters are available and stable across GPU generations and precisions.
    Invoked when stating that OFU works across generations and numeric precisions without instrumentation.
invented entities (1)
  • Overall FLOP Utilization (OFU) no independent evidence
    purpose: Hardware-level efficiency metric derived from two counters.
    New named quantity introduced to approximate MFU; no independent falsifiable prediction beyond the reported correlation is given.

pith-pipeline@v0.9.0 · 5742 in / 1211 out tokens · 27640 ms · 2026-05-21T02:38:42.692592+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 6 internal anchors

  1. [1]

    Microsoft fy2025 q4 earnings,

    Microsoft, “Microsoft fy2025 q4 earnings,” Microsoft Investor Relations,

  2. [2]

    Available: https://www.microsoft.com/en-us/investor/eve nts/fy-2025/earnings-fy-2025-q4

    [Online]. Available: https://www.microsoft.com/en-us/investor/eve nts/fy-2025/earnings-fy-2025-q4

  3. [3]

    Alphabet announces fourth quarter and fiscal year 2025 results,

    Alphabet Inc., “Alphabet announces fourth quarter and fiscal year 2025 results,” February 2026. [Online]. Available: https: //www.sec.gov/Archives/edgar/data/1652044/000165204426000012/goo gexhibit991q42025.htm

  4. [4]

    Meta reports fourth quarter and full year 2025 results,

    Meta Platforms, “Meta reports fourth quarter and full year 2025 results,” January 2026. [Online]. Available: https://investor.atmeta.com/investor-n ews/press-release-details/2026/Meta-Reports-Fourth-Quarter-and-Ful l-Year-2025-Results/default.aspx

  5. [5]

    Citigroup forecasts big tech’s ai spending to cross $2.8 trillion by 2029,

    Reuters, “Citigroup forecasts big tech’s ai spending to cross $2.8 trillion by 2029,” September 2025. [Online]. Available: https://www.reuters.com/world/china/citigroup-forecasts-big-techs-ai-s pending-cross-28-trillion-by-2029-2025-09-30/

  6. [6]

    Pytorch profiler,

    PyTorch Contributors, “Pytorch profiler,” PyTorch Documentation, 2024. [Online]. Available: https://pytorch.org/tutorials/recipes/recipes/profiler r ecipe.html

  7. [7]

    Deepspeed flops profiler,

    Microsoft DeepSpeed Team, “Deepspeed flops profiler,” DeepSpeed Documentation, 2024. [Online]. Available: https://www.deepspeed.ai/tut orials/flops-profiler/

  8. [8]

    Hpctoolkit: Tools for performance analysis of optimized parallel programs,

    L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor- Crummey, and N. R. Tallent, “Hpctoolkit: Tools for performance analysis of optimized parallel programs,” inConcurrency and Computation: Practice and Experience, vol. 22, no. 6, 2010, pp. 685–701. [Online]. Available: https://doi.org/10.1002/cpe.1553

  9. [9]

    Nvidia/megatron-lm: Ongoing research training transformer models at scale,

    NVIDIA, “Nvidia/megatron-lm: Ongoing research training transformer models at scale,” GitHub README, 2025. [Online]. Available: https://github.com/NVIDIA/Megatron-LM

  10. [10]

    Pytorch lightning,

    Lightning AI, “Pytorch lightning,” GitHub repository, 2024. [Online]. Available: https://github.com/Lightning-AI/pytorch-lightning

  11. [11]

    Nvidia nemo: A toolkit for building ai models,

    NVIDIA, “Nvidia nemo: A toolkit for building ai models,” GitHub repository, 2024. [Online]. Available: https://github.com/NVIDIA/NeMo

  12. [12]

    PaLM: Scaling Language Modeling with Pathways

    A. Chowdheryet al., “Palm: Scaling language modeling with pathways,”arXiv:2204.02311, 2022. [Online]. Available: https: //doi.org/10.48550/arXiv.2204.02311

  13. [13]

    Nvidia dcgm exporter,

    NVIDIA, “Nvidia dcgm exporter,” GitHub repository, 2025. [Online]. Available: https://github.com/NVIDIA/dcgm-exporter

  14. [14]

    nv-one-logger: Nvidia’s distributed metrics logging system,

    ——, “nv-one-logger: Nvidia’s distributed metrics logging system,” GitHub repository, 2024. [Online]. Available: https://github.com/NVIDI A/nv-one-logger

  15. [15]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    DeepSeek-AI, “Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model,”arXiv preprint arXiv:2405.04434,

  16. [16]
  17. [17]

    Nvidia tesla v100 gpu architecture,

    NVIDIA, “Nvidia tesla v100 gpu architecture,” NVIDIA, Tech. Rep. WP-08608-001 v1.1, 2017. [Online]. Available: https://images.nvidia.co m/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

  18. [18]

    Nvidia a100 tensor core gpu architecture,

    ——, “Nvidia a100 tensor core gpu architecture,” NVIDIA, Tech. Rep.,

  19. [19]

    Available: https://images.nvidia.com/aem-dam/en-zz/So lutions/data-center/nvidia-ampere-architecture-whitepaper.pdf

    [Online]. Available: https://images.nvidia.com/aem-dam/en-zz/So lutions/data-center/nvidia-ampere-architecture-whitepaper.pdf

  20. [20]

    Nvidia h100 tensor core gpu architecture,

    ——, “Nvidia h100 tensor core gpu architecture,” NVIDIA, Tech. Rep., 2022, whitepaper. [Online]. Available: https://www.advancedclustering.c om/wp-content/uploads/2022/03/gtc22-whitepaper-hopper.pdf

  21. [21]

    Nvidia tensor core evolution: From volta to blackwell,

    SemiAnalysis, “Nvidia tensor core evolution: From volta to blackwell,” SemiAnalysis Newsletter, June 2025. [Online]. Available: https: //newsletter.semianalysis.com/p/nvidia-tensor-core-evolution-from-vol ta-to-blackwell

  22. [22]

    Accelerating ai training with nvidia tf32 tensor cores,

    D. Stosic and P. Micikevicius, “Accelerating ai training with nvidia tf32 tensor cores,” NVIDIA Technical Blog, January 2021. [Online]. Available: https://developer.nvidia.com/blog/accelerating-ai-training-wit h-tf32-tensor-cores/

  23. [23]

    Nvidia h100 tensor core gpu architecture,

    NVIDIA, “Nvidia h100 tensor core gpu architecture,” NVIDIA, Tech. Rep., 2022, whitepaper. [Online]. Available: https://www.advancedcluste ring.com/wp-content/uploads/2022/03/gtc22-whitepaper-hopper.pdf

  24. [24]

    Floating-point 8: An introduction to efficient, lower-precision ai training,

    K. Sevegnani, G. Fiameni, U. Uppal, S. Perez, and A. Pilzer, “Floating-point 8: An introduction to efficient, lower-precision ai training,” NVIDIA Technical Blog, June 2025. [Online]. Available: https://developer.nvidia.com/blog/floating-point-8-an-introduction-to-e fficient-lower-precision-ai-training/

  25. [25]

    Cutlass tutorial: Fast matrix-multiplication with wgmma on nvidia hopper gpus,

    Colfax Research, “Cutlass tutorial: Fast matrix-multiplication with wgmma on nvidia hopper gpus,” Colfax Research Blog, August 2024. [Online]. Available: https://research.colfax-intl.com/cutlass-tutorial-wgm ma-hopper/

  26. [26]

    Introducing nvfp4 for efficient and accurate low- precision inference,

    E. Alvarez, O. Almog, E. Chung, S. Layton, D. Stosic, R. Krashinsky, and K. Aubrey, “Introducing nvfp4 for efficient and accurate low- precision inference,” NVIDIA Technical Blog, June 2025. [Online]. Available: https://developer.nvidia.com/blog/introducing-nvfp4-for-effic ient-and-accurate-low-precision-inference/

  27. [27]

    Nvidia rtx blackwell gpu architecture,

    NVIDIA, “Nvidia rtx blackwell gpu architecture,” NVIDIA, Tech. Rep., 2025, architecture Whitepaper. [Online]. Available: https: //images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-b lackwell-gpu-architecture.pdf

  28. [28]

    Pretraining large language models with NVFP4

    P. Micikevicius, D. Stosic, N. Burgess, M. Cornea, P. Dubey, R. Grisenthwaite, S. Ha, A. Heinecke, P. Judd, J. Kamalu, N. Mellempudi, S. Oberman, M. Shoeybi, M. Renber, K. Siu, and H. Wu, “Pretraining large language models with NVFP4,”arXiv preprint arXiv:2509.25149,

  29. [29]

    Pretraining large language models with NVFP4

    [Online]. Available: https://doi.org/10.48550/arXiv.2509.25149

  30. [30]

    Data movement is all you need: A case study on optimizing transformers,

    A. Ivanov, N. Dryden, T. Ben-Nun, S. Li, and T. Hoefler, “Data movement is all you need: A case study on optimizing transformers,” in Proceedings of Machine Learning and Systems (MLSys), vol. 3, 2021, pp. 711–732. [Online]. Available: https://doi.org/10.48550/arXiv.2007.00072

  31. [31]

    Matrix multiplication background user’s guide,

    NVIDIA, “Matrix multiplication background user’s guide,” NVIDIA Deep Learning Performance Documentation, 2024. [Online]. Available: https://docs.nvidia.com/deeplearning/performance/dl-performance-matri x-multiplication/index.html

  32. [32]

    cublas library user’s guide,

    ——, “cublas library user’s guide,” CUDA Toolkit Documentation, 2025. [Online]. Available: https://docs.nvidia.com/cuda/cublas/index.html

  33. [33]

    nvmatmulheuristics,

    ——, “nvmatmulheuristics,” 2025. [Online]. Available: https://developer. nvidia.com/blog/improving-gemm-kernel-auto-tuning-efficiency-on-nvi dia-gpus-with-heuristics-and-cutlass-4-2/

  34. [34]

    Improving gemm kernel auto-tuning efficiency on nvidia gpus with heuristics and cutlass 4.2,

    H. Zhao, D. Yan, A. Wang, A. Kerr, and M. Yan, “Improving gemm kernel auto-tuning efficiency on nvidia gpus with heuristics and cutlass 4.2,” NVIDIA Technical Blog, January 2025. [Online]. Available: https://developer.nvidia.com/blog/improving-gemm-kernel-auto-tunin g-efficiency-on-nvidia-gpus-with-heuristics-and-cutlass-4-2/

  35. [35]

    Nvidia h100 tensor core gpu,

    NVIDIA, “Nvidia h100 tensor core gpu,” NVIDIA Data Center GPU Product Page, 2024. [Online]. Available: https://www.nvidia.com/en-us/ data-center/h100/

  36. [36]

    Nvidia gb200 tensor core gpu,

    ——, “Nvidia gb200 tensor core gpu,” NVIDIA Data Center GPU Product Page, 2024. [Online]. Available: https://www.nvidia.com/en-us/ data-center/gb200-nvl72/

  37. [37]

    Kuaishou

    V . Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro, “Reducing activation recomputation in large transformer models,”arXiv preprint arXiv:2205.05198, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2205.05198

  38. [38]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020. [Online]. Available: https://doi.org/10.48550/arXiv.2001.08361

  39. [39]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv preprint arXiv:2312.00752, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2312.00752

  40. [40]

    Nvidia osmo: Cloud-native orchestration platform for physical ai,

    NVIDIA, “Nvidia osmo: Cloud-native orchestration platform for physical ai,” Product page, 2025. [Online]. Available: https: //us-west-2-aws.osmo.nvidia.com/

  41. [41]

    Training Deep Nets with Sublinear Memory Cost

    T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training deep nets with sublinear memory cost,”arXiv preprint arXiv:1604.06174, 2016. [Online]. Available: https://doi.org/10.48550/arXiv.1604.06174

  42. [42]

    Benchmarking flops utilization on tpu v4,

    J. Bradbury, Q. Zhang, and A. Selvan, “Benchmarking flops utilization on tpu v4,” Google Cloud (whitepaper), 2023. [Online]. Available: https://services.google.com/fh/files/blogs/tpu v4 benchmarking.pdf

  43. [43]

    Megascale: Scaling large language model training to more than 10,000 gpus,

    Z. Jiang, H. Linet al., “Megascale: Scaling large language model training to more than 10,000 gpus,” inNSDI 2024, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2402.15627

  44. [44]

    MLPerf training benchmark,

    P. Mattson, C. Cheng, G. Diamos, C. Coleman, P. Micikevicius, D. Patterson, H. Tang, G.-Y . Wei, P. Bailis, V . Bittorf, D. Brooks, D. Chen, D. Dutta, U. Gupta, K. Hazelwood, A. Hock, X. Huang, A. Ike, B. Jia, D. Kang, D. Kanter, N. Kumar, J. Liao, G. Ma, D. Narayanan, T. Oguntebi, G. Pekhimenko, L. Pentecost, V . J. Reddi, T. Robie, T. St. John, T. Tabbe...

  45. [45]

    Datacenter-scale analysis and optimization of GPU machine learning workloads,

    C.-J. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan, K. Hazelwood, E. Isaac, Y . Jia, B. Jia, T. Leesatapornwongsa, H. Li, Y . Lianget al., “Datacenter-scale analysis and optimization of GPU machine learning workloads,”IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 11, pp. 2766–2780, 2021. [Online]. Available: https://doi.o...

  46. [46]

    Roofline: An insightful visual performance model for multicore architectures,

    S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for multicore architectures,”Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009. [Online]. Available: https://doi.org/10.1145/1498765.1498785