pith. machine review for the scientific record. sign in

arxiv: 2604.11391 · v1 · submitted 2026-04-13 · 💻 cs.PF

Recognition: unknown

Architectural Trade-offs in the Energy-Efficient Era: A Comparative Study of power-capping NVIDIA H100 and H200

Aditya Ujeniya, Georg Hager, Gerhard Wellein, Jan Eitzinger

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:15 UTC · model grok-4.3

classification 💻 cs.PF
keywords GPU energy efficiencypower cappingNVIDIA H100NVIDIA H200memory bandwidthcompute-bound workloadsmemory-bound workloadsRoofline model
0
0 comments X

The pith

Under power caps, the H100 GPU is slightly better for compute-bound workloads while the H200 excels for memory-bound applications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares energy efficiency of the NVIDIA H100 and H200 GPUs when total power is restricted. These cards have nearly identical compute units but the H200 uses a faster memory technology that changes how power is split between memory and compute cores. Tests with a matrix multiplication kernel for compute-bound work and a bandwidth kernel for memory-bound work show the H100 delivering better performance per watt on the first type and the H200 on the second. The study also fits regression models to memory power draw and notes cases where memory power exceeds the expected limit. The distinction matters because power capping is now routine in large GPU clusters to control energy costs.

Core claim

By isolating memory bandwidth as the main architectural variable between the H100 (HBM2e) and H200 (HBM3e), and applying power caps, the work finds that the H100 remains the slightly better choice for strictly compute-bound workloads across varying power caps, whereas the H200 demonstrates superior efficiency for memory-bound applications.

What carries the argument

Power capping that shifts the split of power between memory and streaming multiprocessors, combined with the Roofline extremes of DGEMM for compute-bound and TheBandwidthBenchmark for memory-bound workloads.

If this is right

  • GPU selection for energy efficiency under power limits should depend on whether a workload is limited by arithmetic or by memory access.
  • Higher memory bandwidth in the H200 improves efficiency specifically when memory power becomes the dominant consumer under caps.
  • Memory power consumption follows a predictable regression with occasional outliers that exceed the fitted limit.
  • Efficiency rankings hold across the range of power caps tested for the two workload classes.
  • Workload-aware assignment of H100 and H200 cards can improve overall system energy use in power-constrained installations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Clusters running mixed job types could improve total efficiency by routing compute-heavy jobs to H100 nodes and memory-heavy jobs to H200 nodes.
  • The same isolation method could be applied to future GPU generations to quantify how memory technology changes affect power-limited efficiency.
  • Profiling tools that classify jobs by their position on the Roofline would help operators decide which card type to allocate.
  • Extending the measurements to frequency scaling or other power-management knobs could reveal additional trade-offs.

Load-bearing premise

That the selected benchmarks accurately isolate pure compute-bound and memory-bound behavior and that memory bandwidth is the dominant difference driving the efficiency results.

What would settle it

Running the same power-cap sweeps on a mixed workload such as sparse matrix-vector multiplication or a real application that sits between the two Roofline extremes to check whether the efficiency ranking between H100 and H200 stays the same or reverses.

Figures

Figures reproduced from arXiv: 2604.11391 by Aditya Ujeniya, Georg Hager, Gerhard Wellein, Jan Eitzinger.

Figure 1
Figure 1. Figure 1: Effective bandwidth for different kernels on NVIDIA H100 and H200 [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: DGEMM: relationship between performance throughput and average [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: DGEMM: split violin plots detailing average SM frequency and memory [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Relationship between performance throughput and power draw across [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Schönauer Triad performance metrics comparing the NVIDIA H100 (left [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: DGEMM benchmark analysis: (a) Energy efficiency comparison between [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Schönauer Triad benchmark analysis: (a) Energy efficiency comparison [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

Modern NVIDIA GPUs like the H100 (HBM2e) and H200 (HBM3e) share similar compute characteristics but differ significantly in memory interface technology and bandwidth. By isolating memory bandwidth as a key variable, the power distribution between the memory and Streaming Multiprocessors (SM) changes notably between the two architectures. In the era of energy-efficient computing, analyzing how these hardware characteristics impact performance per watt is critical. This study investigates how the H100 and H200 manage memory power consumption at various power-cap levels. By a regression analysis, we study the memory power limit and uncover outliers consuming more memory power. To evaluate efficiency, we employ compute-bound (DGEMM) and memory-bound (TheBandwidthBenchmark) workloads, representing the two extremes of the Roof\-line model. Our observations indicate that across varying power caps, the H100 remains the slightly better choice for strictly compute-bound workloads, whereas the H200 demonstrates superior efficiency for memory-bound applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript compares power-capping behavior on NVIDIA H100 (HBM2e) and H200 (HBM3e) GPUs, which share similar compute characteristics but differ in memory bandwidth. It performs regression analysis on memory power limits to identify outliers, then evaluates energy efficiency using DGEMM (compute-bound) and TheBandwidthBenchmark (memory-bound) workloads that represent Roofline extremes. The central observation is that the H100 is slightly preferable for compute-bound workloads while the H200 shows superior efficiency for memory-bound applications across power-cap levels.

Significance. If the empirical results and methodology are fully documented, the work would provide practical guidance on GPU architecture selection for power-constrained HPC and data-center workloads, highlighting how memory-subsystem differences affect performance per watt. The choice of standard Roofline-aligned benchmarks is a positive aspect that grounds the comparison in established performance modeling.

major comments (3)
  1. [Abstract] Abstract: The regression analysis on memory power limits and outlier detection is described but no model equation, fitting procedure, coefficients, R² values, or quantitative results are supplied, so the power-distribution claims cannot be evaluated.
  2. [Abstract] Abstract / Results: No tables, figures, or numerical values (performance per watt, power breakdowns, error bars, or statistical tests) are presented for the DGEMM and TheBandwidthBenchmark runs, leaving the central efficiency comparison unsupported.
  3. [Abstract] Abstract: The claim that the workloads remain strictly compute-bound versus memory-bound under power caps (and that memory bandwidth is isolated) is not accompanied by any re-validation of arithmetic intensity or Roofline position at each cap level, so attribution of efficiency differences solely to HBM2e vs. HBM3e is not yet demonstrated.
minor comments (2)
  1. The benchmark name 'TheBandwidthBenchmark' should be clarified (is it a standard tool or custom code?) and any source or citation provided.
  2. [Title] Title uses inconsistent capitalization ('power-capping'); standardize to 'Power-Capping' or 'power capping'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas where additional documentation and validation will improve the clarity and rigor of the manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The regression analysis on memory power limits and outlier detection is described but no model equation, fitting procedure, coefficients, R² values, or quantitative results are supplied, so the power-distribution claims cannot be evaluated.

    Authors: We agree that the abstract provides only a high-level description. The full manuscript details the regression in Section 3, but to enable direct evaluation we will revise the abstract to include the linear regression model equation, the fitting procedure (ordinary least squares), the coefficients, R² values, and a brief summary of the quantitative outlier results. A supporting table with these statistics will also be added to the main text. revision: yes

  2. Referee: [Abstract] Abstract / Results: No tables, figures, or numerical values (performance per watt, power breakdowns, error bars, or statistical tests) are presented for the DGEMM and TheBandwidthBenchmark runs, leaving the central efficiency comparison unsupported.

    Authors: The results section contains figures showing the efficiency trends, but we acknowledge the absence of explicit numerical tables and statistical details in both the abstract and main text. We will add a summary table of performance-per-watt values, power breakdowns, and error bars for representative power-cap levels, along with the results of statistical tests comparing the two architectures. Key numerical highlights will be incorporated into the revised abstract. revision: yes

  3. Referee: [Abstract] Abstract: The claim that the workloads remain strictly compute-bound versus memory-bound under power caps (and that memory bandwidth is isolated) is not accompanied by any re-validation of arithmetic intensity or Roofline position at each cap level, so attribution of efficiency differences solely to HBM2e vs. HBM3e is not yet demonstrated.

    Authors: This observation is correct. Our original analysis relied on the established Roofline characteristics of the chosen benchmarks without re-measuring arithmetic intensity at every power-cap setting. We will add a dedicated validation subsection (or appendix) that reports arithmetic intensity and Roofline placement for both DGEMM and TheBandwidthBenchmark across the full range of power caps on each GPU. This will confirm that the workloads remain in their intended regimes and strengthen the attribution to memory-subsystem differences. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivations or self-referential reductions

full rationale

The paper presents direct experimental measurements of performance and power on H100 and H200 GPUs under power caps using DGEMM (compute-bound) and TheBandwidthBenchmark (memory-bound). It describes a regression analysis solely to identify outliers in memory power consumption, with no fitted parameters then reused as 'predictions' of efficiency ratios or other quantities. No equations, first-principles derivations, uniqueness theorems, or ansatzes appear; the central claim follows from the observed data points without any step that reduces by construction to prior inputs or self-citations. The analysis is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or new entities are introduced; the work relies on standard empirical benchmarking and regression applied to hardware power measurements.

pith-pipeline@v0.9.0 · 5486 in / 1123 out tokens · 39437 ms · 2026-05-10T15:15:03.085134+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

  1. [1]

    IEEE Solid-State Circuits Society Newsletter8(2), 6–6 (2003).https:// doi.org/10.1109/N-SSC.2003.6499960

    Anantha,C.,Samuel,S.,Robert,B.:AJSSCclassicpaper:Low-powerCMOSdigital design. IEEE Solid-State Circuits Society Newsletter8(2), 6–6 (2003).https:// doi.org/10.1109/N-SSC.2003.6499960

  2. [2]

    Plexus: Taming billion-edge graphs with 3D parallel full-graph GNN training,

    Antepara, O., Zhao, Z., Austin, B., Ding, N., Oliker, L., Wright, N.J., Williams, S.: Benchmark-driven models for energy analysis and attribution of GPU-accelerated supercomputing. In: Proceedings of the International Conference for High Perfor- mance Computing, Networking, Storage and Analysis. p. 888–904. SC ’25, Associa- tion for Computing Machinery, N...

  3. [3]

    Distributed, Parallel, and Cluster Computing0, 0 (2025).https://doi.org/10.48550/arXiv.2510.06902

    Ayesha, A., Anna, K., Georg, H., Gerhard, W.: GROMACS unplugged: How power capping and frequency shapes performance on GPUs. Distributed, Parallel, and Cluster Computing0, 0 (2025).https://doi.org/10.48550/arXiv.2510.06902

  4. [4]

    Schneider, B

    Krzywaniak, A., Czarnul, P., Proficz, J.: Dynamic GPU power capping with online performance tracing for energy efficient GPU computing using DEPO tool. Future Generation Computer Systems145, 396–414 (2023).https://doi.org/10.1016/j. future.2023.03.041

  5. [5]

    IEEE Trans- actions on Very Large Scale Integration (VLSI) Systems2(4), 446–455 (1994)

    Najm, F.: A survey of power estimation techniques in VLSI circuits. IEEE Trans- actions on Very Large Scale Integration (VLSI) Systems2(4), 446–455 (1994). https://doi.org/10.1109/92.335013

  6. [6]

    In: 2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS)

    Patki, T., Frye, Z., Bhatia, H., Di Natale, F., Glosli, J., Ingolfsson, H., Rountree, B.: ComparingGPUpowerandfrequencycapping:AcasestudywiththeMuMMIwork- flow. In: 2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS). pp. 31–39 (2019).https://doi.org/10.1109/WORKS49585.2019.00009

  7. [7]

    A., Bustos, B., & Hitschfeld, N

    Schöne, R., Ilsche, T., Bielert, M., Gocht, A., Hackenberg, D.: Energy efficiency features of the Intel Skylake-SP processor and their impact on performance. In: 2019 International Conference on High Performance Computing & Simulation (HPCS). pp. 399–406 (2019).https://doi.org/10.1109/HPCS48598.2019.9188239

  8. [8]

    Association for Computing Machinery0, 449–459 (2025).https://doi.org/10.1145/3754598.3754670

    Yuan, M., Srinivasan, S., Wang, X.: Power capping of GPU servers for machine learning inference optimization. Association for Computing Machinery0, 449–459 (2025).https://doi.org/10.1145/3754598.3754670