arxiv: 2604.20105 · v1 · submitted 2026-04-22 · 💻 cs.AR

Recognition: unknown

EnergAIzer: Fast and Accurate GPU Power Estimation Framework for AI Workloads

Kyungmi Lee , Zhiye Song , Eun Kyung Lee , Xin Zhang , Tamar Eilam , Anantha P. Chandrakasan

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:56 UTC · model grok-4.3

classification 💻 cs.AR

keywords GPU power estimationAI workloadsperformance modelingutilization predictionNVIDIA GPUsenergy efficiencykernel patterns

0 comments

The pith

EnergAIzer predicts GPU power for AI workloads with 8% error by analytically modeling kernel patterns instead of running simulations or hardware profiles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that AI kernels often use optimizations producing repeatable patterns in memory traffic and execution timing. These patterns let a lightweight model forecast module-level hardware utilization from code structure alone. The resulting utilization numbers feed directly into a power model, cutting prediction time from hours to seconds while keeping error around 8 percent on Ampere GPUs and 7 percent when forecasting H100 power. Accurate fast estimates matter because datacenters need to plan power budgets and test frequency or architecture changes without waiting for full simulations.

Core claim

EnergAIzer builds a performance model whose analytical scaffold comes from the structured patterns created by common AI-kernel optimizations; the scaffold fits empirical data to expose module utilization, which is then supplied to a separate power model to compute dynamic consumption.

What carries the argument

Performance model that treats structured kernel patterns as an analytical scaffold for empirical fitting to predict module-level utilization.

If this is right

Frequency scaling studies become practical because each trial takes seconds rather than hours.
Architectural configuration sweeps, including forecasts for next-generation GPUs such as H100, can be performed with only 7 percent error.
Power-aware design explorations for new AI accelerators no longer require cycle-level simulators or hardware counters for every candidate kernel.
Datacenter operators gain a tool that supports proactive power management decisions without lengthy profiling campaigns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pattern-based approach could be tested on other accelerators that run structured matrix or tensor kernels.
Runtime systems might incorporate the model for on-the-fly power capping or scheduling.
If the patterns prove stable across compiler versions, the method could reduce reliance on vendor-specific profiling tools.

Load-bearing premise

AI kernels commonly employ optimizations that create structured patterns which analytically determine memory traffic and execution timeline sufficiently to expose accurate module-level utilization without post-hoc fitting adjustments.

What would settle it

Run the same AI workloads on an Ampere GPU while measuring actual per-module activity counters and power; if measured utilization differs by more than a few percent from the model's analytic predictions, the 8 percent power error claim fails.

Figures

Figures reproduced from arXiv: 2604.20105 by Anantha P. Chandrakasan, Eun Kyung Lee, Kyungmi Lee, Tamar Eilam, Xin Zhang, Zhiye Song.

**Figure 2.** Figure 2: Two cases where kernels with similar latency exhibit [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of EnergAIzer’s kernel-level prediction. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Describing (a) GEMM, (b) Softmax, and (c) FlashAttention kernels’ memory traffic and load distributions through their [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of timelines for (a) GEMM, (b) Softmax, [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Estimation accuracy of the analytical model and the [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Validating power estimation for GEMM, Softmax, [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 9.** Figure 9: End-to-end latency and power estimation errors for NVIDIA A100-40GB-PCIE and A10 GPUs, with the operating [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Kernel-level latency estimation with detailed com [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 12.** Figure 12: Power estimation errors of forecasting for new GPU [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗

**Figure 13.** Figure 13: Estimation of (Left) the speedup achieved when [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗

**Figure 7.** Figure 7: Validating kernel-level power predictions [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 9.** Figure 9: Validating end-to-end latency and power predic [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 11.** Figure 11: Validating power predictions under voltage [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

read the original abstract

As AI workloads drive increases in datacenter power consumption, accurate GPU power estimation is critical for proactive power management. However, existing power models face a scalability bottleneck not in the modeling techniques themselves, but in obtaining the hardware utilization inputs they require. Conventional approaches rely on either costly simulation or hardware profiling, which makes them impractical when rapid predictions are required. This work presents EnergAIzer, which addresses this scalability bottleneck by developing a lightweight solution to predict utilization inputs, reducing the estimation walltime from hours to seconds. Our key insight is that kernels in AI workloads commonly employ optimizations that create structured patterns, which analytically determine memory traffic and execution timeline. We construct a performance model using these patterns as an analytical scaffold for empirical data fitting, which also naturally exposes module-level utilization. This predicted utilization is then fed into our power model to estimate dynamic power consumption. EnergAIzer achieves 8% power errors on NVIDIA Ampere GPUs, competitive with traditional power models with elaborate cycle-level simulation or hardware profiling. We demonstrate EnergAIzer's exploration capabilities for frequency scaling and architectural configurations, including forecasting the power of NVIDIA H100 with just 7% error. In summary, EnergAIzer provides fast and accurate power prediction for AI workloads, paving the way for power-aware design explorations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EnergAIzer speeds up GPU power estimates for AI workloads by predicting utilization from kernel patterns instead of profiling, but the H100 forecast rests on an unverified transfer of Ampere-tuned patterns.

read the letter

The main thing to know is that this paper claims a lightweight power estimator that cuts walltime from hours to seconds for AI GPU workloads. It builds a performance model around structured patterns in kernel optimizations to predict memory traffic and execution timelines, fits empirical data to expose module-level utilization, and feeds that into a power model. On NVIDIA Ampere it hits 8% error, competitive with heavier simulation or profiling methods, and it shows some use for frequency scaling and config exploration including a 7% error forecast for H100 power.

Referee Report

2 major / 2 minor

Summary. The paper presents EnergAIzer, a GPU power estimation framework for AI workloads that addresses the scalability bottleneck of obtaining hardware utilization inputs. It develops a lightweight performance model that exploits structured patterns from kernel optimizations in AI workloads to analytically determine memory traffic and execution timelines; these patterns serve as a scaffold for empirical data fitting that exposes module-level utilization, which is then input to a power model. The work reports 8% average power error on NVIDIA Ampere GPUs (competitive with cycle-level simulation or profiling approaches) and demonstrates exploration capabilities including a 7% error forecast for NVIDIA H100 power via architectural configuration studies, reducing estimation walltime from hours to seconds.

Significance. If the central claims hold, EnergAIzer would offer a practical advance in power modeling for AI accelerators by enabling rapid, low-overhead predictions without heavy simulation or per-workload profiling. This could support proactive power management and design-space exploration in datacenters, particularly for emerging architectures like Hopper. The pattern-based analytical scaffold combined with fitting is a potentially reusable idea for performance modeling beyond power.

major comments (2)

[Abstract] Abstract and H100 forecasting section: the 7% error claim for NVIDIA H100 power is load-bearing for the exploration contribution, yet the manuscript provides no evidence that the Ampere-derived analytical patterns for memory traffic and execution timeline were re-derived or validated on H100, nor whether any H100 measurements entered the empirical fitting. If the patterns or coefficients are architecture-specific, the utilization inputs to the power model would be invalid even if the power model itself is retuned.
[Performance Model] Performance model description (likely §3 or §4): the central claim that the pattern scaffold plus fitting 'naturally exposes module-level utilization' for accurate power prediction requires explicit details on the fitting procedure, including which workloads were used for validation vs. fitting, data exclusion rules, cross-validation strategy, and reported error bars or confidence intervals. Without these, the reported 8% Ampere error cannot be assessed for independence from the fitted parameters.

minor comments (2)

[Abstract] The abstract states competitive error rates but does not name the specific traditional power models or cycle-level simulators used for comparison; adding these references would strengthen the positioning.
Notation for module-level utilization and the power model equations should be introduced with a clear table or diagram early in the manuscript to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for clarification in our presentation of the H100 forecasting results and the performance model fitting procedure. We address each point below and commit to revisions that will strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract] Abstract and H100 forecasting section: the 7% error claim for NVIDIA H100 power is load-bearing for the exploration contribution, yet the manuscript provides no evidence that the Ampere-derived analytical patterns for memory traffic and execution timeline were re-derived or validated on H100, nor whether any H100 measurements entered the empirical fitting. If the patterns or coefficients are architecture-specific, the utilization inputs to the power model would be invalid even if the power model itself is retuned.

Authors: The analytical patterns are derived from structured kernel optimizations (e.g., tiling and memory access patterns in GEMM and convolution kernels) that are common across AI frameworks and largely architecture-agnostic, as they follow from the CUDA programming model rather than specific hardware parameters. These patterns analytically determine memory traffic and execution timelines, after which empirical fitting on Ampere data exposes module-level utilizations. For the H100 forecast, no H100 hardware measurements were available or used in fitting; instead, we re-parameterize the performance model using publicly documented H100 specifications (SM count, memory bandwidth, tensor core throughput) while retaining the Ampere-fitted utilization predictors. The reported 7% error is computed against power estimates obtained from architectural simulators and NVIDIA documentation for equivalent H100 configurations. We will add an explicit subsection in the revised manuscript detailing these cross-architecture assumptions, the absence of H100 measurements, and the resulting limitations of the forecast. This makes the methodology transparent while preserving the exploration contribution. revision: partial
Referee: [Performance Model] Performance model description (likely §3 or §4): the central claim that the pattern scaffold plus fitting 'naturally exposes module-level utilization' for accurate power prediction requires explicit details on the fitting procedure, including which workloads were used for validation vs. fitting, data exclusion rules, cross-validation strategy, and reported error bars or confidence intervals. Without these, the reported 8% Ampere error cannot be assessed for independence from the fitted parameters.

Authors: We agree that the current manuscript provides only a high-level description of the fitting procedure and therefore does not allow readers to fully evaluate the independence of the 8% error. In the revised version we will expand the performance model section to include: the complete list of training and validation workloads with an explicit split (e.g., 70/30 or leave-one-out); any data exclusion criteria applied (e.g., removal of runs with measurement artifacts); the cross-validation strategy used; and error bars or confidence intervals on the reported average power error. These additions will directly address the concern and enable independent assessment of robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent empirical fitting and architectural exploration

full rationale

The abstract describes constructing a performance model from structured kernel patterns as an analytical scaffold for empirical data fitting on Ampere GPUs, exposing module-level utilization that is then fed into a separate power model. The H100 result is framed as a forecast obtained by exploring architectural configurations rather than a fit to H100 measurements. No equations, self-citations, or self-definitional steps are present in the provided text that would reduce any prediction to its inputs by construction. The chain therefore retains independent content from the fitting process and does not meet the criteria for circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of reusable structured patterns across AI kernels and on the validity of empirical fitting to those patterns; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Kernels in AI workloads commonly employ optimizations that create structured patterns analytically determining memory traffic and execution timeline.
Invoked as the key insight enabling the lightweight utilization predictor.

pith-pipeline@v0.9.0 · 5544 in / 1214 out tokens · 27780 ms · 2026-05-09T23:56:39.073549+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 18 canonical work pages

[1]

2024 united states data center energy usage report,

S. Arman, A. Newkirk, S. J. Smith, A. Hubbard, N. Lei, M. A. B. Siddik, B. Holecek, J. Koomey, E. Masanet, and D. Sartor, “2024 united states data center energy usage report,” Lawrence Berkeley National Laboratory, Tech. Rep., 2024

2024
[2]

Ai has high data center en- ergy costs — but there are solutions,

B. Stackpole, “Ai has high data center en- ergy costs — but there are solutions,” 2025. [Online]. Available: https://mitsloan.mit.edu/ideas-made-to-matter/ ai-has-high-data-center-energy-costs-there-are-solutions

2025
[3]

Towards power efficiency in deep learning on data center hardware,

M. Hodak, M. Gorkovenko, and A. Dholakia, “Towards power efficiency in deep learning on data center hardware,” in2019 IEEE International Conference on Big Data (Big Data), 2019, pp. 1814–1820

2019
[4]

Nvidia h100 tensor core gpu architecture,

NVIDIA, “Nvidia h100 tensor core gpu architecture,” 2023. [On- line]. Available: https://resources.nvidia.com/en-us-hopper-architecture/ nvidia-h100-tensor-c

2023
[5]

Nvidia blackwell datasheet,

——, “Nvidia blackwell datasheet,” 2025. [Online]. Available: https://resources.nvidia.com/en-us-blackwell-architecture/datasheet

2025
[6]

Know your enemy to save cloud energy: Energy-performance characterization of machine learning serving,

J. Yu, J. Kim, and E. Seo, “Know your enemy to save cloud energy: Energy-performance characterization of machine learning serving,” in 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023, pp. 842–854

2023
[7]

2025.00049

J. Stojkovic, C. Zhang, I. n. Goiri, J. Torrellas, and E. Choukse, “Dynamollm: Designing llm inference clusters for performance and energy efficiency,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, Mar. 2025, p. 1348–1362. [Online]. Available: http://dx.doi.org/10.1109/HPCA61900. 2025.00102

work page doi:10.1109/hpca61900 2025
[8]

ISBN 9798400703867

P. Patel, E. Choukse, C. Zhang, I. n. Goiri, B. Warrier, N. Mahalingam, and R. Bianchini, “Characterizing power management opportunities for llms in the cloud,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ser. ASPLOS ’24. New York, NY , USA: Association for Comp...

work page doi:10.1145/3620666.3651329 2024
[9]

Kepler: A framework to calculate the energy consumption of containerized applications,

M. Amaral, H. Chen, T. Chiba, R. Nakazawa, S. Choochotkaew, E. K. Lee, and T. Eilam, “Kepler: A framework to calculate the energy consumption of containerized applications,” in2023 IEEE 16th Interna- tional Conference on Cloud Computing (CLOUD), 2023, pp. 69–71

2023
[10]

Energy-aware tile size selection for affine programs on gpus,

M. Jayaweera, M. Kong, Y . Wang, and D. Kaeli, “Energy-aware tile size selection for affine programs on gpus,” in2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2024, pp. 13– 27

2024
[11]

Zeus: Understanding and optimizing GPU energy consumption of DNN training,

J. You, J.-W. Chung, and M. Chowdhury, “Zeus: Understanding and optimizing GPU energy consumption of DNN training,” in20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). Boston, MA: USENIX Association, Apr. 2023, pp. 119–

2023
[12]

Available: https://www.usenix.org/conference/nsdi23/ presentation/you

[Online]. Available: https://www.usenix.org/conference/nsdi23/ presentation/you
[13]

Reducing energy bloat in large model training,

J.-W. Chung, Y . Gu, I. Jang, L. Meng, N. Bansal, and M. Chowdhury, “Reducing energy bloat in large model training,” inProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, ser. SOSP ’24. ACM, Nov. 2024, p. 144–159. [Online]. Available: http://dx.doi.org/10.1145/3694715.3695970

work page doi:10.1145/3694715.3695970 2024
[14]

Gpuwattch: enabling energy optimizations in gpgpus,

J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V . J. Reddi, “Gpuwattch: enabling energy optimizations in gpgpus,” inProceedings of the 40th Annual International Symposium on Computer Architecture, ser. ISCA ’13. New York, NY , USA: Association for Computing Machinery, 2013, p. 487–498. [Online]. Available: https://doi.org...

work page doi:10.1145/2485922.2485964 2013
[15]

doi: 10.1145/3466752.3480063

V . Kandiah, S. Peverelle, M. Khairy, J. Pan, A. Manjunath, T. G. Rogers, T. M. Aamodt, and N. Hardavellas, “Accelwattch: A power modeling framework for modern gpus,” inMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’21. New York, NY , USA: Association for Computing Machinery, 2021, p. 738–753. [Online]. Available:...

work page doi:10.1145/3466752.3480063 2021
[16]

Understanding the future of energy efficiency in multi-module gpus,

A. Arunkumar, E. Bolotin, D. Nellans, and C.-J. Wu, “Understanding the future of energy efficiency in multi-module gpus,” in2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2019, pp. 519–532

2019
[17]

An integrated gpu power and performance model,

S. Hong and H. Kim, “An integrated gpu power and performance model,”SIGARCH Comput. Archit. News, vol. 38, no. 3, p. 280–289, Jun. 2010. [Online]. Available: https://doi.org/10.1145/1816038.1815998

work page doi:10.1145/1816038.1815998 2010
[18]

Gpgpu power modeling for multi-domain voltage-frequency scaling,

J. Guerreiro, A. Ilic, N. Roma, and P. Tomas, “Gpgpu power modeling for multi-domain voltage-frequency scaling,” in2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2018, pp. 789–800

2018
[19]

Gpgpu performance and power estimation using machine learning,

G. Wu, J. L. Greathouse, A. Lyashevsky, N. Jayasena, and D. Chiou, “Gpgpu performance and power estimation using machine learning,” in2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), 2015, pp. 564–576

2015
[20]

Performance-aware energy-efficient gpu frequency selection using dnn-based models,

G. Ali, M. Side, S. Bhalachandra, N. J. Wright, and Y . Chen, “Performance-aware energy-efficient gpu frequency selection using dnn-based models,” inProceedings of the 52nd International Conference on Parallel Processing, ser. ICPP ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 433–442. [Online]. Available: https://doi.org/10.1145/...

work page doi:10.1145/3605573.3605600 2023
[21]

Improving gpu energy efficiency through an application-transparent frequency scaling policy with performance assurance,

Y . Zhang, Q. Wang, Z. Lin, P. Xu, and B. Wang, “Improving gpu energy efficiency through an application-transparent frequency scaling policy with performance assurance,” inProceedings of the Nineteenth European Conference on Computer Systems, ser. EuroSys ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 769–785. [Online]. Available: ...

work page doi:10.1145/3627703.3629584 2024
[22]

Minimizing power consumption in digital cmos circuits,

A. Chandrakasan and R. Brodersen, “Minimizing power consumption in digital cmos circuits,”Proceedings of the IEEE, vol. 83, no. 4, pp. 498–523, 1995

1995
[23]

Accel-sim: An extensible simulation framework for validated gpu modeling,

M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-sim: An extensible simulation framework for validated gpu modeling,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020, pp. 473–486

2020
[24]

Analyzing machine learning workloads using a detailed gpu simulator,

J. Lew, D. A. Shah, S. Pati, S. Cattell, M. Zhang, A. Sandhupatla, C. Ng, N. Goli, M. D. Sinclair, T. G. Rogers, and T. M. Aamodt, “Analyzing machine learning workloads using a detailed gpu simulator,” in2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019, pp. 151–152

2019
[25]

In: 2019 IEEE Symposium on Security and Privacy (SP)

M. A. Raihan, N. Goli, and T. M. Aamodt, “Modeling deep learning accelerator enabled gpus,” in2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, Mar. 2019, p. 79–92. [Online]. Available: http: //dx.doi.org/10.1109/ISPASS.2019.00016

work page doi:10.1109/ispass.2019.00016 2019
[26]

Principal kernel analysis: A tractable methodology to simulate scaled gpu workloads,

C. Avalos Baddouh, M. Khairy, R. N. Green, M. Payer, and T. G. Rogers, “Principal kernel analysis: A tractable methodology to simulate scaled gpu workloads,” inMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’21. New York, NY , USA: Association for Computing Machinery, 2021, p. 724–737. [Online]. Available: https://...

work page doi:10.1145/3466752.3480100 2021
[27]

Allegro: GPU simulation acceleration for machine learning workloads,

E. Chung, S. Na, and H. Kim, “Allegro: GPU simulation acceleration for machine learning workloads,” inMachine Learning for Computer Architecture and Systems 2024, 2024. [Online]. Available: https: //openreview.net/forum?id=aYbb7xZuu6

2024
[28]

Victima: Drastically increasing address translation reach by leveraging underutilized cache resources,

Y . Li, Y . Sun, and A. Jog, “Path forward beyond simulators: Fast and accurate gpu execution time prediction for dnn workloads,” in Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 380–394. [Online]. Available: https://doi.org/10.1145/36...

work page doi:10.1145/3613424.3614277 2023
[29]

Gonzalez, Matei Zaharia, and Ion Stoica

S. Lee, A. Phanishayee, and D. Mahajan, “Forecasting gpu performance for deep learning training and inference,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, ser. ASPLOS ’25. ACM, Mar. 2025, p. 493–508. [Online]. Available: http://dx.doi.org/10.1145/3669940.3707265

work page doi:10.1145/3669940.3707265 2025
[30]

Habitat: A Runtime- Based computational performance predictor for deep neural network training,

G. X. Yu, Y . Gao, P. Golikov, and G. Pekhimenko, “Habitat: A Runtime- Based computational performance predictor for deep neural network training,” in2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, Jul. 2021, pp. 503–521. [Online]. Available: https://www.usenix.org/conference/atc21/presentation/yu

2021
[31]

1.1 computing’s energy problem (and what we can do about it),

M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” in2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014, pp. 10–14

2014
[32]

Cuda c++ programming guide release 13.0,

NVIDIA, “Cuda c++ programming guide release 13.0,” 2025. [Online]. Available: https://docs.nvidia.com/cuda/cuda-c-programming-guide/

2025
[33]

Nsight compute documentation v2025.3.0,

——, “Nsight compute documentation v2025.3.0,” 2025. [Online]. Available: https://docs.nvidia.com/nsight-compute/index.html

2025
[34]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017

2017
[35]

Cutlass 4.1.0,

NVIDIA, “Cutlass 4.1.0,” 2025. [Online]. Available: https://docs.nvidia. com/cutlass/overview.html

2025
[36]

cublas release 13.0,

——, “cublas release 13.0,” 2025. [Online]. Available: https: //docs.nvidia.com/cuda/pdf/CUBLAS Library.pdf

2025
[37]

DeviceVeil: Robust Authen- tication for Individual USB Devices Using Physical Unclonable Functions

S. Lym, D. Lee, M. O’Connor, N. Chatterjee, and M. Erez, “Delta: Gpu performance model for deep learning applications with in-depth memory system traffic analysis,” in2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, Mar. 2019, p. 293–303. [Online]. Available: http://dx.doi.org/10.1109/ISPASS.2019.00041

work page doi:10.1109/ispass.2019.00041 2019
[38]

Designing cloud servers for lower carbon,

H. Zhang, A. Ning, R. B. Prabhakar, and D. Wentzlaff, “Llmcompass: Enabling efficient hardware design for large language model inference,” inProceedings of the 51st Annual International Symposium on Computer Architecture, ser. ISCA ’24. IEEE Press, 2025, p. 1080–1096. [Online]. Available: https://doi.org/10.1109/ISCA59077.2024.00082

work page doi:10.1109/isca59077.2024.00082 2025
[39]

Flashattention: Fast and memory-efficient exact attention with io-awareness,

T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R ´e, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” 2022

2022
[40]

Nvml reference manual,

NVIDIA, “Nvml reference manual,” 2025. [Online]. Available: https://docs.nvidia.com/deploy/nvml-api/index.html

2025
[41]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2018

2018
[42]

Language models are unsupervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:160025533

2019
[43]

Opt: Open pre-trained transformer language models,

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer, “Opt: Open pre-trained transformer language models,” 2022

2022
[44]

Qwen2 technical report,

A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Yang, J. Xu, J. Zhou, J. Bai, J. He, J. Lin, K. Dang, K. Lu, K. Chen, K. Yang, M. Li, M. Xue, N. Ni, P. Zhang, P. Wang, R. Peng, R. Men, R. Gao, R. Lin, S. Wang, S. Bai, S. Tan, T. Zhu, T. Li, T...

2024
[45]

Deep Residual Learning for Image Recognition , isbn =

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Jun. 2016, p. 770–778. [Online]. Available: http://dx.doi.org/10.1109/cvpr.2016.90

work page doi:10.1109/cvpr.2016.90 2016
[46]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” 2020

2020
[47]

Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer,

S. Mehta and M. Rastegari, “Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer,” 2021

2021
[48]

Sparseloop: An Analytical Approach To Sparse Tensor Accelerator Modeling ,

Y . N. Wu, P.-A. Tsai, A. Parashar, V . Sze, and J. S. Emer, “Sparseloop: An Analytical Approach To Sparse Tensor Accelerator Modeling ,” in ACM/IEEE International Symposium on Microarchitecture (MICRO), 2022

2022
[49]

Roofline: An insightful visual performance model for multicore architectures,

S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,”Commun. ACM, vol. 52, no. 4, p. 65–76, Apr. 2009. [Online]. Available: https://doi.org/10.1145/1498765.1498785

work page doi:10.1145/1498765.1498785 2009
[50]

A roofline model of energy,

J. W. Choi, D. Bedard, R. Fowler, and R. Vuduc, “A roofline model of energy,” in2013 IEEE 27th International Symposium on Parallel and Distributed Processing. IEEE, 2013, pp. 661–672

2013
[51]

An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness,

S. Hong and H. Kim, “An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness,”SIGARCH Comput. Archit. News, vol. 37, no. 3, p. 152–163, Jun. 2009. [Online]. Available: https://doi.org/10.1145/1555815.1555775

work page doi:10.1145/1555815.1555775 2009
[52]

Timeloop: A systematic approach to dnn accelerator evaluation,

A. Parashar, P. Raina, Y . S. Shao, Y .-H. Chen, V . A. Ying, A. Mukkara, R. Venkatesan, B. Khailany, S. W. Keckler, and J. Emer, “Timeloop: A systematic approach to dnn accelerator evaluation,” in2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019, pp. 304–315

2019
[54]

How to access:Please download the artifacts from the Github: https://github.com/kyungmi-lee/energaizer-ispass26- artifact or the Zenodo archive: 10.5281/zenodo.18916559

work page doi:10.5281/zenodo.18916559
[55]

The artifacts require Python3 and Anaconda/Miniconda Virtual Environ- ments

Software dependencies:The provided bash scripts can be executed in Linux or Mac OS environments. The artifacts require Python3 and Anaconda/Miniconda Virtual Environ- ments. D. Installation The installation has two steps: 1) download the pre-collected database for reproducing the results, and 2) building a virtual environment with dependent libraries. Bot...