Recognition: unknown
EnergAIzer: Fast and Accurate GPU Power Estimation Framework for AI Workloads
Pith reviewed 2026-05-09 23:56 UTC · model grok-4.3
The pith
EnergAIzer predicts GPU power for AI workloads with 8% error by analytically modeling kernel patterns instead of running simulations or hardware profiles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EnergAIzer builds a performance model whose analytical scaffold comes from the structured patterns created by common AI-kernel optimizations; the scaffold fits empirical data to expose module utilization, which is then supplied to a separate power model to compute dynamic consumption.
What carries the argument
Performance model that treats structured kernel patterns as an analytical scaffold for empirical fitting to predict module-level utilization.
If this is right
- Frequency scaling studies become practical because each trial takes seconds rather than hours.
- Architectural configuration sweeps, including forecasts for next-generation GPUs such as H100, can be performed with only 7 percent error.
- Power-aware design explorations for new AI accelerators no longer require cycle-level simulators or hardware counters for every candidate kernel.
- Datacenter operators gain a tool that supports proactive power management decisions without lengthy profiling campaigns.
Where Pith is reading between the lines
- The same pattern-based approach could be tested on other accelerators that run structured matrix or tensor kernels.
- Runtime systems might incorporate the model for on-the-fly power capping or scheduling.
- If the patterns prove stable across compiler versions, the method could reduce reliance on vendor-specific profiling tools.
Load-bearing premise
AI kernels commonly employ optimizations that create structured patterns which analytically determine memory traffic and execution timeline sufficiently to expose accurate module-level utilization without post-hoc fitting adjustments.
What would settle it
Run the same AI workloads on an Ampere GPU while measuring actual per-module activity counters and power; if measured utilization differs by more than a few percent from the model's analytic predictions, the 8 percent power error claim fails.
Figures
read the original abstract
As AI workloads drive increases in datacenter power consumption, accurate GPU power estimation is critical for proactive power management. However, existing power models face a scalability bottleneck not in the modeling techniques themselves, but in obtaining the hardware utilization inputs they require. Conventional approaches rely on either costly simulation or hardware profiling, which makes them impractical when rapid predictions are required. This work presents EnergAIzer, which addresses this scalability bottleneck by developing a lightweight solution to predict utilization inputs, reducing the estimation walltime from hours to seconds. Our key insight is that kernels in AI workloads commonly employ optimizations that create structured patterns, which analytically determine memory traffic and execution timeline. We construct a performance model using these patterns as an analytical scaffold for empirical data fitting, which also naturally exposes module-level utilization. This predicted utilization is then fed into our power model to estimate dynamic power consumption. EnergAIzer achieves 8% power errors on NVIDIA Ampere GPUs, competitive with traditional power models with elaborate cycle-level simulation or hardware profiling. We demonstrate EnergAIzer's exploration capabilities for frequency scaling and architectural configurations, including forecasting the power of NVIDIA H100 with just 7% error. In summary, EnergAIzer provides fast and accurate power prediction for AI workloads, paving the way for power-aware design explorations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents EnergAIzer, a GPU power estimation framework for AI workloads that addresses the scalability bottleneck of obtaining hardware utilization inputs. It develops a lightweight performance model that exploits structured patterns from kernel optimizations in AI workloads to analytically determine memory traffic and execution timelines; these patterns serve as a scaffold for empirical data fitting that exposes module-level utilization, which is then input to a power model. The work reports 8% average power error on NVIDIA Ampere GPUs (competitive with cycle-level simulation or profiling approaches) and demonstrates exploration capabilities including a 7% error forecast for NVIDIA H100 power via architectural configuration studies, reducing estimation walltime from hours to seconds.
Significance. If the central claims hold, EnergAIzer would offer a practical advance in power modeling for AI accelerators by enabling rapid, low-overhead predictions without heavy simulation or per-workload profiling. This could support proactive power management and design-space exploration in datacenters, particularly for emerging architectures like Hopper. The pattern-based analytical scaffold combined with fitting is a potentially reusable idea for performance modeling beyond power.
major comments (2)
- [Abstract] Abstract and H100 forecasting section: the 7% error claim for NVIDIA H100 power is load-bearing for the exploration contribution, yet the manuscript provides no evidence that the Ampere-derived analytical patterns for memory traffic and execution timeline were re-derived or validated on H100, nor whether any H100 measurements entered the empirical fitting. If the patterns or coefficients are architecture-specific, the utilization inputs to the power model would be invalid even if the power model itself is retuned.
- [Performance Model] Performance model description (likely §3 or §4): the central claim that the pattern scaffold plus fitting 'naturally exposes module-level utilization' for accurate power prediction requires explicit details on the fitting procedure, including which workloads were used for validation vs. fitting, data exclusion rules, cross-validation strategy, and reported error bars or confidence intervals. Without these, the reported 8% Ampere error cannot be assessed for independence from the fitted parameters.
minor comments (2)
- [Abstract] The abstract states competitive error rates but does not name the specific traditional power models or cycle-level simulators used for comparison; adding these references would strengthen the positioning.
- Notation for module-level utilization and the power model equations should be introduced with a clear table or diagram early in the manuscript to aid readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas for clarification in our presentation of the H100 forecasting results and the performance model fitting procedure. We address each point below and commit to revisions that will strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [Abstract] Abstract and H100 forecasting section: the 7% error claim for NVIDIA H100 power is load-bearing for the exploration contribution, yet the manuscript provides no evidence that the Ampere-derived analytical patterns for memory traffic and execution timeline were re-derived or validated on H100, nor whether any H100 measurements entered the empirical fitting. If the patterns or coefficients are architecture-specific, the utilization inputs to the power model would be invalid even if the power model itself is retuned.
Authors: The analytical patterns are derived from structured kernel optimizations (e.g., tiling and memory access patterns in GEMM and convolution kernels) that are common across AI frameworks and largely architecture-agnostic, as they follow from the CUDA programming model rather than specific hardware parameters. These patterns analytically determine memory traffic and execution timelines, after which empirical fitting on Ampere data exposes module-level utilizations. For the H100 forecast, no H100 hardware measurements were available or used in fitting; instead, we re-parameterize the performance model using publicly documented H100 specifications (SM count, memory bandwidth, tensor core throughput) while retaining the Ampere-fitted utilization predictors. The reported 7% error is computed against power estimates obtained from architectural simulators and NVIDIA documentation for equivalent H100 configurations. We will add an explicit subsection in the revised manuscript detailing these cross-architecture assumptions, the absence of H100 measurements, and the resulting limitations of the forecast. This makes the methodology transparent while preserving the exploration contribution. revision: partial
-
Referee: [Performance Model] Performance model description (likely §3 or §4): the central claim that the pattern scaffold plus fitting 'naturally exposes module-level utilization' for accurate power prediction requires explicit details on the fitting procedure, including which workloads were used for validation vs. fitting, data exclusion rules, cross-validation strategy, and reported error bars or confidence intervals. Without these, the reported 8% Ampere error cannot be assessed for independence from the fitted parameters.
Authors: We agree that the current manuscript provides only a high-level description of the fitting procedure and therefore does not allow readers to fully evaluate the independence of the 8% error. In the revised version we will expand the performance model section to include: the complete list of training and validation workloads with an explicit split (e.g., 70/30 or leave-one-out); any data exclusion criteria applied (e.g., removal of runs with measurement artifacts); the cross-validation strategy used; and error bars or confidence intervals on the reported average power error. These additions will directly address the concern and enable independent assessment of robustness. revision: yes
Circularity Check
No significant circularity; derivation relies on independent empirical fitting and architectural exploration
full rationale
The abstract describes constructing a performance model from structured kernel patterns as an analytical scaffold for empirical data fitting on Ampere GPUs, exposing module-level utilization that is then fed into a separate power model. The H100 result is framed as a forecast obtained by exploring architectural configurations rather than a fit to H100 measurements. No equations, self-citations, or self-definitional steps are present in the provided text that would reduce any prediction to its inputs by construction. The chain therefore retains independent content from the fitting process and does not meet the criteria for circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Kernels in AI workloads commonly employ optimizations that create structured patterns analytically determining memory traffic and execution timeline.
Reference graph
Works this paper leans on
-
[1]
2024 united states data center energy usage report,
S. Arman, A. Newkirk, S. J. Smith, A. Hubbard, N. Lei, M. A. B. Siddik, B. Holecek, J. Koomey, E. Masanet, and D. Sartor, “2024 united states data center energy usage report,” Lawrence Berkeley National Laboratory, Tech. Rep., 2024
2024
-
[2]
Ai has high data center en- ergy costs — but there are solutions,
B. Stackpole, “Ai has high data center en- ergy costs — but there are solutions,” 2025. [Online]. Available: https://mitsloan.mit.edu/ideas-made-to-matter/ ai-has-high-data-center-energy-costs-there-are-solutions
2025
-
[3]
Towards power efficiency in deep learning on data center hardware,
M. Hodak, M. Gorkovenko, and A. Dholakia, “Towards power efficiency in deep learning on data center hardware,” in2019 IEEE International Conference on Big Data (Big Data), 2019, pp. 1814–1820
2019
-
[4]
Nvidia h100 tensor core gpu architecture,
NVIDIA, “Nvidia h100 tensor core gpu architecture,” 2023. [On- line]. Available: https://resources.nvidia.com/en-us-hopper-architecture/ nvidia-h100-tensor-c
2023
-
[5]
Nvidia blackwell datasheet,
——, “Nvidia blackwell datasheet,” 2025. [Online]. Available: https://resources.nvidia.com/en-us-blackwell-architecture/datasheet
2025
-
[6]
Know your enemy to save cloud energy: Energy-performance characterization of machine learning serving,
J. Yu, J. Kim, and E. Seo, “Know your enemy to save cloud energy: Energy-performance characterization of machine learning serving,” in 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023, pp. 842–854
2023
-
[7]
J. Stojkovic, C. Zhang, I. n. Goiri, J. Torrellas, and E. Choukse, “Dynamollm: Designing llm inference clusters for performance and energy efficiency,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, Mar. 2025, p. 1348–1362. [Online]. Available: http://dx.doi.org/10.1109/HPCA61900. 2025.00102
-
[8]
P. Patel, E. Choukse, C. Zhang, I. n. Goiri, B. Warrier, N. Mahalingam, and R. Bianchini, “Characterizing power management opportunities for llms in the cloud,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ser. ASPLOS ’24. New York, NY , USA: Association for Comp...
-
[9]
Kepler: A framework to calculate the energy consumption of containerized applications,
M. Amaral, H. Chen, T. Chiba, R. Nakazawa, S. Choochotkaew, E. K. Lee, and T. Eilam, “Kepler: A framework to calculate the energy consumption of containerized applications,” in2023 IEEE 16th Interna- tional Conference on Cloud Computing (CLOUD), 2023, pp. 69–71
2023
-
[10]
Energy-aware tile size selection for affine programs on gpus,
M. Jayaweera, M. Kong, Y . Wang, and D. Kaeli, “Energy-aware tile size selection for affine programs on gpus,” in2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2024, pp. 13– 27
2024
-
[11]
Zeus: Understanding and optimizing GPU energy consumption of DNN training,
J. You, J.-W. Chung, and M. Chowdhury, “Zeus: Understanding and optimizing GPU energy consumption of DNN training,” in20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). Boston, MA: USENIX Association, Apr. 2023, pp. 119–
2023
-
[12]
Available: https://www.usenix.org/conference/nsdi23/ presentation/you
[Online]. Available: https://www.usenix.org/conference/nsdi23/ presentation/you
-
[13]
Reducing energy bloat in large model training,
J.-W. Chung, Y . Gu, I. Jang, L. Meng, N. Bansal, and M. Chowdhury, “Reducing energy bloat in large model training,” inProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, ser. SOSP ’24. ACM, Nov. 2024, p. 144–159. [Online]. Available: http://dx.doi.org/10.1145/3694715.3695970
-
[14]
Gpuwattch: enabling energy optimizations in gpgpus,
J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V . J. Reddi, “Gpuwattch: enabling energy optimizations in gpgpus,” inProceedings of the 40th Annual International Symposium on Computer Architecture, ser. ISCA ’13. New York, NY , USA: Association for Computing Machinery, 2013, p. 487–498. [Online]. Available: https://doi.org...
-
[15]
V . Kandiah, S. Peverelle, M. Khairy, J. Pan, A. Manjunath, T. G. Rogers, T. M. Aamodt, and N. Hardavellas, “Accelwattch: A power modeling framework for modern gpus,” inMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’21. New York, NY , USA: Association for Computing Machinery, 2021, p. 738–753. [Online]. Available:...
-
[16]
Understanding the future of energy efficiency in multi-module gpus,
A. Arunkumar, E. Bolotin, D. Nellans, and C.-J. Wu, “Understanding the future of energy efficiency in multi-module gpus,” in2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2019, pp. 519–532
2019
-
[17]
An integrated gpu power and performance model,
S. Hong and H. Kim, “An integrated gpu power and performance model,”SIGARCH Comput. Archit. News, vol. 38, no. 3, p. 280–289, Jun. 2010. [Online]. Available: https://doi.org/10.1145/1816038.1815998
-
[18]
Gpgpu power modeling for multi-domain voltage-frequency scaling,
J. Guerreiro, A. Ilic, N. Roma, and P. Tomas, “Gpgpu power modeling for multi-domain voltage-frequency scaling,” in2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2018, pp. 789–800
2018
-
[19]
Gpgpu performance and power estimation using machine learning,
G. Wu, J. L. Greathouse, A. Lyashevsky, N. Jayasena, and D. Chiou, “Gpgpu performance and power estimation using machine learning,” in2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), 2015, pp. 564–576
2015
-
[20]
Performance-aware energy-efficient gpu frequency selection using dnn-based models,
G. Ali, M. Side, S. Bhalachandra, N. J. Wright, and Y . Chen, “Performance-aware energy-efficient gpu frequency selection using dnn-based models,” inProceedings of the 52nd International Conference on Parallel Processing, ser. ICPP ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 433–442. [Online]. Available: https://doi.org/10.1145/...
-
[21]
Y . Zhang, Q. Wang, Z. Lin, P. Xu, and B. Wang, “Improving gpu energy efficiency through an application-transparent frequency scaling policy with performance assurance,” inProceedings of the Nineteenth European Conference on Computer Systems, ser. EuroSys ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 769–785. [Online]. Available: ...
-
[22]
Minimizing power consumption in digital cmos circuits,
A. Chandrakasan and R. Brodersen, “Minimizing power consumption in digital cmos circuits,”Proceedings of the IEEE, vol. 83, no. 4, pp. 498–523, 1995
1995
-
[23]
Accel-sim: An extensible simulation framework for validated gpu modeling,
M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-sim: An extensible simulation framework for validated gpu modeling,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020, pp. 473–486
2020
-
[24]
Analyzing machine learning workloads using a detailed gpu simulator,
J. Lew, D. A. Shah, S. Pati, S. Cattell, M. Zhang, A. Sandhupatla, C. Ng, N. Goli, M. D. Sinclair, T. G. Rogers, and T. M. Aamodt, “Analyzing machine learning workloads using a detailed gpu simulator,” in2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019, pp. 151–152
2019
-
[25]
In: 2019 IEEE Symposium on Security and Privacy (SP)
M. A. Raihan, N. Goli, and T. M. Aamodt, “Modeling deep learning accelerator enabled gpus,” in2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, Mar. 2019, p. 79–92. [Online]. Available: http: //dx.doi.org/10.1109/ISPASS.2019.00016
-
[26]
Principal kernel analysis: A tractable methodology to simulate scaled gpu workloads,
C. Avalos Baddouh, M. Khairy, R. N. Green, M. Payer, and T. G. Rogers, “Principal kernel analysis: A tractable methodology to simulate scaled gpu workloads,” inMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’21. New York, NY , USA: Association for Computing Machinery, 2021, p. 724–737. [Online]. Available: https://...
-
[27]
Allegro: GPU simulation acceleration for machine learning workloads,
E. Chung, S. Na, and H. Kim, “Allegro: GPU simulation acceleration for machine learning workloads,” inMachine Learning for Computer Architecture and Systems 2024, 2024. [Online]. Available: https: //openreview.net/forum?id=aYbb7xZuu6
2024
-
[28]
Y . Li, Y . Sun, and A. Jog, “Path forward beyond simulators: Fast and accurate gpu execution time prediction for dnn workloads,” in Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 380–394. [Online]. Available: https://doi.org/10.1145/36...
-
[29]
Gonzalez, Matei Zaharia, and Ion Stoica
S. Lee, A. Phanishayee, and D. Mahajan, “Forecasting gpu performance for deep learning training and inference,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, ser. ASPLOS ’25. ACM, Mar. 2025, p. 493–508. [Online]. Available: http://dx.doi.org/10.1145/3669940.3707265
-
[30]
Habitat: A Runtime- Based computational performance predictor for deep neural network training,
G. X. Yu, Y . Gao, P. Golikov, and G. Pekhimenko, “Habitat: A Runtime- Based computational performance predictor for deep neural network training,” in2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, Jul. 2021, pp. 503–521. [Online]. Available: https://www.usenix.org/conference/atc21/presentation/yu
2021
-
[31]
1.1 computing’s energy problem (and what we can do about it),
M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” in2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014, pp. 10–14
2014
-
[32]
Cuda c++ programming guide release 13.0,
NVIDIA, “Cuda c++ programming guide release 13.0,” 2025. [Online]. Available: https://docs.nvidia.com/cuda/cuda-c-programming-guide/
2025
-
[33]
Nsight compute documentation v2025.3.0,
——, “Nsight compute documentation v2025.3.0,” 2025. [Online]. Available: https://docs.nvidia.com/nsight-compute/index.html
2025
-
[34]
Attention is all you need,
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017
2017
-
[35]
Cutlass 4.1.0,
NVIDIA, “Cutlass 4.1.0,” 2025. [Online]. Available: https://docs.nvidia. com/cutlass/overview.html
2025
-
[36]
cublas release 13.0,
——, “cublas release 13.0,” 2025. [Online]. Available: https: //docs.nvidia.com/cuda/pdf/CUBLAS Library.pdf
2025
-
[37]
DeviceVeil: Robust Authen- tication for Individual USB Devices Using Physical Unclonable Functions
S. Lym, D. Lee, M. O’Connor, N. Chatterjee, and M. Erez, “Delta: Gpu performance model for deep learning applications with in-depth memory system traffic analysis,” in2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, Mar. 2019, p. 293–303. [Online]. Available: http://dx.doi.org/10.1109/ISPASS.2019.00041
-
[38]
Designing cloud servers for lower carbon,
H. Zhang, A. Ning, R. B. Prabhakar, and D. Wentzlaff, “Llmcompass: Enabling efficient hardware design for large language model inference,” inProceedings of the 51st Annual International Symposium on Computer Architecture, ser. ISCA ’24. IEEE Press, 2025, p. 1080–1096. [Online]. Available: https://doi.org/10.1109/ISCA59077.2024.00082
-
[39]
Flashattention: Fast and memory-efficient exact attention with io-awareness,
T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R ´e, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” 2022
2022
-
[40]
Nvml reference manual,
NVIDIA, “Nvml reference manual,” 2025. [Online]. Available: https://docs.nvidia.com/deploy/nvml-api/index.html
2025
-
[41]
Bert: Pre-training of deep bidirectional transformers for language understanding,
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2018
2018
-
[42]
Language models are unsupervised multitask learners,
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:160025533
2019
-
[43]
Opt: Open pre-trained transformer language models,
S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer, “Opt: Open pre-trained transformer language models,” 2022
2022
-
[44]
Qwen2 technical report,
A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Yang, J. Xu, J. Zhou, J. Bai, J. He, J. Lin, K. Dang, K. Lu, K. Chen, K. Yang, M. Li, M. Xue, N. Ni, P. Zhang, P. Wang, R. Peng, R. Men, R. Gao, R. Lin, S. Wang, S. Bai, S. Tan, T. Zhu, T. Li, T...
2024
-
[45]
Deep Residual Learning for Image Recognition , isbn =
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Jun. 2016, p. 770–778. [Online]. Available: http://dx.doi.org/10.1109/cvpr.2016.90
-
[46]
An image is worth 16x16 words: Transformers for image recognition at scale,
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” 2020
2020
-
[47]
Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer,
S. Mehta and M. Rastegari, “Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer,” 2021
2021
-
[48]
Sparseloop: An Analytical Approach To Sparse Tensor Accelerator Modeling ,
Y . N. Wu, P.-A. Tsai, A. Parashar, V . Sze, and J. S. Emer, “Sparseloop: An Analytical Approach To Sparse Tensor Accelerator Modeling ,” in ACM/IEEE International Symposium on Microarchitecture (MICRO), 2022
2022
-
[49]
Roofline: An insightful visual performance model for multicore architectures,
S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,”Commun. ACM, vol. 52, no. 4, p. 65–76, Apr. 2009. [Online]. Available: https://doi.org/10.1145/1498765.1498785
-
[50]
A roofline model of energy,
J. W. Choi, D. Bedard, R. Fowler, and R. Vuduc, “A roofline model of energy,” in2013 IEEE 27th International Symposium on Parallel and Distributed Processing. IEEE, 2013, pp. 661–672
2013
-
[51]
An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness,
S. Hong and H. Kim, “An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness,”SIGARCH Comput. Archit. News, vol. 37, no. 3, p. 152–163, Jun. 2009. [Online]. Available: https://doi.org/10.1145/1555815.1555775
-
[52]
Timeloop: A systematic approach to dnn accelerator evaluation,
A. Parashar, P. Raina, Y . S. Shao, Y .-H. Chen, V . A. Ying, A. Mukkara, R. Venkatesan, B. Khailany, S. W. Keckler, and J. Emer, “Timeloop: A systematic approach to dnn accelerator evaluation,” in2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019, pp. 304–315
2019
-
[54]
How to access:Please download the artifacts from the Github: https://github.com/kyungmi-lee/energaizer-ispass26- artifact or the Zenodo archive: 10.5281/zenodo.18916559
-
[55]
The artifacts require Python3 and Anaconda/Miniconda Virtual Environ- ments
Software dependencies:The provided bash scripts can be executed in Linux or Mac OS environments. The artifacts require Python3 and Anaconda/Miniconda Virtual Environ- ments. D. Installation The installation has two steps: 1) download the pre-collected database for reproducing the results, and 2) building a virtual environment with dependent libraries. Bot...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.