Recognition: unknown
Evaluating Cross-Architecture Performance Modeling of Distributed ML Workloads Using StableHLO
Pith reviewed 2026-05-10 14:58 UTC · model grok-4.3
The pith
StableHLO can serve as a single portable representation for predicting distributed ML workload performance across GPUs, TPUs, and varied simulators.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The study establishes a StableHLO-based simulation methodology that maps a single workload representation onto multiple performance models, spanning analytical, profiling-based, and simulator-driven predictors. Using this methodology, workloads are evaluated across GPUs and TPUs without requiring access to scaled-out physical systems, enabling systematic comparison across modeling fidelities. An empirical evaluation covering distributed GEMM kernels, ResNet, and large language model training workloads demonstrates that StableHLO preserves relative performance trends across architectures and fidelities, while exposing accuracy trade-offs and simulator limitations. Across evaluated scenarios,
What carries the argument
StableHLO dialect as a unified, portable representation of ML operations that feeds into multiple separate performance predictors.
If this is right
- Workloads can be compared across GPUs and TPUs using one description and without building full hardware clusters.
- Different prediction methods can be applied to the same StableHLO form to reveal accuracy differences.
- Simulator shortcomings become visible when the same workload runs through multiple modeling paths.
- Reusable workflows emerge for checking performance during early ML system design stages.
Where Pith is reading between the lines
- Teams could iterate on hardware choices faster by testing many configurations from one workload file before any physical runs.
- Cross-checking outputs from different simulators against each other might expose and reduce individual tool biases.
- The method could extend naturally to additional accelerator types once their simulators accept StableHLO input.
- Adoption might lower the barrier for comparing new distributed training strategies across mixed hardware environments.
Load-bearing premise
Converting workloads into StableHLO keeps the performance traits that matter for predicting relative differences across architectures and simulators.
What would settle it
Direct measurements on physical multi-GPU and multi-TPU clusters that show the StableHLO-based models fail to match actual relative speedups or efficiencies for the tested workloads.
Figures
read the original abstract
Predicting the performance of large-scale distributed machine learning (ML) workloads across multiple accelerator architectures remains a central challenge in ML system design. Existing GPU and TPU focused simulators are typically architecture-specific, while distributed training simulators rely on workload-specific analytical models or costly post-execution traces, limiting portability and cross-platform comparison. This work evaluates whether MLIR's StableHLO dialect can serve as a unified workload representation for cross-architecture and cross-fidelity performance modeling of distributed ML workloads. The study establishes a StableHLO-based simulation methodology that maps a single workload representation onto multiple performance models, spanning analytical, profiling-based, and simulator-driven predictors. Using this methodology, workloads are evaluated across GPUs and TPUs without requiring access to scaled-out physical systems, enabling systematic comparison across modeling fidelities. An empirical evaluation covering distributed GEMM kernels, ResNet, and large language model training workloads demonstrates that StableHLO preserves relative performance trends across architectures and fidelities, while exposing accuracy trade-offs and simulator limitations. Across evaluated scenarios, prediction errors remain within practical bounds for early-stage design exploration, and the methodology reveals fidelity-dependent limitations in existing GPU simulators. These results indicate that StableHLO provides a viable foundation for unified, distributed ML performance modeling across accelerator architectures and simulators, supporting reusable evaluation workflows and cross-validation throughout the ML system design process.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates MLIR's StableHLO dialect as a unified representation for cross-architecture performance modeling of distributed ML workloads. It maps workloads (distributed GEMM, ResNet, LLM training) to StableHLO and applies them to analytical, profiling-based, and simulator-driven predictors spanning GPUs and TPUs. Empirical results show preserved relative performance trends across architectures and modeling fidelities, with prediction errors within practical bounds for early design, while exposing limitations in existing GPU simulators. The authors conclude that StableHLO enables reusable evaluation workflows and cross-validation in ML system design.
Significance. If substantiated, this provides a meaningful advance for portable performance modeling in distributed ML systems by reducing reliance on architecture-specific tools. The empirical breadth across workloads and fidelities, plus identification of simulator limitations, offers practical value for early-stage design exploration. The use of a standard dialect supports reproducibility and could standardize cross-architecture comparisons in the ML systems community.
major comments (2)
- [Methodology] The central claim that StableHLO mappings preserve relative performance trends without losing distributed execution details (communication patterns, memory hierarchies, scaling) is load-bearing for the viability conclusion. The methodology section must explicitly describe how StableHLO represents collectives and tensor layouts, with direct comparisons to native simulator inputs, to demonstrate that observed trend preservation is not an artifact of the selected workloads.
- [Empirical Evaluation] The abstract states that prediction errors remain within practical bounds, but the evaluation section must report concrete error metrics (e.g., MAPE or absolute relative error per workload), simulator configurations, data exclusion rules, and per-scenario results (including any tables comparing predictions to ground truth). Without these, the support for the 'practical bounds for early-stage design' claim cannot be verified.
minor comments (2)
- Define all performance predictor notations and fidelity levels in a dedicated table or glossary for clarity across the results.
- Include an appendix with full simulator versions, StableHLO dialect commit, and workload mapping scripts to support reproducibility.
Simulated Author's Rebuttal
Thank you for the opportunity to revise our manuscript based on the referee's insightful comments. We address each major comment below and have made revisions to improve clarity and substantiation of our claims.
read point-by-point responses
-
Referee: [Methodology] The central claim that StableHLO mappings preserve relative performance trends without losing distributed execution details (communication patterns, memory hierarchies, scaling) is load-bearing for the viability conclusion. The methodology section must explicitly describe how StableHLO represents collectives and tensor layouts, with direct comparisons to native simulator inputs, to demonstrate that observed trend preservation is not an artifact of the selected workloads.
Authors: We agree with this assessment and have revised the Methodology section to provide a detailed description of StableHLO's representation of collectives and tensor layouts. Specifically, we now include explanations of how operations like AllReduce and AllGather are mapped, along with tensor sharding and layout specifications. We also added direct comparisons between StableHLO-derived inputs and native simulator configurations for the GPU and TPU models, using the distributed GEMM and ResNet workloads as examples. These additions confirm that the trend preservation is not an artifact but stems from faithful representation of distributed aspects. revision: yes
-
Referee: [Empirical Evaluation] The abstract states that prediction errors remain within practical bounds, but the evaluation section must report concrete error metrics (e.g., MAPE or absolute relative error per workload), simulator configurations, data exclusion rules, and per-scenario results (including any tables comparing predictions to ground truth). Without these, the support for the 'practical bounds for early-stage design' claim cannot be verified.
Authors: We have updated the Empirical Evaluation section to include the requested details. Concrete metrics such as Mean Absolute Percentage Error (MAPE) are now reported for each workload (distributed GEMM, ResNet, LLM training) across architectures and modeling fidelities. Simulator configurations, including versions and key parameters, are specified, along with data exclusion rules (e.g., outlier handling). New tables have been added showing per-scenario prediction vs. ground truth comparisons. These revisions provide verifiable support for the practical bounds claim. revision: yes
Circularity Check
No circularity in empirical evaluation of StableHLO mappings
full rationale
The paper presents an empirical methodology that maps distributed ML workloads (GEMM, ResNet, LLM training) to StableHLO and evaluates preservation of relative performance trends across GPU/TPU simulators and analytical models. Claims rest on direct experimental comparisons of prediction errors and trend fidelity against external benchmarks, without any derivations, equations, fitted parameters redefined as predictions, or self-citation chains that reduce the central viability conclusion to its own inputs by construction. The evaluation is self-contained against measured data and cross-fidelity checks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Training compute of frontier ai models grows by 4-5x per year,
J. Sevilla and E. Rold ´an, “Training compute of frontier ai models grows by 4-5x per year,” 2024, available at: https://epoch.ai/blog/ training-compute-of-frontier-ai-models-grows-by-4-5x-per-year, Ac- cessed: 2024-12-03
2024
-
[2]
Rethinking llm inference bottlenecks: Insights from latent attention and mixture-of-experts, 2026
S. Yun, S. Park, H. Nam, Y . Lee, G. Lee, K. Kyung, S. Kim, N. S. Kim, J. Kim, H. Kimet al., “The new llm bottleneck: A systems perspective on latent attention and mixture-of-experts,”arXiv preprint arXiv:2507.15465, 2025
-
[3]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,”arXiv preprint arXiv:1701.06538, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[4]
Mamba: Linear-time sequence modeling with selective state spaces,
A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” inFirst conference on language modeling, 2024
2024
-
[5]
A survey of state of the art large vision language models: Benchmark evaluations and challenges,
Z. Li, X. Wu, H. Du, F. Liu, H. Nghiem, and G. Shi, “A survey of state of the art large vision language models: Benchmark evaluations and challenges,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 1587–1606
2025
-
[6]
Flashattention: Fast and memory-efficient exact attention with io-awareness,
T. Dao, D. Fu, S. Ermon, A. Rudra, and C. R ´e, “Flashattention: Fast and memory-efficient exact attention with io-awareness,”Advances in neural information processing systems, vol. 35, pp. 16 344–16 359, 2022
2022
-
[7]
Triton: an intermediate language and compiler for tiled neural network computations,
P. Tillet, H.-T. Kung, and D. Cox, “Triton: an intermediate language and compiler for tiled neural network computations,” inProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, 2019, pp. 10–19
2019
-
[8]
cuDNN: Efficient primitives for deep learning,
S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, “cudnn: Efficient primitives for deep learning,”arXiv preprint arXiv:1410.0759, 2014
-
[9]
Miopen: An open source library for deep learning primitives,
J. Khan, P. Fultz, A. Tamazov, D. Lowell, C. Liu, M. Melesse, M. Nandhimandalam, K. Nasyrov, I. Perminov, T. Shahet al., “Miopen: An open source library for deep learning primitives,”arXiv preprint arXiv:1910.00078, 2019
-
[10]
onednn graph compiler: A hybrid approach for high-performance deep learning compilation,
J. Li, Z. Qin, Y . Mei, J. Cui, Y . Song, C. Chen, Y . Zhang, L. Du, X. Cheng, B. Jinet al., “onednn graph compiler: A hybrid approach for high-performance deep learning compilation,” in2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2024, pp. 460–470
2024
-
[11]
Xla : Compiling machine learning for peak performance,
A. Sabne, “Xla : Compiling machine learning for peak performance,” 2020
2020
-
[12]
Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation,
J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. V oznesensky, B. Bao, P. Bell, D. Berard, E. Burovskiet al., “Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume ...
2024
-
[13]
Nvidia blackwell architecture technical overview,
NVIDIA, “Nvidia blackwell architecture technical overview,” NVIDIA Corporation, Tech. Rep., March 2024, accessed: December 10, 2025. [On- line]. Available: https://resources.nvidia.com/en-us-blackwell-architecture
2024
-
[14]
Introducing amd CDNA 3 architecture,
AMD, “Introducing amd CDNA 3 architecture,” Advanced Micro Devices, Inc., White Paper 2258402-A, 2023, accessed: December 10, 2025. [On- line]. Available: https://www.amd.com/content/dam/amd/en/documents/ instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf
2023
-
[15]
Ironwood: The first Google TPU for the age of inference,
Google Cloud, “Ironwood: The first Google TPU for the age of inference,” https://blog.google/products/google-cloud/ ironwood-tpu-age-of-inference/, 2025, updated April 23, 2025; accessed: December 10, 2025
2025
-
[16]
Stablehlo specification,
OpenXLA Community, “Stablehlo specification,” https://openxla.org/ stablehlo/spec, 2023, accessed: 2025-09-24
2023
-
[17]
Accel-sim: An extensible simulation framework for validated gpu modeling,
M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-sim: An extensible simulation framework for validated gpu modeling,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020, pp. 473–486
2020
-
[18]
Amped: An analytical model for performance in dis- tributed training of transformers,
D. Moolchandani, J. Kundu, F. Ruelens, P. Vrancx, T. Evenblij, and M. Perumkunnil, “Amped: An analytical model for performance in dis- tributed training of transformers,” in2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2023, pp. 306–315
2023
-
[19]
Calculon: a methodology and tool for high-level co-design of systems and large language models,
M. Isaev, N. McDonald, L. Dennison, and R. Vuduc, “Calculon: a methodology and tool for high-level co-design of systems and large language models,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023, pp. 1–14
2023
-
[20]
J. Kundu, W. Guo, A. BanaGozar, U. De Alwis, S. Sengupta, P. Gupta, and A. Mallik, “Performance modeling and workload analysis of distributed large language model training and inference,”arXiv preprint arXiv:2407.14645, 2024
-
[21]
Distsim: A performance model of large-scale hybrid distributed dnn training,
G. Lu, R. Chen, Y . Wang, Y . Zhou, R. Zhang, Z. Hu, Y . Miao, Z. Cai, L. Li, J. Lenget al., “Distsim: A performance model of large-scale hybrid distributed dnn training,” inProceedings of the 20th ACM International Conference on Computing Frontiers, 2023, pp. 112–122
2023
-
[22]
Deepflow: A cross-stack pathfinding framework for distributed ai systems,
N. Ardalani, S. Pal, and P. Gupta, “Deepflow: A cross-stack pathfinding framework for distributed ai systems,”ACM Transactions on Design Automation of Electronic Systems, vol. 29, no. 2, pp. 1–20, 2024
2024
-
[23]
Astra-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale,
W. Won, T. Heo, S. Rashidi, S. Sridharan, S. Srinivasan, and T. Krishna, “Astra-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale,” in2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2023, pp. 283–294
2023
-
[24]
{SimAI}: Unifying architecture design and performance tuning for {Large-Scale} large language model training with scalability and precision,
X. Wang, Q. Li, Y . Xu, G. Lu, D. Li, L. Chen, H. Zhou, L. Zheng, S. Zhang, Y . Zhuet al., “ {SimAI}: Unifying architecture design and performance tuning for {Large-Scale} large language model training with scalability and precision,” in22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), 2025, pp. 541–558
2025
-
[25]
Accelerating design space exploration for {LLM} training systems with multi-experiment parallel simulation,
F. Gui, K. Gao, L. Chen, D. Li, V . Liu, R. Zhang, H. Yang, and D. Xiong, “Accelerating design space exploration for {LLM} training systems with multi-experiment parallel simulation,” in22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), 2025, pp. 473–488
2025
-
[26]
vtrain: A simulation framework for evaluating cost-effective and compute-optimal large lan- guage model training,
J. Bang, Y . Choi, M. Kim, Y . Kim, and M. Rhu, “vtrain: A simulation framework for evaluating cost-effective and compute-optimal large lan- guage model training,” in2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2024, pp. 153–167
2024
-
[27]
S. Shen, T. Bonato, Z. Hu, P. Jordan, T. Chen, and T. Hoefler, “Atlahs: An application-centric network simulator toolchain for ai, hpc, and distributed storage,”arXiv preprint arXiv:2505.08936, 2025
-
[28]
Proteus: Simulating the performance of distributed dnn training,
J. Duan, X. Li, P. Xu, X. Zhang, S. Yan, Y . Liang, and D. Lin, “Proteus: Simulating the performance of distributed dnn training,”IEEE Transactions on Parallel and Distributed Systems, 2024
2024
-
[29]
Daydream: Accurately estimating the efficacy of optimizations for {DNN} training,
H. Zhu, A. Phanishayee, and G. Pekhimenko, “Daydream: Accurately estimating the efficacy of optimizations for {DNN} training,” in2020 USENIX Annual Technical Conference (USENIX ATC 20), 2020, pp. 337–352
2020
-
[30]
Echo: Simulating distributed training at scale,
Y . Feng, Y . Chen, K. Chen, J. Li, T. Wu, P. Cheng, C. Wu, W. Wang, T.-Y . Ho, and H. Xu, “Echo: Simulating distributed training at scale,” arXiv preprint arXiv:2412.12487, 2024
-
[31]
Llmcompass: Enabling efficient hardware design for large language model inference,
H. Zhang, A. Ning, R. B. Prabhakar, and D. Wentzlaff, “Llmcompass: Enabling efficient hardware design for large language model inference,” in2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 2024, pp. 1080–1096
2024
-
[32]
Chakra: Advancing performance benchmarking and co-design using standardized execution traces, 2023
S. Sridharan, T. Heo, L. Feng, Z. Wang, M. Bergeron, W. Fu, S. Zheng, B. Coutinho, S. Rashidi, C. Manet al., “Chakra: Advancing performance benchmarking and co-design using standardized execution traces,”arXiv preprint arXiv:2305.14516, 2023
-
[33]
Llmservingsim: A hw/sw co-simulation infrastructure for llm inference serving at scale,
J. Cho, M. Kim, H. Choi, G. Heo, and J. Park, “Llmservingsim: A hw/sw co-simulation infrastructure for llm inference serving at scale,”arXiv preprint arXiv:2408.05499, 2024
-
[34]
Group operation assembly language-a flexible way to express collective communication,
T. Hoefler, C. Siebert, and A. Lumsdaine, “Group operation assembly language-a flexible way to express collective communication,” in2009 International Conference on Parallel Processing. IEEE, 2009, pp. 574–581
2009
-
[35]
Distir: An intermediate representation for optimizing distributed neural networks,
K. Santhanam, S. Krishna, R. Tomioka, A. Fitzgibbon, and T. Harris, “Distir: An intermediate representation for optimizing distributed neural networks,” inProceedings of the 1st Workshop on Machine Learning and Systems, 2021, pp. 15–23
2021
-
[36]
torch. fx: Practical program capture and transformation for deep learning in python,
J. Reed, Z. DeVito, H. He, A. Ussery, and J. Ansel, “torch. fx: Practical program capture and transformation for deep learning in python,” Proceedings of Machine Learning and Systems, vol. 4, pp. 638–651, 2022
2022
-
[37]
Forecasting gpu performance for deep learning training and inference,
S. Lee, A. Phanishayee, and D. Mahajan, “Forecasting gpu performance for deep learning training and inference,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, 2025, pp. 493–508
2025
-
[38]
Onnx: Open neural network exchange,
J. Bai, F. Lu, K. Zhanget al., “Onnx: Open neural network exchange,” https://github.com/onnx/onnx, 2019
2019
-
[39]
Onnxim: A fast, cycle-level multi-core npu simulator,
H. Ham, W. Yang, Y . Shin, O. Woo, G. Heo, S. Lee, J. Park, and G. Kim, “Onnxim: A fast, cycle-level multi-core npu simulator,”arXiv preprint arXiv:2406.08051, 2024, available at https://arxiv.org/abs/2406.08051
-
[40]
Zigzag: A memory-centric rapid dnn accelerator design space exploration framework,
L. Mei, P. Houshmand, V . Jain, S. Giraldo, and M. Verhelst, “Zigzag: A memory-centric rapid dnn accelerator design space exploration framework,”arXiv preprint arXiv:2007.11360, 2020
-
[41]
Xla architecture and high-level optimizer (hlo),
OpenXLA Contributors, “Xla architecture and high-level optimizer (hlo),” https://openxla.org/xla/architecture, 2024, accessed: 2025-05-14
2024
-
[42]
Stablehlo: A portable, stable, and versioned ir for ml workloads,
OpenXLA Community, “Stablehlo: A portable, stable, and versioned ir for ml workloads,” https://github.com/openxla/stablehlo, 2023, accessed: 2025-09-24
2023
-
[43]
’sdy’ dialect — openxla project,
“’sdy’ dialect — openxla project,” https://openxla.org/shardy/sdy dialect, accessed: 2025-12-10
2025
-
[44]
Mlir: A compiler infrastructure for the end of moore’s law,
C. Lattner, M. Amini, U. Bondhugula, A. Cohen, A. Davis, A. Pienaar, R. Riddle, and T. Shpeisman, “Mlir: A compiler infrastructure for the end of moore’s law,”Proceedings of the IEEE, 2021
2021
-
[45]
Maxtext: A simple, performant llm library in jax,
AI-Hypercomputer, “Maxtext: A simple, performant llm library in jax,” https://github.com/AI-Hypercomputer/maxtext, 2025, accessed: 2025-09- 30
2025
-
[46]
The structural simulation toolkit,
A. F. Rodrigues, K. S. Hemmert, B. W. Barrett, C. Kersey, R. Oldfield, M. Weston, R. Risen, J. Cook, P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “The structural simulation toolkit,”SIGMETRICS Perform. Eval. Rev., vol. 38, no. 4, p. 37–42, Mar. 2011. [Online]. Available: https://doi.org/10.1145/1964218.1964225
-
[47]
Jax: composable transformations of python+numpy programs,
Google Research, Brain Team, “Jax: composable transformations of python+numpy programs,” https://github.com/google/jax, 2018, accessed: 2025-09-30
2018
-
[48]
Flax: A neural network library and ecosystem for JAX,
J. Heek, A. Levskaya, A. Oliver, M. Ritter, B. Rondepierre, A. Steiner, and M. van Zee, “Flax: A neural network library and ecosystem for JAX,” 2024. [Online]. Available: http://github.com/google/flax
2024
-
[49]
xprof: A profiling and performance analysis tool for machine learning,
OpenXLA contributors, “xprof: A profiling and performance analysis tool for machine learning,” OpenXLA, gitHub repository, Apache-2.0 license. [Online]. Available: https://github.com/openxla/xprof
-
[50]
Large-scale distributed linear algebra with tensor processing units,
A. G. Lewis, J. Beall, M. Ganahl, M. Hauru, S. B. Mallick, and G. Vidal, “Large-scale distributed linear algebra with tensor processing units,” Proceedings of the National Academy of Sciences, vol. 119, no. 33, p. e2122762119, 2022
2022
-
[51]
The current and future of roofline,
C. Yang, “The current and future of roofline,” 2019, available at https: //www.nersc.gov/assets/Uploads/Talk-LBNL-BrownBagSeminar-2019. pdf
2019
-
[52]
Cocossim: A cycle-accurate simulator for heterogeneous systolic array architectures,
M. Choudhary, C. Kjellqvist, J. Ma, and L. W. Wills, “Cocossim: A cycle-accurate simulator for heterogeneous systolic array architectures,” in2025 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2025, pp. 174–185
2025
-
[53]
A domain-specific supercomputer for training deep neural networks,
N. P. Jouppi, D. H. Yoon, G. Kurian, S. Li, N. Patil, J. Laudon, C. Young, D. A. Pattersonet al., “A domain-specific supercomputer for training deep neural networks,”Communications of the ACM, vol. 63, no. 7, pp. 67–78, 2020
2020
-
[54]
Technology-driven, highly- scalable dragonfly topology,
J. Kim, W. J. Dally, S. Scott, and D. Abts, “Technology-driven, highly- scalable dragonfly topology,”ACM SIGARCH Computer Architecture News, vol. 36, no. 3, pp. 77–88, 2008
2008
-
[55]
In-datacenter performance analysis of a tensor processing unit,
N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borcherset al., “In-datacenter performance analysis of a tensor processing unit,” inProceedings of the 44th annual international symposium on computer architecture, 2017, pp. 1–12
2017
-
[56]
R. Raj, S. Banerjee, N. Chandra, Z. Wan, J. Tong, A. Samajdar, and T. Krishna, “Scale-sim v3: A modular cycle-accurate systolic accelerator simulator for end-to-end system analysis,”arXiv preprint arXiv:2504.15377, 2025, available at https://arxiv.org/abs/2504.15377
-
[57]
Zigzag: Enlarging joint architecture-mapping design space exploration for dnn accelerators,
L. Mei, P. Houshmand, V . Jain, S. Giraldo, and M. Verhelst, “Zigzag: Enlarging joint architecture-mapping design space exploration for dnn accelerators,”IEEE Transactions on Computers, vol. 70, no. 8, pp. 1160– 1174, 2021
2021
-
[58]
Iree: An mlir- based compiler and runtime for machine learning models,
T. I. Authors, B. Vanik, and S. Laurenzo, “Iree: An mlir- based compiler and runtime for machine learning models,” 2019, software available at https://github.com/iree-org/iree. [Online]. Available: https://github.com/iree-org/iree
2019
-
[59]
Pytorchsim: A comprehensive, fast, and accurate npu simulation framework,
W. Yang, Y . Shin, O. Woo, G. Park, H. Ham, J. Kang, J. Park, and G. Kim, “Pytorchsim: A comprehensive, fast, and accurate npu simulation framework,” inProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 1363–1380. [Online]. Available: https://doi....
-
[60]
The source code is planned to be open-sourced on GitHub at https://github.com/imec-int/hespas
How to access:The artifact for results reproduction is available at https://doi.org/10.5281/zenodo.18874691. The source code is planned to be open-sourced on GitHub at https://github.com/imec-int/hespas
-
[61]
To reproduce the profiling-based estimator results, matching GPU and TPU hardware is required as specified in Table IV and Figure 5
Hardware dependencies:The core simulation framework can be run on any Linux-based system. To reproduce the profiling-based estimator results, matching GPU and TPU hardware is required as specified in Table IV and Figure 5
-
[62]
Software dependencies:The core simulation framework is written in Python 3.10+. For network simulation, it uses ASTRA-sim and therefore depends on the following two repositories: •ASTRA-sim [23] for network modeling, •Chakra [32] as the trace format input for ASTRA-sim. The following systolic array estimators are utilized as compute estimators in the TPU ...
-
[63]
All datasets are included in the release package
Data sets:The simulation framework takes two types of inputs: •System configuration files, •StableHLO ML workloads. All datasets are included in the release package. D. Installation The framework is released as a self-contained package. Upon extraction, the root directory contains a single bash script that installs all required dependencies and executes t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.