Recognition: no theorem link
On Similarity of Computational Kernels in our Codes and Proxies
Pith reviewed 2026-05-11 01:05 UTC · model grok-4.3
The pith
Performance similarity metrics based on hardware usage patterns correctly match a kernel from the Kripke proxy application to one in the RAJA Performance Suite.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By defining two broad categories of kernels that share performance traits and computing pairwise similarity scores from hardware usage data, the authors show that one kernel in Kripke aligns with a kernel in RAJA on both CPU and GPU platforms, validating the metrics for assessing how well benchmarks represent full codes.
What carries the argument
Pairwise performance similarity metrics derived from hardware usage patterns, which categorize kernels into two groups exhibiting comparable performance behavior.
If this is right
- Comparison of codes and their proxy representations no longer requires labor-intensive manual review.
- Benchmark suites can be checked for coverage of real application performance traits on emerging hardware.
- Hardware designers and code developers gain a scalable way to evaluate how well proxies capture production code behavior.
- The approach works for both CPU-only and GPU-accelerated systems using the same metrics.
Where Pith is reading between the lines
- The method could be extended to build libraries of equivalent kernels across many more proxy applications.
- If validated further, it would allow quicker screening of new architectures by running only matched proxy kernels instead of full codes.
- Adding memory hierarchy or communication pattern data to the metrics might strengthen matches for communication-heavy kernels.
Load-bearing premise
Hardware usage patterns alone, without full runtime profiling or application-specific details, are enough to decide whether two kernels will show equivalent performance.
What would settle it
A case where two kernels receive a high similarity score from the metrics yet display clearly different run times, scaling behavior, or resource bottlenecks when executed on identical hardware.
Figures
read the original abstract
As high-performance computing (HPC) systems rapidly evolve, with increasing on-node parallelism and widespread use of accelerators, understanding how the code maps to hardware is essential for reaching optimal performance. Benchmarks are commonly used for early assessment of emerging architectures (as well as for informing the design of future hardware), but it is often unknown how well the benchmarks represent the performance characteristics of simulation codes. Existing methods for evaluating how well our benchmarks represent our HPC codes are manual, labor intensive, and challenging to scale to many benchmarks. In this paper, we propose performance similarity metrics based on how the code uses the compute hardware. We define and characterize two broad categories of kernels that exhibit similar performance characteristics. We evaluate the pairwise similarity metrics on kernels in the Kripke proxy application and the RAJA Performance Suite, using both a CPU-only system and a GPU-accelerated system. We validate that our similarity metrics correctly match a kernel in the Kripke proxy app to a kernel in the RAJA Performance Suite. Our proposed similarity metrics enable assessment of the similarity of computational kernels in our codes and the proxy applications we use to represent the codes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes hardware-usage-based performance similarity metrics to assess how well proxy applications represent the characteristics of HPC simulation codes. It defines and characterizes two broad categories of kernels exhibiting similar performance behavior, evaluates pairwise similarity on kernels from the Kripke proxy application and the RAJA Performance Suite on both CPU-only and GPU-accelerated systems, and claims validation by correctly matching one Kripke kernel to a RAJA kernel.
Significance. If the metrics can be shown to identify performance-equivalent kernels via an independent ground-truth criterion (rather than circular use of the same usage vectors), the approach would provide a scalable alternative to manual proxy validation in HPC, which is a recognized bottleneck. The dual-platform evaluation (CPU and GPU) and use of established suites (Kripke, RAJA) are positive elements that would strengthen applicability if quantitative results were supplied.
major comments (2)
- [Abstract] Abstract: The central validation claim ('We validate that our similarity metrics correctly match a kernel in the Kripke proxy app to a kernel in the RAJA Performance Suite') supplies no quantitative similarity scores, no definition or formula for the metrics, no error analysis, and no description of the independent correctness criterion (e.g., expert labeling or measured runtime/scaling parity). This renders the primary result unverifiable from the text.
- [Evaluation section] Evaluation (presumed §4–5): No side-by-side runtime, scaling, or application-level outcome data are presented to confirm that the matched kernels behave equivalently under load. Without such external evidence, the claim that hardware-usage patterns alone suffice to establish performance equivalence rests on an untested assumption and cannot support the paper's conclusions.
minor comments (1)
- [Abstract] The abstract and introduction would benefit from explicit definitions or equations for the two proposed similarity metrics and the two kernel categories before the evaluation is described.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address each major comment below and describe the revisions we will incorporate.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central validation claim ('We validate that our similarity metrics correctly match a kernel in the Kripke proxy app to a kernel in the RAJA Performance Suite') supplies no quantitative similarity scores, no definition or formula for the metrics, no error analysis, and no description of the independent correctness criterion (e.g., expert labeling or measured runtime/scaling parity). This renders the primary result unverifiable from the text.
Authors: We agree the abstract is high-level and omits specifics. The similarity metrics are defined via hardware-usage vector comparisons (detailed with formulas in Section 3), and the validation consists of the metrics producing the highest pairwise score for one specific Kripke-RAJA kernel pair on both CPU and GPU platforms. To improve verifiability, we will revise the abstract to include representative quantitative similarity scores, a brief statement of the metric formulation, and clarification that correctness follows from consistent high-similarity matches across platforms. revision: yes
-
Referee: [Evaluation section] Evaluation (presumed §4–5): No side-by-side runtime, scaling, or application-level outcome data are presented to confirm that the matched kernels behave equivalently under load. Without such external evidence, the claim that hardware-usage patterns alone suffice to establish performance equivalence rests on an untested assumption and cannot support the paper's conclusions.
Authors: The evaluation section reports the pairwise similarity scores derived from hardware-usage vectors and identifies the matching kernel pair on the basis of those scores. We acknowledge that direct runtime and scaling comparisons would constitute stronger external corroboration. We will add such measurements for the matched kernels (and a few non-matched controls) on both the CPU-only and GPU-accelerated systems in the revised evaluation section. revision: yes
Circularity Check
No significant circularity; metrics defined independently of validation data
full rationale
The paper defines similarity metrics from hardware usage patterns and states a validation result for kernel matching between Kripke and RAJA without any equations, derivations, or reductions shown. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described content. The central claim rests on independent metric definitions and an external-style validation statement rather than any construction that equates outputs to inputs by definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Computational kernels can be grouped into two broad categories that exhibit similar performance characteristics based on hardware usage.
Reference graph
Works this paper leans on
-
[1]
Amr Abouelmagd, David Boehme, Stephanie Brink, Jason Burmark, Michael McKinsey, Anthony Skjellum, and Olga Pearce. 2026. GPU Partitioning, Power, and Performance of the AMD MI300A. InProceedings of the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region (SCA/HPCAsia ’26). Association for Computing Machiner...
-
[2]
Fullop, Ann Gentile, Steve Monk, Nichamon Naksinehaboon, Jeff Ogden, Mahesh Rajan, Michael Showerman, Joel Stevenson, Narate Taerat, and Tom Tucker
Anthony Agelastos, Benjamin Allan, Jim Brandt, Paul Cassella, Jeremy Enos, J. Fullop, Ann Gentile, Steve Monk, Nichamon Naksinehaboon, Jeff Ogden, Mahesh Rajan, Michael Showerman, Joel Stevenson, Narate Taerat, and Tom Tucker
-
[3]
The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications. International Conference for High Performance Computing, Networking, Storage and Analysis, SC2015 (01 2015), 154–165. doi:10.1109/SC.2014.18
-
[4]
David Arthur and Sergei Vassilvitskii. 2007. k-means++: the advantages of careful seeding. InProceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms(New Orleans, Louisiana)(SODA ’07). Society for Industrial and Applied Mathematics, USA, 1027–1035
2007
-
[5]
Beckingsale, Jason Burmark, Rich Hornung, Holger Jones, William Killian, Adam J
David A. Beckingsale, Jason Burmark, Rich Hornung, Holger Jones, William Killian, Adam J. Kunen, Olga Pearce, Peter Robinson, Brian S. Ryujin, and Thomas RW Scogland. 2019. RAJA: Portable Performance for Large-Scale Sci- entific Applications. In2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). IEEE, Denver, ...
-
[6]
David Boehme, Todd Gamblin, David Beckingsale, Peer-Timo Bremer, Alfredo Gimenez, Matthew LeGendre, Olga Pearce, and Martin Schulz. 2016. Caliper: Performance Introspection for HPC Software Stacks. InProceedings of the Inter- national Conference for High Performance Computing, Networking, Storage and Analysis(Salt Lake City, Utah)(SC ’16). IEEE Press, Art...
2016
-
[7]
Befikir Bogale, Ian Lumsden, Dalal Sukkari, Dewi Yokelson, Stephanie Brink, Olga Pearce, and Michela Taufer. 2025. Surrogate Models for Analyzing Performance Behavior of HPC Applications Using the RAJA Performance Suite. InComputa- tional Science – ICCS 2025: 25th International Conference, Singapore, Singapore, July 7–9, 2025, Proceedings, Part IV(Singapo...
-
[8]
Isaacs, Michela Taufer, and Olga Pearce
Stephanie Brink, Michael McKinsey, David Boehme, Connor Scully-Allison, Ian Lumsden, Daryl Hawkins, Treece Burgess, Vanessa Lama, Jakob Lüttgau, Kather- ine E. Isaacs, Michela Taufer, and Olga Pearce. 2023. Thicket: Seeing the Perfor- mance Experiment Forest for the Individual Run Trees. In32nd Intl Symposium on High-Performance Parallel and Distributed Computing
2023
-
[9]
S. Browne et al. 2000. A Portable Programming Interface for Performance Evalu- ation on Modern Processors.Int. J. High Perform. Comput. Appl.14, 3 (aug 2000), 189–204. doi:10.1177/109434200001400303
-
[10]
Communications in Statistics 3(1), 1-27 (1974)
T. Caliński and J Harabasz. 1974. A dendrite method for clus- ter analysis.Communications in Statistics3, 1 (1974), 1–27. arXiv:https://doi.org/10.1080/03610927408827101 doi:10.1080/03610927408827101
-
[11]
David Davies and Don Bouldin. 1979. A Cluster Separation Measure.Pattern Analysis and Machine Intelligence, IEEE Transactions onPAMI-1 (05 1979), 224 –
1979
-
[12]
doi:10.1109/TPAMI.1979.4766909
-
[13]
Nan Ding and Samuel Williams. 2019. An Instruction Roofline Model for GPUs. In2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 7–18. doi:10.1109/PMBS49563.2019.00007
-
[14]
J. C. Dunn. 1973. A Fuzzy Relative of the ISODATA Process and Its Use in Detect- ing Compact Well-Separated Clusters.Journal of Cybernetics3, 3 (1973), 32–57. arXiv:https://doi.org/10.1080/01969727308546046 doi:10.1080/01969727308546046
-
[15]
Anirudh Jayakumar, Prakash Murali, and Sathish Vadhiyar. 2015. Matching Application Signatures for Performance Predictions Using a Single Execution. InProceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS ’15). IEEE Computer Society, USA, 1161–1170. doi:10.1109/ IPDPS.2015.20
2015
-
[16]
A. J. Kunen, T. S. Bailey, and P. N. Brown. 2015.KRIPKE-A Massively Parallel Transport Mini-App. Tech. Rep. Lawrence Livermore National Laboratory (LLNL), Livermore, CA
2015
-
[17]
LLNL. 2015. Kripke. http://github.com/LLNL/kripke
2015
-
[18]
LLNL. 2017. Caliper. https://github.com/llnl/caliper
2017
-
[19]
LLNL. 2017. RAJA Performance Suite. http://github.com/LLNL/RAJAPerf
2017
-
[20]
LLNL. 2023. Benchpark. https://github.com/LLNL/benchpark
2023
-
[21]
LLNL. 2023. Thicket. https://github.com/llnl/thicket
2023
-
[22]
S. Lloyd. 1982. Least squares quantization in PCM.IEEE Transactions on Informa- tion Theory28, 2 (1982), 129–137. doi:10.1109/TIT.1982.1056489
-
[23]
McCalpin
John D. McCalpin. 1991-2007.STREAM: Sustainable Memory Bandwidth in High Performance Computers. Technical Report. University of Virginia, Charlottesville, Virginia. http://www.cs.virginia.edu/stream/ A continually updated technical report. http://www.cs.virginia.edu/stream/
1991
-
[24]
F. H. McMahon. 1986.Livermore Fortran kernels: A computer test of numerical performance range. Technical Report. UCRL-53724
1986
-
[25]
NVIDIA. [n. d.]. NVIDIA Nsight Compute Profiling Tool. https://docs.nvidia. com/nsight-compute/NsightCompute/index.html
-
[26]
Olga Pearce, Gregory Becker, Stephanie Brink, Nathan Hanford, Dewi Yokelson, August Knox, and Barry Rountree. 2025. HPC Benchmarking: Repeat, Replicate, Reproduce. InProceedings of the 3rd ACM Conference on Reproducibility and Replicability (ACM REP ’25). Association for Computing Machinery, New York, NY, USA, 85–95. doi:10.1145/3736731.3746150
-
[27]
Olga Pearce, Jason Burmark, Rich Hornung, Befikir Bogale, Ian Lumsden, Michael McKinsey, Dewi Yokelson, David Boehme, Stephanie Brink, Michela Taufer, and Tom Scogland. 2024. RAJA Performance Suite: Performance Portability Analysis with Caliper and Thicket.2024 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), 12...
-
[28]
Olga Pearce, Alec Scott, Gregory Becker, Riyaz Haque, Nathan Hanford, Stephanie Brink, Doug Jacobsen, Heidi Poxon, Jens Domke, and Todd Gamblin. 2023. To- wards Collaborative Continuous Benchmarking for HPC. InProceedings of the SC ’23 Workshops of The International Conference on High Performance Comput- ing, Network, Storage, and Analysis(Denver, CO, USA...
-
[29]
Dan Pelleg and Andrew W. Moore. 2000. X-means: Extending K-means with Efficient Estimation of the Number of Clusters. InInternational Conference on Machine Learning. https://api.semanticscholar.org/CorpusID:11243672
2000
-
[30]
Louis-Noel Pouchet. 2010. The Polyhedral Benchmark Suite. https://web.cs.ucla. edu/~pouchet/software/polybench/
2010
-
[31]
D. F. Richards, O. Aaziz, J. Cook, J. Kuehn, G. Watson, P. McCorquodale, W. Godoy, J. Delozier, M. Carroll, and C. Vaughan. 2021. Quantitative Performance Assess- ment of Proxy Apps and Parents Report for ECP Proxy App Project Milestone ADCD-504-11. (2021). https://www.osti.gov/servlets/purl/1860797
-
[32]
Peter J. Rousseeuw. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis.J. Comput. Appl. Math.20 (1987), 53–65. doi:10.1016/0377-0427(87)90125-7
-
[33]
Erich Schubert. 2023. Stop using the elbow criterion for k-means and how to choose the number of clusters instead.SIGKDD Explor. Newsl.25, 1 (July 2023), 36–42. doi:10.1145/3606274.3606278
-
[34]
scikit learn. 2024. AgglomerativeClustering. https://scikit- learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
2024
-
[35]
scikit learn. 2026. Kmeans. https://scikit- learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
2026
-
[36]
Robert Tibshirani, Guenther Walther, and Trevor Hastie. 2001. Estimating the number of clusters in a data set via the gap statistic.Journal of the Royal Statistical Society: Series B (Statistical Methodology)63, 2 (2001), 411–
2001
-
[37]
arXiv:https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/1467-9868.00293 doi:10.1111/1467-9868.00293
- [38]
-
[39]
Jonathan Weinberg, M. O. McCracken, Erich Strohmaier, and A. Snavely. 2005. Quantifying Locality In The Memory Access Patterns of HPC Applications. ACM/IEEE SC 2005 Conference (SC’05)null (2005), 50–50. doi:10.1109/SC.2005.59
-
[40]
Ahmad Yasin. 2014. A Top-Down Method for Performance Analysis and Counters Architecture. In2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, CA, USA, 35–44. doi:10.1109/ISPASS.2014. 6844459
-
[41]
Dewi Yokelson, Stephanie Brink, Jason Burmark, Michael McKinsey, Befikir Bogale, Ian Lumsden, Michela Taufer, Tom Scogland, and Olga Pearce. 2025. Cross-Architecture Performance Analysis Using the RAJA Performance Suite. InProceedings of the 54th International Conference on Parallel Processing (ICPP ’25). Association for Computing Machinery, New York, NY,...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.