pith. machine review for the scientific record. sign in

arxiv: 2605.06968 · v1 · submitted 2026-05-07 · 💻 cs.DC

Recognition: no theorem link

On Similarity of Computational Kernels in our Codes and Proxies

Michael McKinsey , Stephanie Brink , Olga Pearce

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:05 UTC · model grok-4.3

classification 💻 cs.DC
keywords performance similarity metricscomputational kernelsproxy applicationsHPC benchmarkshardware usage patternsKripkeRAJA Performance Suite
0
0 comments X

The pith

Performance similarity metrics based on hardware usage patterns correctly match a kernel from the Kripke proxy application to one in the RAJA Performance Suite.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces metrics that compare computational kernels by how they use hardware resources such as compute units and memory bandwidth. These metrics are tested on kernels from the Kripke proxy and the RAJA Performance Suite across both CPU-only and GPU systems. The evaluation confirms that the metrics identify matching kernels without needing manual inspection of full code behavior. A reader would care because current ways to check if benchmarks represent real simulation codes are slow and do not scale as HPC hardware adds more parallelism and accelerators. If the metrics hold, developers could automatically judge how well proxy apps stand in for production codes when assessing new machines.

Core claim

By defining two broad categories of kernels that share performance traits and computing pairwise similarity scores from hardware usage data, the authors show that one kernel in Kripke aligns with a kernel in RAJA on both CPU and GPU platforms, validating the metrics for assessing how well benchmarks represent full codes.

What carries the argument

Pairwise performance similarity metrics derived from hardware usage patterns, which categorize kernels into two groups exhibiting comparable performance behavior.

If this is right

  • Comparison of codes and their proxy representations no longer requires labor-intensive manual review.
  • Benchmark suites can be checked for coverage of real application performance traits on emerging hardware.
  • Hardware designers and code developers gain a scalable way to evaluate how well proxies capture production code behavior.
  • The approach works for both CPU-only and GPU-accelerated systems using the same metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be extended to build libraries of equivalent kernels across many more proxy applications.
  • If validated further, it would allow quicker screening of new architectures by running only matched proxy kernels instead of full codes.
  • Adding memory hierarchy or communication pattern data to the metrics might strengthen matches for communication-heavy kernels.

Load-bearing premise

Hardware usage patterns alone, without full runtime profiling or application-specific details, are enough to decide whether two kernels will show equivalent performance.

What would settle it

A case where two kernels receive a high similarity score from the metrics yet display clearly different run times, scaling behavior, or resource bottlenecks when executed on identical hardware.

Figures

Figures reproduced from arXiv: 2605.06968 by Michael McKinsey, Olga Pearce, Stephanie Brink.

Figure 2
Figure 2. Figure 2: GPU roofline analysis of RAJA Performance Suite [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Minimum memory size for stable performance metrics ( [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: K-means clustering using top-down metrics [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Agglomerative clustering using top-down metrics [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: K-means clustering using ncu metrics 4.1.2 Compactness. In Figures 4 and 5, the memory bound cluster not only contains the most kernels, but it is also more compact than the compute bound cluster. We observe from [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Agglomerative clustering using ncu metrics [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: K-means clustering using top-down and ncu metrics [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: Agglomerative clustering using top-down and ncu [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: LTIMES kernel runtime: Kripke implementation vs. RAJA Performance Suite implementations. [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of agglomerative and k-means clustering using different metrics and selection criteria. Rows are selection [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Reference CPU and GPU metric values for all kernels [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗
read the original abstract

As high-performance computing (HPC) systems rapidly evolve, with increasing on-node parallelism and widespread use of accelerators, understanding how the code maps to hardware is essential for reaching optimal performance. Benchmarks are commonly used for early assessment of emerging architectures (as well as for informing the design of future hardware), but it is often unknown how well the benchmarks represent the performance characteristics of simulation codes. Existing methods for evaluating how well our benchmarks represent our HPC codes are manual, labor intensive, and challenging to scale to many benchmarks. In this paper, we propose performance similarity metrics based on how the code uses the compute hardware. We define and characterize two broad categories of kernels that exhibit similar performance characteristics. We evaluate the pairwise similarity metrics on kernels in the Kripke proxy application and the RAJA Performance Suite, using both a CPU-only system and a GPU-accelerated system. We validate that our similarity metrics correctly match a kernel in the Kripke proxy app to a kernel in the RAJA Performance Suite. Our proposed similarity metrics enable assessment of the similarity of computational kernels in our codes and the proxy applications we use to represent the codes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes hardware-usage-based performance similarity metrics to assess how well proxy applications represent the characteristics of HPC simulation codes. It defines and characterizes two broad categories of kernels exhibiting similar performance behavior, evaluates pairwise similarity on kernels from the Kripke proxy application and the RAJA Performance Suite on both CPU-only and GPU-accelerated systems, and claims validation by correctly matching one Kripke kernel to a RAJA kernel.

Significance. If the metrics can be shown to identify performance-equivalent kernels via an independent ground-truth criterion (rather than circular use of the same usage vectors), the approach would provide a scalable alternative to manual proxy validation in HPC, which is a recognized bottleneck. The dual-platform evaluation (CPU and GPU) and use of established suites (Kripke, RAJA) are positive elements that would strengthen applicability if quantitative results were supplied.

major comments (2)
  1. [Abstract] Abstract: The central validation claim ('We validate that our similarity metrics correctly match a kernel in the Kripke proxy app to a kernel in the RAJA Performance Suite') supplies no quantitative similarity scores, no definition or formula for the metrics, no error analysis, and no description of the independent correctness criterion (e.g., expert labeling or measured runtime/scaling parity). This renders the primary result unverifiable from the text.
  2. [Evaluation section] Evaluation (presumed §4–5): No side-by-side runtime, scaling, or application-level outcome data are presented to confirm that the matched kernels behave equivalently under load. Without such external evidence, the claim that hardware-usage patterns alone suffice to establish performance equivalence rests on an untested assumption and cannot support the paper's conclusions.
minor comments (1)
  1. [Abstract] The abstract and introduction would benefit from explicit definitions or equations for the two proposed similarity metrics and the two kernel categories before the evaluation is described.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central validation claim ('We validate that our similarity metrics correctly match a kernel in the Kripke proxy app to a kernel in the RAJA Performance Suite') supplies no quantitative similarity scores, no definition or formula for the metrics, no error analysis, and no description of the independent correctness criterion (e.g., expert labeling or measured runtime/scaling parity). This renders the primary result unverifiable from the text.

    Authors: We agree the abstract is high-level and omits specifics. The similarity metrics are defined via hardware-usage vector comparisons (detailed with formulas in Section 3), and the validation consists of the metrics producing the highest pairwise score for one specific Kripke-RAJA kernel pair on both CPU and GPU platforms. To improve verifiability, we will revise the abstract to include representative quantitative similarity scores, a brief statement of the metric formulation, and clarification that correctness follows from consistent high-similarity matches across platforms. revision: yes

  2. Referee: [Evaluation section] Evaluation (presumed §4–5): No side-by-side runtime, scaling, or application-level outcome data are presented to confirm that the matched kernels behave equivalently under load. Without such external evidence, the claim that hardware-usage patterns alone suffice to establish performance equivalence rests on an untested assumption and cannot support the paper's conclusions.

    Authors: The evaluation section reports the pairwise similarity scores derived from hardware-usage vectors and identifies the matching kernel pair on the basis of those scores. We acknowledge that direct runtime and scaling comparisons would constitute stronger external corroboration. We will add such measurements for the matched kernels (and a few non-matched controls) on both the CPU-only and GPU-accelerated systems in the revised evaluation section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; metrics defined independently of validation data

full rationale

The paper defines similarity metrics from hardware usage patterns and states a validation result for kernel matching between Kripke and RAJA without any equations, derivations, or reductions shown. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described content. The central claim rests on independent metric definitions and an external-style validation statement rather than any construction that equates outputs to inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the two kernel categories appear as a domain assumption introduced to support the metrics.

axioms (1)
  • domain assumption Computational kernels can be grouped into two broad categories that exhibit similar performance characteristics based on hardware usage.
    Stated in abstract as the foundation for defining similarity metrics.

pith-pipeline@v0.9.0 · 5494 in / 1140 out tokens · 36994 ms · 2026-05-11T01:05:11.101968+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 21 canonical work pages

  1. [1]

    Amr Abouelmagd, David Boehme, Stephanie Brink, Jason Burmark, Michael McKinsey, Anthony Skjellum, and Olga Pearce. 2026. GPU Partitioning, Power, and Performance of the AMD MI300A. InProceedings of the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region (SCA/HPCAsia ’26). Association for Computing Machiner...

  2. [2]

    Fullop, Ann Gentile, Steve Monk, Nichamon Naksinehaboon, Jeff Ogden, Mahesh Rajan, Michael Showerman, Joel Stevenson, Narate Taerat, and Tom Tucker

    Anthony Agelastos, Benjamin Allan, Jim Brandt, Paul Cassella, Jeremy Enos, J. Fullop, Ann Gentile, Steve Monk, Nichamon Naksinehaboon, Jeff Ogden, Mahesh Rajan, Michael Showerman, Joel Stevenson, Narate Taerat, and Tom Tucker

  3. [3]

    International Conference for High Performance Computing, Networking, Storage and Analysis, SC2015 (01 2015), 154–165

    The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications. International Conference for High Performance Computing, Networking, Storage and Analysis, SC2015 (01 2015), 154–165. doi:10.1109/SC.2014.18

  4. [4]

    David Arthur and Sergei Vassilvitskii. 2007. k-means++: the advantages of careful seeding. InProceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms(New Orleans, Louisiana)(SODA ’07). Society for Industrial and Applied Mathematics, USA, 1027–1035

  5. [5]

    Beckingsale, Jason Burmark, Rich Hornung, Holger Jones, William Killian, Adam J

    David A. Beckingsale, Jason Burmark, Rich Hornung, Holger Jones, William Killian, Adam J. Kunen, Olga Pearce, Peter Robinson, Brian S. Ryujin, and Thomas RW Scogland. 2019. RAJA: Portable Performance for Large-Scale Sci- entific Applications. In2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). IEEE, Denver, ...

  6. [6]

    David Boehme, Todd Gamblin, David Beckingsale, Peer-Timo Bremer, Alfredo Gimenez, Matthew LeGendre, Olga Pearce, and Martin Schulz. 2016. Caliper: Performance Introspection for HPC Software Stacks. InProceedings of the Inter- national Conference for High Performance Computing, Networking, Storage and Analysis(Salt Lake City, Utah)(SC ’16). IEEE Press, Art...

  7. [7]

    Befikir Bogale, Ian Lumsden, Dalal Sukkari, Dewi Yokelson, Stephanie Brink, Olga Pearce, and Michela Taufer. 2025. Surrogate Models for Analyzing Performance Behavior of HPC Applications Using the RAJA Performance Suite. InComputa- tional Science – ICCS 2025: 25th International Conference, Singapore, Singapore, July 7–9, 2025, Proceedings, Part IV(Singapo...

  8. [8]

    Isaacs, Michela Taufer, and Olga Pearce

    Stephanie Brink, Michael McKinsey, David Boehme, Connor Scully-Allison, Ian Lumsden, Daryl Hawkins, Treece Burgess, Vanessa Lama, Jakob Lüttgau, Kather- ine E. Isaacs, Michela Taufer, and Olga Pearce. 2023. Thicket: Seeing the Perfor- mance Experiment Forest for the Individual Run Trees. In32nd Intl Symposium on High-Performance Parallel and Distributed Computing

  9. [9]

    Browne et al

    S. Browne et al. 2000. A Portable Programming Interface for Performance Evalu- ation on Modern Processors.Int. J. High Perform. Comput. Appl.14, 3 (aug 2000), 189–204. doi:10.1177/109434200001400303

  10. [10]

    Communications in Statistics 3(1), 1-27 (1974)

    T. Caliński and J Harabasz. 1974. A dendrite method for clus- ter analysis.Communications in Statistics3, 1 (1974), 1–27. arXiv:https://doi.org/10.1080/03610927408827101 doi:10.1080/03610927408827101

  11. [11]

    David Davies and Don Bouldin. 1979. A Cluster Separation Measure.Pattern Analysis and Machine Intelligence, IEEE Transactions onPAMI-1 (05 1979), 224 –

  12. [12]

    doi:10.1109/TPAMI.1979.4766909

  13. [13]

    Nan Ding and Samuel Williams. 2019. An Instruction Roofline Model for GPUs. In2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 7–18. doi:10.1109/PMBS49563.2019.00007

  14. [14]

    J. C. Dunn. 1973. A Fuzzy Relative of the ISODATA Process and Its Use in Detect- ing Compact Well-Separated Clusters.Journal of Cybernetics3, 3 (1973), 32–57. arXiv:https://doi.org/10.1080/01969727308546046 doi:10.1080/01969727308546046

  15. [15]

    Anirudh Jayakumar, Prakash Murali, and Sathish Vadhiyar. 2015. Matching Application Signatures for Performance Predictions Using a Single Execution. InProceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS ’15). IEEE Computer Society, USA, 1161–1170. doi:10.1109/ IPDPS.2015.20

  16. [16]

    A. J. Kunen, T. S. Bailey, and P. N. Brown. 2015.KRIPKE-A Massively Parallel Transport Mini-App. Tech. Rep. Lawrence Livermore National Laboratory (LLNL), Livermore, CA

  17. [17]

    LLNL. 2015. Kripke. http://github.com/LLNL/kripke

  18. [18]

    LLNL. 2017. Caliper. https://github.com/llnl/caliper

  19. [19]

    LLNL. 2017. RAJA Performance Suite. http://github.com/LLNL/RAJAPerf

  20. [20]

    LLNL. 2023. Benchpark. https://github.com/LLNL/benchpark

  21. [21]

    LLNL. 2023. Thicket. https://github.com/llnl/thicket

  22. [22]

    S. Lloyd. 1982. Least squares quantization in PCM.IEEE Transactions on Informa- tion Theory28, 2 (1982), 129–137. doi:10.1109/TIT.1982.1056489

  23. [23]

    McCalpin

    John D. McCalpin. 1991-2007.STREAM: Sustainable Memory Bandwidth in High Performance Computers. Technical Report. University of Virginia, Charlottesville, Virginia. http://www.cs.virginia.edu/stream/ A continually updated technical report. http://www.cs.virginia.edu/stream/

  24. [24]

    F. H. McMahon. 1986.Livermore Fortran kernels: A computer test of numerical performance range. Technical Report. UCRL-53724

  25. [25]

    NVIDIA. [n. d.]. NVIDIA Nsight Compute Profiling Tool. https://docs.nvidia. com/nsight-compute/NsightCompute/index.html

  26. [26]

    Olga Pearce, Gregory Becker, Stephanie Brink, Nathan Hanford, Dewi Yokelson, August Knox, and Barry Rountree. 2025. HPC Benchmarking: Repeat, Replicate, Reproduce. InProceedings of the 3rd ACM Conference on Reproducibility and Replicability (ACM REP ’25). Association for Computing Machinery, New York, NY, USA, 85–95. doi:10.1145/3736731.3746150

  27. [27]

    Olga Pearce, Jason Burmark, Rich Hornung, Befikir Bogale, Ian Lumsden, Michael McKinsey, Dewi Yokelson, David Boehme, Stephanie Brink, Michela Taufer, and Tom Scogland. 2024. RAJA Performance Suite: Performance Portability Analysis with Caliper and Thicket.2024 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), 12...

  28. [28]

    Olga Pearce, Alec Scott, Gregory Becker, Riyaz Haque, Nathan Hanford, Stephanie Brink, Doug Jacobsen, Heidi Poxon, Jens Domke, and Todd Gamblin. 2023. To- wards Collaborative Continuous Benchmarking for HPC. InProceedings of the SC ’23 Workshops of The International Conference on High Performance Comput- ing, Network, Storage, and Analysis(Denver, CO, USA...

  29. [29]

    Dan Pelleg and Andrew W. Moore. 2000. X-means: Extending K-means with Efficient Estimation of the Number of Clusters. InInternational Conference on Machine Learning. https://api.semanticscholar.org/CorpusID:11243672

  30. [30]

    Louis-Noel Pouchet. 2010. The Polyhedral Benchmark Suite. https://web.cs.ucla. edu/~pouchet/software/polybench/

  31. [31]

    D. F. Richards, O. Aaziz, J. Cook, J. Kuehn, G. Watson, P. McCorquodale, W. Godoy, J. Delozier, M. Carroll, and C. Vaughan. 2021. Quantitative Performance Assess- ment of Proxy Apps and Parents Report for ECP Proxy App Project Milestone ADCD-504-11. (2021). https://www.osti.gov/servlets/purl/1860797

  32. [32]

    1987 , issue_date =

    Peter J. Rousseeuw. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis.J. Comput. Appl. Math.20 (1987), 53–65. doi:10.1016/0377-0427(87)90125-7

  33. [33]

    Erich Schubert. 2023. Stop using the elbow criterion for k-means and how to choose the number of clusters instead.SIGKDD Explor. Newsl.25, 1 (July 2023), 36–42. doi:10.1145/3606274.3606278

  34. [34]

    scikit learn. 2024. AgglomerativeClustering. https://scikit- learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

  35. [35]

    scikit learn. 2026. Kmeans. https://scikit- learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

  36. [36]

    Robert Tibshirani, Guenther Walther, and Trevor Hastie. 2001. Estimating the number of clusters in a data set via the gap statistic.Journal of the Royal Statistical Society: Series B (Statistical Methodology)63, 2 (2001), 411–

  37. [37]

    arXiv:https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/1467-9868.00293 doi:10.1111/1467-9868.00293

  38. [38]

    Joe H. Ward. 1963. Hierarchical Grouping to Optimize an Objective Function.J. Amer. Statist. Assoc.58, 301 (1963), 236–244. http://www.jstor.org/stable/2282967

  39. [39]

    Jonathan Weinberg, M. O. McCracken, Erich Strohmaier, and A. Snavely. 2005. Quantifying Locality In The Memory Access Patterns of HPC Applications. ACM/IEEE SC 2005 Conference (SC’05)null (2005), 50–50. doi:10.1109/SC.2005.59

  40. [40]

    Ahmad Yasin. 2014. A Top-Down Method for Performance Analysis and Counters Architecture. In2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, CA, USA, 35–44. doi:10.1109/ISPASS.2014. 6844459

  41. [41]

    Dewi Yokelson, Stephanie Brink, Jason Burmark, Michael McKinsey, Befikir Bogale, Ian Lumsden, Michela Taufer, Tom Scogland, and Olga Pearce. 2025. Cross-Architecture Performance Analysis Using the RAJA Performance Suite. InProceedings of the 54th International Conference on Parallel Processing (ICPP ’25). Association for Computing Machinery, New York, NY,...