arxiv: 2605.06968 · v1 · submitted 2026-05-07 · 💻 cs.DC

Recognition: no theorem link

On Similarity of Computational Kernels in our Codes and Proxies

Michael McKinsey , Stephanie Brink , Olga Pearce

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:05 UTC · model grok-4.3

classification 💻 cs.DC

keywords performance similarity metricscomputational kernelsproxy applicationsHPC benchmarkshardware usage patternsKripkeRAJA Performance Suite

0 comments

The pith

Performance similarity metrics based on hardware usage patterns correctly match a kernel from the Kripke proxy application to one in the RAJA Performance Suite.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces metrics that compare computational kernels by how they use hardware resources such as compute units and memory bandwidth. These metrics are tested on kernels from the Kripke proxy and the RAJA Performance Suite across both CPU-only and GPU systems. The evaluation confirms that the metrics identify matching kernels without needing manual inspection of full code behavior. A reader would care because current ways to check if benchmarks represent real simulation codes are slow and do not scale as HPC hardware adds more parallelism and accelerators. If the metrics hold, developers could automatically judge how well proxy apps stand in for production codes when assessing new machines.

Core claim

By defining two broad categories of kernels that share performance traits and computing pairwise similarity scores from hardware usage data, the authors show that one kernel in Kripke aligns with a kernel in RAJA on both CPU and GPU platforms, validating the metrics for assessing how well benchmarks represent full codes.

What carries the argument

Pairwise performance similarity metrics derived from hardware usage patterns, which categorize kernels into two groups exhibiting comparable performance behavior.

If this is right

Comparison of codes and their proxy representations no longer requires labor-intensive manual review.
Benchmark suites can be checked for coverage of real application performance traits on emerging hardware.
Hardware designers and code developers gain a scalable way to evaluate how well proxies capture production code behavior.
The approach works for both CPU-only and GPU-accelerated systems using the same metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be extended to build libraries of equivalent kernels across many more proxy applications.
If validated further, it would allow quicker screening of new architectures by running only matched proxy kernels instead of full codes.
Adding memory hierarchy or communication pattern data to the metrics might strengthen matches for communication-heavy kernels.

Load-bearing premise

Hardware usage patterns alone, without full runtime profiling or application-specific details, are enough to decide whether two kernels will show equivalent performance.

What would settle it

A case where two kernels receive a high similarity score from the metrics yet display clearly different run times, scaling behavior, or resource bottlenecks when executed on identical hardware.

Figures

Figures reproduced from arXiv: 2605.06968 by Michael McKinsey, Olga Pearce, Stephanie Brink.

**Figure 3.** Figure 3: Minimum memory size for stable performance metrics ( [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 5.** Figure 5: K-means clustering using top-down metrics [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 4.** Figure 4: Agglomerative clustering using top-down metrics [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 7.** Figure 7: K-means clustering using ncu metrics 4.1.2 Compactness. In Figures 4 and 5, the memory bound cluster not only contains the most kernels, but it is also more compact than the compute bound cluster. We observe from [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 6.** Figure 6: Agglomerative clustering using ncu metrics [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 9.** Figure 9: K-means clustering using top-down and ncu metrics [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗

**Figure 8.** Figure 8: Agglomerative clustering using top-down and ncu [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 10.** Figure 10: LTIMES kernel runtime: Kripke implementation vs. RAJA Performance Suite implementations. [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 11.** Figure 11: Comparison of agglomerative and k-means clustering using different metrics and selection criteria. Rows are selection [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 12.** Figure 12: Reference CPU and GPU metric values for all kernels [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗

read the original abstract

As high-performance computing (HPC) systems rapidly evolve, with increasing on-node parallelism and widespread use of accelerators, understanding how the code maps to hardware is essential for reaching optimal performance. Benchmarks are commonly used for early assessment of emerging architectures (as well as for informing the design of future hardware), but it is often unknown how well the benchmarks represent the performance characteristics of simulation codes. Existing methods for evaluating how well our benchmarks represent our HPC codes are manual, labor intensive, and challenging to scale to many benchmarks. In this paper, we propose performance similarity metrics based on how the code uses the compute hardware. We define and characterize two broad categories of kernels that exhibit similar performance characteristics. We evaluate the pairwise similarity metrics on kernels in the Kripke proxy application and the RAJA Performance Suite, using both a CPU-only system and a GPU-accelerated system. We validate that our similarity metrics correctly match a kernel in the Kripke proxy app to a kernel in the RAJA Performance Suite. Our proposed similarity metrics enable assessment of the similarity of computational kernels in our codes and the proxy applications we use to represent the codes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines hardware-usage similarity metrics and two kernel categories then applies them to match a Kripke kernel with one from RAJA, but the validation step needs clearer independent evidence.

read the letter

The main thing here is a practical method for checking how similar computational kernels are across codes and proxies, using metrics based on hardware usage patterns rather than full profiling. They introduce two categories of kernels that share performance characteristics and then define pairwise similarity metrics. The evaluation uses kernels from the Kripke proxy app and the RAJA Performance Suite, run on both CPU-only and GPU systems. They report that the metrics correctly identify a match between one Kripke kernel and one in RAJA. This is new in the way it ties the metrics directly to hardware usage for this matching task, and the validation on these specific applications adds some concrete evidence. It does a good job highlighting the problem of manual benchmark selection in HPC and offering an automated path forward. The focus on both CPU and GPU is also a plus, given how architectures are evolving. The soft spot is in the validation step. The claim that the metrics correctly match the kernels requires some external way to know the match is right, like measured performance parity or independent labeling. The abstract does not provide quantitative similarity scores, details on the metric formulas, or confirmation that the equivalence was checked separately from the usage patterns themselves. If the full paper has those elements with actual numbers and comparisons, it strengthens the work; otherwise the central result stays hard to verify. The assumption that hardware usage alone suffices for performance equivalence also needs scrutiny, as context like data movement or algorithm specifics might matter. This paper is for HPC performance engineers and developers who select or design proxy applications for testing new hardware. A reader in that space can take the metrics and try them on their own codes, which gives it some immediate utility. It deserves a serious referee. The idea is grounded in a real pain point, the evaluation is on relevant software, and the methods appear defined independently. Reviewers can focus on clarifying the validation process and adding the missing quantitative support. I recommend sending it to peer review.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes hardware-usage-based performance similarity metrics to assess how well proxy applications represent the characteristics of HPC simulation codes. It defines and characterizes two broad categories of kernels exhibiting similar performance behavior, evaluates pairwise similarity on kernels from the Kripke proxy application and the RAJA Performance Suite on both CPU-only and GPU-accelerated systems, and claims validation by correctly matching one Kripke kernel to a RAJA kernel.

Significance. If the metrics can be shown to identify performance-equivalent kernels via an independent ground-truth criterion (rather than circular use of the same usage vectors), the approach would provide a scalable alternative to manual proxy validation in HPC, which is a recognized bottleneck. The dual-platform evaluation (CPU and GPU) and use of established suites (Kripke, RAJA) are positive elements that would strengthen applicability if quantitative results were supplied.

major comments (2)

[Abstract] Abstract: The central validation claim ('We validate that our similarity metrics correctly match a kernel in the Kripke proxy app to a kernel in the RAJA Performance Suite') supplies no quantitative similarity scores, no definition or formula for the metrics, no error analysis, and no description of the independent correctness criterion (e.g., expert labeling or measured runtime/scaling parity). This renders the primary result unverifiable from the text.
[Evaluation section] Evaluation (presumed §4–5): No side-by-side runtime, scaling, or application-level outcome data are presented to confirm that the matched kernels behave equivalently under load. Without such external evidence, the claim that hardware-usage patterns alone suffice to establish performance equivalence rests on an untested assumption and cannot support the paper's conclusions.

minor comments (1)

[Abstract] The abstract and introduction would benefit from explicit definitions or equations for the two proposed similarity metrics and the two kernel categories before the evaluation is described.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses

Referee: [Abstract] Abstract: The central validation claim ('We validate that our similarity metrics correctly match a kernel in the Kripke proxy app to a kernel in the RAJA Performance Suite') supplies no quantitative similarity scores, no definition or formula for the metrics, no error analysis, and no description of the independent correctness criterion (e.g., expert labeling or measured runtime/scaling parity). This renders the primary result unverifiable from the text.

Authors: We agree the abstract is high-level and omits specifics. The similarity metrics are defined via hardware-usage vector comparisons (detailed with formulas in Section 3), and the validation consists of the metrics producing the highest pairwise score for one specific Kripke-RAJA kernel pair on both CPU and GPU platforms. To improve verifiability, we will revise the abstract to include representative quantitative similarity scores, a brief statement of the metric formulation, and clarification that correctness follows from consistent high-similarity matches across platforms. revision: yes
Referee: [Evaluation section] Evaluation (presumed §4–5): No side-by-side runtime, scaling, or application-level outcome data are presented to confirm that the matched kernels behave equivalently under load. Without such external evidence, the claim that hardware-usage patterns alone suffice to establish performance equivalence rests on an untested assumption and cannot support the paper's conclusions.

Authors: The evaluation section reports the pairwise similarity scores derived from hardware-usage vectors and identifies the matching kernel pair on the basis of those scores. We acknowledge that direct runtime and scaling comparisons would constitute stronger external corroboration. We will add such measurements for the matched kernels (and a few non-matched controls) on both the CPU-only and GPU-accelerated systems in the revised evaluation section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; metrics defined independently of validation data

full rationale

The paper defines similarity metrics from hardware usage patterns and states a validation result for kernel matching between Kripke and RAJA without any equations, derivations, or reductions shown. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described content. The central claim rests on independent metric definitions and an external-style validation statement rather than any construction that equates outputs to inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the two kernel categories appear as a domain assumption introduced to support the metrics.

axioms (1)

domain assumption Computational kernels can be grouped into two broad categories that exhibit similar performance characteristics based on hardware usage.
Stated in abstract as the foundation for defining similarity metrics.

pith-pipeline@v0.9.0 · 5494 in / 1140 out tokens · 36994 ms · 2026-05-11T01:05:11.101968+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 21 canonical work pages

[1]

Amr Abouelmagd, David Boehme, Stephanie Brink, Jason Burmark, Michael McKinsey, Anthony Skjellum, and Olga Pearce. 2026. GPU Partitioning, Power, and Performance of the AMD MI300A. InProceedings of the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region (SCA/HPCAsia ’26). Association for Computing Machiner...

work page doi:10.1145/3773656.3773680 2026
[2]

Fullop, Ann Gentile, Steve Monk, Nichamon Naksinehaboon, Jeff Ogden, Mahesh Rajan, Michael Showerman, Joel Stevenson, Narate Taerat, and Tom Tucker

Anthony Agelastos, Benjamin Allan, Jim Brandt, Paul Cassella, Jeremy Enos, J. Fullop, Ann Gentile, Steve Monk, Nichamon Naksinehaboon, Jeff Ogden, Mahesh Rajan, Michael Showerman, Joel Stevenson, Narate Taerat, and Tom Tucker
[3]

International Conference for High Performance Computing, Networking, Storage and Analysis, SC2015 (01 2015), 154–165

The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications. International Conference for High Performance Computing, Networking, Storage and Analysis, SC2015 (01 2015), 154–165. doi:10.1109/SC.2014.18

work page doi:10.1109/sc.2014.18 2015
[4]

David Arthur and Sergei Vassilvitskii. 2007. k-means++: the advantages of careful seeding. InProceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms(New Orleans, Louisiana)(SODA ’07). Society for Industrial and Applied Mathematics, USA, 1027–1035

2007
[5]

Beckingsale, Jason Burmark, Rich Hornung, Holger Jones, William Killian, Adam J

David A. Beckingsale, Jason Burmark, Rich Hornung, Holger Jones, William Killian, Adam J. Kunen, Olga Pearce, Peter Robinson, Brian S. Ryujin, and Thomas RW Scogland. 2019. RAJA: Portable Performance for Large-Scale Sci- entific Applications. In2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). IEEE, Denver, ...

work page doi:10.1109/p3hpc49587.2019.00012 2019
[6]

David Boehme, Todd Gamblin, David Beckingsale, Peer-Timo Bremer, Alfredo Gimenez, Matthew LeGendre, Olga Pearce, and Martin Schulz. 2016. Caliper: Performance Introspection for HPC Software Stacks. InProceedings of the Inter- national Conference for High Performance Computing, Networking, Storage and Analysis(Salt Lake City, Utah)(SC ’16). IEEE Press, Art...

2016
[7]

Befikir Bogale, Ian Lumsden, Dalal Sukkari, Dewi Yokelson, Stephanie Brink, Olga Pearce, and Michela Taufer. 2025. Surrogate Models for Analyzing Performance Behavior of HPC Applications Using the RAJA Performance Suite. InComputa- tional Science – ICCS 2025: 25th International Conference, Singapore, Singapore, July 7–9, 2025, Proceedings, Part IV(Singapo...

work page doi:10.1007/978-3-031-97635-3_39 2025
[8]

Isaacs, Michela Taufer, and Olga Pearce

Stephanie Brink, Michael McKinsey, David Boehme, Connor Scully-Allison, Ian Lumsden, Daryl Hawkins, Treece Burgess, Vanessa Lama, Jakob Lüttgau, Kather- ine E. Isaacs, Michela Taufer, and Olga Pearce. 2023. Thicket: Seeing the Perfor- mance Experiment Forest for the Individual Run Trees. In32nd Intl Symposium on High-Performance Parallel and Distributed Computing

2023
[9]

Browne et al

S. Browne et al. 2000. A Portable Programming Interface for Performance Evalu- ation on Modern Processors.Int. J. High Perform. Comput. Appl.14, 3 (aug 2000), 189–204. doi:10.1177/109434200001400303

work page doi:10.1177/109434200001400303 2000
[10]

Communications in Statistics 3(1), 1-27 (1974)

T. Caliński and J Harabasz. 1974. A dendrite method for clus- ter analysis.Communications in Statistics3, 1 (1974), 1–27. arXiv:https://doi.org/10.1080/03610927408827101 doi:10.1080/03610927408827101

work page doi:10.1080/03610927408827101 1974
[11]

David Davies and Don Bouldin. 1979. A Cluster Separation Measure.Pattern Analysis and Machine Intelligence, IEEE Transactions onPAMI-1 (05 1979), 224 –

1979
[12]

doi:10.1109/TPAMI.1979.4766909

work page doi:10.1109/tpami.1979.4766909 1979
[13]

Nan Ding and Samuel Williams. 2019. An Instruction Roofline Model for GPUs. In2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 7–18. doi:10.1109/PMBS49563.2019.00007

work page doi:10.1109/pmbs49563.2019.00007 2019
[14]

J. C. Dunn. 1973. A Fuzzy Relative of the ISODATA Process and Its Use in Detect- ing Compact Well-Separated Clusters.Journal of Cybernetics3, 3 (1973), 32–57. arXiv:https://doi.org/10.1080/01969727308546046 doi:10.1080/01969727308546046

work page doi:10.1080/01969727308546046 1973
[15]

Anirudh Jayakumar, Prakash Murali, and Sathish Vadhiyar. 2015. Matching Application Signatures for Performance Predictions Using a Single Execution. InProceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS ’15). IEEE Computer Society, USA, 1161–1170. doi:10.1109/ IPDPS.2015.20

2015
[16]

A. J. Kunen, T. S. Bailey, and P. N. Brown. 2015.KRIPKE-A Massively Parallel Transport Mini-App. Tech. Rep. Lawrence Livermore National Laboratory (LLNL), Livermore, CA

2015
[17]

LLNL. 2015. Kripke. http://github.com/LLNL/kripke

2015
[18]

LLNL. 2017. Caliper. https://github.com/llnl/caliper

2017
[19]

LLNL. 2017. RAJA Performance Suite. http://github.com/LLNL/RAJAPerf

2017
[20]

LLNL. 2023. Benchpark. https://github.com/LLNL/benchpark

2023
[21]

LLNL. 2023. Thicket. https://github.com/llnl/thicket

2023
[22]

S. Lloyd. 1982. Least squares quantization in PCM.IEEE Transactions on Informa- tion Theory28, 2 (1982), 129–137. doi:10.1109/TIT.1982.1056489

work page doi:10.1109/tit.1982.1056489 1982
[23]

McCalpin

John D. McCalpin. 1991-2007.STREAM: Sustainable Memory Bandwidth in High Performance Computers. Technical Report. University of Virginia, Charlottesville, Virginia. http://www.cs.virginia.edu/stream/ A continually updated technical report. http://www.cs.virginia.edu/stream/

1991
[24]

F. H. McMahon. 1986.Livermore Fortran kernels: A computer test of numerical performance range. Technical Report. UCRL-53724

1986
[25]

NVIDIA. [n. d.]. NVIDIA Nsight Compute Profiling Tool. https://docs.nvidia. com/nsight-compute/NsightCompute/index.html
[26]

Olga Pearce, Gregory Becker, Stephanie Brink, Nathan Hanford, Dewi Yokelson, August Knox, and Barry Rountree. 2025. HPC Benchmarking: Repeat, Replicate, Reproduce. InProceedings of the 3rd ACM Conference on Reproducibility and Replicability (ACM REP ’25). Association for Computing Machinery, New York, NY, USA, 85–95. doi:10.1145/3736731.3746150

work page doi:10.1145/3736731.3746150 2025
[27]

Olga Pearce, Jason Burmark, Rich Hornung, Befikir Bogale, Ian Lumsden, Michael McKinsey, Dewi Yokelson, David Boehme, Stephanie Brink, Michela Taufer, and Tom Scogland. 2024. RAJA Performance Suite: Performance Portability Analysis with Caliper and Thicket.2024 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), 12...

work page doi:10.1109/scw63240 2024
[28]

Olga Pearce, Alec Scott, Gregory Becker, Riyaz Haque, Nathan Hanford, Stephanie Brink, Doug Jacobsen, Heidi Poxon, Jens Domke, and Todd Gamblin. 2023. To- wards Collaborative Continuous Benchmarking for HPC. InProceedings of the SC ’23 Workshops of The International Conference on High Performance Comput- ing, Network, Storage, and Analysis(Denver, CO, USA...

work page doi:10.1145/3624062 2023
[29]

Dan Pelleg and Andrew W. Moore. 2000. X-means: Extending K-means with Efficient Estimation of the Number of Clusters. InInternational Conference on Machine Learning. https://api.semanticscholar.org/CorpusID:11243672

2000
[30]

Louis-Noel Pouchet. 2010. The Polyhedral Benchmark Suite. https://web.cs.ucla. edu/~pouchet/software/polybench/

2010
[31]

D. F. Richards, O. Aaziz, J. Cook, J. Kuehn, G. Watson, P. McCorquodale, W. Godoy, J. Delozier, M. Carroll, and C. Vaughan. 2021. Quantitative Performance Assess- ment of Proxy Apps and Parents Report for ECP Proxy App Project Milestone ADCD-504-11. (2021). https://www.osti.gov/servlets/purl/1860797

work page arXiv 2021
[32]

1987 , issue_date =

Peter J. Rousseeuw. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis.J. Comput. Appl. Math.20 (1987), 53–65. doi:10.1016/0377-0427(87)90125-7

work page doi:10.1016/0377-0427(87)90125-7 1987
[33]

Erich Schubert. 2023. Stop using the elbow criterion for k-means and how to choose the number of clusters instead.SIGKDD Explor. Newsl.25, 1 (July 2023), 36–42. doi:10.1145/3606274.3606278

work page doi:10.1145/3606274.3606278 2023
[34]

scikit learn. 2024. AgglomerativeClustering. https://scikit- learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

2024
[35]

scikit learn. 2026. Kmeans. https://scikit- learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

2026
[36]

Robert Tibshirani, Guenther Walther, and Trevor Hastie. 2001. Estimating the number of clusters in a data set via the gap statistic.Journal of the Royal Statistical Society: Series B (Statistical Methodology)63, 2 (2001), 411–

2001
[37]

arXiv:https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/1467-9868.00293 doi:10.1111/1467-9868.00293

work page doi:10.1111/1467-9868.00293
[38]

Joe H. Ward. 1963. Hierarchical Grouping to Optimize an Objective Function.J. Amer. Statist. Assoc.58, 301 (1963), 236–244. http://www.jstor.org/stable/2282967

work page arXiv 1963
[39]

Jonathan Weinberg, M. O. McCracken, Erich Strohmaier, and A. Snavely. 2005. Quantifying Locality In The Memory Access Patterns of HPC Applications. ACM/IEEE SC 2005 Conference (SC’05)null (2005), 50–50. doi:10.1109/SC.2005.59

work page doi:10.1109/sc.2005.59 2005
[40]

Ahmad Yasin. 2014. A Top-Down Method for Performance Analysis and Counters Architecture. In2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, CA, USA, 35–44. doi:10.1109/ISPASS.2014. 6844459

work page doi:10.1109/ispass.2014 2014
[41]

Dewi Yokelson, Stephanie Brink, Jason Burmark, Michael McKinsey, Befikir Bogale, Ian Lumsden, Michela Taufer, Tom Scogland, and Olga Pearce. 2025. Cross-Architecture Performance Analysis Using the RAJA Performance Suite. InProceedings of the 54th International Conference on Parallel Processing (ICPP ’25). Association for Computing Machinery, New York, NY,...

work page arXiv 2025