pith. machine review for the scientific record. sign in

arxiv: 2605.03713 · v2 · submitted 2026-05-05 · 💻 cs.AR · cs.PF

Recognition: 2 theorem links

· Lean Theorem

SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:48 UTC · model grok-4.3

classification 💻 cs.AR cs.PF
keywords SPEC CPU2026benchmark characterizationrepresentativeness analysisCPU performance evaluationinstruction cache stressworkload subsetscross-suite comparisonmicroarchitectural metrics
0
0 comments X

The pith

SPEC CPU2026 increases instruction volume, memory footprint, and instruction-cache pressure compared to SPEC CPU2017, while compact subsets preserve most of its behavior.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper provides the first detailed characterization of the SPEC CPU2026 benchmark suite across nine diverse processor platforms. It reveals that this new suite demands more instructions and memory than its 2017 predecessor and places greater stress on the instruction cache. The authors demonstrate that selecting just 4-5 representative workloads per category can capture 96.4 to 99.9 percent of the full suite's microarchitectural behavior. This allows for more efficient yet faithful CPU performance evaluations. By comparing it to other suites, they position SPEC CPU2026 as a general-purpose benchmark that bridges toward real-world datacenter workloads.

Core claim

We find that, compared to SPEC CPU2017, SPEC CPU2026 increases instruction volume and memory footprint, and shifts pressure toward emerging bottlenecks, most notably higher instruction-cache stress. Using clustering-based representativeness analysis, we identify that compact subsets of 4-5 workloads per group preserve 96.4-99.9% of full-suite behavior, substantially reducing evaluation costs without sacrificing fidelity. SPEC CPU2026 remains a general-purpose suite with complementary characteristics to MLPerf and DCPerf, yet moves closer to real-world CPU behavior than prior generations.

What carries the argument

Clustering-based representativeness analysis on microarchitectural metrics measured across nine platforms, which identifies minimal workload subsets and supports direct cross-suite comparisons.

Load-bearing premise

The nine chosen platforms adequately represent the diversity of modern CPU microarchitectures and the clustering analysis on microarchitectural metrics accurately identifies subsets that preserve all relevant behaviors without missing critical interactions or outliers.

What would settle it

A new processor platform outside the nine shows substantially different ranking or bottleneck behavior when running the proposed compact subsets versus the full suite, or a previously unmeasured metric like energy per instruction deviates sharply from the reported trends.

Figures

Figures reproduced from arXiv: 2605.03713 by Andrew Jacob, Lizy K. John, Neeraja J. Yadwadkar, RuiHao Li.

Figure 1
Figure 1. Figure 1: Key performance metric comparison between view at source ↗
Figure 2
Figure 2. Figure 2: Dendrogram showing similarity between SPEC CPU26 workloads; shorter linkage distance indicates higher similarity. 3.2 Representative Subsets Using the hierarchical clusters, we derive compact workload subsets that preserve the behavioral diversity of the full suite. We first cut each dendrogram at a fixed linkage distance to form groups of behaviorally similar workloads; view at source ↗
Figure 3
Figure 3. Figure 3: Principal Component (PC) comparison across four benchmark suites. The top 8 PCs capture 84% of total variability. view at source ↗
Figure 4
Figure 4. Figure 4: Dendrogram showing DCPerf and MLPerf have view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of key metrics between SPEC CPU26, SPEC CPU17, MLPerf, and DCPerf. Note that SPEC generally offers broader coverage than those domain-specific benchmark suites (this section focuses on geo-mean comparisons for each suite). 4.2 Detailed Microarchitectural Analysis Beyond similarity analysis, we compare microarchitectural be￾havior across SPEC CPU26, SPEC CPU17, DCPerf, and MLPerf using CPU-C4 view at source ↗
Figure 6
Figure 6. Figure 6: Performance impact of alternative memory allocators with/without THP across view at source ↗
Figure 7
Figure 7. Figure 7: RSS for CPU2026/CPU2017 Rate (per-workload data view at source ↗
Figure 8
Figure 8. Figure 8: IPC and DIMM bandwidth across prefetcher configurations. Higher IPC comes from better DIMM bandwidth utilization. view at source ↗
Figure 9
Figure 9. Figure 9: Performance distributions across different compilers for view at source ↗
Figure 10
Figure 10. Figure 10: SPEC CPU17 vs. CPU26 inst count across compilers; CPU26 is more compiler-sensitive (especially 706.stockfish_r). and 1.09× on SPEC CPU26. For FP Rate, gcc-15 slightly regresses on SPEC CPU17 (0.97×) but still improves SPEC CPU26 (1.07×). Exam￾ining IPC and instruction count reveals a key difference between the two suites. On SPEC CPU26, the gains are driven primarily by reduced dynamic instruction count: … view at source ↗
Figure 11
Figure 11. Figure 11: Per-copy retired-instruction distributions across nine machines ( view at source ↗
Figure 12
Figure 12. Figure 12: Scaling distributions (1–40 threads) for view at source ↗
Figure 13
Figure 13. Figure 13: Using the RRR mode to run 709.cactus_r and 749.fo view at source ↗
Figure 14
Figure 14. Figure 14: Detailed per-workload performance of different allocators on view at source ↗
Figure 15
Figure 15. Figure 15: Detailed per-workload performance of different allocators on view at source ↗
Figure 16
Figure 16. Figure 16: Detailed per-workload performance of different compilers on view at source ↗
Figure 17
Figure 17. Figure 17: Detailed per-workload performance of different compilers on view at source ↗
Figure 18
Figure 18. Figure 18: Detailed per-workload performance of Speed workload scaling on view at source ↗
Figure 19
Figure 19. Figure 19: Detailed per-workload performance of Speed workload scaling on view at source ↗
read the original abstract

Specialized accelerators dominate AI workloads, but CPUs remain critical for orchestrating these accelerators and running datacenter services. As a result, CPU performance increasingly shapes end-to-end system efficiency, making it necessary for benchmarks to reflect modern workloads and bottlenecks. However, it remains unclear how emerging CPU benchmark suites reflect these shifts. To address this, we present the first comprehensive characterization of SPEC CPU2026 across nine platforms spanning recent Intel, AMD, Ampere, and Nvidia processors. We find that, compared to SPEC CPU2017, SPEC CPU2026 increases instruction volume and memory footprint, and shifts pressure toward emerging bottlenecks, most notably higher instruction-cache stress. We next examine whether the full suite is necessary for architectural evaluation. Using clustering-based representativeness analysis, we identify that compact subsets of 4-5 workloads per group preserve 96.4-99.9% of full-suite behavior, substantially reducing evaluation costs without sacrificing fidelity. To better position SPEC CPU2026, we compare it against SPEC CPU2017, DCPerf, and MLPerf using cross-suite microarchitectural metrics. SPEC CPU2026 remains a general-purpose suite with complementary characteristics: it is less vector-intensive than MLPerf and has lower frontend pressure than DCPerf, yet moves closer to real-world CPU behavior than prior SPEC CPU generations. Finally, we show that SPEC CPU2026 supports practical architectural studies beyond aggregate scores through case studies on page sizes and allocators, prefetching, compiler optimizations, ISA sensitivity, and many-core scaling. The new round-robin stagger mode generates proxy workloads that approximate DCPerf, reducing the IPC gap to 13.7%. Overall, SPEC CPU2026 sets a new foundation for rigorous and cost-effective CPU evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript provides the first comprehensive characterization of SPEC CPU2026 across nine platforms spanning recent Intel, AMD, Ampere, and Nvidia processors. Compared to SPEC CPU2017, it reports increased instruction volume and memory footprint with a shift toward higher instruction-cache stress. Clustering analysis identifies compact subsets of 4-5 workloads per group that preserve 96.4-99.9% of full-suite behavior. Cross-suite comparisons with SPEC CPU2017, DCPerf, and MLPerf position SPEC CPU2026 as a general-purpose suite with complementary characteristics (less vector-intensive than MLPerf, lower frontend pressure than DCPerf). Case studies demonstrate utility for studies on page sizes, allocators, prefetching, compiler optimizations, ISA sensitivity, and many-core scaling, including a round-robin stagger mode that approximates DCPerf with a 13.7% IPC gap reduction.

Significance. If the empirical results hold, this work is significant for the computer architecture community by updating benchmark understanding for modern CPU workloads that orchestrate accelerators and run datacenter services. The multi-platform measurements provide directional evidence on workload shifts, and the clustering-derived subsets could substantially lower evaluation costs while maintaining high fidelity. The cross-suite positioning and practical case studies offer actionable guidance for benchmark selection and architectural studies. Strengths include direct execution measurements on standard benchmarks across diverse vendors and the introduction of a new stagger mode for proxy workloads.

major comments (2)
  1. [§3 and Abstract] Platform selection (§3 and Abstract): The characterization relies on nine platforms (Intel, AMD, Ampere, Nvidia) to support claims of increased instruction volume, memory footprint, and higher i-cache stress versus CPU2017. However, this selection may under-sample axes such as vector width variations or novel cache coherence protocols, which is load-bearing for the generalizability of the 'shifts pressure toward emerging bottlenecks' claim; the paper should add explicit discussion of platform limitations and sensitivity tests.
  2. [Abstract and §4] Clustering-based representativeness analysis (Abstract and §4): The central cost-effectiveness claim rests on subsets of 4-5 workloads preserving 96.4-99.9% of full-suite behavior via clustering on microarchitectural metrics. The abstract presents these fidelity numbers without error bars, statistical details, full methodology (e.g., chosen metrics, algorithm parameters, or outlier validation), leaving it unclear whether all relevant interactions are captured; this requires expansion to verify the representativeness result.
minor comments (2)
  1. [Abstract] Abstract: Include the exact list of nine platforms and any confidence intervals or methodology summary for the 96.4-99.9% figures to improve clarity and allow immediate assessment of the fidelity claims.
  2. [Cross-suite comparison] Cross-suite comparison section: Specify the precise set of microarchitectural metrics used for positioning against DCPerf and MLPerf to enhance reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for the constructive major comments. We address each point below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: [§3 and Abstract] Platform selection (§3 and Abstract): The characterization relies on nine platforms (Intel, AMD, Ampere, Nvidia) to support claims of increased instruction volume, memory footprint, and higher i-cache stress versus CPU2017. However, this selection may under-sample axes such as vector width variations or novel cache coherence protocols, which is load-bearing for the generalizability of the 'shifts pressure toward emerging bottlenecks' claim; the paper should add explicit discussion of platform limitations and sensitivity tests.

    Authors: We agree that an explicit discussion of platform limitations is warranted to better support the generalizability of our claims regarding workload shifts. In the revised version, we will add a new subsection in §3 that details the rationale for selecting the nine platforms (spanning recent Intel, AMD, Ampere, and Nvidia processors) while acknowledging that they do not exhaustively cover all microarchitectural axes, such as every possible vector width variation or novel cache coherence protocols. We will also incorporate sensitivity tests by reporting how the key trends (increased instruction volume, memory footprint, and i-cache stress) hold consistently across the available platforms, thereby providing directional evidence without overstating universality. revision: yes

  2. Referee: [Abstract and §4] Clustering-based representativeness analysis (Abstract and §4): The central cost-effectiveness claim rests on subsets of 4-5 workloads preserving 96.4-99.9% of full-suite behavior via clustering on microarchitectural metrics. The abstract presents these fidelity numbers without error bars, statistical details, full methodology (e.g., chosen metrics, algorithm parameters, or outlier validation), leaving it unclear whether all relevant interactions are captured; this requires expansion to verify the representativeness result.

    Authors: We will expand §4 with the requested details to make the clustering analysis fully transparent and verifiable. Specifically, we will include: the complete list of microarchitectural metrics used for clustering, the exact algorithm and parameters (e.g., number of clusters and distance metric), outlier validation steps, and statistical measures such as error bars or variance on the reported 96.4-99.9% fidelity values. The abstract will be lightly revised to note that full methodological details and statistical support appear in §4. This expansion will clarify how the 4-5 workload subsets capture relevant interactions while preserving high fidelity to the full suite. revision: yes

Circularity Check

0 steps flagged

Empirical characterization with no circular derivations or self-referential reductions

full rationale

The paper's core claims rest on direct empirical measurements: running SPEC CPU2026 workloads on nine external platforms (Intel, AMD, Ampere, Nvidia) to quantify instruction volume, memory footprint, and i-cache stress relative to SPEC CPU2017, followed by standard clustering on microarchitectural counters to identify compact subsets whose measured behavior is then validated against the full suite (yielding the 96.4-99.9% preservation figures). Cross-suite positioning against DCPerf and MLPerf uses the same external metrics without any fitted parameters renamed as predictions, self-citations invoked for uniqueness theorems, or ansatzes smuggled from prior author work. No derivation step reduces by construction to its own inputs; the representativeness analysis is falsifiable via the reported metric comparisons and remains independent of the paper's conclusions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work relies on empirical execution of existing SPEC workloads and standard clustering; one data-driven choice of subset size (4-5) is presented as preserving high fidelity. No new entities postulated.

free parameters (1)
  • subset size per group
    Selected as 4-5 workloads to achieve 96.4-99.9% preservation; appears chosen based on clustering results rather than a priori.
axioms (1)
  • domain assumption SPEC CPU workloads and the chosen microarchitectural metrics adequately capture real-world CPU bottlenecks in datacenter and AI-orchestration scenarios
    Invoked to justify relevance of the suite and the clustering analysis.

pith-pipeline@v0.9.0 · 5629 in / 1653 out tokens · 51476 ms · 2026-05-08T18:48:02.965770+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

144 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layer CNN accelerators. In2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1–12

  2. [2]

    Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, and Yuxiong He. 2022. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. InSC22: International Conference for High Performance Computing, Networking, Storage and ...

  3. [3]

    Ampere. 2026. Ampere Processor Platforms. https://amperecomputing.com/pr oducts/processors

  4. [4]

    Ulf Andersson, Min Qiu, and Ziyang Zhang. 2006. Parallel power computation for photonic crystal devices.Methods and applications of analysis13, 2 (2006), 149–156

  5. [5]

    Georgia Antoniou, Davide Bartolini, Haris Volos, Marios Kleanthous, Zhe Wang, Kleovoulos Kalaitzidis, Tom Rollet, Ziwei Li, Onur Mutlu, Yiannakis Sazeides, and Jawad Haj Yahya. 2024. Agile C-states: a core C-state architecture for latency critical applications optimizing both transition and cold-start latency. ACM Transactions on Architecture and Code Opt...

  6. [6]

    ARM. 2026. The Arm ASTC Encoder, a compressor for the Adaptive Scalable Texture Compression data format. https://github.com/ARM-software/astc- encoder

  7. [7]

    ARM. 2026. The world’s most efficient agentic CPU. https://www.arm.com/pr oducts/cloud-datacenter/arm-agi-cpu

  8. [8]

    Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller, Saugata Ghose, Jayneel Gandhi, Christopher J Rossbach, and Onur Mutlu. 2018. Mosaic: En- abling application-transparent support for multiple page sizes in throughput processors.ACM SIGOPS Operating Systems Review52, 1 (2018), 27–44

  9. [9]

    Mohammad Bakhshalipour, Seyedali Tabaeiaghdaei, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2019. Evaluation of hardware data prefetchers on server processors.ACM Computing Surveys (CSUR)52, 3 (2019), 1–29

  10. [10]

    Jon Berndt. 2004. JSBSim: An open source flight dynamics model in C++. In AIAA modeling and simulation technologies conference and exhibit. 4923

  11. [11]

    Vaughn Betz and Jonathan Rose. 1997. VPR: A new packing, placement and routing tool for FPGA research. InInternational Workshop on Field Programmable Logic and Applications. Springer, 213–222

  12. [12]

    Ravi Bhargava and Kai Troester. 2024. AMD next-generation “Zen 4” core and 4th gen AMD EPYC server CPUs.IEEE Micro44, 3 (2024), 8–17

  13. [13]

    Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R

    Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator.ACM SIGARCH computer architecture news39, 2 (2011), 1–7

  14. [14]

    Cleaning up the Mess: Re-Evaluating the Real-System Modeling Accuracy of Ramulator 2.0

    F. Nisa Bostanci, Haocong Luo, Ataberk Olgun, Maria Makeenkova, Ger- aldo F. Oliveira, A. Giray Yaglikci, and Onur Mutlu. 2026. Cleaning up the Mess: Re-Evaluating the Real-System Modeling Accuracy of Ramulator 2.0. arXiv:2510.15744 [cs.AR] https://arxiv.org/abs/2510.15744

  15. [15]

    Robert Brayton and Alan Mishchenko. 2010. ABC: An academic industrial- strength verification tool. InInternational Conference on Computer Aided Verifi- cation. Springer, 24–40

  16. [16]

    Kistowski

    James Bucek, Klaus-Dieter Lange, and Jóakim v. Kistowski. 2018. SPEC CPU2017: Next-generation compute benchmark. InCompanion of the 2018 ACM/SPEC International Conference on Performance Engineering. 41–42

  17. [17]

    Benjamin Buchfink, Klaus Reuter, and Hajk-Georg Drost. 2021. Sensitive protein alignments at tree-of-life scale using DIAMOND.Nature methods18, 4 (2021), 366–368

  18. [18]

    Calin Cascaval, Evelyn Duesterwald, Peter F Sweeney, and Robert W Wisniewski

  19. [19]

    In14th International Conference on Parallel Architectures and Compilation Techniques (PACT’05)

    Multiple page size modeling and optimization. In14th International Conference on Parallel Architectures and Compilation Techniques (PACT’05). IEEE, 339–349

  20. [20]

    Hao Chen, Kim Laine, and Rachel Player. 2017. Simple encrypted arithmetic library-SEAL v2. 1. InInternational conference on financial cryptography and data security. Springer, 3–18

  21. [21]

    Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, and Zhiru Zhang. 2024. Understanding the potential of fpga- based spatial acceleration for large language model inference.ACM Transactions on Reconfigurable Technology and Systems18, 1 (2024), 1–29

  22. [22]

    Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. 2016. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks.IEEE journal of solid-state circuits52, 1 (2016), 127–138

  23. [23]

    Weiwei Chu, Xinfeng Xie, Jiecao Yu, Jie Wang, Amar Phanishayee, Chunqiang Tang, Yuchen Hao, Jianyu Huang, Mustafa Ozdal, Jun Wang, Vedanuj Goswami, Naman Goyal, Abhishek Kadian, Andrew Gu, Chris Cai, Feng Tian, Xiaodong Wang, Min Si, Pavan Balaji, Ching-Hsiang Chu, and Jongsoo Park. 2025. Scaling Llama 3 Training with Efficient Parallelism Strategies. InP...

  24. [24]

    Joel Coburn, Chunqiang Tang, Sameer Abu Asal, Neeraj Agrawal, Raviteja Chinta, Harish Dixit, Brian Dodds, Saritha Dwarakapuram, Amin Firoozshahian, Cao Gao, Kaustubh Gondkar, Tyler Graf, Junhan Hu, Jian Huang, Sterling Hughes, Adam Hutchin, Bhasker Jakka, Guoqiang Jerry Chen, Indu Kalyanara- man, Ashwin Kamath, Pankaj Kansal, Erum Kazi, Roman Levenstein, ...

  25. [25]

    Guilherme Cox and Abhishek Bhattacharjee. 2017. Efficient address translation for architectures with multiple page sizes.ACM SIGPLAN Notices52, 4 (2017), 435–448

  26. [26]

    Linker, Ronald M

    Cooper Downs, Jon A. Linker, Ronald M. Caplan, Emily I. Mason, Pete Riley, Ryder Davidson, Andres Reyes, Erika Palmerio, Roberto Lionello, James Turtle, Michal Ben-Nun, Miko M. Stulajter, Viacheslav S. Titov, Tibor Török, Lisa A. Upton, Raphael Attie, Bibhuti K. Jha, Charles N. Arge, Carl J. Henney, Gher- ardo Valori, Hanna Strecker, Daniele Calchetti, Di...

  27. [27]

    Pouya Esmaili-Dokht, Francesco Sgherzi, Valéria Soldera Girelli, Isaac Boix- aderas, Mariana Carmin, Alireza Monemi, Adrià Armejach, Estanislao Mercadal, Germán Llort, Petar Radojković, Miquel Moreto, Judit Giménez, Xavier Mar- torell, Eduard Ayguadé, Jesus Labarta, Emanuele Confalonieri, Rishabh Dubey, and Jason Adlard. 2024. A mess of memory system benc...

  28. [28]

    Mark Evers, Leslie Barnes, and Mike Clark. 2022. The AMD next-generation “Zen 3” core.IEEE Micro42, 3 (2022), 7–12

  29. [29]

    Facebook. 2026. Zstandard - Fast real-time compression algorithm. https: //github.com/facebook/zstd

  30. [30]

    Yinxiao Feng and Kaisheng Ma. 2022. Chiplet actuary: A quantitative cost model and multi-chiplet architecture exploration. InProceedings of the 59th ACM/IEEE Design Automation Conference. 121–126

  31. [31]

    Amin Firoozshahian, Joel Coburn, Roman Levenstein, Rakesh Nattoji, Ashwin Kamath, Olivia Wu, Gurdeepak Grewal, Harish Aepala, Bhasker Jakka, Bob Dreyer, Adam Hutchin, Utku Diril, Krishnakumar Nair, Ehsan K. Aredestani, Martin Schatz, Yuchen Hao, Rakesh Komuravelli, Kunming Ho, Sameer Abu Asal, Joe Shajrawi, Kevin Quinn, Nagesh Sreedhara, Pankaj Kansal, Wi...

  32. [32]

    Reinhardt, Adrian M

    Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, Stephen Heil, Prerak Patel, Adam Sapek, Gabriel Weisz, Lisa Woods, Sitaram Lanka, Steven K. Reinhardt, Adrian M. Caulfield, Eric S. Chung, and Doug Burger. 2018. A configurable cloud-scale DNN processor fo...

  33. [33]

    Kevin P Gaffney, Martin Prammer, Larry Brasfield, D Richard Hipp, Dan Kennedy, and Jignesh M Patel. 2022. SQLite: past, present, and future.Proceed- ings of the VLDB Endowment15, 12 (2022)

  34. [34]

    Christophe Geuzaine and Jean-François Remacle. 2009. Gmsh: A 3-D finite element mesh generator with built-in pre-and post-processing facilities.Inter- national journal for numerical methods in engineering79, 11 (2009), 1309–1331

  35. [35]

    Marc-Oliver Gewaltig and Markus Diesmann. 2007. Nest (neural simulation tool).Scholarpedia2, 4 (2007), 1430

  36. [36]

    Abraham Gonzalez, Aasheesh Kolli, Samira Khan, Sihang Liu, Vidushi Dadu, Sagar Karandikar, Jichuan Chang, Krste Asanovic, and Parthasarathy Ran- ganathan. 2023. Profiling hyperscale big data processing. InProceedings of the 50th Annual International Symposium on Computer Architecture. 1–16

  37. [37]

    Tom Goodale, Gabrielle Allen, Gerd Lanfermann, Joan Massó, Thomas Radke, Edward Seidel, and John Shalf. 2002. The cactus framework and toolkit: design and applications: invited talk. InInternational conference on high performance computing for computational science. Springer, 197–227

  38. [38]

    Björn Gottschall, Silvio Campelo de Santana, and Magnus Jahre. 2023. Balancing accuracy and evaluation overhead in simulation point selection. In2023 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 43–53

  39. [39]

    Darryl Gove. 2007. CPU2006 working set size.ACM SIGARCH Computer Architecture News35, 1 (2007), 90–96

  40. [40]

    Danilo Guerrera, Rubén M Cabezón, Jean-Guillaume Piccinali, Aurélien Cavelan, Florina M Ciorba, David Imbert, Lucio Mayer, and Darren Reed. 2018. Towards a mini-app for smoothed particle hydrodynamics at exascale. In2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 607–614

  41. [41]

    Faruk Guvenilir and Yale N Patt. 2020. Tailored page sizes. In2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 900–912

  42. [42]

    Ranjan Hebbar SR and Aleksandar Milenković. 2019. SPEC CPU2017: Perfor- mance, event, and energy characterization on the core i7-8700K. InProceedings of the 2019 ACM/SPEC International Conference on Performance Engineering. 111–118

  43. [43]

    John L Henning. 2002. SPEC CPU2000: Measuring CPU performance in the new millennium.Computer33, 7 (2002), 28–35

  44. [44]

    John L Henning. 2006. SPEC CPU2006 benchmark descriptions.ACM SIGARCH Computer Architecture News34, 4 (2006), 1–17

  45. [45]

    Andrew Hamilton Hunter, Chris Kennelly, Paul Turner, Darryl Gove, Tipp Moseley, and Parthasarathy Ranganathan. 2021. Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator. In15th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI}21). 257–273

  46. [46]

    Mohsen Imani, Saransh Gupta, Yeseong Kim, and Tajana Rosing. 2019. Floatpim: In-memory acceleration of deep neural network training with high precision. InProceedings of the 46th International Symposium on Computer Architecture. 802–815

  47. [47]

    Koki Ishida, Ilkwon Byun, Ikki Nagaoka, Kosuke Fukumitsu, Masamitsu Tanaka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Jangwoo Kim, and Koji In- oue. 2020. SuperNPU: An extremely fast neural processing unit using supercon- ducting logic devices. In2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 58–72

  48. [48]

    Adam N Jacobvitz, Andrew D Hilton, and Daniel J Sorin. 2015. Multi-program benchmark definition. In2015 IEEE international symposium on performance analysis of systems and software (ISPASS). IEEE, 72–82

  49. [49]

    Akanksha Jain, Hannah Lin, Carlos Villavieja, Baris Kasikci, Chris Kennelly, Milad Hashemi, and Parthasarathy Ranganathan. 2024. Limoncello: Prefetchers for scale. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 577–590

  50. [50]

    Rishabh Jain, Scott Cheng, Vishwas Kalagi, Vrushabh Sanghavi, Samvit Kaul, Meena Arunachalam, Kiwan Maeng, Adwait Jog, Anand Sivasubramaniam, Mahmut Taylan Kandemir, and Chita R. Das. 2023. Optimizing cpu perfor- mance for recommendation systems at-scale. InProceedings of the 50th Annual International Symposium on Computer Architecture. 1–15

  51. [51]

    Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, and Minlan Yu. 2025. Neo: Sav- ing gpu memory crisis with cpu offloading for online llm inference.Proceedings of Machine Learning and Systems7 (2025)

  52. [52]

    Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, Clifford Young, Xiang Zhou, Zongwei Zhou, and David A Patterson. 2023. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. InProceedings of the 50th annual inter...

  53. [53]

    Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Da- ley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richar...

  54. [54]

    Jowi Morales. 2026. Are we staring down the barrel of an AI-driven CPU shortage? https://www.tomshardware.com/pc-components/cpus/cpus-are- cool-again-intel-and-amd-reporting-spikes-in-cpu-demand-due-to-agentic- ai-shortages-lisa-su-says-business-exceeded-expectations-while-intel-is- looking-at-long-term-agreements-with-potential-customers

  55. [55]

    Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Neckermann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev, André F. T. Martins, and Alexandra Birch. 2018. Marian: Fast neural machine translation in C++. InProceedings of ACL 2018, system demonstrations. 116–121

  56. [56]

    Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. 2015. Profiling a warehouse- scale computer. InProceedings of the 42nd annual international symposium on computer architecture. 158–169

  57. [57]

    Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, Qijing Huang, Kyle Kovacs, Borivoje Nikolic, Randy Katz, Jonathan Bachrach, and Krste Asanović. 2018. FireSim: FPGA-accelerated cycle-exact scale-out system simulation in the public cloud. In2018 ACM/IEEE 45th ...

  58. [58]

    Martin Kronbichler, Dmytro Sashko, and Peter Munch. 2023. Enhancing data lo- cality of the conjugate gradient method for high-order matrix-free finite-element implementations.The International Journal of High Performance Computing Applications37, 2 (2023), 61–81

  59. [59]

    Jaewon Kwon, Yongju Lee, Hongju Kal, Minjae Kim, Youngsok Kim, and Won Woo Ro. 2023. McCore: A Holistic Management of High-Performance Het- erogeneous Multicores. InProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. 1044–1058

  60. [60]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles. 611–626

  61. [61]

    Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. InInternational symposium on code generation and optimization, 2004. CGO 2004.IEEE, 75–86

  62. [62]

    Jaekyu Lee, Hyesoon Kim, and Richard Vuduc. 2012. When prefetching works, when it doesn’t, and why.ACM Transactions on Architecture and Code Optimiza- tion (TACO)9, 1 (2012), 1–29

  63. [63]

    Taehyung Lee, Sumit Kumar Monga, Changwoo Min, and Young Ik Eom. 2023. Memtis: Efficient memory tiering with dynamic page classification and page size determination. InProceedings of the 29th Symposium on Operating Systems Principles. 17–34

  64. [64]

    Daan Leijen, Benjamin Zorn, and Leonardo De Moura. 2019. Mimalloc: Free list sharding in action. InAsian Symposium on Programming Languages and Systems. Springer, 244–265

  65. [65]

    Ankur Limaye and Tosiron Adegbija. 2018. A workload characterization of the spec cpu2017 benchmark suite. In2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 149–158

  66. [66]

    Linux Kernel Community. 2026. Linux perf tool. https://perf.wiki.kernel.org/i ndex.php/Main_Page

  67. [67]

    Qiuyun Llull, Songchun Fan, Seyed Majid Zahedi, and Benjamin C Lee. 2017. Cooper: Task colocation with cooperative games. In2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 421–432

  68. [68]

    Thomas J Macke and David A Case. 1998. Modeling unusual nucleic acid structures. ACS Publications

  69. [69]

    Daniel Marjamäki. 2013. Cppcheck: a tool for static c/c++ code analysis.URL: https://cppcheck. sourceforge. io(2013)

  70. [70]

    Maronga, S

    B. Maronga, S. Banzhaf, C. Burmeister, T. Esch, R. Forkel, D. Fröhlich, V. Fuka, K. F. Gehrke, J. Geletič, S. Giersch, T. Gronemeier, G. Groß, W. Heldens, A. Hellsten, F. Hoffmann, A. Inagaki, E. Kadasch, F. Kanani-Sühring, K. Ketelsen, B. A. Khan, C. Knigge, H. Knoop, P. Krč, M. Kurppa, H. Maamari, A. Matzarakis, M. Mauder, M. Pallasch, D. Pavlik, J. Pfa...

  71. [71]

    John, Tsuguchika Tabaru, Carole-Jean Wu, Lingjie Xu, Masafumi Yamazaki, Cliff Young, and Matei Zaharia

    Peter Mattson, Christine Cheng, Cody Coleman, Greg Diamos, Paulius Mi- cikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor 13 Bittorf, David Brooks, Dehao Chen, Debojyoti Dutta, Udit Gupta, Kim Hazel- wood, Andrew Hock, Xinyuan Huang, Atsushi Ike, Bill Jia, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Guokai Ma, Deepak Na...

  72. [72]

    Simon McIntosh-Smith, Matthew Martineau, Tom Deakin, Grzegorz Pawelczak, Wayne Gaudin, Paul Garrett, Wei Liu, Richard Smedley-Stevenson, and David Beckingsale. 2017. Tealeaf: A mini-application to enable design-space explo- rations for iterative sparse linear solvers. In2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 842–849

  73. [73]

    Richard C Murphy, Kyle B Wheeler, Brian W Barrett, and James A Ang. 2010. Introducing the graph 500.Cray Users Group (CUG)19, 45-74 (2010), 22

  74. [74]

    Seonjin Na, Geonhwa Jeong, Byung H Ahn, Aaron Jezghani, Jeffrey Young, Christopher J Hughes, Tushar Krishna, and Hyesoon Kim. 2025. Flexinfer: Flexible llm inference with cpu computations.Proceedings of Machine Learning and Systems7 (2025)

  75. [75]

    Seonjin Na, Geonhwa Jeong, Byung Hoon Ahn, Jeffrey Young, Tushar Krishna, and Hyesoon Kim. 2024. Understanding performance implications of llm infer- ence on cpus. In2024 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 169–180

  76. [76]

    Samuel Naffziger, Noah Beck, Thomas Burd, Kevin Lepak, Gabriel H Loh, Mahesh Subramony, and Sean White. 2021. Pioneering chiplet technology and design for the amd epyc™and ryzen™processor families: Industrial product. In2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 57–70

  77. [77]

    Coon, David Culler, Vidushi Dadu, Martin Dixon, Henry M

    Arash Nasr-Esfahany, Mohammad Alizadeh, Victor Lee, Hanna Alam, Brett W. Coon, David Culler, Vidushi Dadu, Martin Dixon, Henry M. Levy, Santosh Pandey, Parthasarathy Ranganathan, and Amir Yazdanbakhsh. 2025. Concorde: Fast and Accurate CPU Performance Modeling with Compositional Analytical- ML Fusion. InProceedings of the 52nd Annual International Symposi...

  78. [78]

    Munch, Carleton L

    Nevine Nassif, Ashley O. Munch, Carleton L. Molnar, Gerald Pasdast, Sitara- man V. Lyer, Zibing Yang, Oscar Mendoza, Mark Huddart, Srikrishnan Venkatara- man, Sireesha Kandula, Rafi Marom, Alexandra M. Kern, Bill Bowhill, David R. Mulvihill, Srikanth Nimmagadda, Varma Kalidindi, Jonathan Krause, Moham- mad M. Haq, Roopali Sharma, and Kevin Duda. 2022. Sap...

  79. [79]

    Agustín Navarro-Torres, Jesús Alastruey-Benedé, Pablo Ibáñez-Marín, and Víc- tor Viñals-Yúfera. 2019. Memory hierarchy characterization of SPEC CPU2006 and SPEC CPU2017 on the Intel Xeon Skylake-SP.Plos one14, 8 (2019), e0220135

  80. [80]

    Nicholas Nethercote, Peter J Stuckey, Ralph Becket, Sebastian Brand, Gregory J Duck, and Guido Tack. 2007. MiniZinc: Towards a standard CP modelling language. InInternational conference on principles and practice of constraint programming. Springer, 529–543

Showing first 80 references.