pith. sign in

arxiv: 2605.15222 · v1 · pith:AARY37WBnew · submitted 2026-05-13 · 💻 cs.SE · cs.CL· cs.PL

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

Pith reviewed 2026-05-19 17:58 UTC · model grok-4.3

classification 💻 cs.SE cs.CLcs.PL
keywords LLM code generationperformance optimizationbenchmarkhigh-performance computingparallelismGPU programmingsystems softwarecode efficiency
0
0 comments X

The pith

Current LLMs produce code that is functionally correct but far from expert-optimized on system-level performance tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PerfCodeBench to test how well large language models optimize code for high performance in systems programming. It shows that models can write working code yet leave large efficiency gaps, especially on tasks involving parallelism and GPU operations. This matters because real performance-critical systems depend on hardware-aware choices that go beyond basic correctness. The benchmark supplies executable checks, baselines, and reference solutions so both correctness and runtime can be measured directly. Results point to the need for evaluation that rewards efficient implementations rather than correct ones alone.

Core claim

PerfCodeBench consists of tasks that demand system-level implementation choices, hardware-aware optimizations, and handling of performance bottlenecks. When evaluated on a broad set of state-of-the-art LLMs, the generated code exhibits a clear gap relative to expert-optimized reference solutions, with the largest shortfalls appearing on parallelism and GPU tasks. Models also show inconsistent cross-language behavior and rarely match expert efficiency levels.

What carries the argument

PerfCodeBench, the executable benchmark that pairs each task with correctness verification, a baseline implementation, and a reference optimized solution to quantify runtime efficiency.

If this is right

  • LLMs must improve specifically on parallelism and GPU operations to close the efficiency gap.
  • Code-generation evaluation should incorporate runtime performance metrics in addition to functional correctness.
  • Cross-language robustness remains a clear weakness that limits practical deployment.
  • Performance-aware benchmarks are required to steer future model development toward efficient systems software.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could be used to create targeted fine-tuning datasets focused on optimization reasoning.
  • Integrating hardware simulation feedback into model training might narrow the observed gaps over time.
  • Similar task designs could be applied to other hardware platforms to test broader claims about model limitations.

Load-bearing premise

The selected tasks accurately capture the realistic system-level choices, hardware-aware optimizations, and performance bottlenecks that matter in practice.

What would settle it

An LLM that consistently produces code matching or beating the reference optimized solutions on the benchmark tasks across multiple runs would falsify the claimed performance gap.

Figures

Figures reproduced from arXiv: 2605.15222 by Hanyu Yang, Haochen Shi, Haoran Li, Huihao Jing, Shaojin Chen, Sirui Zhang, Wenbin Hu, Yangqiu Song.

Figure 1
Figure 1. Figure 1: Distribution of executable tasks in PerfCodeBench across programming languages and task [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Additional findings from PerfCodeBench. Some models rarely succeed, but their successful [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Heatmap of the per-language PerfCodeBench results. Rows correspond to evaluated models, [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗
read the original abstract

Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness or algorithmic problem solving, while realistic systems-level optimization is still underexplored. To address this gap, we introduce PerfCodeBench, an executable benchmark for evaluating LLMs on high-performance code optimization. The tasks require system-level implementation choices, hardware-aware optimization, and careful handling of performance bottlenecks. Each task includes executable correctness checks, a baseline implementation, and a reference optimized solution. This allows us to evaluate both correctness and runtime-oriented efficiency. Our evaluation on a broad set of state-of-the-art LLMs shows a clear gap between model-generated code and expert-optimized implementations. The gap is especially large on tasks involving parallelism and GPU operations. Current models also show weaknesses in cross-language robustness and in consistently reaching expert-level efficiency. These results suggest that performance-aware evaluation are still needed. LLMs should move beyond generating merely correct code toward producing efficient systems software. We submit the benchmark data, evaluation infrastructure, and complete logs of all LLMs-generated code at https://anonymous.4open.science/r/perfcodebench-7CDE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PerfCodeBench, an executable benchmark for evaluating LLMs on high-performance code optimization. Tasks require system-level implementation choices, hardware-aware optimizations, and handling of performance bottlenecks; each includes executable correctness checks, a baseline implementation, and a reference optimized solution. Evaluation across state-of-the-art LLMs reports a clear gap versus expert-optimized code, largest on parallelism and GPU tasks, plus weaknesses in cross-language robustness and consistent efficiency.

Significance. If the tasks prove representative of real production bottlenecks, the benchmark would usefully demonstrate that current LLMs remain limited for performance-critical systems code and motivate performance-aware evaluation beyond functional correctness. Public release of the benchmark data, evaluation infrastructure, and complete LLM-generated code logs is a clear strength that supports reproducibility and follow-on work.

major comments (2)
  1. [Abstract] Abstract: the central claim of a 'clear gap' that is 'especially large on tasks involving parallelism and GPU operations' depends on the tasks accurately encoding realistic implementation decisions and bottlenecks. No information is supplied on task provenance (real kernels vs. synthetic), the fraction of tasks targeting GPU/parallelism, the magnitude of reference-vs-baseline speedups, or any external validation that the chosen bottlenecks are representative rather than selected for effect.
  2. [Evaluation] Evaluation (throughout): the reported gaps lack visible statistical controls for LLM output variability (e.g., multiple samples per prompt, temperature settings, or confidence intervals on runtime metrics). Without these, selection effects cannot be ruled out and the headline performance gap remains difficult to interpret.
minor comments (2)
  1. [Abstract] Abstract: 'performance-aware evaluation are still needed' is grammatically incorrect; should read 'performance-aware evaluations are still needed' or 'performance-aware evaluation is still needed'.
  2. [Abstract] The anonymous repository link is provided but should be replaced with a permanent archive (e.g., Zenodo) before publication to ensure long-term accessibility of the logs and infrastructure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify how to better support the paper's claims. We address each major point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of a 'clear gap' that is 'especially large on tasks involving parallelism and GPU operations' depends on the tasks accurately encoding realistic implementation decisions and bottlenecks. No information is supplied on task provenance (real kernels vs. synthetic), the fraction of tasks targeting GPU/parallelism, the magnitude of reference-vs-baseline speedups, or any external validation that the chosen bottlenecks are representative rather than selected for effect.

    Authors: We agree that additional details on task construction are required to substantiate the central claims. In the revised manuscript we will add a dedicated subsection on benchmark construction that describes task provenance (drawn from documented performance-critical kernels in the HPC literature), the fraction of tasks involving parallelism and GPU operations, the observed reference-versus-baseline speedups, and the selection rationale based on commonly reported systems bottlenecks. revision: yes

  2. Referee: [Evaluation] Evaluation (throughout): the reported gaps lack visible statistical controls for LLM output variability (e.g., multiple samples per prompt, temperature settings, or confidence intervals on runtime metrics). Without these, selection effects cannot be ruled out and the headline performance gap remains difficult to interpret.

    Authors: We acknowledge the need for statistical controls. The revised evaluation section will report results aggregated over multiple samples per prompt, specify the temperature and sampling settings, and include standard deviations together with confidence intervals on the runtime metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark tasks and metrics are independently defined with public references.

full rationale

The paper introduces PerfCodeBench as a new executable benchmark containing baseline implementations, reference optimized solutions, and correctness checks for each task. Performance gaps are measured directly by comparing LLM outputs against these external references on runtime efficiency, with no equations, fitted parameters, or derivations that reduce the reported gaps to self-defined quantities. No self-citations are invoked to justify task selection, uniqueness of the metric, or the emphasis on parallelism/GPU tasks. The claims rest on the provided task set and submitted public logs rather than any self-referential construction or renaming of prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the benchmark tasks and reference solutions represent meaningful performance targets; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Executable correctness checks and runtime comparisons to reference solutions can reliably measure optimization quality.
    Invoked when claiming gaps between model code and expert implementations.

pith-pipeline@v0.9.0 · 5772 in / 1089 out tokens · 35531 ms · 2026-05-19T17:58:50.521493+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 4 internal anchors

  1. [1]

    Introducing Claude Opus 4.5

    Anthropic. Introducing Claude Opus 4.5. https://www.anthropic.com/news/ claude-opus-4-5, 2025

  2. [2]

    Claude Code Overview

    Anthropic. Claude Code Overview. https://code.claude.com/docs/en/overview, 2026

  3. [3]

    Claude model overview

    Anthropic. Claude model overview. https://platform.claude.com/docs/en/ about-claude/models/overview, 2026

  4. [4]

    Understanding software engineering agents: A study of thought-action-result trajectories

    Islem Bouzenia and Michael Pradel. Understanding software engineering agents: A study of thought-action-result trajectories. InASE, pages 2846–2857. IEEE, 2025

  5. [5]

    Seed2.0.https://seed.bytedance.com/en/seed2, 2026

    ByteDance Seed. Seed2.0.https://seed.bytedance.com/en/seed2, 2026

  6. [6]

    Autocodebench: Large language models are automatic code benchmark generators.CoRR, abs/2508.09101, 2025

    Jason Chou, Ao Liu, Yuchi Deng, Zhiying Zeng, Tao Zhang, Haotian Zhu, Jianwei Cai, Yue Mao, Chenchen Zhang, Lingyun Tan, Ziyan Xu, Bohui Zhai, Hengyi Liu, Speed Zhu, Wiggin Zhou, and Fengzong Lian. Autocodebench: Large language models are automatic code benchmark generators.CoRR, abs/2508.09101, 2025

  7. [7]

    DeepSeek V4 Preview Release

    DeepSeek. DeepSeek V4 Preview Release. https://api-docs.deepseek.com/news/ news260424, 2026

  8. [8]

    A Survey on Code Generation with LLM-based Agents

    Yihong Dong, Xue Jiang, Jiaru Qian, Tian Wang, Kechi Zhang, Zhi Jin, and Ge Li. A survey on code generation with llm-based agents.CoRR, abs/2508.00083, 2025

  9. [9]

    CodeArena: A collective evaluation platform for LLM code generation

    Mingzhe Du, Anh Tuan Luu, Bin Ji, Xiaobao Wu, Yuhao Qing, Dong Huang, Terry Yue Zhuo, Qian Liu, and See-Kiong Ng. CodeArena: A collective evaluation platform for LLM code generation. InACL (3), pages 502–512. Association for Computational Linguistics, 2025

  10. [10]

    Eval- uating agents.md: Are repository-level context files helpful for coding agents?, 2026

    Thibaud Gloaguen, Niels Mündler, Mark Müller, Veselin Raychev, and Martin Vechev. Eval- uating agents.md: Are repository-level context files helpful for coding agents?, 2026. URL https://arxiv.org/abs/2602.11988

  11. [11]

    Gemini 3.1 Pro: A smarter model for your most complex tasks

    Google. Gemini 3.1 Pro: A smarter model for your most complex tasks. https: //blog.google/innovation-and-ai/models-and-research/gemini-models/ gemini-3-1-pro/, 2026

  12. [12]

    Gemma 4: Byte for byte, the most capable open models

    Google. Gemma 4: Byte for byte, the most capable open models. https://blog.google/ innovation-and-ai/technology/developers-tools/gemma-4/, 2026

  13. [13]

    Gemini API models

    Google AI for Developers. Gemini API models. https://ai.google.dev/gemini-api/ docs/models, 2026

  14. [14]

    Gemma 4 model overview

    Google AI for Developers. Gemma 4 model overview. https://ai.google.dev/gemma/ docs/core, 2026

  15. [15]

    EffiBench: Benchmarking the efficiency of automatically generated code

    Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, and Jie Zhang. EffiBench: Benchmarking the efficiency of automatically generated code. InNeurIPS, 2024

  16. [16]

    LiveCodeBench: Holistic and con- tamination free evaluation of large language models for code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and con- tamination free evaluation of large language models for code. InICLR. OpenReview.net, 2025

  17. [17]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. SWE-bench: Can language models resolve real-world github issues? InICLR. OpenReview.net, 2024

  18. [18]

    Parsing gigabytes of JSON per second.arXiv preprint arXiv:1902.08318, 2019

    Geoff Langdale and Daniel Lemire. Parsing gigabytes of JSON per second.arXiv preprint arXiv:1902.08318, 2019. URLhttps://arxiv.org/abs/1902.08318. 11

  19. [19]

    RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices

    Jia Li, Hongyi Deng, Yiran Zhang, Kechi Zhang, Tianqi Shao, Tiankuo Zhao, Weinan Wang, Zhi Jin, Ge Li, Yang Liu, Yingtao Fang, and Yihong Dong. Realbench: A repo-level code generation benchmark aligned with real-world software development practices, 2026. URL https://arxiv.org/abs/2604.22659

  20. [20]

    FEA-Bench: A benchmark for evaluating repository-level code generation for feature implementation

    Wei Li, Xin Zhang, Zhongxin Guo, Shaoguang Mao, Wen Luo, Guangyue Peng, Yangyu Huang, Houfeng Wang, and Scarlett Li. FEA-Bench: A benchmark for evaluating repository-level code generation for feature implementation. InACL (1), pages 17160–17176. Association for Computational Linguistics, 2025

  21. [21]

    ProjectEval: A benchmark for programming agents automated evaluation on project-level code generation

    Kaiyuan Liu, Youcheng Pan, Yang Xiang, Daojing He, Jing Li, Yexing Du, and Tianrun Gao. ProjectEval: A benchmark for programming agents automated evaluation on project-level code generation. InACL (Findings), Findings of ACL, pages 20205–20221. Association for Computational Linguistics, 2025

  22. [22]

    LZ4: Extremely fast compression algorithm

    LZ4 Contributors. LZ4: Extremely fast compression algorithm. https://github.com/lz4/ lz4, 2026

  23. [23]

    The Llama 4 herd: The beginning of a new era of natively multimodal AI

    Meta AI. The Llama 4 herd: The beginning of a new era of natively multimodal AI. https: //ai.meta.com/blog/llama-4-multimodal-intelligence/, 2025

  24. [24]

    Llama 4 models.https://www.llama.com/models/llama-4/, 2026

    Meta Llama. Llama 4 models.https://www.llama.com/models/llama-4/, 2026

  25. [25]

    Kimi K2.6 quickstart

    Moonshot AI. Kimi K2.6 quickstart. https://platform.kimi.ai/docs/guide/ kimi-k2-6-quickstart, 2026

  26. [26]

    Kimi AI with K2.6: Better coding, smarter agents.https://www.kimi.com/en, 2026

    Moonshot AI. Kimi AI with K2.6: Better coding, smarter agents.https://www.kimi.com/en, 2026

  27. [27]

    CUB: Reusable software components for the CUDA programming model

    NVIDIA. CUB: Reusable software components for the CUDA programming model. https: //github.com/NVIDIA/cub, 2026

  28. [28]

    CUDA Samples.https://github.com/NVIDIA/cuda-samples, 2026

    NVIDIA. CUDA Samples.https://github.com/NVIDIA/cuda-samples, 2026

  29. [29]

    Thrust: The C++ parallel algorithms library

    NVIDIA. Thrust: The C++ parallel algorithms library. https://github.com/NVIDIA/ thrust, 2026

  30. [30]

    Codex: AI Coding Partner from OpenAI.https://openai.com/codex/, 2026

    OpenAI. Codex: AI Coding Partner from OpenAI.https://openai.com/codex/, 2026

  31. [31]

    Introducing GPT-5.5

    OpenAI. Introducing GPT-5.5. https://openai.com/index/introducing-gpt-5-5/ , 2026

  32. [32]

    OpenClaw: Personal AI Assistant.https://openclaw.ai/, 2026

    OpenClaw. OpenClaw: Personal AI Assistant.https://openclaw.ai/, 2026

  33. [33]

    GPT-5 - api pricing and providers

    OpenRouter. GPT-5 - api pricing and providers. https://openrouter.ai/openai/gpt-5, 2025

  34. [34]

    GPT-5.4 - api pricing and providers.https://openrouter.ai/openai/gpt-5

    OpenRouter. GPT-5.4 - api pricing and providers.https://openrouter.ai/openai/gpt-5. 4, 2026

  35. [35]

    Seed-2.0-Mini - api pricing and providers

    OpenRouter. Seed-2.0-Mini - api pricing and providers. https://openrouter.ai/ bytedance-seed/seed-2.0-mini, 2026

  36. [36]

    COFFE: A code efficiency benchmark for code generation.Proc

    Yun Peng, Jun Wan, Yichen Li, and Xiaoxue Ren. COFFE: A code efficiency benchmark for code generation.Proc. ACM Softw. Eng., 2(FSE):242–265, 2025

  37. [37]

    Zhang, Heming Cui, Siu-Ming Yiu, Dong Huang, See-Kiong Ng, and Luu Anh Tuan

    Yuhao Qing, Boyu Zhu, Mingzhe Du, Zhijiang Guo, Terry Yue Zhuo, Qianru Zhang, Jie M. Zhang, Heming Cui, Siu-Ming Yiu, Dong Huang, See-Kiong Ng, and Luu Anh Tuan. EffiBench- X: A multi-language benchmark for measuring efficiency of llm-generated code.CoRR, abs/2505.13004, 2025

  38. [38]

    Qwen3.6 model family.https://qwen.ai/, 2026

    Qwen Team. Qwen3.6 model family.https://qwen.ai/, 2026

  39. [39]

    Qwen3.6-35B-A3B: Agentic coding power, now open to all

    Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all. https://qwen.ai/ blog?id=qwen3.6-35b-a3b, 2026. 12

  40. [40]

    simdutf: Unicode validation and transcoding at billions of characters per second.https://github.com/simdutf/simdutf, 2026

    simdutf Contributors. simdutf: Unicode validation and transcoding at billions of characters per second.https://github.com/simdutf/simdutf, 2026

  41. [41]

    Yongjian Tang and Thomas A. Runkler. LLM-based agentic systems for software engineering: Challenges and opportunities.CoRR, abs/2601.09822, 2026

  42. [42]

    Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, and et al. Openhands: An open platform for AI software developers as generalist agents. In IC...

  43. [43]

    Agentless: Demystifying LLM-based Software Engineering Agents

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents.CoRR, abs/2407.01489, 2024

  44. [44]

    Live-swe-agent: Can software engineering agents self-evolve on the fly?arXiv preprint arXiv:2511.13646, 2025

    Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, and Lingming Zhang. Live-SWE- agent: Can software engineering agents self-evolve on the fly?CoRR, abs/2511.13646, 2025

  45. [45]

    Web-bench: A LLM code benchmark based on web standards and frameworks.CoRR, abs/2505.07473, 2025

    Kai Xu, YiWei Mao, XinYi Guan, and ZiLong Feng. Web-bench: A LLM code benchmark based on web standards and frameworks.CoRR, abs/2505.07473, 2025

  46. [46]

    xxHash: Extremely fast non-cryptographic hash algorithm

    xxHash Contributors. xxHash: Extremely fast non-cryptographic hash algorithm. https: //github.com/Cyan4973/xxHash, 2026

  47. [47]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InNeurIPS, 2024

  48. [48]

    SWE-smith: Scaling Data for Software Engineering Agents

    John Yang, Kilian Leret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. SWE-smith: Scaling data for software engineering agents.CoRR, abs/2504.21798, 2025

  49. [49]

    yyjson: A high performance JSON library written in ANSI C

    yyjson Contributors. yyjson: A high performance JSON library written in ANSI C. https: //github.com/ibireme/yyjson, 2026

  50. [50]

    AutoCodeRover: Au- tonomous program improvement

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. AutoCodeRover: Au- tonomous program improvement. InISSTA, pages 1592–1604. ACM, 2024

  51. [51]

    BigCodeBench: Benchmarking code generation with diverse function calls and complex instructions

    Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, James Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, and et al. BigCodeBench: Benchmarking code generation with diverse function...

  52. [52]

    true"); Reference. int up = row.indexOf(

    Zstandard Contributors. Zstandard: Fast real-time compression algorithm. https://github. com/facebook/zstd, 2026. 13 A Data Sources This appendix lists public sources used to build PerfCodeBench. These sources provide realistic systems workloads. They also provide executable benchmark designs and optimization motifs for task construction. The source pool ...