PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

Hanyu Yang; Haochen Shi; Haoran Li; Huihao Jing; Shaojin Chen; Sirui Zhang; Wenbin Hu; Yangqiu Song

arxiv: 2605.15222 · v1 · pith:AARY37WBnew · submitted 2026-05-13 · 💻 cs.SE · cs.CL· cs.PL

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

Huihao Jing , Wenbin Hu , Haochen Shi , Hanyu Yang , Sirui Zhang , Shaojin Chen , Haoran Li , Yangqiu Song This is my paper

Pith reviewed 2026-05-19 17:58 UTC · model grok-4.3

classification 💻 cs.SE cs.CLcs.PL

keywords LLM code generationperformance optimizationbenchmarkhigh-performance computingparallelismGPU programmingsystems softwarecode efficiency

0 comments

The pith

Current LLMs produce code that is functionally correct but far from expert-optimized on system-level performance tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PerfCodeBench to test how well large language models optimize code for high performance in systems programming. It shows that models can write working code yet leave large efficiency gaps, especially on tasks involving parallelism and GPU operations. This matters because real performance-critical systems depend on hardware-aware choices that go beyond basic correctness. The benchmark supplies executable checks, baselines, and reference solutions so both correctness and runtime can be measured directly. Results point to the need for evaluation that rewards efficient implementations rather than correct ones alone.

Core claim

PerfCodeBench consists of tasks that demand system-level implementation choices, hardware-aware optimizations, and handling of performance bottlenecks. When evaluated on a broad set of state-of-the-art LLMs, the generated code exhibits a clear gap relative to expert-optimized reference solutions, with the largest shortfalls appearing on parallelism and GPU tasks. Models also show inconsistent cross-language behavior and rarely match expert efficiency levels.

What carries the argument

PerfCodeBench, the executable benchmark that pairs each task with correctness verification, a baseline implementation, and a reference optimized solution to quantify runtime efficiency.

If this is right

LLMs must improve specifically on parallelism and GPU operations to close the efficiency gap.
Code-generation evaluation should incorporate runtime performance metrics in addition to functional correctness.
Cross-language robustness remains a clear weakness that limits practical deployment.
Performance-aware benchmarks are required to steer future model development toward efficient systems software.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be used to create targeted fine-tuning datasets focused on optimization reasoning.
Integrating hardware simulation feedback into model training might narrow the observed gaps over time.
Similar task designs could be applied to other hardware platforms to test broader claims about model limitations.

Load-bearing premise

The selected tasks accurately capture the realistic system-level choices, hardware-aware optimizations, and performance bottlenecks that matter in practice.

What would settle it

An LLM that consistently produces code matching or beating the reference optimized solutions on the benchmark tasks across multiple runs would falsify the claimed performance gap.

Figures

Figures reproduced from arXiv: 2605.15222 by Hanyu Yang, Haochen Shi, Haoran Li, Huihao Jing, Shaojin Chen, Sirui Zhang, Wenbin Hu, Yangqiu Song.

**Figure 2.** Figure 2: Additional findings from PerfCodeBench. Some models rarely succeed, but their successful [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Heatmap of the per-language PerfCodeBench results. Rows correspond to evaluated models, [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗

read the original abstract

Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness or algorithmic problem solving, while realistic systems-level optimization is still underexplored. To address this gap, we introduce PerfCodeBench, an executable benchmark for evaluating LLMs on high-performance code optimization. The tasks require system-level implementation choices, hardware-aware optimization, and careful handling of performance bottlenecks. Each task includes executable correctness checks, a baseline implementation, and a reference optimized solution. This allows us to evaluate both correctness and runtime-oriented efficiency. Our evaluation on a broad set of state-of-the-art LLMs shows a clear gap between model-generated code and expert-optimized implementations. The gap is especially large on tasks involving parallelism and GPU operations. Current models also show weaknesses in cross-language robustness and in consistently reaching expert-level efficiency. These results suggest that performance-aware evaluation are still needed. LLMs should move beyond generating merely correct code toward producing efficient systems software. We submit the benchmark data, evaluation infrastructure, and complete logs of all LLMs-generated code at https://anonymous.4open.science/r/perfcodebench-7CDE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PerfCodeBench, an executable benchmark for evaluating LLMs on high-performance code optimization. Tasks require system-level implementation choices, hardware-aware optimizations, and handling of performance bottlenecks; each includes executable correctness checks, a baseline implementation, and a reference optimized solution. Evaluation across state-of-the-art LLMs reports a clear gap versus expert-optimized code, largest on parallelism and GPU tasks, plus weaknesses in cross-language robustness and consistent efficiency.

Significance. If the tasks prove representative of real production bottlenecks, the benchmark would usefully demonstrate that current LLMs remain limited for performance-critical systems code and motivate performance-aware evaluation beyond functional correctness. Public release of the benchmark data, evaluation infrastructure, and complete LLM-generated code logs is a clear strength that supports reproducibility and follow-on work.

major comments (2)

[Abstract] Abstract: the central claim of a 'clear gap' that is 'especially large on tasks involving parallelism and GPU operations' depends on the tasks accurately encoding realistic implementation decisions and bottlenecks. No information is supplied on task provenance (real kernels vs. synthetic), the fraction of tasks targeting GPU/parallelism, the magnitude of reference-vs-baseline speedups, or any external validation that the chosen bottlenecks are representative rather than selected for effect.
[Evaluation] Evaluation (throughout): the reported gaps lack visible statistical controls for LLM output variability (e.g., multiple samples per prompt, temperature settings, or confidence intervals on runtime metrics). Without these, selection effects cannot be ruled out and the headline performance gap remains difficult to interpret.

minor comments (2)

[Abstract] Abstract: 'performance-aware evaluation are still needed' is grammatically incorrect; should read 'performance-aware evaluations are still needed' or 'performance-aware evaluation is still needed'.
[Abstract] The anonymous repository link is provided but should be replaced with a permanent archive (e.g., Zenodo) before publication to ensure long-term accessibility of the logs and infrastructure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify how to better support the paper's claims. We address each major point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of a 'clear gap' that is 'especially large on tasks involving parallelism and GPU operations' depends on the tasks accurately encoding realistic implementation decisions and bottlenecks. No information is supplied on task provenance (real kernels vs. synthetic), the fraction of tasks targeting GPU/parallelism, the magnitude of reference-vs-baseline speedups, or any external validation that the chosen bottlenecks are representative rather than selected for effect.

Authors: We agree that additional details on task construction are required to substantiate the central claims. In the revised manuscript we will add a dedicated subsection on benchmark construction that describes task provenance (drawn from documented performance-critical kernels in the HPC literature), the fraction of tasks involving parallelism and GPU operations, the observed reference-versus-baseline speedups, and the selection rationale based on commonly reported systems bottlenecks. revision: yes
Referee: [Evaluation] Evaluation (throughout): the reported gaps lack visible statistical controls for LLM output variability (e.g., multiple samples per prompt, temperature settings, or confidence intervals on runtime metrics). Without these, selection effects cannot be ruled out and the headline performance gap remains difficult to interpret.

Authors: We acknowledge the need for statistical controls. The revised evaluation section will report results aggregated over multiple samples per prompt, specify the temperature and sampling settings, and include standard deviations together with confidence intervals on the runtime metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark tasks and metrics are independently defined with public references.

full rationale

The paper introduces PerfCodeBench as a new executable benchmark containing baseline implementations, reference optimized solutions, and correctness checks for each task. Performance gaps are measured directly by comparing LLM outputs against these external references on runtime efficiency, with no equations, fitted parameters, or derivations that reduce the reported gaps to self-defined quantities. No self-citations are invoked to justify task selection, uniqueness of the metric, or the emphasis on parallelism/GPU tasks. The claims rest on the provided task set and submitted public logs rather than any self-referential construction or renaming of prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the benchmark tasks and reference solutions represent meaningful performance targets; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Executable correctness checks and runtime comparisons to reference solutions can reliably measure optimization quality.
Invoked when claiming gaps between model code and expert implementations.

pith-pipeline@v0.9.0 · 5772 in / 1089 out tokens · 35531 ms · 2026-05-19T17:58:50.521493+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 4 internal anchors

[1]

Introducing Claude Opus 4.5

Anthropic. Introducing Claude Opus 4.5. https://www.anthropic.com/news/ claude-opus-4-5, 2025

work page 2025
[2]

Claude Code Overview

Anthropic. Claude Code Overview. https://code.claude.com/docs/en/overview, 2026

work page 2026
[3]

Claude model overview

Anthropic. Claude model overview. https://platform.claude.com/docs/en/ about-claude/models/overview, 2026

work page 2026
[4]

Understanding software engineering agents: A study of thought-action-result trajectories

Islem Bouzenia and Michael Pradel. Understanding software engineering agents: A study of thought-action-result trajectories. InASE, pages 2846–2857. IEEE, 2025

work page 2025
[5]

Seed2.0.https://seed.bytedance.com/en/seed2, 2026

ByteDance Seed. Seed2.0.https://seed.bytedance.com/en/seed2, 2026

work page 2026
[6]

Autocodebench: Large language models are automatic code benchmark generators.CoRR, abs/2508.09101, 2025

Jason Chou, Ao Liu, Yuchi Deng, Zhiying Zeng, Tao Zhang, Haotian Zhu, Jianwei Cai, Yue Mao, Chenchen Zhang, Lingyun Tan, Ziyan Xu, Bohui Zhai, Hengyi Liu, Speed Zhu, Wiggin Zhou, and Fengzong Lian. Autocodebench: Large language models are automatic code benchmark generators.CoRR, abs/2508.09101, 2025

work page arXiv 2025
[7]

DeepSeek V4 Preview Release

DeepSeek. DeepSeek V4 Preview Release. https://api-docs.deepseek.com/news/ news260424, 2026

work page 2026
[8]

A Survey on Code Generation with LLM-based Agents

Yihong Dong, Xue Jiang, Jiaru Qian, Tian Wang, Kechi Zhang, Zhi Jin, and Ge Li. A survey on code generation with llm-based agents.CoRR, abs/2508.00083, 2025

work page internal anchor Pith review arXiv 2025
[9]

CodeArena: A collective evaluation platform for LLM code generation

Mingzhe Du, Anh Tuan Luu, Bin Ji, Xiaobao Wu, Yuhao Qing, Dong Huang, Terry Yue Zhuo, Qian Liu, and See-Kiong Ng. CodeArena: A collective evaluation platform for LLM code generation. InACL (3), pages 502–512. Association for Computational Linguistics, 2025

work page 2025
[10]

Eval- uating agents.md: Are repository-level context files helpful for coding agents?, 2026

Thibaud Gloaguen, Niels Mündler, Mark Müller, Veselin Raychev, and Martin Vechev. Eval- uating agents.md: Are repository-level context files helpful for coding agents?, 2026. URL https://arxiv.org/abs/2602.11988

work page arXiv 2026
[11]

Gemini 3.1 Pro: A smarter model for your most complex tasks

Google. Gemini 3.1 Pro: A smarter model for your most complex tasks. https: //blog.google/innovation-and-ai/models-and-research/gemini-models/ gemini-3-1-pro/, 2026

work page 2026
[12]

Gemma 4: Byte for byte, the most capable open models

Google. Gemma 4: Byte for byte, the most capable open models. https://blog.google/ innovation-and-ai/technology/developers-tools/gemma-4/, 2026

work page 2026
[13]

Gemini API models

Google AI for Developers. Gemini API models. https://ai.google.dev/gemini-api/ docs/models, 2026

work page 2026
[14]

Gemma 4 model overview

Google AI for Developers. Gemma 4 model overview. https://ai.google.dev/gemma/ docs/core, 2026

work page 2026
[15]

EffiBench: Benchmarking the efficiency of automatically generated code

Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, and Jie Zhang. EffiBench: Benchmarking the efficiency of automatically generated code. InNeurIPS, 2024

work page 2024
[16]

LiveCodeBench: Holistic and con- tamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and con- tamination free evaluation of large language models for code. InICLR. OpenReview.net, 2025

work page 2025
[17]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. SWE-bench: Can language models resolve real-world github issues? InICLR. OpenReview.net, 2024

work page 2024
[18]

Parsing gigabytes of JSON per second.arXiv preprint arXiv:1902.08318, 2019

Geoff Langdale and Daniel Lemire. Parsing gigabytes of JSON per second.arXiv preprint arXiv:1902.08318, 2019. URLhttps://arxiv.org/abs/1902.08318. 11

work page arXiv 1902
[19]

RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices

Jia Li, Hongyi Deng, Yiran Zhang, Kechi Zhang, Tianqi Shao, Tiankuo Zhao, Weinan Wang, Zhi Jin, Ge Li, Yang Liu, Yingtao Fang, and Yihong Dong. Realbench: A repo-level code generation benchmark aligned with real-world software development practices, 2026. URL https://arxiv.org/abs/2604.22659

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

FEA-Bench: A benchmark for evaluating repository-level code generation for feature implementation

Wei Li, Xin Zhang, Zhongxin Guo, Shaoguang Mao, Wen Luo, Guangyue Peng, Yangyu Huang, Houfeng Wang, and Scarlett Li. FEA-Bench: A benchmark for evaluating repository-level code generation for feature implementation. InACL (1), pages 17160–17176. Association for Computational Linguistics, 2025

work page 2025
[21]

ProjectEval: A benchmark for programming agents automated evaluation on project-level code generation

Kaiyuan Liu, Youcheng Pan, Yang Xiang, Daojing He, Jing Li, Yexing Du, and Tianrun Gao. ProjectEval: A benchmark for programming agents automated evaluation on project-level code generation. InACL (Findings), Findings of ACL, pages 20205–20221. Association for Computational Linguistics, 2025

work page 2025
[22]

LZ4: Extremely fast compression algorithm

LZ4 Contributors. LZ4: Extremely fast compression algorithm. https://github.com/lz4/ lz4, 2026

work page 2026
[23]

The Llama 4 herd: The beginning of a new era of natively multimodal AI

Meta AI. The Llama 4 herd: The beginning of a new era of natively multimodal AI. https: //ai.meta.com/blog/llama-4-multimodal-intelligence/, 2025

work page 2025
[24]

Llama 4 models.https://www.llama.com/models/llama-4/, 2026

Meta Llama. Llama 4 models.https://www.llama.com/models/llama-4/, 2026

work page 2026
[25]

Kimi K2.6 quickstart

Moonshot AI. Kimi K2.6 quickstart. https://platform.kimi.ai/docs/guide/ kimi-k2-6-quickstart, 2026

work page 2026
[26]

Kimi AI with K2.6: Better coding, smarter agents.https://www.kimi.com/en, 2026

Moonshot AI. Kimi AI with K2.6: Better coding, smarter agents.https://www.kimi.com/en, 2026

work page 2026
[27]

CUB: Reusable software components for the CUDA programming model

NVIDIA. CUB: Reusable software components for the CUDA programming model. https: //github.com/NVIDIA/cub, 2026

work page 2026
[28]

CUDA Samples.https://github.com/NVIDIA/cuda-samples, 2026

NVIDIA. CUDA Samples.https://github.com/NVIDIA/cuda-samples, 2026

work page 2026
[29]

Thrust: The C++ parallel algorithms library

NVIDIA. Thrust: The C++ parallel algorithms library. https://github.com/NVIDIA/ thrust, 2026

work page 2026
[30]

Codex: AI Coding Partner from OpenAI.https://openai.com/codex/, 2026

OpenAI. Codex: AI Coding Partner from OpenAI.https://openai.com/codex/, 2026

work page 2026
[31]

Introducing GPT-5.5

OpenAI. Introducing GPT-5.5. https://openai.com/index/introducing-gpt-5-5/ , 2026

work page 2026
[32]

OpenClaw: Personal AI Assistant.https://openclaw.ai/, 2026

OpenClaw. OpenClaw: Personal AI Assistant.https://openclaw.ai/, 2026

work page 2026
[33]

GPT-5 - api pricing and providers

OpenRouter. GPT-5 - api pricing and providers. https://openrouter.ai/openai/gpt-5, 2025

work page 2025
[34]

GPT-5.4 - api pricing and providers.https://openrouter.ai/openai/gpt-5

OpenRouter. GPT-5.4 - api pricing and providers.https://openrouter.ai/openai/gpt-5. 4, 2026

work page 2026
[35]

Seed-2.0-Mini - api pricing and providers

OpenRouter. Seed-2.0-Mini - api pricing and providers. https://openrouter.ai/ bytedance-seed/seed-2.0-mini, 2026

work page 2026
[36]

COFFE: A code efficiency benchmark for code generation.Proc

Yun Peng, Jun Wan, Yichen Li, and Xiaoxue Ren. COFFE: A code efficiency benchmark for code generation.Proc. ACM Softw. Eng., 2(FSE):242–265, 2025

work page 2025
[37]

Zhang, Heming Cui, Siu-Ming Yiu, Dong Huang, See-Kiong Ng, and Luu Anh Tuan

Yuhao Qing, Boyu Zhu, Mingzhe Du, Zhijiang Guo, Terry Yue Zhuo, Qianru Zhang, Jie M. Zhang, Heming Cui, Siu-Ming Yiu, Dong Huang, See-Kiong Ng, and Luu Anh Tuan. EffiBench- X: A multi-language benchmark for measuring efficiency of llm-generated code.CoRR, abs/2505.13004, 2025

work page arXiv 2025
[38]

Qwen3.6 model family.https://qwen.ai/, 2026

Qwen Team. Qwen3.6 model family.https://qwen.ai/, 2026

work page 2026
[39]

Qwen3.6-35B-A3B: Agentic coding power, now open to all

Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all. https://qwen.ai/ blog?id=qwen3.6-35b-a3b, 2026. 12

work page 2026
[40]

simdutf: Unicode validation and transcoding at billions of characters per second.https://github.com/simdutf/simdutf, 2026

simdutf Contributors. simdutf: Unicode validation and transcoding at billions of characters per second.https://github.com/simdutf/simdutf, 2026

work page 2026
[41]

Yongjian Tang and Thomas A. Runkler. LLM-based agentic systems for software engineering: Challenges and opportunities.CoRR, abs/2601.09822, 2026

work page arXiv 2026
[42]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, and et al. Openhands: An open platform for AI software developers as generalist agents. In IC...

work page 2025
[43]

Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents.CoRR, abs/2407.01489, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Live-swe-agent: Can software engineering agents self-evolve on the fly?arXiv preprint arXiv:2511.13646, 2025

Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, and Lingming Zhang. Live-SWE- agent: Can software engineering agents self-evolve on the fly?CoRR, abs/2511.13646, 2025

work page arXiv 2025
[45]

Web-bench: A LLM code benchmark based on web standards and frameworks.CoRR, abs/2505.07473, 2025

Kai Xu, YiWei Mao, XinYi Guan, and ZiLong Feng. Web-bench: A LLM code benchmark based on web standards and frameworks.CoRR, abs/2505.07473, 2025

work page arXiv 2025
[46]

xxHash: Extremely fast non-cryptographic hash algorithm

xxHash Contributors. xxHash: Extremely fast non-cryptographic hash algorithm. https: //github.com/Cyan4973/xxHash, 2026

work page 2026
[47]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InNeurIPS, 2024

work page 2024
[48]

SWE-smith: Scaling Data for Software Engineering Agents

John Yang, Kilian Leret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. SWE-smith: Scaling data for software engineering agents.CoRR, abs/2504.21798, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

yyjson: A high performance JSON library written in ANSI C

yyjson Contributors. yyjson: A high performance JSON library written in ANSI C. https: //github.com/ibireme/yyjson, 2026

work page 2026
[50]

AutoCodeRover: Au- tonomous program improvement

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. AutoCodeRover: Au- tonomous program improvement. InISSTA, pages 1592–1604. ACM, 2024

work page 2024
[51]

BigCodeBench: Benchmarking code generation with diverse function calls and complex instructions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, James Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, and et al. BigCodeBench: Benchmarking code generation with diverse function...

work page 2025
[52]

true"); Reference. int up = row.indexOf(

Zstandard Contributors. Zstandard: Fast real-time compression algorithm. https://github. com/facebook/zstd, 2026. 13 A Data Sources This appendix lists public sources used to build PerfCodeBench. These sources provide realistic systems workloads. They also provide executable benchmark designs and optimization motifs for task construction. The source pool ...

work page 2026

[1] [1]

Introducing Claude Opus 4.5

Anthropic. Introducing Claude Opus 4.5. https://www.anthropic.com/news/ claude-opus-4-5, 2025

work page 2025

[2] [2]

Claude Code Overview

Anthropic. Claude Code Overview. https://code.claude.com/docs/en/overview, 2026

work page 2026

[3] [3]

Claude model overview

Anthropic. Claude model overview. https://platform.claude.com/docs/en/ about-claude/models/overview, 2026

work page 2026

[4] [4]

Understanding software engineering agents: A study of thought-action-result trajectories

Islem Bouzenia and Michael Pradel. Understanding software engineering agents: A study of thought-action-result trajectories. InASE, pages 2846–2857. IEEE, 2025

work page 2025

[5] [5]

Seed2.0.https://seed.bytedance.com/en/seed2, 2026

ByteDance Seed. Seed2.0.https://seed.bytedance.com/en/seed2, 2026

work page 2026

[6] [6]

Autocodebench: Large language models are automatic code benchmark generators.CoRR, abs/2508.09101, 2025

Jason Chou, Ao Liu, Yuchi Deng, Zhiying Zeng, Tao Zhang, Haotian Zhu, Jianwei Cai, Yue Mao, Chenchen Zhang, Lingyun Tan, Ziyan Xu, Bohui Zhai, Hengyi Liu, Speed Zhu, Wiggin Zhou, and Fengzong Lian. Autocodebench: Large language models are automatic code benchmark generators.CoRR, abs/2508.09101, 2025

work page arXiv 2025

[7] [7]

DeepSeek V4 Preview Release

DeepSeek. DeepSeek V4 Preview Release. https://api-docs.deepseek.com/news/ news260424, 2026

work page 2026

[8] [8]

A Survey on Code Generation with LLM-based Agents

Yihong Dong, Xue Jiang, Jiaru Qian, Tian Wang, Kechi Zhang, Zhi Jin, and Ge Li. A survey on code generation with llm-based agents.CoRR, abs/2508.00083, 2025

work page internal anchor Pith review arXiv 2025

[9] [9]

CodeArena: A collective evaluation platform for LLM code generation

Mingzhe Du, Anh Tuan Luu, Bin Ji, Xiaobao Wu, Yuhao Qing, Dong Huang, Terry Yue Zhuo, Qian Liu, and See-Kiong Ng. CodeArena: A collective evaluation platform for LLM code generation. InACL (3), pages 502–512. Association for Computational Linguistics, 2025

work page 2025

[10] [10]

Eval- uating agents.md: Are repository-level context files helpful for coding agents?, 2026

Thibaud Gloaguen, Niels Mündler, Mark Müller, Veselin Raychev, and Martin Vechev. Eval- uating agents.md: Are repository-level context files helpful for coding agents?, 2026. URL https://arxiv.org/abs/2602.11988

work page arXiv 2026

[11] [11]

Gemini 3.1 Pro: A smarter model for your most complex tasks

Google. Gemini 3.1 Pro: A smarter model for your most complex tasks. https: //blog.google/innovation-and-ai/models-and-research/gemini-models/ gemini-3-1-pro/, 2026

work page 2026

[12] [12]

Gemma 4: Byte for byte, the most capable open models

Google. Gemma 4: Byte for byte, the most capable open models. https://blog.google/ innovation-and-ai/technology/developers-tools/gemma-4/, 2026

work page 2026

[13] [13]

Gemini API models

Google AI for Developers. Gemini API models. https://ai.google.dev/gemini-api/ docs/models, 2026

work page 2026

[14] [14]

Gemma 4 model overview

Google AI for Developers. Gemma 4 model overview. https://ai.google.dev/gemma/ docs/core, 2026

work page 2026

[15] [15]

EffiBench: Benchmarking the efficiency of automatically generated code

Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, and Jie Zhang. EffiBench: Benchmarking the efficiency of automatically generated code. InNeurIPS, 2024

work page 2024

[16] [16]

LiveCodeBench: Holistic and con- tamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and con- tamination free evaluation of large language models for code. InICLR. OpenReview.net, 2025

work page 2025

[17] [17]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. SWE-bench: Can language models resolve real-world github issues? InICLR. OpenReview.net, 2024

work page 2024

[18] [18]

Parsing gigabytes of JSON per second.arXiv preprint arXiv:1902.08318, 2019

Geoff Langdale and Daniel Lemire. Parsing gigabytes of JSON per second.arXiv preprint arXiv:1902.08318, 2019. URLhttps://arxiv.org/abs/1902.08318. 11

work page arXiv 1902

[19] [19]

RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices

Jia Li, Hongyi Deng, Yiran Zhang, Kechi Zhang, Tianqi Shao, Tiankuo Zhao, Weinan Wang, Zhi Jin, Ge Li, Yang Liu, Yingtao Fang, and Yihong Dong. Realbench: A repo-level code generation benchmark aligned with real-world software development practices, 2026. URL https://arxiv.org/abs/2604.22659

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

FEA-Bench: A benchmark for evaluating repository-level code generation for feature implementation

Wei Li, Xin Zhang, Zhongxin Guo, Shaoguang Mao, Wen Luo, Guangyue Peng, Yangyu Huang, Houfeng Wang, and Scarlett Li. FEA-Bench: A benchmark for evaluating repository-level code generation for feature implementation. InACL (1), pages 17160–17176. Association for Computational Linguistics, 2025

work page 2025

[21] [21]

ProjectEval: A benchmark for programming agents automated evaluation on project-level code generation

Kaiyuan Liu, Youcheng Pan, Yang Xiang, Daojing He, Jing Li, Yexing Du, and Tianrun Gao. ProjectEval: A benchmark for programming agents automated evaluation on project-level code generation. InACL (Findings), Findings of ACL, pages 20205–20221. Association for Computational Linguistics, 2025

work page 2025

[22] [22]

LZ4: Extremely fast compression algorithm

LZ4 Contributors. LZ4: Extremely fast compression algorithm. https://github.com/lz4/ lz4, 2026

work page 2026

[23] [23]

The Llama 4 herd: The beginning of a new era of natively multimodal AI

Meta AI. The Llama 4 herd: The beginning of a new era of natively multimodal AI. https: //ai.meta.com/blog/llama-4-multimodal-intelligence/, 2025

work page 2025

[24] [24]

Llama 4 models.https://www.llama.com/models/llama-4/, 2026

Meta Llama. Llama 4 models.https://www.llama.com/models/llama-4/, 2026

work page 2026

[25] [25]

Kimi K2.6 quickstart

Moonshot AI. Kimi K2.6 quickstart. https://platform.kimi.ai/docs/guide/ kimi-k2-6-quickstart, 2026

work page 2026

[26] [26]

Kimi AI with K2.6: Better coding, smarter agents.https://www.kimi.com/en, 2026

Moonshot AI. Kimi AI with K2.6: Better coding, smarter agents.https://www.kimi.com/en, 2026

work page 2026

[27] [27]

CUB: Reusable software components for the CUDA programming model

NVIDIA. CUB: Reusable software components for the CUDA programming model. https: //github.com/NVIDIA/cub, 2026

work page 2026

[28] [28]

CUDA Samples.https://github.com/NVIDIA/cuda-samples, 2026

NVIDIA. CUDA Samples.https://github.com/NVIDIA/cuda-samples, 2026

work page 2026

[29] [29]

Thrust: The C++ parallel algorithms library

NVIDIA. Thrust: The C++ parallel algorithms library. https://github.com/NVIDIA/ thrust, 2026

work page 2026

[30] [30]

Codex: AI Coding Partner from OpenAI.https://openai.com/codex/, 2026

OpenAI. Codex: AI Coding Partner from OpenAI.https://openai.com/codex/, 2026

work page 2026

[31] [31]

Introducing GPT-5.5

OpenAI. Introducing GPT-5.5. https://openai.com/index/introducing-gpt-5-5/ , 2026

work page 2026

[32] [32]

OpenClaw: Personal AI Assistant.https://openclaw.ai/, 2026

OpenClaw. OpenClaw: Personal AI Assistant.https://openclaw.ai/, 2026

work page 2026

[33] [33]

GPT-5 - api pricing and providers

OpenRouter. GPT-5 - api pricing and providers. https://openrouter.ai/openai/gpt-5, 2025

work page 2025

[34] [34]

GPT-5.4 - api pricing and providers.https://openrouter.ai/openai/gpt-5

OpenRouter. GPT-5.4 - api pricing and providers.https://openrouter.ai/openai/gpt-5. 4, 2026

work page 2026

[35] [35]

Seed-2.0-Mini - api pricing and providers

OpenRouter. Seed-2.0-Mini - api pricing and providers. https://openrouter.ai/ bytedance-seed/seed-2.0-mini, 2026

work page 2026

[36] [36]

COFFE: A code efficiency benchmark for code generation.Proc

Yun Peng, Jun Wan, Yichen Li, and Xiaoxue Ren. COFFE: A code efficiency benchmark for code generation.Proc. ACM Softw. Eng., 2(FSE):242–265, 2025

work page 2025

[37] [37]

Zhang, Heming Cui, Siu-Ming Yiu, Dong Huang, See-Kiong Ng, and Luu Anh Tuan

Yuhao Qing, Boyu Zhu, Mingzhe Du, Zhijiang Guo, Terry Yue Zhuo, Qianru Zhang, Jie M. Zhang, Heming Cui, Siu-Ming Yiu, Dong Huang, See-Kiong Ng, and Luu Anh Tuan. EffiBench- X: A multi-language benchmark for measuring efficiency of llm-generated code.CoRR, abs/2505.13004, 2025

work page arXiv 2025

[38] [38]

Qwen3.6 model family.https://qwen.ai/, 2026

Qwen Team. Qwen3.6 model family.https://qwen.ai/, 2026

work page 2026

[39] [39]

Qwen3.6-35B-A3B: Agentic coding power, now open to all

Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all. https://qwen.ai/ blog?id=qwen3.6-35b-a3b, 2026. 12

work page 2026

[40] [40]

simdutf: Unicode validation and transcoding at billions of characters per second.https://github.com/simdutf/simdutf, 2026

simdutf Contributors. simdutf: Unicode validation and transcoding at billions of characters per second.https://github.com/simdutf/simdutf, 2026

work page 2026

[41] [41]

Yongjian Tang and Thomas A. Runkler. LLM-based agentic systems for software engineering: Challenges and opportunities.CoRR, abs/2601.09822, 2026

work page arXiv 2026

[42] [42]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, and et al. Openhands: An open platform for AI software developers as generalist agents. In IC...

work page 2025

[43] [43]

Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents.CoRR, abs/2407.01489, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Live-swe-agent: Can software engineering agents self-evolve on the fly?arXiv preprint arXiv:2511.13646, 2025

Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, and Lingming Zhang. Live-SWE- agent: Can software engineering agents self-evolve on the fly?CoRR, abs/2511.13646, 2025

work page arXiv 2025

[45] [45]

Web-bench: A LLM code benchmark based on web standards and frameworks.CoRR, abs/2505.07473, 2025

Kai Xu, YiWei Mao, XinYi Guan, and ZiLong Feng. Web-bench: A LLM code benchmark based on web standards and frameworks.CoRR, abs/2505.07473, 2025

work page arXiv 2025

[46] [46]

xxHash: Extremely fast non-cryptographic hash algorithm

xxHash Contributors. xxHash: Extremely fast non-cryptographic hash algorithm. https: //github.com/Cyan4973/xxHash, 2026

work page 2026

[47] [47]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InNeurIPS, 2024

work page 2024

[48] [48]

SWE-smith: Scaling Data for Software Engineering Agents

John Yang, Kilian Leret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. SWE-smith: Scaling data for software engineering agents.CoRR, abs/2504.21798, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

yyjson: A high performance JSON library written in ANSI C

yyjson Contributors. yyjson: A high performance JSON library written in ANSI C. https: //github.com/ibireme/yyjson, 2026

work page 2026

[50] [50]

AutoCodeRover: Au- tonomous program improvement

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. AutoCodeRover: Au- tonomous program improvement. InISSTA, pages 1592–1604. ACM, 2024

work page 2024

[51] [51]

BigCodeBench: Benchmarking code generation with diverse function calls and complex instructions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, James Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, and et al. BigCodeBench: Benchmarking code generation with diverse function...

work page 2025

[52] [52]

true"); Reference. int up = row.indexOf(

Zstandard Contributors. Zstandard: Fast real-time compression algorithm. https://github. com/facebook/zstd, 2026. 13 A Data Sources This appendix lists public sources used to build PerfCodeBench. These sources provide realistic systems workloads. They also provide executable benchmark designs and optimization motifs for task construction. The source pool ...

work page 2026