pith. sign in

arxiv: 2606.04023 · v1 · pith:73DI4P3Nnew · submitted 2026-06-01 · 💻 cs.SE · cs.AI

CodegenBench: Can LLMs Write Efficient Code Across Architectures?

Pith reviewed 2026-06-28 13:36 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords LLM code generationhigh-performance computingcross-architectureBLAS benchmarksparallel codeSunwayKunpeng
0
0 comments X

The pith

Large language models generate optimized parallel code for x86_64 but show sharp performance drops on Sunway and Kunpeng.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CodegenBench to test LLM ability to produce efficient code for high-performance computing across three architectures. It evaluates 106 standard BLAS routines plus 20 specialized kernels on x86_64, Sunway, and Kunpeng systems. Results indicate that models succeed on widely documented x86_64 hardware yet degrade on the other two platforms that have less public training material. The study also finds that LLMs handle moderate-difficulty tasks with short code best.

Core claim

State-of-the-art LLMs can generate optimized code for ubiquitous architectures like x86_64, yet they exhibit significant performance degradation on domain-specific architectures with limited public documentation and training data. Analysis of implementation length and task complexity shows current LLMs work best on moderately difficult problems that need concise snippets.

What carries the argument

CodegenBench, a benchmark suite of 106 BLAS routines and 20 architecture-specific kernels that measures runtime performance of LLM-generated parallel code on x86_64, Sunway, and Kunpeng.

If this is right

  • LLMs remain most reliable when asked for concise implementations of moderate complexity.
  • Code quality declines as target architecture documentation becomes scarcer.
  • Open-sourced dataset and evaluation tools can support further work on LLM-driven HPC code generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectures with sparse public data may need targeted retrieval or fine-tuning before LLMs can match x86_64 results.
  • The same benchmark approach could reveal similar limits when applied to other emerging or proprietary hardware.
  • Widespread use of LLMs for cross-platform optimization may be delayed until data scarcity is addressed.

Load-bearing premise

Performance gaps across architectures arise mainly from differences in the amount of public documentation and training data rather than from how the benchmark tasks were chosen or how speed was measured.

What would settle it

An experiment that applies identical BLAS and kernel tasks to all three architectures and finds comparable LLM code efficiency, or that shows the chosen tasks differ systematically in inherent difficulty.

Figures

Figures reproduced from arXiv: 2606.04023 by Bowen Wu, Haohuan Fu, Jie Li, Juepeng Zheng, Junqi Hu, Qinrui Zheng, Wenzhao Wu, Yutong Lu.

Figure 1
Figure 1. Figure 1: Comparison of our framework with existing paradigms, and its comprehensive performance [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall statistic of CodegenBench. as HeteroBench [43], HosNa [12] and MultiKernelBench [45] concerned about multi-architecture including CPU, GPU, NPU, TPU and FPGA, the specific architectures used by supercomputers such as Sunway and Kunpeng that are of importance to have criteria in are not included. In this paper, we introduce a new benchmark called CodegenBench to test the performance of generated par… view at source ↗
Figure 3
Figure 3. Figure 3: Overall pipeline for CodegenBench. The pipeline automates evaluation. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of P ass@1, P ass@5, and F ast1@1 metrics across varying levels of BLAS routines. Generally the complexity of BLAS routines increase as the Level increases. 0 2 4 6 8 DS-v4-flash DS-v4-pro DS-v3.2 opus-4.6 opus-4.7 qwen3.5-plus qwen3.6-flash qwen3.6-plus sonnet-4.6 4101 tk 2.81x 7487 tk 5.13x 1459 tk 1.00x 2189 tk 1.50x 2054 tk 1.41x 11543 tk 7.91x 7629 tk 5.23x 8743 tk 5.99x 2457 tk 1.68x BLAS … view at source ↗
Figure 5
Figure 5. Figure 5: Average tokens consumed to generate correct code. Claude series use fewer tokens to [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Code Length vs. P ass@1, in four scenarios. P ass@1 decreases as code length in￾creases. We also notice that there is a gap between open-source models and closed-source models in BLAS/Kunpeng configuration. Evaluating Cross-Architecture Code Generation Efficacy. Let us recall our original question: Can LLMs genuinely write efficient code across diverse architectures? Based on our empirical findings, the an… view at source ↗
Figure 7
Figure 7. Figure 7: Architecture for SW26010. it an ideal and reliable candidate for evaluating fundamental LLM code generation capabilities specifically targeting broadly adopted x86 systems. For BLAS-related experiments on this platform, we utilize gcc to compile and link against the OpenBLAS library. C.2.2 Sunway The Sunway experiments are conducted on the renowned Sunway TaihuLight supercomputer infrastructure. This massi… view at source ↗
Figure 8
Figure 8. Figure 8: Architecture for Kunpeng. Our experimental setup on this platform consists of a cluster of multiple interconnected nodes, where each individual node is robustly equipped with dual Kunpeng processors. Furthermore, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Performance comparison distinguishing real and complex number arithmetic on [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Performance comparison distinguishing real and complex number arithmetic on [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Performance comparison distinguishing real and complex number arithmetic across [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: ssyr2 case in BLAS/x86, Both Opus and DeepSeek chose appropriate intrinsics and surpassed OpenBLAS. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: dgemv case in BLAS/x86, Both Opus and DeepSeek chose appropriate intrinsics and surpassed OpenBLAS. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: attention case in LeetKunpeng, Opus chose appropriate intrinsics and surpassed referenced implementation. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: zsyr2k case in BLAS/Kunpeng, Opus and Qwen provided correct result, but the execution efficiency were much worse than kblas under BLAS/Kunpeng setting. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: zsymm case in BLAS/Kunpeng, Sonnet provided correct result, but the execution efficiency were worse than kblas under BLAS/Kunpeng setting. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: csymm case in BLAS/Kunpeng, Opus provided correct result, but the execution efficiency were worse than kblas under BLAS/Kunpeng setting. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Compilation failure in the zgemm case. DeepSeek V4 Pro erroneously invoked svcmul_f64_z, an undocumented and non-existent intrinsic, resulting in an immediate compilation abort. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Compilation failure in the ssyrk case. DeepSeek V3.2 erroneously invoked _mm256_reduce_add_ps, an undocumented and non-existent intrinsic, resulting in an immediate compilation abort. DeepSeek V4 Flash's solution for ssymv "<HEADER>#include <immintrin.h>\n#include <cstring>\n#include <algorithm></HEADER>\n<MCOMPILE>-mavx2 -mfma</MCOMPILE>\n<CODE>\nvoid ssymv(const enum CBLAS_ORDER order, const enum CBLAS_… view at source ↗
Figure 20
Figure 20. Figure 20: Compilation failure in the ssymv case. DeepSeek V4 Flash erroneously invoked _mm256_reduce_add_ps, an undocumented and non-existent intrinsic, resulting in an immediate compilation abort. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Compilation failure in the dsbmv case. Qwen 3.5 Plus erroneously invoked _mm256_reduce_add_pd, an undocumented and non-existent intrinsic, resulting in an immediate compilation abort. Claude Opus 4.6's solution for dsymv "Looking at this problem, I need to implement `dsymv` (double-precision symmetric matrix-vector multiply) following the CBLAS interface, optimized for ARM with SVE.\n\nKey considerations:… view at source ↗
Figure 22
Figure 22. Figure 22: Compilation failure in the dsymv case. Claude Opus 4.6 erroneously invoked svst1_scatter_index_f64, an undocumented and non-existent intrinsic, resulting in an immediate compilation abort. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_22.png] view at source ↗
read the original abstract

While large language models (LLMs) have been extensively evaluated on code generation tasks for general-purpose programming and GPU-accelerated environments (e.g., PyTorch, CUDA), their capabilities in CPU-oriented high-performance computing (HPC) across diverse architectures remain underexplored. To bridge this gap, we introduce CodegenBench, a comprehensive benchmark suite designed to evaluate the generation of efficient parallel code across three distinct hardware platforms: x86_64, Sunway, and Kunpeng. Our benchmark comprises 106 standard Basic Linear Algebra Subprograms (BLAS) routines establishing a fundamental baseline, alongside 20 specialized computational kernels adapted for each of the unique supercomputing architectures (LeetSunway and LeetKunpeng). Our extensive evaluation reveals that while state-of-the-art LLMs can generate optimized code for ubiquitous architectures like x86_64, they exhibit significant performance degradation on domain-specific architectures with limited public documentation and training data, highlighting critical limitations in cross-platform generalization. Furthermore, our analysis of factors influencing code quality such as implementation length and task complexity indicates that current LLMs are most effective for moderately difficult problems requiring concise code snippets. We open-source our dataset and automated evaluation infrastructure to facilitate future research in LLM-driven high-performance code generation. The resources are available at https://anonymous.4open.science/r/CodegenBench-EDE1/ and https://anonymous.4open.science/r/CodegenBenchDataset-2551.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces CodegenBench, a benchmark suite consisting of 106 standard BLAS routines (identical across platforms) plus 20 specialized kernels adapted for each of three architectures (x86_64, Sunway, Kunpeng), to evaluate LLMs' ability to generate efficient parallel code. The central claim is that state-of-the-art LLMs produce optimized code for ubiquitous architectures like x86_64 but exhibit significant performance degradation on domain-specific architectures with limited public documentation and training data; the authors also analyze factors such as implementation length and task complexity, and open-source the dataset and evaluation infrastructure.

Significance. If the central claim holds after addressing confounds, the work would usefully extend LLM code-generation evaluation into CPU-oriented HPC across heterogeneous supercomputing architectures, an area noted as underexplored. The open-sourcing of the dataset and automated evaluation infrastructure is a clear strength that supports reproducibility and follow-on research.

major comments (1)
  1. [benchmark design and evaluation sections describing the 20 specialized kernels (LeetSunway and LeetKunpeng)] The attribution of performance degradation primarily to limited public documentation and training data (abstract and benchmark description) is not supported without controls for task adaptation. The 20 specialized kernels are explicitly 'adapted for each' architecture, yet no evidence is provided that the adaptations preserve equivalent algorithmic complexity, optimization targets, or measurement definitions (e.g., absolute runtime vs. relative speedup, correctness thresholds) across platforms. This leaves open the possibility that observed gaps arise from platform-specific task difficulty rather than data volume.
minor comments (1)
  1. [abstract] The abstract states the main finding but provides no quantitative results, error bars, or evaluation protocol details (e.g., how efficiency is scored or statistical significance of degradation). While abstracts often summarize, the full paper should include these in the results section for the claim to be assessable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting a potential confound in our benchmark design. The concern regarding controls for task adaptation in the specialized kernels is well-taken and points to an area where additional clarification will strengthen the manuscript.

read point-by-point responses
  1. Referee: [benchmark design and evaluation sections describing the 20 specialized kernels (LeetSunway and LeetKunpeng)] The attribution of performance degradation primarily to limited public documentation and training data (abstract and benchmark description) is not supported without controls for task adaptation. The 20 specialized kernels are explicitly 'adapted for each' architecture, yet no evidence is provided that the adaptations preserve equivalent algorithmic complexity, optimization targets, or measurement definitions (e.g., absolute runtime vs. relative speedup, correctness thresholds) across platforms. This leaves open the possibility that observed gaps arise from platform-specific task difficulty rather than data volume.

    Authors: We agree this is a valid point that requires strengthening. The 106 BLAS routines are identical across platforms and serve as the primary controlled baseline for cross-architecture comparison. The 20 specialized kernels (LeetSunway and LeetKunpeng) were adapted to exercise architecture-unique features while targeting comparable problem sizes and computational patterns. However, the current manuscript does not include explicit quantitative controls (e.g., operation counts, memory access patterns, or parallelism metrics) demonstrating equivalence. In the revised manuscript we will add: (1) a table of per-kernel complexity metrics across the three architectures; (2) explicit statement that all evaluations use identical correctness thresholds and report normalized speedup relative to architecture-specific naive baselines; and (3) clarification in the abstract and benchmark description that the main performance-degradation claim rests on the identical BLAS subset. These additions will better isolate the effect of training-data availability. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain or self-referential reductions

full rationale

The paper presents an empirical benchmark study (CodegenBench) consisting of 106 BLAS routines and 20 architecture-adapted kernels, followed by direct LLM evaluations on x86_64, Sunway, and Kunpeng. No equations, fitted parameters, or mathematical derivations are present that could reduce any result to prior inputs by construction. The central claim rests on new test data and measurements; the abstract and provided text contain no self-citations invoked as load-bearing uniqueness theorems or ansatzes. This is a standard non-circular empirical report.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the definition of the new benchmark tasks and the assumption that performance measurements reflect genuine cross-architecture generalization limits; no free parameters, invented entities, or non-standard axioms are visible in the abstract.

axioms (1)
  • domain assumption Standard assumptions about what constitutes efficient parallel code in HPC benchmarking
    The evaluation implicitly relies on conventional metrics for code performance on the three architectures.

pith-pipeline@v0.9.1-grok · 5811 in / 1204 out tokens · 30460 ms · 2026-06-28T13:36:58.855843+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 20 canonical work pages · 14 internal anchors

  1. [1]

    Evaluating the performance of kunpeng 920 processors on modern hpc applications

    Ilya Afanasyev and Dmitry Lichmanov. Evaluating the performance of kunpeng 920 processors on modern hpc applications. InInternational Conference on Parallel Computing Technologies, pages 301–321. Springer, 2021

  2. [2]

    Qwen3.5, March 2026

    Alibaba. Qwen3.5, March 2026. URLhttps://qwen.ai/blog?id=qwen3.5. Accessed: 2026-05-06

  3. [3]

    Qwen3.6, March 2026

    Alibaba. Qwen3.6, March 2026. URLhttps://qwen.ai/blog?id=qwen3.6. Accessed: 2026-05-06

  4. [4]

    Qwen3.6-35b-a3b, March 2026

    Alibaba. Qwen3.6-35b-a3b, March 2026. URL https://qwen.ai/blog?id=qwen3.6-35b-a3b. Ac- cessed: 2026-05-06

  5. [5]

    SantaCoder: don’t reach for the stars!

    Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, et al. Santacoder: don’t reach for the stars!arXiv preprint arXiv:2301.03988, 2023

  6. [6]

    Introducing claude opus 4.6, February 2026

    Anthropic. Introducing claude opus 4.6, February 2026. URL https://www.anthropic.com/news/ claude-opus-4-6. Accessed: 2026-05-06

  7. [7]

    Introducing claude opus 4.7, March 2026

    Anthropic. Introducing claude opus 4.7, March 2026. URL https://www.anthropic.com/news/ claude-opus-4-7. Accessed: 2026-05-06

  8. [8]

    Introducing claude sonnet 4.6, February 2026

    Anthropic. Introducing claude sonnet 4.6, February 2026. URLhttps://www.anthropic.com/news/ claude-sonnet-4-6. Accessed: 2026-05-06

  9. [9]

    Multi-lingual evaluation of code generation models,

    Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, et al. Multi-lingual evaluation of code generation models. arXiv preprint arXiv:2210.14868, 2022

  10. [10]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  11. [11]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  12. [12]

    Hosna: A dpc++ benchmark suite for heterogeneous architectures

    Najmeh Nazari Bavarsad, Hosein Mohammadi Makrani, Hossein Sayadi, Lawrence Landis, Setareh Rafatirad, and Houman Homayoun. Hosna: A dpc++ benchmark suite for heterogeneous architectures. In 2021 IEEE 39th International Conference on Computer Design (ICCD), pages 509–516. IEEE, 2021

  13. [13]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  14. [14]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025

  15. [15]

    Data race detection using large language models

    Le Chen, Xianzhong Ding, Murali Emani, Tristan Vanderbruggen, Pei-Hung Lin, and Chunhua Liao. Data race detection using large language models. InProceedings of the SC’23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, pages 215–223, 2023

  16. [16]

    Pcebench: A multi-dimensional benchmark for evaluating large language models in parallel code generation

    Le Chen, Nesreen Ahmed, Mihai Capot˘a, Ted Willke, Niranjan Hasabnis, and Ali Jannesari. Pcebench: A multi-dimensional benchmark for evaluating large language models in parallel code generation. In2025 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 546–557. IEEE, 2025

  17. [17]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  18. [18]

    Deepseek-v4-flash, April 2026

    DeepSeek AI. Deepseek-v4-flash, April 2026. URL https://api-docs.deepseek.com/news/ news260424#deepseek-v4-flash. Accessed: 2026-05-06

  19. [19]

    Deepseek-v4-pro, April 2026

    DeepSeek AI. Deepseek-v4-pro, April 2026. URL https://api-docs.deepseek.com/news/ news260424#deepseek-v4-pro. Accessed: 2026-05-06. 10

  20. [20]

    An overview of the sparse basic linear algebra subprograms: The new standard from the blas technical forum.ACM Transactions on Mathematical Software (TOMS), 28(2):239–267, 2002

    Iain S Duff, Michael A Heroux, and Roldan Pozo. An overview of the sparse basic linear algebra subprograms: The new standard from the blas technical forum.ACM Transactions on Mathematical Software (TOMS), 28(2):239–267, 2002

  21. [21]

    The sunway taihulight supercomputer: system and applications

    Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, Wei Xue, Fangfang Liu, Fangli Qiao, et al. The sunway taihulight supercomputer: system and applications. Science China Information Sciences, 59(7):072001, 2016

  22. [22]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yifan Wu, YK Li, et al. Deepseek-coder: when the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196, 2024

  23. [23]

    Effibench: Benchmarking the efficiency of automatically generated code.Advances in Neural Information Processing Systems, 37: 11506–11544, 2024

    Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, and Jie M Zhang. Effibench: Benchmarking the efficiency of automatically generated code.Advances in Neural Information Processing Systems, 37: 11506–11544, 2024

  24. [24]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  25. [25]

    Ds-1000: A natural and reliable benchmark for data science code generation

    Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark for data science code generation. InInternational Conference on Machine Learning, pages 18319–18345. PMLR, 2023

  26. [26]

    Tritonbench: Benchmarking large language model capabilities for generating triton operators

    Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, WangHaojie WangHaojie, Jianrong Wang, Xu Han, et al. Tritonbench: Benchmarking large language model capabilities for generating triton operators. InFindings of the Association for Computational Linguistics: ACL 2025, pages 23053–23066, 2025

  27. [27]

    StarCoder: may the source be with you!

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161, 2023

  28. [28]

    Competition-level code generation with alphacode

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022

  29. [29]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

  30. [30]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

  31. [31]

    Is your code gen- erated by chatgpt really correct? rigorous evaluation of large language models for code genera- tion

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and LINGMING ZHANG. Is your code gen- erated by chatgpt really correct? rigorous evaluation of large language models for code genera- tion. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Ad- vances in Neural Information Processing Systems, volume 36, pages 21558–21572. Curran Ass...

  32. [32]

    Performance evaluation of general purpose large language models for basic linear algebra subprograms code generation.arXiv preprint arXiv:2507.04697, 2025

    Daichi Mukunoki, Shun-ichiro Hayashi, Tetsuya Hoshino, and Takahiro Katagiri. Performance evaluation of general purpose large language models for basic linear algebra subprograms code generation.arXiv preprint arXiv:2507.04697, 2025

  33. [33]

    Llm4vv: Developing llm-driven testsuite for compiler validation.Future Generation Computer Systems, 160:1–13, 2024

    Christian Munley, Aaron Jarmusch, and Sunita Chandrasekaran. Llm4vv: Developing llm-driven testsuite for compiler validation.Future Generation Computer Systems, 160:1–13, 2024

  34. [34]

    Can large language models write parallel code? InProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing, pages 281–294, 2024

    Daniel Nichols, Joshua H Davis, Zhaojun Xie, Arjun Rajaram, and Abhinav Bhatele. Can large language models write parallel code? InProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing, pages 281–294, 2024

  35. [35]

    KernelBench: Can LLMs Write Efficient GPU Kernels?

    Anne Ouyang, Simon Guo, Simran Arora, Alex L Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. Kernelbench: Can llms write efficient gpu kernels?arXiv preprint arXiv:2502.10517, 2025

  36. [36]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022. 11

  37. [37]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

  38. [38]

    Ai governance and accountability: An analysis of anthropic’s claude.arXiv preprint arXiv:2407.01557, 2024

    Aman Priyanshu, Yash Maurya, and Zuofei Hong. Ai governance and accountability: An analysis of anthropic’s claude.arXiv preprint arXiv:2407.01557, 2024

  39. [39]

    How efficient is llm-generated code? a rigorous & high-standard benchmark.arXiv preprint arXiv:2406.06647, 2024

    Ruizhong Qiu, Weiliang Will Zeng, James Ezick, Christopher Lott, and Hanghang Tong. How efficient is llm-generated code? a rigorous & high-standard benchmark.arXiv preprint arXiv:2406.06647, 2024

  40. [40]

    Code Llama: Open Foundation Models for Code

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

  41. [41]

    Evaluating llms for code generation in hri: A comparative study of chatgpt, gemini, and claude.Applied Artificial Intelligence, 39 (1):2439610, 2025

    Andrei Sobo, Awes Mubarak, Almas Baimagambetov, and Nikolaos Polatidis. Evaluating llms for code generation in hri: A comparative study of chatgpt, gemini, and claude.Applied Artificial Intelligence, 39 (1):2439610, 2025

  42. [42]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  43. [43]

    Heterobench: Multi-kernel benchmarks for heterogeneous systems

    Hongzheng Tian, Alok Mishra, Zhiheng Chen, Rolando P Hong Enriquez, Dejan Milojicic, Eitan Frachten- berg, and Sitao Huang. Heterobench: Multi-kernel benchmarks for heterogeneous systems. InProceedings of the 16th ACM/SPEC International Conference on Performance Engineering, pages 320–333, 2025

  44. [44]

    Comparing llama-2 and gpt-3 llms for hpc kernels generation

    Pedro Valero-Lara, Alexis Huante, Mustafa Al Lail, William F Godoy, Keita Teranishi, Prasanna Bal- aprakash, and Jeffrey S Vetter. Comparing llama-2 and gpt-3 llms for hpc kernels generation. In International Workshop on Languages and Compilers for Parallel Computing, pages 20–32. Springer, 2023

  45. [45]

    Multikernelbench: A multi-platform benchmark for kernel generation.arXiv e-prints, pp

    Zhongzhen Wen, Yinghui Zhang, Zhong Li, Zhongxin Liu, Linna Xie, and Tian Zhang. Multikernelbench: A multi-platform benchmark for kernel generation.arXiv e-prints, pp. arXiv–2507, 2025

  46. [46]

    Kunpeng 920: The first 7-nm chiplet-based 64-core arm soc for cloud services.IEEE Micro, 41(5):67–75, 2021

    Jing Xia, Chuanning Cheng, Xiping Zhou, Yuxing Hu, and Peter Chun. Kunpeng 920: The first 7-nm chiplet-based 64-core arm soc for cloud services.IEEE Micro, 41(5):67–75, 2021

  47. [47]

    Jian Yang, Wei Zhang, Yibo Miao, Shanghaoran Quan, Zhenhe Wu, Qiyao Peng, Liqun Yang, Tianyu Liu, Zeyu Cui, Binyuan Hui, et al. Qwen2. 5-xcoder: Multi-agent collaboration for multilingual code instruction tuning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13121–13131, 2025

  48. [48]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025

  49. [49]

    GLM-5: from Vibe Coding to Agentic Engineering

    Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

  50. [50]

    Cudabench: Benchmarking llms for text-to-cuda generation.arXiv preprint arXiv:2603.02236, 2026

    Jiace Zhu, Wentao Chen, Qi Fan, Zhixing Ren, Junying Wu, Xing Zhe Chai, Chotiwit Rungrueangwutthi- non, Yehan Ma, and An Zou. Cudabench: Benchmarking llms for text-to-cuda generation.arXiv preprint arXiv:2603.02236, 2026

  51. [51]

    DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

    Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv preprint arXiv:2406.11931, 2024. 12 CodegenBench: Can LLMs Write Efficient Code Across Architectures? (Supplemental Materials) Table of Contents in...

  52. [52]

    DeepSeek v3.2

    You must pay attention to the const constraints of the parameters. You should not violate any of the read-only constraints specified in the function signature. 6.<complex> will be included for any complex related function. Besides, you must include any C++ standard library header if you have used any of these functions or features, for example, <algorithm...

  53. [53]

    DeepSeek v3.2

    You must pay attention to the const constraints of the parameters. You should not violate any of the read-only constraints specified in the function signature. 6.<complex> will be included for any complex related function. Besides, you must include any C++ standard library header if you have used any of these functions or features, for example, <algorithm...

  54. [54]

    You should not violate any of the read-only constraints specified in the function signature

    You must pay attention to the const constraints of the parameters. You should not violate any of the read-only constraints specified in the function signature. 4.For the codes that you have to generate multiple parts, you should generate the code for each seperately, and return with <CODE1></CODE1>, <CODE2> </CODE2>... tag pairs. For example, if you have ...

  55. [55]

    For any other functions that you want to define to support your implementation, you can use <HELPER></HELPER> tag pair to return the code, for example: <HELPER> //contents for assist functions </HELPER>

  56. [56]

    func(){ //contents }

    For any ARM SME codes, you must understand that the function should be modified with __arm_locally_streaming to enter streaming mode, and __arm_new("za") to use za register.The following lines is the context for code generation, which may include code snippets, API specifications, and other information. You can use this information to generate the target ...

  57. [57]

    Qwen 3.6 Plus

    You must pay attention to the const constraints of the parameters. You should not violate any of the read-only constraints specified in the function signature. 6.<complex> will be included for any complex related function. Besides, you must include any C++ standard library header if you have used any of these functions or features, for example, <algorithm...

  58. [58]

    func(){ //contents }

    You must pay attention to the const constraints of the parameters. You should not violate any of the read-only constraints specified in the function signature. 6.<complex> will be included for any complex related function. Besides, you must include any C++ standard library header if you have used any of these functions or features, for example, <algorithm...

  59. [59]

    func(){ //contents }

    You must pay attention to the const constraints of the parameters. You should not violate any of the read-only constraints specified in the function signature. 6.<complex> will be included for any complex related function. Besides, you must include any C++ standard library header if you have used any of these functions or features, for example, <algorithm...

  60. [60]

    You should not violate any of the read-only constraints specified in the function signature

    You must pay attention to the const constraints of the parameters. You should not violate any of the read-only constraints specified in the function signature. 6.<complex> will be included for any complex related function. Besides, you must include any C++ standard library header if you have used any of these functions or features, for example, <algorithm...

  61. [61]

    use_upper_triangle\

    return;\n\n // Handle beta * Y\n if (beta == 0.0) {\n if (incY == 1) {\n int i = 0;\n for (; i + 3 < N; i += 4) {\n _mm256_storeu_pd(Y + i, _mm256_setzero_pd());\n }\n for (; i < N; i++) Y[i] = 0.0;\n } else {\n for (int i = 0; i < N; i++) Y[i * incY] = 0.0;\n }\n } else if (beta != 1.0) {\n if (incY == 1) {\n __m256d vbeta = _mm256_set1_pd(beta);\n int i...