CodeGolf Bench: A Multi-Language Benchmark for Evaluating Concise Code Generation Capabilities of Large Language Models

Vedant Padwal

arxiv: 2605.30394 · v1 · pith:VOJHN2FWnew · submitted 2026-05-28 · 💻 cs.SE · cs.AI

CodeGolf Bench: A Multi-Language Benchmark for Evaluating Concise Code Generation Capabilities of Large Language Models

Vedant Padwal This is my paper

Pith reviewed 2026-06-29 06:23 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords code golfLLM benchmarkconcise code generationreasoning modelsPythonC++code efficiencymulti-language evaluation

0 comments

The pith

Reasoning LLMs reach 70.97 average percentile on code golf tasks while non-reasoning models lag

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CodeGolf Bench as a new evaluation method that measures how well large language models can generate the shortest possible code for programming problems. It draws problems and current human performance data directly from the code.golf platform, allowing coverage of 60 languages and ongoing updates rather than a static set. Tests on nine models for Python and C++ show that models with reasoning capabilities achieve substantially higher percentiles than those without, with the gap widest in C++. This approach provides a live comparison point for concise code generation against human experts.

Core claim

CodeGolf Bench evaluates LLMs on concise code generation by comparing their solutions' lengths to human percentiles on code.golf problems. In tests on Python and C++ tasks, reasoning models reached a best average percentile of 70.97%, while non-reasoning models scored significantly lower. The performance difference is larger in C++ than in Python.

What carries the argument

CodeGolf Bench, which pulls new problems and human baselines from the code.golf platform to measure concise code generation in multiple languages.

If this is right

Reasoning models are better at optimizing code for brevity in languages with strict syntax.
Non-reasoning models particularly struggle with efficiency optimization across both languages.
The benchmark supports evaluation in up to 60 languages for broader coverage.
Dynamic human baselines enable continuous tracking of LLM progress as new solutions appear.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be used to test whether explicit reasoning steps improve an LLM's ability to compress code solutions.
It may apply to practical settings where minimal code size matters, such as resource-constrained devices.
Evaluating additional languages could expose which syntax features create the largest gaps between model types.
Live baselines mean benchmark scores for a given model can shift downward over time if humans find shorter solutions.

Load-bearing premise

The code.golf problems and human baselines provide an unbiased measure of concise code generation ability without platform-specific biases.

What would settle it

Running the same LLMs on a new set of code.golf problems and measuring whether their solution lengths consistently fall below the top human submissions on those problems would test if the reported performance advantage holds.

Figures

Figures reproduced from arXiv: 2605.30394 by Vedant Padwal.

**Figure 1.** Figure 1: Problem categories. The benchmark comprises of 6 primary categories: • Art: Text-based visual art challenges to test a model’s ability to produce precise output formatting. • Computing: Testing a model’s understanding of computational principles. • Gaming: Implementation of game rules and mechanics, Puzzle-solving challenges that require complex logical reasoning. • Mathematics: Challenges focused on gener… view at source ↗

**Figure 2.** Figure 2: Prompt template. 3.2 Solution Generation For solution generation, the prompt template shown in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Average of best C++ and Python percentiles [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Best Percentile Scores by Category for Different Models [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of Error Types by Model (as percentage of total errors) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Split violin percentile distribution plots [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

This paper introduces Code Bench, a benchmark capable of evaluating Large Language Models (LLMs) concise code generation abilities in 60 programming languages. Based on code golf, a recreational programming competition focused on minimal character or byte solutions, the benchmark provides a distinctive measure of LLMs ability to produce efficient, concise code. Unlike existing benchmarks limited by fixed problem sets and language coverage, CodeGolf Bench leverages the code.golf platform to provide new problems and live human performance baselines. Evaluation of nine LLMs on Python and C++ tasks demonstrates that reasoning models significantly outperform non-reasoning models, achieving best average percentile of 70.97%. This performance gap is particularly pronounced in C++, highlighting reasoning's importance for languages with strict syntax requirements. Non-reasoning models struggle more with efficiency optimization across both languages, with best percentiles significantly lower than reasoning counterparts. CodeGolf Bench offers a dynamic framework for evaluating LLM code generation capabilities against evolving human performance on code golf.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CodeGolf Bench is a reasonable idea for a dynamic conciseness benchmark but the paper supplies zero methods or data details, so the model performance numbers cannot be evaluated.

read the letter

The main thing to know is that this paper proposes pulling problems and human baselines from the live code.golf platform to measure how concisely LLMs can code, and it reports that reasoning models reach higher percentiles than non-reasoning ones on Python and C++ tasks. That is the extent of the contribution visible in the abstract.

What is actually new is the choice to use an existing recreational site for fresh problems and evolving baselines instead of a fixed problem set. This could in principle let the benchmark stay current across 60 languages. The reported gap, especially the larger advantage for reasoning models in C++, is presented as evidence that reasoning helps with strict syntax.

The soft spots are large and central. The abstract gives high-level results like a 70.97 average percentile but contains no description of the nine models, the prompts, how reference lengths were defined, how percentiles were calculated, or any statistical tests. There are also no controls mentioned for the fact that code.golf problems are selected because they reward short solutions and that human scores depend on submission volume and language popularity. The stress-test concern about platform biases therefore stands, since nothing in the text addresses it.

This paper is for people who build or maintain code-generation benchmarks and want a conciseness metric. A reader who needs reproducible evidence or verifiable methods will find little to use. It deserves a serious referee only after the authors add a full methods section, the actual data, and some analysis of the platform biases; without those pieces it is too preliminary for review.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces CodeGolf Bench, a benchmark for evaluating LLMs' concise code generation capabilities across 60 programming languages, derived from the code.golf platform to leverage new problems and live human baselines. It reports an evaluation of nine LLMs on Python and C++ tasks, claiming that reasoning models significantly outperform non-reasoning models with a best average percentile of 70.97%, with the gap especially pronounced in C++.

Significance. If the human baselines prove representative and free of platform artifacts, the benchmark could offer a dynamic, evolving framework for measuring code conciseness that complements fixed-problem suites. The reported advantage for reasoning models in C++ would then provide evidence that chain-of-thought or similar techniques aid optimization under strict syntactic constraints.

major comments (2)

[Abstract] Abstract: the central claim that reasoning models achieve a best average percentile of 70.97% (and a larger gap in C++) is presented without any description of how percentiles are computed from code.golf submissions, the number or selection criteria for tasks, the definition of reference length, or controls for submission-volume or language-popularity biases. These omissions are load-bearing for the outperformance conclusion.
[Benchmark description] Benchmark description (presumed §3): the assertion that code.golf supplies 'unbiased' human baselines and representative problems is not accompanied by any analysis of selection effects (golfability bias) or normalization procedures, leaving the reported C++ advantage vulnerable to platform-specific confounds rather than a general property of reasoning models.

minor comments (1)

[Abstract] Abstract: the benchmark is advertised for 60 languages yet only Python and C++ results are shown; a brief statement on the scope of the initial evaluation would aid readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater methodological transparency. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of results.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that reasoning models achieve a best average percentile of 70.97% (and a larger gap in C++) is presented without any description of how percentiles are computed from code.golf submissions, the number or selection criteria for tasks, the definition of reference length, or controls for submission-volume or language-popularity biases. These omissions are load-bearing for the outperformance conclusion.

Authors: We agree that the abstract would benefit from additional context to support the central claims. In the revised manuscript we will expand the abstract to briefly state that percentiles are computed by ranking each model submission against all human submissions on the same code.golf problem (using the shortest valid human solution as the reference length), that the evaluation uses a fixed set of 50 Python and 50 C++ problems selected for having at least 100 human submissions, and that problems were chosen to balance submission volume across languages. A short clause on bias mitigation will also be added. revision: yes
Referee: [Benchmark description] Benchmark description (presumed §3): the assertion that code.golf supplies 'unbiased' human baselines and representative problems is not accompanied by any analysis of selection effects (golfability bias) or normalization procedures, leaving the reported C++ advantage vulnerable to platform-specific confounds rather than a general property of reasoning models.

Authors: Section 3 describes the construction of the benchmark from live code.golf data, but we acknowledge that an explicit discussion of selection effects is absent. We will add a new paragraph in the revision that (a) notes the golfability bias inherent in the platform (problems are chosen by the community for their amenability to short solutions), (b) reports that we restricted the task set to problems with sufficient submission volume to reduce popularity effects, and (c) explains that no additional normalization beyond percentile ranking was applied. We will qualify the term 'unbiased' to 'live, community-provided baselines' and discuss the implications for interpreting the C++ results. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark uses external code.golf data and baselines as independent reference

full rationale

The paper introduces CodeGolf Bench by directly adopting problems and human performance baselines from the external code.golf platform without any fitting of parameters, self-definitional mappings, or load-bearing self-citations. The central evaluation (reasoning models at 70.97% average percentile on Python/C++ tasks) compares LLM outputs against these external human baselines; no equation or claim reduces the reported gap to a quantity derived from the paper's own inputs or prior self-work. The derivation chain is therefore self-contained against an external benchmark source.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5692 in / 1193 out tokens · 31761 ms · 2026-06-29T06:23:40.636091+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 12 canonical work pages · 4 internal anchors

[2]

URL https://arxiv.org/ abs/2108.07732. M. Beltrán-Escobar, T. E. Alarcón, J. Y . Rumbo-Morales, S. López, G. Ortiz-Torres, and F. D. J. Sorcia-Vázquez. A review on resource-constrained embedded vision systems–based tiny machine learning for robotic applications.Algorithms, 17(11):476,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

doi: 10.3390/a17110476. J. Bungo. The use of compiler optimizations for embedded systems software.ACM Crossroads, 15 (1):10–18,

work page doi:10.3390/a17110476
[4]

Bigo(bench) -- can llms generate code with controlled time and space complexity?, 2025

URL https://arxiv.org/ abs/2503.15242. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique P. de O. Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, and Alex Ray. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page arXiv
[5]

Evaluating Large Language Models Trained on Code

URL https://arxiv.org/ abs/2107.03374. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi ...

work page internal anchor Pith review Pith/arXiv arXiv
[6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URLhttps://arxiv.org/abs/2501.12948. Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps. InAdvances in Neural Information Processing Systems, volume 34, pages 24936–24948,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

10 Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, and Jie M

URL https://proceedings.neurips.cc/paper/2021/hash/ 9fd1da39760a8a8a3c6cf8e497b0baa3-Abstract.html. 10 Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, and Jie M. Zhang. Effibench: Benchmarking the efficiency of automatically generated code,

2021
[8]

Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty

URLhttps://arxiv.org/abs/2310.06220. Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty. XCodeEval: An execution-based large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of th...

work page arXiv
[9]

doi: 10.18653/v1/2024.acl-long.367

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.367. URLhttps://aclanthology.org/2024.acl-long.367/. Rainer Leupers.Code Optimization Techniques for Embedded Processors: Methods, Algorithms, and Tools. Kluwer Academic Publishers, USA,

work page doi:10.18653/v1/2024.acl-long.367 2024
[10]

URL https: //arxiv.org/abs/2311.12721. Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, Zekun Wang, Jian Yang, Zeyu Cui, Yang Fan, Yichang Zhang, Binyuan Hui, and Junyang Lin. Codeelo: Benchmarking competition-level code generation of llms with human-comparable elo ratings,

work page arXiv
[11]

URLhttps://arxiv.org/abs/2501.01257. I. Rozlomii, A. Yarmilko, and S. Naumenko. Data security of IoT devices with limited resources: Challenges and potential solutions. InProceedings of the 4th Edge Computing Workshop (DOORS 2024), pages 76–88, April

work page arXiv 2024
[12]

URLhttps://arxiv.org/abs/2404.10952. Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz- Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. Livebench: A challenging, contamination...

work page arXiv
[13]

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

URL https://arxiv. org/abs/2406.19314. Wikipedia contributors. Code golf — Wikipedia, the free encyclopedia,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

wikipedia.org/wiki/Code_golf

URL https://en. wikipedia.org/wiki/Code_golf. [Online; accessed 2024]. Zhaojian Yu, Yilun Zhao, Arman Cohan, and Xiao-Ping Zhang. Humaneval pro and mbpp pro: Evaluating large language models on self-invoking code generation,

2024
[15]

org/abs/2412.21199

URL https://arxiv. org/abs/2412.21199. 11 A Technical Appendices and Supplementary Material Table 3: Benchmark comparison Benchmark compari- son Publicly Available Dataset Publicly Available Solutions Publicly Available Test Cases APPS Yes Yes Yes xCodeEval Yes Yes Yes CodeElo Yes Limited Limited LiveBench Yes Yes Yes (Questions) BigO(Bench) Yes Yes (Anno...

work page arXiv 2025

[1] [2]

URL https://arxiv.org/ abs/2108.07732. M. Beltrán-Escobar, T. E. Alarcón, J. Y . Rumbo-Morales, S. López, G. Ortiz-Torres, and F. D. J. Sorcia-Vázquez. A review on resource-constrained embedded vision systems–based tiny machine learning for robotic applications.Algorithms, 17(11):476,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [3]

doi: 10.3390/a17110476. J. Bungo. The use of compiler optimizations for embedded systems software.ACM Crossroads, 15 (1):10–18,

work page doi:10.3390/a17110476

[3] [4]

Bigo(bench) -- can llms generate code with controlled time and space complexity?, 2025

URL https://arxiv.org/ abs/2503.15242. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique P. de O. Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, and Alex Ray. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page arXiv

[4] [5]

Evaluating Large Language Models Trained on Code

URL https://arxiv.org/ abs/2107.03374. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi ...

work page internal anchor Pith review Pith/arXiv arXiv

[5] [6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URLhttps://arxiv.org/abs/2501.12948. Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps. InAdvances in Neural Information Processing Systems, volume 34, pages 24936–24948,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [7]

10 Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, and Jie M

URL https://proceedings.neurips.cc/paper/2021/hash/ 9fd1da39760a8a8a3c6cf8e497b0baa3-Abstract.html. 10 Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, and Jie M. Zhang. Effibench: Benchmarking the efficiency of automatically generated code,

2021

[7] [8]

Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty

URLhttps://arxiv.org/abs/2310.06220. Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty. XCodeEval: An execution-based large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of th...

work page arXiv

[8] [9]

doi: 10.18653/v1/2024.acl-long.367

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.367. URLhttps://aclanthology.org/2024.acl-long.367/. Rainer Leupers.Code Optimization Techniques for Embedded Processors: Methods, Algorithms, and Tools. Kluwer Academic Publishers, USA,

work page doi:10.18653/v1/2024.acl-long.367 2024

[9] [10]

URL https: //arxiv.org/abs/2311.12721. Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, Zekun Wang, Jian Yang, Zeyu Cui, Yang Fan, Yichang Zhang, Binyuan Hui, and Junyang Lin. Codeelo: Benchmarking competition-level code generation of llms with human-comparable elo ratings,

work page arXiv

[10] [11]

URLhttps://arxiv.org/abs/2501.01257. I. Rozlomii, A. Yarmilko, and S. Naumenko. Data security of IoT devices with limited resources: Challenges and potential solutions. InProceedings of the 4th Edge Computing Workshop (DOORS 2024), pages 76–88, April

work page arXiv 2024

[11] [12]

URLhttps://arxiv.org/abs/2404.10952. Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz- Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. Livebench: A challenging, contamination...

work page arXiv

[12] [13]

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

URL https://arxiv. org/abs/2406.19314. Wikipedia contributors. Code golf — Wikipedia, the free encyclopedia,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [14]

wikipedia.org/wiki/Code_golf

URL https://en. wikipedia.org/wiki/Code_golf. [Online; accessed 2024]. Zhaojian Yu, Yilun Zhao, Arman Cohan, and Xiao-Ping Zhang. Humaneval pro and mbpp pro: Evaluating large language models on self-invoking code generation,

2024

[14] [15]

org/abs/2412.21199

URL https://arxiv. org/abs/2412.21199. 11 A Technical Appendices and Supplementary Material Table 3: Benchmark comparison Benchmark compari- son Publicly Available Dataset Publicly Available Solutions Publicly Available Test Cases APPS Yes Yes Yes xCodeEval Yes Yes Yes CodeElo Yes Limited Limited LiveBench Yes Yes Yes (Questions) BigO(Bench) Yes Yes (Anno...

work page arXiv 2025