Recognition: no theorem link
Evaluating LLM-Generated Code: A Benchmark and Developer Study
Pith reviewed 2026-05-12 02:22 UTC · model grok-4.3
The pith
Developer reviews uncover production-readiness issues in LLM code that standard correctness benchmarks overlook.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A three-fold methodology that combines a dedicated correctness benchmark on a complex project, code quality verification, and developer opinions gathered through structured code reviews provides a fuller picture of LLM-generated code than correctness-focused benchmarks. When used on GPT-4.1, DeepSeek-V3-0324, and Claude Opus 4, the reviews produced additional findings on whether the code reaches a production-ready state.
What carries the argument
Three-fold evaluation methodology integrating a custom correctness benchmark, code quality verification, and structured developer code-review surveys.
Load-bearing premise
That feedback collected from developers in a structured review process gives reliable, generalizable information about production readiness that benchmarks miss.
What would settle it
Repeating the developer reviews on the same code samples with new reviewers and finding that they consistently identify no additional production-readiness problems beyond the benchmark results.
read the original abstract
Code generation is one of the tasks for which the use of Large Language Models is widely adopted and highly successful. Given this popularity, there are many benchmarks dedicated to code generation that can help select the best model. However, they primarily focus on measuring solution correctness, leaving other aspects, such as code quality and usability, behind. This paper aims to describe a custom tree-fold evaluation methodology for code generated by Large Language Models that bridges this gap. The methodology includes a dedicated correctness benchmark based on a complex multi-level computer science project, code quality verification, and a survey of developers' opinions on generated code samples gathered through a structured code-review process. The proposed methodology's usage and usefulness are demonstrated by evaluating and comparing three general-purpose Large Language Models: GPT-4.1, DeepSeek-V3-0324, and Claude Opus 4. The results show that reviews gathered from developers can yield many new findings, especially those related to the code being in a production-ready state, that would not be possible to obtain using the standard correctness-focused benchmark approach.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a three-fold methodology for evaluating LLM-generated code: (1) a correctness benchmark built on a complex multi-level computer science project, (2) code quality verification, and (3) structured developer reviews collected via a survey process. The authors apply the methodology to compare three general-purpose LLMs (GPT-4.1, DeepSeek-V3-0324, and Claude Opus 4) and conclude that the developer reviews surface production-readiness insights (e.g., maintainability, deployment concerns) that standard correctness-only benchmarks miss.
Significance. If the concrete examples of additional findings hold, the work is significant for highlighting limitations of purely automated correctness benchmarks in code generation. The new benchmark on a complex project and the explicit inclusion of human developer feedback address a recognized gap in the field, potentially informing more holistic evaluation frameworks. The demonstration-style results provide practical evidence that developer input can reveal usability and production aspects not captured by pass@k or similar metrics.
major comments (1)
- [Results] Results section: the claim that developer reviews 'yield many new findings' on production readiness rests on the presentation of specific examples; however, without reported inter-rater agreement, number of reviewers, or exclusion criteria, it is difficult to assess whether the additional insights are robust or idiosyncratic to the small set of reviewed samples.
minor comments (2)
- [Methodology] The methodology description would be strengthened by an appendix containing the exact survey instrument and code-review template used with developers.
- [Results] Table or figure summarizing per-model correctness scores, quality metrics, and review themes side-by-side would improve readability of the comparative results.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation of minor revision. We address the major comment below and will revise the manuscript to provide the requested methodological details.
read point-by-point responses
-
Referee: [Results] Results section: the claim that developer reviews 'yield many new findings' on production readiness rests on the presentation of specific examples; however, without reported inter-rater agreement, number of reviewers, or exclusion criteria, it is difficult to assess whether the additional insights are robust or idiosyncratic to the small set of reviewed samples.
Authors: We agree that the current description of the developer review component lacks sufficient detail for readers to fully evaluate the robustness of the reported insights. In the revised manuscript, we will expand the relevant sections to report: the exact number of developers who participated in the structured code-review process, their professional backgrounds and selection criteria, any exclusion criteria applied to code samples or individual responses, and the protocol used to synthesize recurring themes from the qualitative feedback. We will also clarify that the reviews consisted of independent structured assessments rather than paired quantitative ratings, which is why inter-rater agreement metrics were not computed; instead, we will describe how common production-readiness concerns were identified across responses. These additions will make explicit that the examples are drawn from a defined process and are intended to illustrate gaps missed by correctness benchmarks, rather than to claim statistical generalizability. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper presents an empirical three-fold methodology (correctness benchmark on a complex project, code quality checks, and structured developer reviews) to evaluate LLM-generated code and demonstrate that reviews surface production-readiness insights missed by standard benchmarks. No derivation chain, equations, fitted parameters renamed as predictions, or self-citations appear in the described approach or results. The central claim is supported directly by the concrete findings from applying the methodology to GPT-4.1, DeepSeek-V3-0324, and Claude Opus 4, without any reduction to definitional inputs or prior self-referential results. The study is self-contained as a proof-of-concept demonstration.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Developer opinions obtained through a structured code-review process provide reliable signals about production readiness that automated correctness and quality metrics miss.
Reference graph
Works this paper leans on
-
[1]
Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, Sujan Kumar Gonugondla, Hantian Ding, Varun Kumar, Nathan Fulton, Arash Farahani, Siddhartha Jain, Robert Giaquinto, Haifeng Qian, Murali Krishna Ra- manathan, Ramesh Nallapati, Baishakhi Ray, Parminder Bhatia, Su...
-
[2]
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. arXiv:2108.07732 [cs.PL] https://arxiv.org/abs/2108.07732
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
2025.The Temperature Parameter | DeepSeek API Docs
DeepSeek. 2025.The Temperature Parameter | DeepSeek API Docs. https://api- docs.deepseek.com/quick_start/parameter_settings
work page 2025
-
[5]
Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. Evaluating Large Language Models in Class-Level Code Generation. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machinery, New York...
- [6]
-
[7]
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Mapping Language to Code in Programmatic Context. arXiv:1808.09588 [cs.CL] https://arxiv.org/abs/1808.09588
work page Pith review arXiv 2018
-
[8]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real- World GitHub Issues? arXiv:2310.06770 [cs.CL] https://arxiv.org/abs/2310.06770
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. arXiv:2305.01210 [cs.SE] https://arxiv. org/abs/2305.01210
work page internal anchor Pith review arXiv 2023
- [10]
-
[11]
Bradley McDanel and Ed Novak. 2025. Designing LLM-Resistant Programming Assignments: Insights and Strategies for CS Educators. InProceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1(Pittsburgh, PA, USA)(SIGCSETS 2025). Association for Computing Machinery, New York, NY, USA, 756–762. doi:10.1145/3641554.3701872
-
[12]
Tanha Miah and Hong Zhu. 2024. User Centric Evaluation of Code Genera- tion Tools (Invited Paper). In2024 IEEE International Conference on Artificial Intelligence Testing (AITest). 109–119. doi:10.1109/AITest62860.2024.00022
-
[13]
2025.Homepage | SonarQube Cloud | Sonar Documentation
SonarSource. 2025.Homepage | SonarQube Cloud | Sonar Documentation. https: //docs.sonarsource.com/sonarqube-cloud
work page 2025
-
[14]
2025.Software qualities | SonarQube Cloud Documentation
SonarSource. 2025.Software qualities | SonarQube Cloud Documentation. https: //docs.sonarsource.com/sonarqube-cloud/digging-deeper/software-qualities/
work page 2025
-
[15]
2026.Evaluating LLM-Generated Code: Benchmarking on complex assignment
Joanna Szych. 2026.Evaluating LLM-Generated Code: Benchmarking on complex assignment. https://github.com/AsiaSzych/Tree_of_Life/
work page 2026
-
[16]
2026.Evaluating LLM-Generated Code: Developer Study
Joanna Szych. 2026.Evaluating LLM-Generated Code: Developer Study. doi:10. 5281/zenodo.18806359
work page 2026
-
[17]
2025.Building The Tree of Life from Scratch
Christopher Tralie. 2025.Building The Tree of Life from Scratch. http://nifty. stanford.edu/2025/tralie-phylogenetic-trees/
work page 2025
-
[18]
Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia. 2025. Can LLMs Replace Human Evaluators? An Empirical Study of LLM- as-a-Judge in Software Engineering.Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA086 (June 2025), 23 pages. doi:10.1145/3728963
-
[19]
Wei Wang, Huilong Ning, Gaowei Zhang, Libo Liu, and Yi Wang. 2024. Rocks Coding, Not Development: A Human-Centric, Experimental Evaluation of LLM- Supported SE Tasks.Proc. ACM Softw. Eng.1, FSE, Article 32 (July 2024), 23 pages. doi:10.1145/3643758
-
[20]
Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. CoderEval: A Benchmark of Prag- matic Code Generation with Generative Pre-trained Models. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machin...
-
[21]
Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. 2023. RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association...
-
[22]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685 [cs.CL] https://arxiv.org/abs/2306.05685
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. 2024. CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Bench- marking on HumanEval-X. arXiv:2303.17568 [cs.LG] https://arxiv.org/abs/2303. 17568
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.