Recognition: unknown
Can LLMs be Effective Code Contributors? A Study on Open-source Projects
Pith reviewed 2026-05-08 08:05 UTC · model grok-4.3
The pith
LLMs are not yet effective contributors to production open-source code, with success rates ranging from 0% to 60% on real commits.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that LLMs cannot yet serve as effective contributors to production code in sizable open-source projects. When the models were asked to reproduce 212 real commits spanning bug fixes and minor feature additions in projects such as FFmpeg and wolfSSL, success rates ranged from 0% to 60% depending on the project and model. The LLMs repeatedly generated syntactically incorrect code, code that failed static verification, or code that did not pass the project's test suite. They also had difficulty producing genuinely new code or handling contexts outside narrow size ranges, and many apparent successes stemmed from parroting changes seen during training.
What carries the argument
An evaluation framework that subjects LLM-generated patches to syntax validation, static verification, and full execution of the project's test suite.
If this is right
- LLMs need improved ability to handle project-specific context and codebases of varying sizes to become reliable contributors.
- Success in code contribution requires passing both static checks and full test suites, which current models frequently fail.
- Many LLM successes on existing changes appear to rely on memorization rather than generalization to novel tasks.
- Project-specific differences produce large swings in performance, implying that general-purpose models may need customization per codebase.
Where Pith is reading between the lines
- Future models that emphasize reasoning over pattern matching could raise success rates on previously unseen code changes.
- Hybrid systems that combine LLMs with project-specific documentation or commit history retrieval might overcome current context limitations.
- These results support continued use of LLMs for early drafting or suggestion, paired with mandatory human review before integration.
Load-bearing premise
The 212 selected commits from eight projects are representative of the code changes LLMs would be asked to make in real open-source or production settings.
What would settle it
Applying the same framework to a new collection of several hundred commits from additional projects and observing success rates above 70% with no evidence of training-data parroting would falsify the central claim.
Figures
read the original abstract
LLM-generated code is widely used, and the share of committed code produced by LLMs is expected to increase. However, we are not at a point where LLMs can be effective contributors to production code. We present an approach that exposes the shortcomings of LLM generation on such projects, and proposes recommendations; the targets of our study are sizable open-source projects, e.g., FFmpeg and wolfSSL. First, we developed a framework that uses verification and validation to evaluate a given LLM's suitability to fix or add features to an existing project. Second, we apply the framework to 212 commits (bug fixes and small feature improvements) in eight popular open-source projects and three LLMs: GPT-4o, Ministral3, and Qwen3-Coder. The success rate varied from 0% to 60% depending on the project. The LLMs failed in a variety of ways, from generating syntactically incorrect code, to producing code that fails basic (static) verification, or validation via the project's test suite. In particular, the LLMs struggle with generating new code, handling contexts (function or file) outside a certain size range, and in many cases their success is due to parroting code changes they have been trained on.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports an empirical evaluation of three large language models (GPT-4o, Ministral 3, and Qwen3-Coder) on their ability to contribute code to eight open-source projects by replicating 212 historical commits involving bug fixes and small feature additions. Using a framework that applies verification and validation against project test suites, the study finds success rates between 0% and 60% depending on the project. LLMs frequently fail to produce syntactically correct code, pass static verification, or satisfy test suites, and successes are often attributed to the models reproducing changes present in their training data. The authors conclude that LLMs are not yet effective as contributors to production-level open-source code and offer recommendations based on observed shortcomings.
Significance. This study provides concrete, project-specific performance data on LLM code generation in realistic open-source contexts, which is valuable for understanding current capabilities and limitations. If the results are robust, they underscore the need for improved context handling and generalization in LLMs for software engineering tasks, potentially informing both tool development and developer practices in integrating AI assistance.
major comments (3)
- The criteria for selecting the 212 commits across the eight projects (including sizable ones like FFmpeg and wolfSSL) are not specified. The central claim that LLMs are not effective contributors rests on these success rates (0-60%) being indicative of real usage; without details on sampling by size, complexity, context length, or representativeness of typical contribution tasks (bug fixes vs. larger changes), the failure modes in syntax, verification, and validation could be artifacts of the chosen task distribution.
- The observation that 'in many cases their success is due to parroting code changes they have been trained on' is load-bearing for interpreting the results but lacks methodological detail on how this was assessed (e.g., via similarity checks, overlap metrics, or case studies). This weakens the distinction between genuine contribution capability and memorization.
- The framework description provides no specifics on exact prompting strategies, context window handling, or statistical controls for the three models. This is necessary to evaluate reproducibility of the reported success rates and failure categories, as well as to rule out post-hoc choices affecting the headline conclusion.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. We address each major point below and will revise the manuscript accordingly to improve clarity, reproducibility, and transparency.
read point-by-point responses
-
Referee: The criteria for selecting the 212 commits across the eight projects (including sizable ones like FFmpeg and wolfSSL) are not specified. The central claim that LLMs are not effective contributors rests on these success rates (0-60%) being indicative of real usage; without details on sampling by size, complexity, context length, or representativeness of typical contribution tasks (bug fixes vs. larger changes), the failure modes in syntax, verification, and validation could be artifacts of the chosen task distribution.
Authors: We agree that the selection criteria were insufficiently detailed. In the revised manuscript we will add an explicit subsection under Experimental Setup describing the process: commits were drawn from the main branches of the eight projects, restricted to those whose messages and diffs indicated bug fixes or small feature additions, and chosen to span a range of project sizes and domains. We did not apply formal stratified sampling by complexity or context length; we will add a limitations paragraph acknowledging that the observed failure modes may partly reflect this task distribution and that broader sampling would strengthen generalizability claims. revision: yes
-
Referee: The observation that 'in many cases their success is due to parroting code changes they have been trained on' is load-bearing for interpreting the results but lacks methodological detail on how this was assessed (e.g., via similarity checks, overlap metrics, or case studies). This weakens the distinction between genuine contribution capability and memorization.
Authors: We will expand the relevant results subsection to document the assessment procedure. Successes were flagged as potential memorization when the generated patch exhibited high token overlap and low edit distance with the original commit diff; we computed normalized Levenshtein distance and Jaccard similarity on tokenized code and performed manual review of borderline cases. The revised text will report the exact similarity thresholds applied, the proportion of successes meeting the criteria, and representative examples to allow readers to evaluate the distinction between memorization and novel generation. revision: yes
-
Referee: The framework description provides no specifics on exact prompting strategies, context window handling, or statistical controls for the three models. This is necessary to evaluate reproducibility of the reported success rates and failure categories, as well as to rule out post-hoc choices affecting the headline conclusion.
Authors: We will augment the Framework section with the precise prompting templates used for each task category, the rules applied for context truncation or summarization to respect each model's context window, and any statistical controls (e.g., number of independent generations per commit and aggregation methods). These additions will make the experimental protocol fully reproducible and demonstrate that prompting choices were fixed in advance rather than tuned post hoc. revision: yes
Circularity Check
No circularity: purely empirical measurement against external benchmarks
full rationale
The paper reports an empirical study that applies a verification/validation framework to LLM-generated patches for 212 commits across eight open-source projects. Success rates (0-60%) and failure modes (syntax errors, static verification failures, test-suite failures, parroting) are measured directly against the projects' own test suites and analyzers. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains are present; the central claim rests on observable outcomes from external oracles rather than any reduction to the study's own inputs or prior self-referential results.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The 212 commits represent typical bug fixes and small feature additions that LLMs would be asked to perform.
- domain assumption Project test suites and static verification provide a reliable signal of code correctness and maintainability.
Reference graph
Works this paper leans on
-
[1]
https: //github.blog/news-insights/research/survey-r eveals-ais-impact-on-the-developer -experience , 2024
SurveyrevealsAI’simpactonthedeveloperexperience. https: //github.blog/news-insights/research/survey-r eveals-ais-impact-on-the-developer -experience , 2024
2024
-
[2]
GitHub Copilot - Your AI pair programmer.https://github .com/features/copilot, 2024
2024
-
[3]
GitHub Copilot Enterprise on GPT-4o.https://github.b log/changelog/2024-07-05-github-copilot-enter prise-on-gpt-4o, 2024
2024
-
[4]
https://visualstudio.microsoft.com/s ervices/intellicode/, 2025
IntelliCode. https://visualstudio.microsoft.com/s ervices/intellicode/, 2025
2025
-
[5]
Accessed: 2025-10-15
Cursor: The ai code editor.https://cursor.com/ , 2025. Accessed: 2025-10-15
2025
-
[6]
https://github.blog/2022 -09-07-research-quantifying-github-copilots-i mpact-on-developer-productivity-and-happiness , 2022
Research: quantifying GitHub Copilot’s impact on developer productivity and happiness. https://github.blog/2022 -09-07-research-quantifying-github-copilots-i mpact-on-developer-productivity-and-happiness , 2022. 11
2022
-
[7]
S. Li, Y. Cheng, J. Chen, J. Xuan, S. He, and W. Shang. Assessing the performance of ai-generated code: A case study on github copilot. In2024 IEEE 35th ISSRE, pages 216–227, 2024
2024
- [8]
- [9]
-
[10]
Security degradation in iterative AI code generation: A systematic analysis of the paradox,
S. Shukla, H. Joshi, and R. Syed. Security degradation in iterativeaicodegeneration–asystematicanalysisoftheparadox. arXiv preprint arXiv:2506.11022, 2025
-
[11]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review arXiv 2021
- [12]
-
[13]
H. Yu, B. Shen, D. Ran, J. Zhang, Q. Zhang, Y. Ma, G. Liang, Y. Li, Q. Wang, and T. Xie. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. InProceedings of the IEEE/ACM 46th ICSE. Association for Computing Machinery, 2024. ISBN 9798400702174
2024
-
[14]
https://cwe.mitre.org/top25/archive/2024/2024_ cwe_top25.html, 2024
2024 CWE Top 25 Most Dangerous Software Weaknesses. https://cwe.mitre.org/top25/archive/2024/2024_ cwe_top25.html, 2024
2024
-
[15]
https://github.com/srdja/Collectio ns-C, 2025
Collections-C. https://github.com/srdja/Collectio ns-C, 2025
2025
-
[16]
jansson.https://github.com/akheron/jansson, 2025
2025
-
[17]
Clang: a C Language Family Frontend for LLVM.http: //clang.llvm.org/
-
[18]
Springer London, London,
Steve Easterbrook, Janice Singer, Margaret-Anne Storey, and Daniela Damian.Selecting Empirical Methods for Software En- gineering Research, pages 285–311. Springer London, London,
-
[19]
doi: 10.1007/978-1-84800-0 44-5_11
ISBN 978-1-84800-044-5. doi: 10.1007/978-1-84800-0 44-5_11. URL https://doi.org/10.1007/978-1-84800 -044-5_11
-
[20]
IsraelHerraiz,DanielRodriguez,GregorioRobles,andJesusM. Gonzalez-Barahona. The evolution of the laws of software evolution: A discussion based on a systematic literature review. ACM Comput. Surv., 46(2), December 2013. ISSN 0360-0300. doi: 10.1145/2543581.2543595. URLhttps://doi.org/10 .1145/2543581.2543595
-
[21]
Towards a better understanding of software evolution: an empirical study on open-source software.J
Iulian Neamtiu, Guowu Xie, and Jianbo Chen. Towards a better understanding of software evolution: an empirical study on open-source software.J. Softw. Evol. Process, 25(3):193–218, March 2013. ISSN 2047-7473. doi: 10.1002/smr.564. URL https://doi.org/10.1002/smr.564
-
[22]
Anempirical studyofadoptionofsoftwaretestinginopensourceprojects
P.S.Kochhar,T.F.Bissyandé,D.Lo,andL.Jiang. Anempirical studyofadoptionofsoftwaretestinginopensourceprojects. In Proc.13thInternationalConferenceonQualitySoftware, pages 103–112. IEEE, 2013
2013
-
[23]
packcc.https://github.com/arithy/packcc, 2025
2025
-
[24]
libhl.https://github.com/xant/libhl, 2025
2025
-
[25]
FFmpeg.https://github.com/FFmpeg/FFmpeg, 2025
2025
-
[26]
wolfSSL.https://github.com/wolfSSL/wolfssl, 2025
2025
-
[27]
Bison.https://www.gnu.org/software/bison/, 2025
2025
-
[28]
https://security.appspot.com/vsftpd.html , 2025
vsftpd. https://security.appspot.com/vsftpd.html , 2025
2025
- [29]
-
[30]
Z. Wang, Z. Zhou, D. Song, Y. Huang, S. Chen, L. Ma, and T. Zhang. Towards understanding the characteristics of code generation errors made by large language models. In2025 IEEE/ACM 47th ICSE, pages 2587–2599. IEEE Computer Society, 2025
2025
-
[31]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22. Curran Associates Inc., 2022. ISBN 9781713871088
2022
-
[32]
Prompt engineering for Copilot Chat.https://docs.githu b.com/en/copilot/concepts/prompt-engineering-f or-copilot-chat, 2025
2025
-
[33]
Mockus and L
A. Mockus and L. G. Votta. Identifying reasons for software changes using historic databases. InProceedings 2000 Interna- tional Conference on Software Maintenance, pages 120–130, 2000
2000
-
[34]
Buse and W
R.P.L. Buse and W. R. Weimer. Automatically documenting program changes. InProceedings of the 25th IEEE/ACM International Conference on Automated Software Engineering, page33–42.AssociationforComputingMachinery,2010. ISBN 9781450301169
2010
-
[35]
Y. Tian, Y. Zhang, K. Stol, L. Jiang, and H. Liu. What makes a good commit message? In2022 IEEE/ACM 44th ICSE, pages 2389–2401, 2022
2022
-
[36]
The generalization of ‘student’s’problem when several different population varlances are involved
Bernard L Welch. The generalization of ‘student’s’problem when several different population varlances are involved. Biometrika, 34(1-2):28–35, 1947
1947
-
[37]
H.Wang,X.Xia,D.Lo,Q.He,X.Wang,andJ.Grundy.Context- aware retrieval-based deep commit message generation.ACM Trans. Softw. Eng. Methodol., 30(4), 2021. ISSN 1049-331X
2021
-
[38]
S. Liu, C. Gao, S. Chen, L. Y. Nie, and Y. Liu. Atom: Commit message generation based on abstract syntax tree and hybrid ranking. volume 48, pages 1800–1817, 2022
2022
-
[39]
Y. Wu, Y. Wang, Y. Li, W. Tao, S. Yu, H. Yang, W. Jiang, and J. Li. An empirical study on commit message generation using llms via in-context learning. In2025 IEEE/ACM 47th ICSE, pages 553–565. IEEE Computer Society, 2025
2025
-
[40]
Zhang, Y
Y. Zhang, Y. Xie, S. Li, K. Liu, C. Wang, Z. Jia, X. Huang, J. Song, C. Luo, Z. Zheng, R. Xu, S. Liu, Y.and Zheng, and X. Liao. Unseen horizons: Unveiling the real capability of llm code generation beyond the familiar. In2025 IEEE/ACM 47th ICSE, page 604–615. IEEE Computer Society, 2025
2025
-
[41]
InProceedingsof the28thInternationalConferenceonEvaluationandAssessment in Software Engineering, pages 252–261, 2024
Zoltán Ságodi, Gábor Antal, Bence Bogenfürst, Martin Isztin, PéterHegedűs,andRudolfFerenc.Realitycheck: Assessinggpt- 4infixingreal-worldsoftwarevulnerabilities. InProceedingsof the28thInternationalConferenceonEvaluationandAssessment in Software Engineering, pages 252–261, 2024
2024
-
[42]
J. Liu, C. S. Xia, Y. Wang, and L. Zhang. Is your code generated by ChatGPT really correct? rigorous evaluation of largelanguagemodelsforcodegeneration. InProc.NIPS,2023
2023
-
[43]
AntonioDellaPorta,StefanoLambiase,andFabioPalomba. Do promptpatternsaffectcodequality? afirstempiricalassessment of chatgpt-generated code.arXiv preprint arXiv:2504.13656, 2025. 12
-
[44]
Quality assessment of pythontestsgeneratedbylargelanguagemodels.arXivpreprint arXiv:2506.14297, 2025
Victor Alves, Carla Bezerra, Ivan Machado, Larissa Rocha, Tássio Virgínio, and Publio Silva. Quality assessment of pythontestsgeneratedbylargelanguagemodels.arXivpreprint arXiv:2506.14297, 2025
-
[45]
A performance study of llm-generated code on leetcode
Tristan Coignion, Clément Quinton, and Romain Rouvoy. A performance study of llm-generated code on leetcode. In Proceedings of the 28th international conference on evaluation and assessment in software engineering, pages 79–89, 2024
2024
-
[46]
Zhong and Z
L. Zhong and Z. Wang. Can llm replace stack overflow? a studyonrobustnessandreliabilityoflargelanguagemodelcode generation. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 21841–21849, 2024
2024
-
[47]
arXiv preprint arXiv:2506.23100 (2025) arXiv:2506.23100
Jiayi Zhang, Kai Huang, Jian Zhang, Yang Liu, and Chunyang Chen. Repair ingredients are all you need: Improving large language model-based program repair via repair ingredients search.arXiv preprint arXiv:2506.23100, 2025
-
[48]
Improving patch correctness analysis via random testing and large language models
Facundo Molina, Juan Manuel Copia, and Alessandra Gorla. Improving patch correctness analysis via random testing and large language models. In2024 IEEE Conference on Software Testing, Verification and Validation (ICST), pages 317–328. IEEE, 2024
2024
-
[49]
Metal: meta- morphic testing framework for analyzing large-language model qualities
Sangwon Hyun, Mingyu Guo, and M Ali Babar. Metal: meta- morphic testing framework for analyzing large-language model qualities. In2024 IEEE Conference on Software Testing, Verifi- cation and Validation (ICST), pages 117–128. IEEE, 2024
2024
-
[50]
Mao, and Z
Z.Zhang, C.Wang, Y.Wang, E.Shi, Y.Ma, W.Zhong, J.Chen, M. Mao, and Z. Zheng. Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation.Proceed- ings of the ACM on Software Engineering, 2(ISSTA):481–503, 2025
2025
-
[51]
Y. Chen, M. Chen, C. Gao, Z. Jiang, Z. Li, and Y. Ma. Towards mitigating api hallucination in code generated by llms with hier- archical dependency aware. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engi- neering, page 468–479. Association for Computing Machinery, 2025
2025
-
[52]
Anempirical study of the non-determinism of chatgpt in code generation
SOuyang,J.M.Zhang,M.Harman,andM.Wang. Anempirical study of the non-determinism of chatgpt in code generation. ACM Trans. Softw. Eng. Methodol., 34(2), 2025. ISSN 1049- 331X
2025
-
[53]
Turbulence: Systematically and automatically testing instruction-tunedlargelanguagemodelsforcode
Shahin Honarvar, Mark van der Wilk, and Alastair F Don- aldson. Turbulence: Systematically and automatically testing instruction-tunedlargelanguagemodelsforcode. In2025IEEE Conference on Software Testing, Verification and Validation (ICST), pages 80–91. IEEE, 2025
2025
-
[54]
Zheng, Y
D. Zheng, Y. Wang, E. Shi, R. Zhang, Y. Ma, H. Zhang, and Z. Zheng. Humanevo: An evolution-aware benchmark for more realistic evaluation of repository-level code generation. In2025 IEEE/ACM47thICSE,pages764–764.IEEEComputerSociety, 2025
2025
-
[55]
Zhang, Y
T. Zhang, Y. Yu, X. Mao, S. Wang, K. Yang, Y. Lu, Z. Zhang, and Y. Zhao. Instruct or interact? exploring and eliciting llms’ capability in code snippet adaptation through prompt engineering. In2025 IEEE/ACM 47th ICSE, page 566–577. IEEE Computer Society, 2025
2025
-
[56]
Z. Tian, J. Chen, and X. Zhang. Fixing large language models’ specification misunderstanding for better code generation. In 2025IEEE/ACM47thICSE,pages1514–1526.IEEEComputer Society, 2025
2025
-
[57]
H. Song, A. Goknil, X. Jiang, E. Melum, H. Joe, C. Gazzotti, V. Frascolla, A. N. Videsjorden, and P. Nguyen. Developing multi-agent llm applications through continuous human-llm co- programming. In2025IEEE/ACM4thInternationalConference onAIEngineering–SoftwareEngineeringforAI(CAIN),pages 42–47. IEEE Computer Society, 2025. 13
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.