pith. machine review for the scientific record. sign in

arxiv: 2604.23340 · v1 · submitted 2026-04-25 · 💻 cs.SE

Recognition: unknown

Can LLMs be Effective Code Contributors? A Study on Open-source Projects

Chun Jie Chong, Iulian Neamtiu, Muyeed Ahmed, Zhihao (Zephyr) Yao

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:05 UTC · model grok-4.3

classification 💻 cs.SE
keywords LLM code generationopen source projectscode contributionsoftware engineeringLLM evaluationbug fixestest suite validationtraining data parroting
0
0 comments X

The pith

LLMs are not yet effective contributors to production open-source code, with success rates ranging from 0% to 60% on real commits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether current LLMs can act as reliable code contributors by attempting to implement 212 actual bug fixes and small feature changes drawn from eight sizable open-source projects. The authors created a framework that subjects LLM outputs to syntax checks, static verification, and execution against the project's test suite. Across three models, success varied sharply by project, with frequent failures to produce valid code or pass tests. Where models succeeded, the changes often matched patterns already present in their training data rather than demonstrating fresh reasoning about the codebase. These results indicate that LLMs still require substantial human oversight before they can be trusted to modify complex real-world projects.

Core claim

The central claim is that LLMs cannot yet serve as effective contributors to production code in sizable open-source projects. When the models were asked to reproduce 212 real commits spanning bug fixes and minor feature additions in projects such as FFmpeg and wolfSSL, success rates ranged from 0% to 60% depending on the project and model. The LLMs repeatedly generated syntactically incorrect code, code that failed static verification, or code that did not pass the project's test suite. They also had difficulty producing genuinely new code or handling contexts outside narrow size ranges, and many apparent successes stemmed from parroting changes seen during training.

What carries the argument

An evaluation framework that subjects LLM-generated patches to syntax validation, static verification, and full execution of the project's test suite.

If this is right

  • LLMs need improved ability to handle project-specific context and codebases of varying sizes to become reliable contributors.
  • Success in code contribution requires passing both static checks and full test suites, which current models frequently fail.
  • Many LLM successes on existing changes appear to rely on memorization rather than generalization to novel tasks.
  • Project-specific differences produce large swings in performance, implying that general-purpose models may need customization per codebase.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future models that emphasize reasoning over pattern matching could raise success rates on previously unseen code changes.
  • Hybrid systems that combine LLMs with project-specific documentation or commit history retrieval might overcome current context limitations.
  • These results support continued use of LLMs for early drafting or suggestion, paired with mandatory human review before integration.

Load-bearing premise

The 212 selected commits from eight projects are representative of the code changes LLMs would be asked to make in real open-source or production settings.

What would settle it

Applying the same framework to a new collection of several hundred commits from additional projects and observing success rates above 70% with no evidence of training-data parroting would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.23340 by Chun Jie Chong, Iulian Neamtiu, Muyeed Ahmed, Zhihao (Zephyr) Yao.

Figure 1
Figure 1. Figure 1: Human vs. GPT-4o fixes for an incorrect index bug. view at source ↗
Figure 2
Figure 2. Figure 2: LLM-generated code for fixing a use-after-free bug. view at source ↗
Figure 3
Figure 3. Figure 3: GPT-4o-generated code for fixing a null pointer dereference error. view at source ↗
Figure 4
Figure 4. Figure 4: LLM-generated code for implementing a size retrieval function. view at source ↗
Figure 5
Figure 5. Figure 5: Overview of our approach. Manageable commit size. The selected commits’ median sizes were 4 LOC for bug fixes and 15 LOC for feature enhancement ( view at source ↗
Figure 6
Figure 6. Figure 6: GPT-4o fixed a bug partially. Human fix GPT-4o fix 1 case ’r’: str[j++] = ’\x0d’; break; 2 case ’t’: str[j++] = ’\x09’; break; 3 case ’v’: str[j++] = ’\x0b’; break; 4 case ’x’: 5 case ’u’: 6 case ’\n’: 7 case ’\r’: 8 default: str[j++] = ’\\’; str[j++] = str[i]; Impl. details omitted 1 case ’r’: str[j++] = ’\x0d’; break; 2 case ’t’: str[j++] = ’\x09’; break; 3 case ’v’: str[j++] = ’\x0b’; break; 4 case ’x’:… view at source ↗
Figure 7
Figure 7. Figure 7: GPT-4o deleted functional code unrelated to the task. view at source ↗
Figure 8
Figure 8. Figure 8: Success rate across file sizes (LOC). [5-14] [15-25] [26-43] [44-55] [56-90] [91-131] [132-191] [192-816] Function Size 0 20 40 60 Success Rate (%) ChatGPT Ministral3 Qwen3-Coder view at source ↗
Figure 9
Figure 9. Figure 9: Success rate across function size (LOC). view at source ↗
read the original abstract

LLM-generated code is widely used, and the share of committed code produced by LLMs is expected to increase. However, we are not at a point where LLMs can be effective contributors to production code. We present an approach that exposes the shortcomings of LLM generation on such projects, and proposes recommendations; the targets of our study are sizable open-source projects, e.g., FFmpeg and wolfSSL. First, we developed a framework that uses verification and validation to evaluate a given LLM's suitability to fix or add features to an existing project. Second, we apply the framework to 212 commits (bug fixes and small feature improvements) in eight popular open-source projects and three LLMs: GPT-4o, Ministral3, and Qwen3-Coder. The success rate varied from 0% to 60% depending on the project. The LLMs failed in a variety of ways, from generating syntactically incorrect code, to producing code that fails basic (static) verification, or validation via the project's test suite. In particular, the LLMs struggle with generating new code, handling contexts (function or file) outside a certain size range, and in many cases their success is due to parroting code changes they have been trained on.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript reports an empirical evaluation of three large language models (GPT-4o, Ministral 3, and Qwen3-Coder) on their ability to contribute code to eight open-source projects by replicating 212 historical commits involving bug fixes and small feature additions. Using a framework that applies verification and validation against project test suites, the study finds success rates between 0% and 60% depending on the project. LLMs frequently fail to produce syntactically correct code, pass static verification, or satisfy test suites, and successes are often attributed to the models reproducing changes present in their training data. The authors conclude that LLMs are not yet effective as contributors to production-level open-source code and offer recommendations based on observed shortcomings.

Significance. This study provides concrete, project-specific performance data on LLM code generation in realistic open-source contexts, which is valuable for understanding current capabilities and limitations. If the results are robust, they underscore the need for improved context handling and generalization in LLMs for software engineering tasks, potentially informing both tool development and developer practices in integrating AI assistance.

major comments (3)
  1. The criteria for selecting the 212 commits across the eight projects (including sizable ones like FFmpeg and wolfSSL) are not specified. The central claim that LLMs are not effective contributors rests on these success rates (0-60%) being indicative of real usage; without details on sampling by size, complexity, context length, or representativeness of typical contribution tasks (bug fixes vs. larger changes), the failure modes in syntax, verification, and validation could be artifacts of the chosen task distribution.
  2. The observation that 'in many cases their success is due to parroting code changes they have been trained on' is load-bearing for interpreting the results but lacks methodological detail on how this was assessed (e.g., via similarity checks, overlap metrics, or case studies). This weakens the distinction between genuine contribution capability and memorization.
  3. The framework description provides no specifics on exact prompting strategies, context window handling, or statistical controls for the three models. This is necessary to evaluate reproducibility of the reported success rates and failure categories, as well as to rule out post-hoc choices affecting the headline conclusion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major point below and will revise the manuscript accordingly to improve clarity, reproducibility, and transparency.

read point-by-point responses
  1. Referee: The criteria for selecting the 212 commits across the eight projects (including sizable ones like FFmpeg and wolfSSL) are not specified. The central claim that LLMs are not effective contributors rests on these success rates (0-60%) being indicative of real usage; without details on sampling by size, complexity, context length, or representativeness of typical contribution tasks (bug fixes vs. larger changes), the failure modes in syntax, verification, and validation could be artifacts of the chosen task distribution.

    Authors: We agree that the selection criteria were insufficiently detailed. In the revised manuscript we will add an explicit subsection under Experimental Setup describing the process: commits were drawn from the main branches of the eight projects, restricted to those whose messages and diffs indicated bug fixes or small feature additions, and chosen to span a range of project sizes and domains. We did not apply formal stratified sampling by complexity or context length; we will add a limitations paragraph acknowledging that the observed failure modes may partly reflect this task distribution and that broader sampling would strengthen generalizability claims. revision: yes

  2. Referee: The observation that 'in many cases their success is due to parroting code changes they have been trained on' is load-bearing for interpreting the results but lacks methodological detail on how this was assessed (e.g., via similarity checks, overlap metrics, or case studies). This weakens the distinction between genuine contribution capability and memorization.

    Authors: We will expand the relevant results subsection to document the assessment procedure. Successes were flagged as potential memorization when the generated patch exhibited high token overlap and low edit distance with the original commit diff; we computed normalized Levenshtein distance and Jaccard similarity on tokenized code and performed manual review of borderline cases. The revised text will report the exact similarity thresholds applied, the proportion of successes meeting the criteria, and representative examples to allow readers to evaluate the distinction between memorization and novel generation. revision: yes

  3. Referee: The framework description provides no specifics on exact prompting strategies, context window handling, or statistical controls for the three models. This is necessary to evaluate reproducibility of the reported success rates and failure categories, as well as to rule out post-hoc choices affecting the headline conclusion.

    Authors: We will augment the Framework section with the precise prompting templates used for each task category, the rules applied for context truncation or summarization to respect each model's context window, and any statistical controls (e.g., number of independent generations per commit and aggregation methods). These additions will make the experimental protocol fully reproducible and demonstrate that prompting choices were fixed in advance rather than tuned post hoc. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement against external benchmarks

full rationale

The paper reports an empirical study that applies a verification/validation framework to LLM-generated patches for 212 commits across eight open-source projects. Success rates (0-60%) and failure modes (syntax errors, static verification failures, test-suite failures, parroting) are measured directly against the projects' own test suites and analyzers. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains are present; the central claim rests on observable outcomes from external oracles rather than any reduction to the study's own inputs or prior self-referential results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the chosen commits and projects adequately sample the space of production code changes and that passing the project's own tests and static checks is a sufficient proxy for effective contribution.

axioms (2)
  • domain assumption The 212 commits represent typical bug fixes and small feature additions that LLMs would be asked to perform.
    Used as the evaluation targets for LLM suitability.
  • domain assumption Project test suites and static verification provide a reliable signal of code correctness and maintainability.
    Basis for labeling an LLM output as successful or failed.

pith-pipeline@v0.9.0 · 5534 in / 1270 out tokens · 59124 ms · 2026-05-08T08:05:32.457621+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 12 canonical work pages · 1 internal anchor

  1. [1]

    https: //github.blog/news-insights/research/survey-r eveals-ais-impact-on-the-developer -experience , 2024

    SurveyrevealsAI’simpactonthedeveloperexperience. https: //github.blog/news-insights/research/survey-r eveals-ais-impact-on-the-developer -experience , 2024

  2. [2]

    GitHub Copilot - Your AI pair programmer.https://github .com/features/copilot, 2024

  3. [3]

    GitHub Copilot Enterprise on GPT-4o.https://github.b log/changelog/2024-07-05-github-copilot-enter prise-on-gpt-4o, 2024

  4. [4]

    https://visualstudio.microsoft.com/s ervices/intellicode/, 2025

    IntelliCode. https://visualstudio.microsoft.com/s ervices/intellicode/, 2025

  5. [5]

    Accessed: 2025-10-15

    Cursor: The ai code editor.https://cursor.com/ , 2025. Accessed: 2025-10-15

  6. [6]

    https://github.blog/2022 -09-07-research-quantifying-github-copilots-i mpact-on-developer-productivity-and-happiness , 2022

    Research: quantifying GitHub Copilot’s impact on developer productivity and happiness. https://github.blog/2022 -09-07-research-quantifying-github-copilots-i mpact-on-developer-productivity-and-happiness , 2022. 11

  7. [7]

    S. Li, Y. Cheng, J. Chen, J. Xuan, S. He, and W. Shang. Assessing the performance of ai-generated code: A case study on github copilot. In2024 IEEE 35th ISSRE, pages 216–227, 2024

  8. [8]

    C. J. Chong, Z. Yao, and I. Neamtiu. Artificial-intelligence generated code considered harmful: A road map for secure and high-quality code generation.arXiv preprint arXiv:2409.19182, 2024

  9. [9]

    F. Liu, Y. Liu, L. Shi, H. Huang, R. Wang, Z. Yang, L. Zhang, Z Li, and Y. Ma. Exploring and evaluating hallucinations in llm-poweredcodegeneration.arXivpreprintarXiv:2404.00971, 2024

  10. [10]

    Security degradation in iterative AI code generation: A systematic analysis of the paradox,

    S. Shukla, H. Joshi, and R. Syed. Security degradation in iterativeaicodegeneration–asystematicanalysisoftheparadox. arXiv preprint arXiv:2506.11022, 2025

  11. [11]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  12. [12]

    T. Liu, C. Xu, and J. McAuley. Repobench: Benchmarking repository-level code auto-completion systems.arXiv preprint arXiv:2306.03091, 2023

  13. [13]

    H. Yu, B. Shen, D. Ran, J. Zhang, Q. Zhang, Y. Ma, G. Liang, Y. Li, Q. Wang, and T. Xie. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. InProceedings of the IEEE/ACM 46th ICSE. Association for Computing Machinery, 2024. ISBN 9798400702174

  14. [14]

    https://cwe.mitre.org/top25/archive/2024/2024_ cwe_top25.html, 2024

    2024 CWE Top 25 Most Dangerous Software Weaknesses. https://cwe.mitre.org/top25/archive/2024/2024_ cwe_top25.html, 2024

  15. [15]

    https://github.com/srdja/Collectio ns-C, 2025

    Collections-C. https://github.com/srdja/Collectio ns-C, 2025

  16. [16]

    jansson.https://github.com/akheron/jansson, 2025

  17. [17]

    Clang: a C Language Family Frontend for LLVM.http: //clang.llvm.org/

  18. [18]

    Springer London, London,

    Steve Easterbrook, Janice Singer, Margaret-Anne Storey, and Daniela Damian.Selecting Empirical Methods for Software En- gineering Research, pages 285–311. Springer London, London,

  19. [19]

    doi: 10.1007/978-1-84800-0 44-5_11

    ISBN 978-1-84800-044-5. doi: 10.1007/978-1-84800-0 44-5_11. URL https://doi.org/10.1007/978-1-84800 -044-5_11

  20. [20]

    Gonzalez-Barahona

    IsraelHerraiz,DanielRodriguez,GregorioRobles,andJesusM. Gonzalez-Barahona. The evolution of the laws of software evolution: A discussion based on a systematic literature review. ACM Comput. Surv., 46(2), December 2013. ISSN 0360-0300. doi: 10.1145/2543581.2543595. URLhttps://doi.org/10 .1145/2543581.2543595

  21. [21]

    Towards a better understanding of software evolution: an empirical study on open-source software.J

    Iulian Neamtiu, Guowu Xie, and Jianbo Chen. Towards a better understanding of software evolution: an empirical study on open-source software.J. Softw. Evol. Process, 25(3):193–218, March 2013. ISSN 2047-7473. doi: 10.1002/smr.564. URL https://doi.org/10.1002/smr.564

  22. [22]

    Anempirical studyofadoptionofsoftwaretestinginopensourceprojects

    P.S.Kochhar,T.F.Bissyandé,D.Lo,andL.Jiang. Anempirical studyofadoptionofsoftwaretestinginopensourceprojects. In Proc.13thInternationalConferenceonQualitySoftware, pages 103–112. IEEE, 2013

  23. [23]

    packcc.https://github.com/arithy/packcc, 2025

  24. [24]

    libhl.https://github.com/xant/libhl, 2025

  25. [25]

    FFmpeg.https://github.com/FFmpeg/FFmpeg, 2025

  26. [26]

    wolfSSL.https://github.com/wolfSSL/wolfssl, 2025

  27. [27]

    Bison.https://www.gnu.org/software/bison/, 2025

  28. [28]

    https://security.appspot.com/vsftpd.html , 2025

    vsftpd. https://security.appspot.com/vsftpd.html , 2025

  29. [29]

    A. K. Zhang, N. Perry, R. Dulepet, E. Jones, J. W. Lin, J. Ji, C. Menders, G. Hussein, S. Liu, D. Jasper, et al. Cybench: A framework for evaluating cybersecurity capabilities and risk of language models.arXiv preprint arXiv:2408.08926, 2024

  30. [30]

    Z. Wang, Z. Zhou, D. Song, Y. Huang, S. Chen, L. Ma, and T. Zhang. Towards understanding the characteristics of code generation errors made by large language models. In2025 IEEE/ACM 47th ICSE, pages 2587–2599. IEEE Computer Society, 2025

  31. [31]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22. Curran Associates Inc., 2022. ISBN 9781713871088

  32. [32]

    Prompt engineering for Copilot Chat.https://docs.githu b.com/en/copilot/concepts/prompt-engineering-f or-copilot-chat, 2025

  33. [33]

    Mockus and L

    A. Mockus and L. G. Votta. Identifying reasons for software changes using historic databases. InProceedings 2000 Interna- tional Conference on Software Maintenance, pages 120–130, 2000

  34. [34]

    Buse and W

    R.P.L. Buse and W. R. Weimer. Automatically documenting program changes. InProceedings of the 25th IEEE/ACM International Conference on Automated Software Engineering, page33–42.AssociationforComputingMachinery,2010. ISBN 9781450301169

  35. [35]

    Y. Tian, Y. Zhang, K. Stol, L. Jiang, and H. Liu. What makes a good commit message? In2022 IEEE/ACM 44th ICSE, pages 2389–2401, 2022

  36. [36]

    The generalization of ‘student’s’problem when several different population varlances are involved

    Bernard L Welch. The generalization of ‘student’s’problem when several different population varlances are involved. Biometrika, 34(1-2):28–35, 1947

  37. [37]

    H.Wang,X.Xia,D.Lo,Q.He,X.Wang,andJ.Grundy.Context- aware retrieval-based deep commit message generation.ACM Trans. Softw. Eng. Methodol., 30(4), 2021. ISSN 1049-331X

  38. [38]

    S. Liu, C. Gao, S. Chen, L. Y. Nie, and Y. Liu. Atom: Commit message generation based on abstract syntax tree and hybrid ranking. volume 48, pages 1800–1817, 2022

  39. [39]

    Y. Wu, Y. Wang, Y. Li, W. Tao, S. Yu, H. Yang, W. Jiang, and J. Li. An empirical study on commit message generation using llms via in-context learning. In2025 IEEE/ACM 47th ICSE, pages 553–565. IEEE Computer Society, 2025

  40. [40]

    Zhang, Y

    Y. Zhang, Y. Xie, S. Li, K. Liu, C. Wang, Z. Jia, X. Huang, J. Song, C. Luo, Z. Zheng, R. Xu, S. Liu, Y.and Zheng, and X. Liao. Unseen horizons: Unveiling the real capability of llm code generation beyond the familiar. In2025 IEEE/ACM 47th ICSE, page 604–615. IEEE Computer Society, 2025

  41. [41]

    InProceedingsof the28thInternationalConferenceonEvaluationandAssessment in Software Engineering, pages 252–261, 2024

    Zoltán Ságodi, Gábor Antal, Bence Bogenfürst, Martin Isztin, PéterHegedűs,andRudolfFerenc.Realitycheck: Assessinggpt- 4infixingreal-worldsoftwarevulnerabilities. InProceedingsof the28thInternationalConferenceonEvaluationandAssessment in Software Engineering, pages 252–261, 2024

  42. [42]

    J. Liu, C. S. Xia, Y. Wang, and L. Zhang. Is your code generated by ChatGPT really correct? rigorous evaluation of largelanguagemodelsforcodegeneration. InProc.NIPS,2023

  43. [43]

    Do promptpatternsaffectcodequality? afirstempiricalassessment of chatgpt-generated code.arXiv preprint arXiv:2504.13656, 2025

    AntonioDellaPorta,StefanoLambiase,andFabioPalomba. Do promptpatternsaffectcodequality? afirstempiricalassessment of chatgpt-generated code.arXiv preprint arXiv:2504.13656, 2025. 12

  44. [44]

    Quality assessment of pythontestsgeneratedbylargelanguagemodels.arXivpreprint arXiv:2506.14297, 2025

    Victor Alves, Carla Bezerra, Ivan Machado, Larissa Rocha, Tássio Virgínio, and Publio Silva. Quality assessment of pythontestsgeneratedbylargelanguagemodels.arXivpreprint arXiv:2506.14297, 2025

  45. [45]

    A performance study of llm-generated code on leetcode

    Tristan Coignion, Clément Quinton, and Romain Rouvoy. A performance study of llm-generated code on leetcode. In Proceedings of the 28th international conference on evaluation and assessment in software engineering, pages 79–89, 2024

  46. [46]

    Zhong and Z

    L. Zhong and Z. Wang. Can llm replace stack overflow? a studyonrobustnessandreliabilityoflargelanguagemodelcode generation. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 21841–21849, 2024

  47. [47]

    arXiv preprint arXiv:2506.23100 (2025) arXiv:2506.23100

    Jiayi Zhang, Kai Huang, Jian Zhang, Yang Liu, and Chunyang Chen. Repair ingredients are all you need: Improving large language model-based program repair via repair ingredients search.arXiv preprint arXiv:2506.23100, 2025

  48. [48]

    Improving patch correctness analysis via random testing and large language models

    Facundo Molina, Juan Manuel Copia, and Alessandra Gorla. Improving patch correctness analysis via random testing and large language models. In2024 IEEE Conference on Software Testing, Verification and Validation (ICST), pages 317–328. IEEE, 2024

  49. [49]

    Metal: meta- morphic testing framework for analyzing large-language model qualities

    Sangwon Hyun, Mingyu Guo, and M Ali Babar. Metal: meta- morphic testing framework for analyzing large-language model qualities. In2024 IEEE Conference on Software Testing, Verifi- cation and Validation (ICST), pages 117–128. IEEE, 2024

  50. [50]

    Mao, and Z

    Z.Zhang, C.Wang, Y.Wang, E.Shi, Y.Ma, W.Zhong, J.Chen, M. Mao, and Z. Zheng. Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation.Proceed- ings of the ACM on Software Engineering, 2(ISSTA):481–503, 2025

  51. [51]

    Y. Chen, M. Chen, C. Gao, Z. Jiang, Z. Li, and Y. Ma. Towards mitigating api hallucination in code generated by llms with hier- archical dependency aware. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engi- neering, page 468–479. Association for Computing Machinery, 2025

  52. [52]

    Anempirical study of the non-determinism of chatgpt in code generation

    SOuyang,J.M.Zhang,M.Harman,andM.Wang. Anempirical study of the non-determinism of chatgpt in code generation. ACM Trans. Softw. Eng. Methodol., 34(2), 2025. ISSN 1049- 331X

  53. [53]

    Turbulence: Systematically and automatically testing instruction-tunedlargelanguagemodelsforcode

    Shahin Honarvar, Mark van der Wilk, and Alastair F Don- aldson. Turbulence: Systematically and automatically testing instruction-tunedlargelanguagemodelsforcode. In2025IEEE Conference on Software Testing, Verification and Validation (ICST), pages 80–91. IEEE, 2025

  54. [54]

    Zheng, Y

    D. Zheng, Y. Wang, E. Shi, R. Zhang, Y. Ma, H. Zhang, and Z. Zheng. Humanevo: An evolution-aware benchmark for more realistic evaluation of repository-level code generation. In2025 IEEE/ACM47thICSE,pages764–764.IEEEComputerSociety, 2025

  55. [55]

    Zhang, Y

    T. Zhang, Y. Yu, X. Mao, S. Wang, K. Yang, Y. Lu, Z. Zhang, and Y. Zhao. Instruct or interact? exploring and eliciting llms’ capability in code snippet adaptation through prompt engineering. In2025 IEEE/ACM 47th ICSE, page 566–577. IEEE Computer Society, 2025

  56. [56]

    Z. Tian, J. Chen, and X. Zhang. Fixing large language models’ specification misunderstanding for better code generation. In 2025IEEE/ACM47thICSE,pages1514–1526.IEEEComputer Society, 2025

  57. [57]

    H. Song, A. Goknil, X. Jiang, E. Melum, H. Joe, C. Gazzotti, V. Frascolla, A. N. Videsjorden, and P. Nguyen. Developing multi-agent llm applications through continuous human-llm co- programming. In2025IEEE/ACM4thInternationalConference onAIEngineering–SoftwareEngineeringforAI(CAIN),pages 42–47. IEEE Computer Society, 2025. 13