pith. machine review for the scientific record. sign in

arxiv: 2604.15270 · v1 · submitted 2026-04-16 · 💻 cs.SE

Recognition: unknown

Enhancing Large Language Models with Retrieval Augmented Generation for Software Testing and Inspection Automation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:37 UTC · model grok-4.3

classification 💻 cs.SE
keywords Large Language ModelsRetrieval Augmented GenerationSoftware TestingCode InspectionTest Case GenerationSoftware VerificationHallucination ReductionAutomation
0
0 comments X

The pith

Retrieval-augmented generation improves large language model outputs for automated test case creation and code inspection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates automating two key verification activities in software development by using large language models to generate test cases and to inspect source code. To counter the tendency of these models to produce confident but wrong answers, the authors add a retrieval step that pulls relevant external information into the model's prompt. Experiments indicate that this addition generally yields better test cases and more useful inspection results than the models produce on their own. If the approach holds, software teams could spend less time on manual testing and review work while still catching more issues earlier in the development cycle. Readers would care because the method offers a concrete way to make current AI tools more dependable for safety-critical software quality tasks.

Core claim

Incorporating external context through a retrieval-augmented generation pipeline produces a generally positive effect on both test-case generation and code-inspection tasks performed by large language models, as shown in the authors' experiments, and thereby lowers overall project costs while raising the effectiveness and efficiency of verification and validation work.

What carries the argument

The retrieval-augmented generation pipeline, which supplies the large language model with supplementary external knowledge to reduce hallucinated outputs during test generation and code inspection.

If this is right

  • Large language models equipped with retrieval produce more reliable test cases than the same models without it.
  • Code inspection performed by large language models becomes more accurate once relevant external context is added.
  • Human testers and inspectors require less time to complete their work when assisted by the retrieval-augmented models.
  • The overall cost of verification and validation activities in a software project declines as a direct result of the improved automation.
  • Effectiveness of testing and inspection rises because the models generate fewer hallucinations that would otherwise require human correction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same retrieval technique could be applied to other software-engineering tasks that currently suffer from large-language-model hallucinations, such as requirements tracing or defect prediction.
  • Different choices of knowledge base, such as project-specific documentation or open-source code repositories, might produce stronger or weaker gains depending on the domain.
  • Integration of the retrieval pipeline into existing continuous-integration tools could allow teams to run automated checks on every code commit with minimal added setup.
  • Long-term use might reveal whether the models gradually improve their internal representations or whether performance plateaus once the external sources are exhausted.

Load-bearing premise

The selected external knowledge sources and the retrieval process itself must deliver accurate, relevant context without injecting new inaccuracies or biases that cancel out the intended gains.

What would settle it

A controlled comparison on the same set of code samples in which the retrieval-augmented model generates more incorrect test cases or inspection comments than the plain large-language-model baseline would disprove the claimed positive impact.

Figures

Figures reproduced from arXiv: 2604.15270 by Armin Moin, Nazanin Siavash, Zoe Fingleton.

Figure 1
Figure 1. Figure 1: Overview of the proposed RAG-based framework for automated code inspection and test case genera [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Prompt for code inspection 4.2 Automated Test Case Generation For the pipeline and knowledge base setup of the automated test case generation, we prepare our model using the same methodology as automated code inspection, creating and configuring our embedding model and vector storage. In our main processing loop, we use the same general framework as automated code inspection, with some systemic differences… view at source ↗
Figure 3
Figure 3. Figure 3: Prompt for test generation Following embedding, the pipeline provides the specified LLM with the program under test, the function name and description, and the knowledge base context (see [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Efficiency of Automated Test Generation [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Code Analysis Predictions using GPT-3.5-turbo, where (a) the [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Efficiency of Automated Code Inspection The evaluation results of the code analysis using GPT-3.5-Turbo and GPT-4.1 are presented in [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: GPT-3.5 Turbo: Mismatch Type Count Comparison (RAG vs Non-RAG) [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: GPT-4: Mismatch Type Count Comparison (RAG vs Non-RAG) [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: GPT-4o: Mismatch Type Count Comparison (RAG vs Non-RAG) [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Mismatch Type Count Comparison (GPT-3.5, GPT-4, GPT-4o) between RAG and Non-RAG [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
read the original abstract

In this paper, we focus on automating two of the widely used Verification and Validation (V&V) activities in the Software Development Lifecycle (SDLC): Software testing and software inspection (also known as review). Concerning the former, we concentrate on automated test case generation using Large Language Models (LLMs). For the latter, we enable inspection of the source code by LLMs. To address the known LLM hallucination problem, in which LLMs confidently produce incorrect outputs, we implement a Retrieval Augmented Generation (RAG) pipeline to integrate supplementary knowledge sources and provide additional context to the LLM. Our experimental results indicate that incorporating external context via the RAG pipeline has a generally positive impact on both test case generation and code inspection. This novel approach reduces the total project cost by saving human testers'/inspectors' time. It also improves the effectiveness and efficiency of these V&V activities, as evidenced by our experimental study.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes integrating Retrieval Augmented Generation (RAG) with Large Language Models (LLMs) to automate two verification and validation activities: test case generation and source code inspection. By supplying external knowledge sources as context, the approach aims to mitigate LLM hallucinations. The central claim is that this RAG pipeline produces a generally positive impact on both activities, reduces total project cost by saving human effort, and improves effectiveness and efficiency, as shown by the authors' experimental study.

Significance. If the experimental claims were supported by reproducible quantitative evidence, the work could contribute to more reliable LLM-based automation of software V&V tasks, with potential cost and quality benefits in industrial settings. The absence of reported metrics, baselines, and statistical controls, however, prevents any assessment of whether the claimed benefits are real, reproducible, or attributable to RAG rather than prompt design or selection effects.

major comments (2)
  1. Abstract: The assertion that 'our experimental results indicate that incorporating external context via the RAG pipeline has a generally positive impact on both test case generation and code inspection' is presented without any supporting data. No metrics (e.g., branch coverage, mutation score, defect recall), baselines (plain LLM vs. RAG), sample sizes, project counts, or statistical tests are supplied, making it impossible to evaluate the central claim.
  2. Experimental evaluation section: The manuscript supplies none of the elements required to substantiate the reported positive impact, including explicit non-RAG baselines, objective and reproducible metrics, multiple projects or modules with reported sizes, or controls for LLM stochasticity (multiple runs, significance tests). The observed benefit could therefore be noise, selection bias, or prompt engineering rather than RAG.
minor comments (1)
  1. The description of the RAG pipeline would benefit from an explicit diagram or pseudocode showing the retrieval mechanism, knowledge sources, and how retrieved context is injected into the LLM prompt.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the experimental claims require stronger quantitative substantiation to be fully convincing and will revise the manuscript accordingly. Our point-by-point responses to the major comments follow.

read point-by-point responses
  1. Referee: Abstract: The assertion that 'our experimental results indicate that incorporating external context via the RAG pipeline has a generally positive impact on both test case generation and code inspection' is presented without any supporting data. No metrics (e.g., branch coverage, mutation score, defect recall), baselines (plain LLM vs. RAG), sample sizes, project counts, or statistical tests are supplied, making it impossible to evaluate the central claim.

    Authors: We acknowledge that the abstract summarizes the findings at a high level without concrete supporting figures. In the revised manuscript we will update the abstract to include key quantitative results from the experimental study, such as observed improvements in coverage and defect detection when using RAG versus the plain LLM baseline, together with the number of projects evaluated. revision: yes

  2. Referee: Experimental evaluation section: The manuscript supplies none of the elements required to substantiate the reported positive impact, including explicit non-RAG baselines, objective and reproducible metrics, multiple projects or modules with reported sizes, or controls for LLM stochasticity (multiple runs, significance tests). The observed benefit could therefore be noise, selection bias, or prompt engineering rather than RAG.

    Authors: This observation is correct for the current version of the paper. While the manuscript describes the RAG pipeline and reports generally positive outcomes, it does not yet contain the full set of quantitative controls requested. In the revision we will expand the experimental section to add: explicit comparisons against non-RAG LLM baselines, objective metrics (branch coverage for test generation and defect recall for inspection), results across multiple projects with their sizes reported, and repeated runs with statistical significance testing to isolate the contribution of the RAG component. revision: yes

Circularity Check

0 steps flagged

No derivation chain exists; experimental claim is not circular

full rationale

The manuscript describes an implementation of a RAG pipeline for LLM-based test generation and code inspection, followed by a qualitative statement that experiments show a 'generally positive impact.' No equations, parameters, uniqueness theorems, or derivations are present. The central claim rests on experimental outcomes rather than any self-referential reduction of a result to its own inputs or to a self-citation chain. Absence of reported baselines, metrics, or statistical tests affects evidence quality but does not create circularity under the defined patterns. The paper is self-contained against external benchmarks in the narrow sense that its (unquantified) claim does not logically collapse into a tautology or fitted input renamed as prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no new mathematical objects, free parameters, or invented entities. It relies on the standard assumption that retrieved documents improve LLM output quality, which is treated as a domain assumption rather than derived.

pith-pipeline@v0.9.0 · 5461 in / 1009 out tokens · 28152 ms · 2026-05-10T10:37:41.535621+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 21 canonical work pages · 3 internal anchors

  1. [1]

    Sommerville,Software engineering

    I. Sommerville,Software engineering. Always learning, Boston Columbus Indianapolis New York San Francisco Hoboken Amsterdam Cape Town Dubai London: Pearson, tenth edition ed., 2016

  2. [2]

    Automating a complete software test process using llms: An automotive case study,

    S. Wang, Y. Yu, R. Feldt, and D. Parthasarathy, “Automating a complete software test process using llms: An automotive case study, ”arXiv preprint arXiv:2502.04008, 2025

  3. [3]

    Detecting code comment inconsistency using siamese recurrent network,

    F. Rabbi and M. S. Siddik, “Detecting code comment inconsistency using siamese recurrent network, ” inProceedings of the 28th international conference on program comprehension, pp. 371–375, 2020

  4. [4]

    Analyzing apis documentation and code to detect directive defects,

    Y. Zhou, R. Gu, T. Chen, Z. Huang, S. Panichella, and H. Gall, “Analyzing apis documentation and code to detect directive defects, ” in2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), pp. 27–37, IEEE, 2017

  5. [5]

    Detecting code comment inconsistencies using llm and program analysis,

    Y. Zhang, “Detecting code comment inconsistencies using llm and program analysis, ” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, pp. 683–685, 2024

  6. [6]

    Reference-based retrieval-augmented unit test generation,

    Z. Zhang, X. Liu, Y. Lin, X. Gao, H. Sun, and Y. Yuan, “Reference-based retrieval-augmented unit test generation, ”ACM Transactions on Software Engineering and Methodology, 2025

  7. [7]

    Deep scaling factor quantization network for large-scale image retrieval,

    Z. Deng, Z. Lai, Y. Ding, H. Kong, and X. Wu, “Deep scaling factor quantization network for large-scale image retrieval, ” inProceedings of the 2024 international conference on multimedia retrieval, pp. 851–859, 2024

  8. [8]

    A survey on rag meeting llms: Towards retrieval-augmented large language models,

    W. Fan, Y. Ding, L. Ning, S. Wang, H. Li, D. Yin, T.-S. Chua, and Q. Li, “A survey on rag meeting llms: Towards retrieval-augmented large language models, ” inProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pp. 6491–6501, 2024

  9. [9]

    Generalization through memo- rization: Nearest neighbor language models

    U. Khandelwal, O. Levy, D. Jurafsky, L. Zettlemoyer, and M. Lewis, “Generalization through memorization: Nearest neighbor language models, ”arXiv preprint arXiv:1911.00172, 2019

  10. [10]

    Leveraging passage retrieval with generative models for open domain question answering

    G. Izacard and E. Grave, “Leveraging passage retrieval with generative models for open domain question answering, ” arXiv preprint arXiv:2007.01282, 2020

  11. [11]

    Testeval: Benchmarking large language models for test case generation, 2025

    W. Wang, C. Yang, Z. Wang, Y. Huang, Z. Chu, D. Song, L. Zhang, A. R. Chen, and L. Ma, “TESTEVAL: Benchmarking Large Language Models for Test Case Generation, ” Feb. 2025. arXiv:2406.04531 [cs]

  12. [12]

    A Survey of Large Language Models,

    W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu, P. Liu, J.-Y. Nie, and J.-R. Wen, “A Survey of Large Language Models, ”

  13. [13]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need, ”Advances in neural information processing systems, vol. 30, 2017

  14. [14]

    A Comprehensive Overview of Large Language Models,

    H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A Comprehensive Overview of Large Language Models, ”ACM Transactions on Intelligent Systems and Technology, June 2025. Publisher: Association for Computing Machinery (ACM)

  15. [15]

    Emerging trends: A gentle introduction to fine-tuning,

    K. W. Church, Z. Chen, and Y. Ma, “Emerging trends: A gentle introduction to fine-tuning, ”Natural Language Engineering, vol. 27, pp. 763–778, Nov. 2021

  16. [16]

    Bert: Pre-training of deep bidirectional transformers for language understanding,

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding, ” inProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186, 2019

  17. [17]

    arXiv preprint arXiv:2408.13296 (2024)

    V. B. Parthasarathy, A. Zafar, A. Khan, and A. Shahid, “The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, ” Oct. 2024. arXiv:2408.13296 [cs]

  18. [18]

    When scaling meets LLM finetuning: The effect of data, model and finetuning method.arXiv preprint arXiv:2402.17193, 2024

    B. Zhang, Z. Liu, C. Cherry, and O. Firat, “When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method, ” Feb. 2024. arXiv:2402.17193 [cs]

  19. [19]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions,

    L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and T. Liu, “A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions, ”ACM Transactions , Vol. 1, No. 1, Article . Publication date: April 2026. Enhancing Large Language Models with Retrieval Augmented Generation for Soft...

  20. [20]

    Reducing hallucination in structured outputs via Retrieval-Augmented Generation,

    P. Béchard and O. M. Ayala, “Reducing hallucination in structured outputs via Retrieval-Augmented Generation, ” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), pp. 228–238, 2024. arXiv:2404.08189 [cs]

  21. [21]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, ” Apr. 2021. arXiv:2005.11401 [cs]

  22. [22]

    CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models,

    Y. Lyu, Z. Li, S. Niu, F. Xiong, B. Tang, W. Wang, H. Wu, H. Liu, T. Xu, and E. Chen, “CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models, ”ACM Transactions on Information Systems, vol. 43, pp. 1–32, Mar. 2025. Publisher: Association for Computing Machinery (ACM)

  23. [23]

    Barr, and Collin McMillan

    M. R. Parvez, W. U. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Retrieval Augmented Code Generation and Summarization, ” Sept. 2021. arXiv:2108.11601 [cs]

  24. [24]

    Retrieval-augmented generation to improve math question- answering: Trade-offs between groundedness and human preference,

    Z. Levonian, C. Li, W. Zhu, A. Gade, O. Henkel, M.-E. Postle, and W. Xing, “Retrieval-augmented Generation to Improve Math Question-Answering: Trade-offs Between Groundedness and Human Preference, ” Nov. 2023. arXiv:2310.03184 [cs]

  25. [25]

    Writing by Memorizing: Hierarchical Retrieval-based Medical Report Generation,

    X. Yang, M. Ye, Q. You, and F. Ma, “Writing by Memorizing: Hierarchical Retrieval-based Medical Report Generation, ” May 2021. arXiv:2106.06471 [cs]

  26. [26]

    Retrieval-augmented generation for ai-generated content: A survey.CoRR, abs/2402.19473, 2024

    P. Zhao, H. Zhang, Q. Yu, Z. Wang, Y. Geng, F. Fu, L. Yang, W. Zhang, J. Jiang, and B. Cui, “Retrieval-Augmented Generation for AI-Generated Content: A Survey, ” June 2024. arXiv:2402.19473 [cs]

  27. [27]

    Four eyes are better than two: On the impact of code reviews on software quality,

    G. Bavota and B. Russo, “Four eyes are better than two: On the impact of code reviews on software quality, ” in2015 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 81–90, IEEE, 2015

  28. [28]

    Design and code inspections to reduce errors in program development,

    M. E. Fagan, “Design and code inspections to reduce errors in program development, ”IBM Systems Journal, vol. 38, no. 2.3, pp. 258–287, 1999

  29. [29]

    Previously on... automating code review,

    R. Heumüller and F. Ortmeier, “Previously on... automating code review, ”arXiv preprint arXiv:2508.18003, 2025

  30. [30]

    Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study,

    Q. Guo, J. Cao, X. Xie, S. Liu, X. Li, B. Chen, and X. Peng, “Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study, ” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, (Lisbon Portugal), pp. 1–13, ACM, Feb. 2024

  31. [31]

    CoRRabs/2312.08477(2023)

    Y. Hao, W. Chen, Z. Zhou, and W. Cui, “E&V: Prompting Large Language Models to Perform Static Analysis by Pseudo-code Execution and Verification, ” Dec. 2023. arXiv:2312.08477 [cs]

  32. [32]

    Fine-tuning and prompt engineering for large language models-based code review automation,

    C. Pornprasit and C. Tantithamthavorn, “Fine-tuning and prompt engineering for large language models-based code review automation, ”Information and Software Technology, vol. 175, p. 107523, Nov. 2024

  33. [33]

    Automated code review in practice,

    U. Cihan, V. Haratian, A. Icoz, M. K. Gul, O. Devran, E. F. Bayendur, B. M. Ucar, and E. Tuzun, “Automated code review in practice, ” in2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pp. 425–436, 2025

  34. [34]

    CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models,

    H. Y. Lin, C. Liu, H. Gao, P. Thongtanunam, and C. Treude, “CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models, ” Mar. 2025. arXiv:2503.16167

  35. [35]

    Improving the learning of code review successive tasks with cross-task knowledge distillation,

    O. Ben Sghaier and H. Sahraoui, “Improving the learning of code review successive tasks with cross-task knowledge distillation, ”Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 1086–1106, 2024

  36. [36]

    Automating code review activities by large-scale pre-training,

    Z. Li, S. Lu, D. Guo, N. Duan, S. Jannu, G. Jenks, D. Majumder, J. Green, A. Svyatkovskiy, S. Fu,et al., “Automating code review activities by large-scale pre-training, ” inProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1035–1047, 2022

  37. [37]

    CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?,

    Y. Zhao, Z. Luo, Y. Tian, H. Lin, W. Yan, A. Li, and J. Ma, “CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?, ” Sept. 2024. arXiv:2408.10718 [cs]

  38. [38]

    Does AI Code Review Lead to Code Changes? A Case Study of GitHub Actions

    K. Sun, H. Kuang, S. Baltes, X. Zhou, H. Zhang, X. Ma, G. Rong, D. Shao, and C. Treude, “Does ai code review lead to code changes? a case study of github actions, ”arXiv preprint arXiv:2508.18771, 2025

  39. [39]

    Large language models for code analysis: Do {LLMs} really do their job?,

    C. Fang, N. Miao, S. Srivastav, J. Liu, R. Zhang, R. Fang, R. Tsang, N. Nazari, H. Wang, and H. Homayoun, “Large language models for code analysis: Do {LLMs} really do their job?, ” in33rd USENIX Security Symposium (USENIX Security 24), pp. 829–846, 2024

  40. [40]

    Automatic Generation of Floating-Point Test Data,

    W. Miller and D. Spooner, “Automatic Generation of Floating-Point Test Data, ”IEEE Transactions on Software Engineering, vol. SE-2, pp. 223–226, Sept. 1976

  41. [41]

    An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation,

    M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, “An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation, ”IEEE Transactions on Software Engineering, vol. 50, pp. 85–105, Jan. 2024

  42. [42]

    Large Language Models are Few-shot Testers: Exploring LLM-based General Bug Reproduction,

    S. Kang, J. Yoon, and S. Yoo, “Large Language Models are Few-shot Testers: Exploring LLM-based General Bug Reproduction, ” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp. 2312–2323, May

  43. [43]

    Chatunitest: A framework for llm-based test generation,

    Y. Chen, Z. Hu, C. Zhi, J. Han, S. Deng, and J. Yin, “Chatunitest: A framework for llm-based test generation, ” in Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, pp. 572– 576, 2024. , Vol. 1, No. 1, Article . Publication date: April 2026. 18 Fingleton et al

  44. [44]

    Large language models as test case generators: Performance evaluation and enhancement,

    K. Li and Y. Yuan, “Large language models as test case generators: Performance evaluation and enhancement, ”arXiv preprint arXiv:2404.13340, 2024

  45. [45]

    Using Large Language Models to Generate JUnit Tests: An Empirical Study,

    M. L. Siddiq, J. C. Da Silva Santos, R. H. Tanvir, N. Ulfat, F. Al Rifat, and V. Carvalho Lopes, “Using Large Language Models to Generate JUnit Tests: An Empirical Study, ” inProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, (Salerno Italy), pp. 313–322, ACM, June 2024

  46. [46]

    arXiv preprint arXiv:2305.04207 , year=

    Z. Yuan, Y. Lou, M. Liu, S. Ding, K. Wang, Y. Chen, and X. Peng, “No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation, ” May 2024. arXiv:2305.04207 [cs]

  47. [47]

    Bug In the Code Stack: Can LLMs Find Bugs in Large Python Code Stacks,

    H. Lee, S. Sharma, and B. Hu, “Bug In the Code Stack: Can LLMs Find Bugs in Large Python Code Stacks, ” June 2024. arXiv:2406.15325 [cs]

  48. [48]

    Program Synthesis with Large Language Models

    J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton, “Program Synthesis with Large Language Models, ” Aug. 2021. arXiv:2108.07732 [cs]

  49. [49]

    Mbpp: Mostly basic python problems (google research github repository)

    Google Research, “Mbpp: Mostly basic python problems (google research github repository). ” https://github.com/google- research/google-research/tree/master/mbpp, 2025. Accessed: 2025-08-13

  50. [50]

    Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation,

    Z. Zhang, C. Wang, Y. Wang, E. Shi, Y. Ma, W. Zhong, J. Chen, M. Mao, and Z. Zheng, “Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation, ”Proceedings of the ACM on Software Engineering, vol. 2, no. ISSTA, pp. 481–503, 2025

  51. [51]

    sentence-transformers/all-minilm-l6-v2

    Hugging Face, “sentence-transformers/all-minilm-l6-v2. ” https://huggingface.co/sentence-transformers/all-MiniLM- L6-v2, jan 2024. Accessed: 2025-07-21

  52. [52]

    LLM4SoftwareTesting: Testeval,

    LLM4SoftwareTesting, “LLM4SoftwareTesting: Testeval, ” Jan. 2025

  53. [53]

    GPT-3.5 Turbo,

    OpenAI, “GPT-3.5 Turbo, ” 2024

  54. [54]

    HammingHQ: Bug in the code stack,

    HammingHQ, “HammingHQ: Bug in the code stack, ” June 2024

  55. [55]

    QAS Lab: Zoe fingleton, month = jul, howpublished = https://github.com/qas-lab/reu-zoe-fingleton, note = Accessed: 2025-07-31

    QAS Lab, “QAS Lab: Zoe fingleton, month = jul, howpublished = https://github.com/qas-lab/reu-zoe-fingleton, note = Accessed: 2025-07-31. ” , Vol. 1, No. 1, Article . Publication date: April 2026