pith. machine review for the scientific record. sign in

arxiv: 2604.01799 · v2 · submitted 2026-04-02 · 💻 cs.SE

Recognition: 2 theorem links

· Lean Theorem

TestDecision: Sequential Test Suite Generation via Greedy Optimization and Reinforcement Learning

Guoqing Wang , Chengran Yang , Xiaoxuan Zhou , Zeyu Sun , Bo Wang , David Lo , Dan Hao

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:33 UTC · model grok-4.3

classification 💻 cs.SE
keywords test suite generationlarge language modelsreinforcement learninggreedy optimizationsoftware testingbranch coverageautomated test generationsubmodular optimization
0
0 comments X

The pith

TestDecision turns base LLMs into neural greedy experts for building test suites by exploiting monotone submodularity of the coverage objective.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes test suite generation as a Markov Decision Process and shows that its objective function satisfies monotone submodularity. This property justifies relaxing the NP-hard global optimization into a sequence of greedy steps that maximize marginal coverage gain at each addition. The authors implement this insight in TestDecision through a greedy inference framework and a reinforcement-learning training pipeline that teaches the LLM to choose tests with high incremental value. A reader would care because the method delivers large gains in branch coverage, execution success, and bug detection while using only open-source models of modest size.

Core claim

By proving that test-suite construction exhibits monotone submodularity, the work reduces the problem to a tractable greedy procedure that an LLM can learn to follow. TestDecision therefore consists of an inference engine that builds suites step by step according to this greedy rule and an RL stage that fine-tunes the base model to maximize expected marginal coverage. On the ULT benchmark the resulting system raises branch coverage by 38-52 percent and execution pass rate by 298-559 percent over the same base models, reaches parity with a much larger proprietary model, and surfaces 58-95 percent more bugs.

What carries the argument

The monotone-submodular objective of the test-suite MDP, which licenses the step-wise greedy selection rule executed by the RL-trained LLM.

If this is right

  • Branch coverage rises 38.15-52.37 percent over base LLMs.
  • Execution pass rate rises 298.22-558.88 percent over base LLMs.
  • Performance matches a far larger proprietary model while using only a 7B open model.
  • The method detects 58.43-95.45 percent more bugs than the same base models.
  • Generalization improves on LiveCodeBench.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same submodularity-plus-greedy pattern may transfer to other sequential construction tasks such as API composition or patch generation.
  • If submodularity fails on certain program domains, hybrid search that occasionally backtracks could restore performance.
  • Training cost could be reduced by distilling the greedy policy into a smaller non-LLM model once the marginal-gain predictor is learned.

Load-bearing premise

The coverage objective for test suites must be monotone submodular so that the greedy step-by-step rule remains near-optimal.

What would settle it

An experiment on a new benchmark in which a non-greedy, globally optimal search produces suites with measurably higher final coverage than TestDecision's greedy construction.

Figures

Figures reproduced from arXiv: 2604.01799 by Bo Wang, Chengran Yang, Dan Hao, David Lo, Guoqing Wang, Xiaoxuan Zhou, Zeyu Sun.

Figure 1
Figure 1. Figure 1: Illustrative example of the sequential dependency. Test C is functionally valid, but its contribution [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Coverage growth with respect to test suite size ( [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The overview of TestDecision. The framework operates as an iterative generation loop. The LLM acts [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Step-wise performance trajectory. TestDecision exhibits a steeper growth curve compared to baselines. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
read the original abstract

With the rapid evolution of LLMs, automated software testing is witnessing a paradigm shift. While proprietary models like GPT-4o demonstrate impressive capabilities, their high deployment costs and data privacy concerns make open-source LLMs the practical imperative for many academic and industrial scenarios. In the field of automated test generation, it has evolved to iterative workflows to construct test suites based on LLMs. When utilizing open-source LLMs, we empirically observe they lack a suite-level perspective, suffering from structural myopia-failing to generate new tests with large marginal gain based on the current covered status. In this paper, from the perspective of sequences, we formalize test suite generation as a MDP and demonstrate that its objective exhibits monotone submodularity, which enables an effective relaxation of this NP-hard global optimization into a tractable step-wise greedy procedure. Guided by this insight, we propose TestDecision, which transforms LLMs into neural greedy experts. TestDecision consists of two synergistic components: (1) an inference framework which implements test suite construction following a step-wise greedy strategy; and (2) a training pipeline of reinforcement learning which equips the base LLM with sequential test generation ability to maximize marginal gain. Comprehensive evaluations on the ULT benchmark demonstrate that TestDecision significantly outperforms existing advanced methods. It brings an improvement between 38.15-52.37% in branch coverage and 298.22-558.88% in execution pass rate over all base models, achieving a comparable performance on 7B backbone with a much larger proprietary LLM GPT-5.2. Furthermore, TestDecision can find 58.43-95.45% more bugs than vanilla base LLMs and exhibit superior generalization on LiveCodeBench, proving its capability to construct high-quality test suites.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper formalizes test suite generation as an MDP whose objective (marginal coverage and pass-rate gain) is claimed to be monotone submodular, permitting a greedy step-wise relaxation of the NP-hard problem. It introduces TestDecision, comprising a greedy inference framework and an RL training pipeline that equips base LLMs with sequential decision-making to maximize marginal gains. On the ULT benchmark, it reports 38.15–52.37% gains in branch coverage, 298.22–558.88% in execution pass rate, and 58.43–95.45% more bugs found versus base models, with 7B-scale performance comparable to GPT-5.2.

Significance. If the submodularity property is rigorously established and the empirical gains prove reproducible under controlled stochasticity, the work would provide a principled bridge between submodular optimization and RL for LLM-based testing, enabling smaller open-source models to approach proprietary performance without relying on scale alone.

major comments (2)
  1. [§3] §3 (MDP Formalization and Submodularity): The central justification for the greedy procedure rests on the claim that the coverage/pass-rate objective is monotone submodular, yet the manuscript supplies no explicit derivation or verification of the diminishing-returns inequality (f(A ∪ {x}) − f(A) ≥ f(B ∪ {x}) − f(B) for A ⊆ B). Given that LLM test generation is stochastic, the marginal gain is not a fixed set function; this must be shown to hold in expectation or via a deterministic surrogate, otherwise the theoretical grounding for both the inference framework and the RL objective is undermined.
  2. [§5] §5 (Experimental Evaluation): The reported percentage improvements lack accompanying statistical tests, variance across seeds, or explicit data-exclusion rules. Without these, it is impossible to determine whether the 38–52% coverage and 298–558% pass-rate gains are robust or artifacts of particular prompt orderings or model sampling temperatures.
minor comments (2)
  1. [Abstract] The abstract refers to 'GPT-5.2'; clarify whether this is a typographical error for an existing model or a placeholder, and provide the exact model identifier used for comparison.
  2. [§3] Notation for the MDP state (current coverage set) and action (next test) should be introduced once and used consistently; several equations in §3 reuse symbols without redefinition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the theoretical and empirical foundations of our work. We address each major point below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [§3] The manuscript supplies no explicit derivation or verification of the diminishing-returns inequality for monotone submodularity. Given stochastic LLM generation, the marginal gain is not a fixed set function; this must be shown in expectation or via a deterministic surrogate.

    Authors: We agree the derivation should be more explicit. Section 3.2 states that the coverage objective is monotone submodular and sketches the proof for the deterministic case. In revision we will expand this into a full step-by-step derivation of the inequality f(A ∪ {x}) − f(A) ≥ f(B ∪ {x}) − f(B) for A ⊆ B, then extend it to the stochastic setting by showing the inequality holds in expectation over the LLM sampling distribution. We will also introduce a deterministic surrogate (expected coverage under temperature sampling) to ground both the greedy inference and RL objective. revision: yes

  2. Referee: [§5] The reported percentage improvements lack statistical tests, variance across seeds, or explicit data-exclusion rules, making it impossible to assess robustness.

    Authors: We accept this criticism. The revised version will report means and standard deviations over at least five independent random seeds for all metrics. We will add paired statistical tests (Wilcoxon signed-rank) with p-values to confirm significance of the reported gains. We will also state the exact data-exclusion criteria (only environment-level execution failures are excluded; all model-generated tests are retained) and fix the sampling parameters (temperature 0.7, fixed prompt ordering) used throughout the experiments. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation chain is self-contained

full rationale

The paper formalizes test suite generation as an MDP whose objective (marginal coverage and pass-rate gain) is asserted to exhibit monotone submodularity as an intrinsic property of the coverage function, permitting the greedy relaxation. This assertion is presented as a mathematical property of the chosen reward rather than a definition that tautologically equates the prediction to the input. The RL training pipeline then learns a policy to maximize that marginal gain, without the policy definition or the reported performance gains being forced by construction from the same fitted parameters. No self-citation chains, ansatz smuggling, or renaming of known results appear in the load-bearing steps. The derivation therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that test coverage forms a monotone submodular set function; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Test suite generation objective exhibits monotone submodularity
    Invoked to justify the greedy relaxation of the NP-hard problem.

pith-pipeline@v0.9.0 · 5636 in / 1257 out tokens · 30599 ms · 2026-05-13T21:33:12.139386+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 10 internal anchors

  1. [1]

    GPT-4o: Optimized GPT-4 Language Model

    2024. GPT-4o: Optimized GPT-4 Language Model. https://openai.com/index/hello-gpt-4o/ Accessed: 2026-01-21

  2. [2]

    DeepSeek-R1

    2025. DeepSeek-R1. https://github.com/deepseek-ai/DeepSeek-R1 Accessed: 2026-01-26

  3. [3]

    Toufique Ahmed and Premkumar Devanbu. 2022. Multilingual training for software engineering. InProceedings of the 44th International Conference on Software Engineering. 1443–1455

  4. [4]

    Juan Altmayer Pizzorno and Emery D Berger. 2025. CoverUp: Effective High Coverage Test Generation for Python. Proceedings of the ACM on Software Engineering2, FSE (2025), 2897–2919

  5. [5]

    Razan Baltaji, Saurabh Pujar, Martin Hirzel, Louis Mandel, Luca Buratti, and Lav R Varshney. [n. d.]. Cross-lingual Transfer in Programming Languages: An Extensive Empirical Study.Transactions on Machine Learning Research ([n. d.])

  6. [6]

    Ned Batchelder and contributors. 2026. Coverage.py Documentation. https://coverage.readthedocs.io/. Accessed: 2026-01-26

  7. [7]

    2014.Tractability: Practical Approaches to Hard Problems

    Lucas Bordeaux, Youssef Hamadi, and Pushmeet Kohli. 2014.Tractability: Practical Approaches to Hard Problems. Cambridge University Press

  8. [8]

    Mark Chen. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

  9. [9]

    Yinghao Chen, Zehao Hu, Chen Zhi, Junxiao Han, Shuiguang Deng, and Jianwei Yin. 2024. Chatunitest: A framework for llm-based test generation. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 572–576

  10. [10]

    Richard Church and Charles ReVelle. 1974. The maximal covering location problem. InPapers of the regional science association, Vol. 32. Springer-Verlag Berlin/Heidelberg, 101–118

  11. [11]

    Cosmic Ray contributors. 2026. Cosmic Ray Documentation. https://cosmic-ray.readthedocs.io/. Accessed: 2026-01-26

  12. [12]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al . 2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407

  13. [13]

    Jueon Eom, Seyeon Jeong, and Taekyoung Kwon. 2024. Fuzzing javascript interpreters with coverage-guided reinforce- ment learning for llm-based mutation. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1656–1668

  14. [14]

    Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. InProceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416–419

  15. [15]

    Gordon Fraser and Andrea Arcuri. 2014. A large-scale evaluation of automated unit test generation using evosuite. ACM Transactions on Software Engineering and Methodology (TOSEM)24, 2 (2014), 1–42

  16. [16]

    Sijia Gu, Noor Nashid, and Ali Mesbah. 2025. LLM Test Generation via Iterative Hybrid Program Analysis.arXiv preprint arXiv:2503.13580(2025)

  17. [17]

    Siqi Gu, Quanjun Zhang, Kecheng Li, Chunrong Fang, Fangyuan Tian, Liuchuan Zhu, Jianyi Zhou, and Zhenyu Chen

  18. [18]

    preprint arXiv:2408.03095(2024)

    Testart: Improving llm-based unit testing via co-evolution of automated generation and repair iteration.arXiv 20 Wang et al. preprint arXiv:2408.03095(2024)

  19. [19]

    Junda He, Christoph Treude, and David Lo. 2025. LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead.ACM Transactions on Software Engineering and Methodology34, 5 (2025), 1–30

  20. [20]

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79

  21. [21]

    Dong Huang, Jie M Zhang, Mark Harman, Qianru Zhang, Mingzhe Du, and See-Kiong Ng. 2025. Benchmarking llms for unit test generation from real-world functions.arXiv preprint arXiv:2508.00408(2025)

  22. [22]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186(2024)

  23. [23]

    Laura Inozemtseva and Reid Holmes. 2014. Coverage is not strongly correlated with test suite effectiveness. In Proceedings of the 36th international conference on software engineering. 435–445

  24. [24]

    Kush Jain, Gabriel Synnaeve, and Baptiste Roziere. 2025. TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark. InThe Thirteenth International Conference on Learning Representations

  25. [25]

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2025. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. InThe Thirteenth International Conference on Learning Representations

  26. [26]

    Xue Jiang, Yihong Dong, Mengyang Liu, Hongyi Deng, Tian Wang, Yongding Tao, Rongyu Cao, Binhua Li, Zhi Jin, Wenpin Jiao, et al. 2025. CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment.arXiv preprint arXiv:2510.18471(2025)

  27. [27]

    David Kempe, Jon Kleinberg, and Éva Tardos. 2003. Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 137–146

  28. [28]

    David Kempe, Jon Kleinberg, and Éva Tardos. 2005. Influential nodes in a diffusion model for social networks. In international colloquium on automata, languages, and programming. Springer, 1127–1138

  29. [29]

    Samir Khuller, Anna Moss, and Joseph Seffi Naor. 1999. The budgeted maximum coverage problem.Information processing letters70, 1 (1999), 39–45

  30. [30]

    Andreas Krause and Carlos Guestrin. 2005. Near-optimal nonmyopic value of information in graphical models. In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence. 324–331

  31. [31]

    Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. 2022. Coderl: Mastering code generation through pretrained models and deep reinforcement learning.Advances in Neural Information Processing Systems35 (2022), 21314–21328

  32. [32]

    Dongjun Lee, Changho Hwang, and Kimin Lee. 2025. Learning to generate unit test via adversarial reinforcement learning.arXiv preprint arXiv:2508.21107(2025)

  33. [33]

    Caroline Lemieux, Jeevana Priya Inala, Shuvendu K Lahiri, and Siddhartha Sen. 2023. Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 919–931

  34. [34]

    Jingxuan Li, Lei Li, and Tao Li. 2012. Multi-document summarization via submodularity.Applied Intelligence37, 3 (2012), 420–430

  35. [35]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

  36. [36]

    Jiate Liu, Yiqin Zhu, Kaiwen Xiao, QIANG FU, Xiao Han, Yang Wei, and Deheng Ye. 2023. RLTF: Reinforcement Learning from Unit Test Feedback.Transactions on Machine Learning Research(2023)

  37. [37]

    Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. 2024. Starcoder 2 and the stack v2: The next generation.arXiv preprint arXiv:2402.19173(2024)

  38. [38]

    Stephan Lukasczyk and Gordon Fraser. 2022. Pynguin: Automated unit test generation for python. InProceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. 168–172

  39. [39]

    Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, and Amin Karbasi. 2016. Fast constrained submodular maxi- mization: Personalized data summarization. InInternational Conference on Machine Learning. PMLR, 1358–1367

  40. [40]

    Niels Mündler, Mark Müller, Jingxuan He, and Martin Vechev. 2024. SWT-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems37 (2024), 81857–81887

  41. [41]

    George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. 1978. An analysis of approximations for maximizing submodular set functions—I.Mathematical programming14, 1 (1978), 265–294

  42. [42]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems35 (2022), 27730–27744. TestDecision: Sequential Test Suite Generation via Greedy O...

  43. [43]

    Carlos Pacheco and Michael D Ernst. 2007. Randoop: feedback-directed random testing for Java. InCompanion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion. 815–816

  44. [44]

    Martin L Puterman. 1990. Markov decision processes.Handbooks in operations research and management science2 (1990), 331–434

  45. [45]

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950 (2023)

  46. [46]

    Gabriel Ryan, Siddhartha Jain, Mingyue Shang, Shiqi Wang, Xiaofei Ma, Murali Krishna Ramanathan, and Baishakhi Ray. 2024. Code-aware prompting: A study of coverage-guided test generation in regression setting using llm. Proceedings of the ACM on Software Engineering1, FSE (2024), 951–971

  47. [47]

    Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An empirical evaluation of using large language models for automated unit test generation.IEEE Transactions on Software Engineering50, 1 (2023), 85–105

  48. [48]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347(2017)

  49. [49]

    Ye Shang, Quanjun Zhang, Chunrong Fang, Siqi Gu, Jianyi Zhou, and Zhenyu Chen. 2025. A large-scale empirical study on fine-tuning large language models for unit testing.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 1678–1700

  50. [50]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024.URL https://arxiv. org/abs/2402.033002, 3 (2024), 5

  51. [51]

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. HybridFlow: A Flexible and Efficient RLHF Framework.arXiv preprint arXiv: 2409.19256(2024)

  52. [52]

    Benjamin Steenhoek, Michele Tufano, Neel Sundaresan, and Alexey Svyatkovskiy. 2025. Reinforcement learning from automatic feedback for high-quality unit test generation. In2025 IEEE/ACM International Workshop on Deep Learning for Testing and Testing for Deep Learning (DeepTest). IEEE, 37–44

  53. [53]

    Wannita Takerngsaksiri, Rujikorn Charakorn, Chakkrit Tantithamthavorn, and Yuan-Fang Li. 2025. Pytester: Deep reinforcement learning for text-to-testcase generation.Journal of Systems and Software224 (2025), 112381

  54. [54]

    Fu Teng, Miao Pan, Xuhong Zhang, Zhezhi He, Yiyao Yang, Xinyi Chai, Mengnan Qi, Liqiang Lu, and Jianwei Yin. 2025. VERIRL: Boosting the LLM-based Verilog Code Generation via Reinforcement Learning.arXiv preprint arXiv:2508.18462 (2025)

  55. [55]

    Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, Shao Kun Deng, and Neel Sundaresan. 2020. Unit test case generation with transformers and focal context.arXiv preprint arXiv:2009.05617(2020)

  56. [56]

    Guoqing Wang, Zeyu Sun, Sixiang Ye, Zhihao Gong, Yizhou Chen, Yifan Zhao, Qingyuan Liang, and Dan Hao. [n. d.]. Do advanced language models eliminate the need for prompt engineering in software engineering?ACM Transactions on Software Engineering and Methodology([n. d.])

  57. [57]

    Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Lingming Zhang, An Ran Chen, and Lei Ma. 2025. Testeval: Benchmarking large language models for test case generation. InFindings of the Association for Computational Linguistics: NAACL 2025. 3547–3562

  58. [58]

    Yinjie Wang, Ling Yang, Ye Tian, Ke Shen, and Mengdi Wang. 2025. Co-evolving llm coder and unit tester via reinforcement learning.arXiv preprint arXiv:2506.03136(2025)

  59. [59]

    Zejun Wang, Kaibo Liu, Ge Li, and Zhi Jin. 2024. Hits: High-coverage llm-based unit test generation via method slicing. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1258–1268

  60. [60]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

  61. [61]

    Frank Wilcoxon. 1945. Individual comparisons by ranking methods.Biometrics Bulletin1, 6 (1945), 80–83. doi:10.2307/ 3001968

  62. [62]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  63. [63]

    Aidan ZH Yang, Claire Le Goues, Ruben Martins, and Vincent Hellendoorn. 2024. Large language models for test-free fault localization. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12

  64. [64]

    Chen Yang, Junjie Chen, Bin Lin, Ziqi Wang, and Jianyi Zhou. 2024. Advancing code coverage: Incorporating program analysis with large language models.ACM Transactions on Software Engineering and Methodology(2024)

  65. [65]

    Lin Yang, Chen Yang, Shutao Gao, Weijing Wang, Bo Wang, Qihao Zhu, Xiao Chu, Jianyi Zhou, Guangtai Liang, Qianxiang Wang, et al. 2024. On the evaluation of large language models in unit test generation. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1607–1619. 22 Wang et al

  66. [66]

    Yufan Ye, Ting Zhang, Wenbin Jiang, and Hua Huang. 2025. Process-supervised reinforcement learning for code generation.arXiv preprint arXiv:2502.01715(2025)

  67. [67]

    Yongda Yu, Guoping Rong, Haifeng Shen, He Zhang, Dong Shao, Min Wang, Zhao Wei, Yong Xu, and Juhong Wang

  68. [68]

    Fine-tuning large language models to improve accuracy and comprehensibility of automated code review.ACM transactions on software engineering and methodology34, 1 (2024), 1–26

  69. [69]

    Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. 2024. Evaluating and improving chatgpt for unit test generation.Proceedings of the ACM on Software Engineering1, FSE (2024), 1703–1726

  70. [70]

    Junwei Zhang, Xing Hu, Xin Xia, Shing-Chi Cheung, and Shanping Li. 2025. Automated unit test generation via chain of thought prompt and reinforcement learning from coverage feedback.ACM Transactions on Software Engineering and Methodology(2025)