pith. machine review for the scientific record. sign in

arxiv: 2604.05560 · v1 · submitted 2026-04-07 · 💻 cs.SE

Recognition: no theorem link

An Iterative Test-and-Repair Framework for Competitive Code Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:41 UTC · model grok-4.3

classification 💻 cs.SE
keywords code generationcompetitive programmingLLM debuggingtest generationiterative repairAPPS benchmarkCodeContestszero-shot evaluation
0
0 comments X

The pith

FixAudit lets a single 7B model outperform 32B baselines in competitive code generation by iteratively auditing its own candidates for bugs and repairing them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FixAudit to improve LLM performance on competitive programming tasks. It trains one model to play two roles: an Auditor that inspects a candidate program to generate tests targeting its specific flaws, and a Fixer that uses any failing test to edit the program. The process starts with one initial candidate and repeats the audit-repair loop until a strong solution emerges. This contrasts with earlier methods that generate tests without seeing the code and always create new candidates from scratch. The result reported is that the 7B model using FixAudit exceeds the average results of a 32B model in the same family on zero-shot evaluation across standard benchmarks.

Core claim

FixAudit approaches competitive code generation from a new perspective: starting from a single initial candidate, it iteratively improves the candidate through a targeted test-and-repair debugging cycle. The framework trains one shared model with two specialized roles through four stages: the Fixer, which repairs the current candidate based on a failing test, and the Auditor, which reads the candidate code to generate new tests that expose its remaining bugs. We evaluate FixAudit on three benchmarks: APPS, CodeContests, and xCodeEval. Applied to a 7B model, the framework surpasses the average performance of the larger 32B baseline within the same model family under the zero-shot setting.

What carries the argument

The iterative Auditor-Fixer cycle on a shared model, where the Auditor inspects candidate code to produce exposing tests and the Fixer edits the code using those failures.

If this is right

  • A 7B model using the iterative cycle can exceed the average zero-shot performance of a 32B model in the same family on APPS, CodeContests, and xCodeEval.
  • Pass@1 scores improve 35.1 to 36.8 percent over strong baselines built on the identical 7B base model.
  • AvgPassRatio improves between 7.1 and 24.5 percent over the same baselines.
  • The model learns to repair existing candidates instead of regenerating every solution from scratch each time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach trades model size for inference-time iteration, suggesting smaller models may suffice when self-audit loops are available.
  • The same Auditor-Fixer pattern could be tested on non-code generation tasks that benefit from self-correction, such as mathematical proof writing.
  • Because the Auditor sees the code, the generated tests may cover edge cases that human-written or description-only tests routinely miss.

Load-bearing premise

The Auditor can read the candidate code and produce tests that expose implementation-specific bugs missed by ordinary test suites, and the repair step then yields genuine fixes rather than overfitting to the model's own tests.

What would settle it

Running the 7B FixAudit model on a fresh competitive-programming benchmark and finding no gain, or a loss, in Pass@1 or AvgPassRatio relative to the 32B baseline from the same family.

Figures

Figures reproduced from arXiv: 2604.05560 by He Ye, Lingfeng Bao, Lingxiao Tang, Muyang Ye, Xiaoxue Ren, Zhaoyang Chu, Zhongxin Liu.

Figure 1
Figure 1. Figure 1: The motivation example the maximum middle gap ((𝑎𝑖 −𝑎𝑖−1)/2.0). Because this partial logic is sufficient to pass all provided public tests, the candidate appears to be correct. However, there is a severe logic flaw hidden in this code: it completely forgets to check the distance from the start of the street to the first lantern (𝑎0). As illustrated by the adversarial test case at the bottom of [PITH_FULL_… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of FixAudit. Unlike CURE, where the Tester generates inputs based only on the problem description, the Auditor in FixAudit reads the candidate code. Given the specification 𝑆 and the Fixer’s repaired candidate 𝑃 ′ , the Auditor must propose a complete test case which contains both an input 𝑥 and its expected output 𝑦, that breaks 𝑃 ′ . Because the Auditor sees the actual code, it can focus on the … view at source ↗
Figure 3
Figure 3. Figure 3: Pass@1 performance as the number of iterations increases. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A case study on the problem, “Nearest Integer Not [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Large language models (LLMs) have made remarkable progress in code generation, but competitive programming remains a challenge. Recent training-based methods have improved code generation by using reinforcement learning (RL) with execution feedback. The more recent framework CURE further incorporates test generation into the training process, jointly training a Coder and a Tester within a single model. At inference time, the Coder generates many candidate programs, and the Tester generates tests from the problem description. The candidate who passes the most of the generated tests is selected as the final answer. However, CURE has two critical limitations. First, the Tester never reads any candidate code, so its tests often fail to expose implementation-specific bugs. Second, the Coder generates every candidate from scratch and never learns to fix a buggy program based on a failed test. To address these limitations, we propose FixAudit, which approaches competitive code generation from a new perspective: starting from a single initial candidate, it iteratively improves the candidate through a targeted test-and-repair debugging cycle. The framework trains one shared model with two specialized roles through four stages: the Fixer, which repairs the current candidate based on a failing test, and the Auditor, which reads the candidate code to generate new tests that expose its remaining bugs. We evaluate FixAudit on three benchmarks: APPS, CodeContests, and xCodeEval. Applied to a 7B model, the framework surpasses the average performance of the larger 32B baseline within the same model family under the zero-shot setting. Compared to strong baselines built on the same 7B base model, FixAudit improves average Pass@1 by 35.1% to 36.8% and average AvgPassRatio by 7.1% to 24.5%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces FixAudit, an iterative test-and-repair framework for competitive code generation with LLMs. It addresses limitations in prior work like CURE by training a single model with specialized Fixer and Auditor roles across four stages: the Auditor generates tests by reading candidate code to expose implementation-specific bugs, and the Fixer learns repairs from failing tests. The approach starts from an initial candidate and iteratively improves it. Evaluations on APPS, CodeContests, and xCodeEval show a 7B model surpassing the average performance of a 32B baseline in the same family under zero-shot settings, with Pass@1 improvements of 35.1% to 36.8% and AvgPassRatio gains of 7.1% to 24.5% over strong same-size baselines.

Significance. If the results hold under rigorous controls, the work demonstrates that code-aware test generation combined with targeted repair can substantially boost smaller LLMs on competitive programming tasks, offering an alternative to model scaling. The shared-model multi-role training and iterative debugging cycle represent a practical advance in training methodologies for code generation, with potential for broader application in software engineering tasks requiring robustness to hidden tests.

major comments (3)
  1. [Experimental Evaluation] Experimental Evaluation section: The central performance claims (7B surpassing 32B baseline; 35.1–36.8% Pass@1 lift) are reported without details on run variance, number of trials, baseline implementation specifics (e.g., prompting for the 32B model), or statistical tests. This absence directly undermines assessment of whether the gains are robust or reproducible.
  2. [Training Methodology] Training Methodology (four-stage process): The Auditor reads candidate code to generate tests and the Fixer repairs based on failures, but the manuscript provides no description of data splits, negative sampling, or hold-out mechanisms to prevent the loop from rewarding patterns that pass self-generated tests without improving behavior on the benchmarks' independent hidden suites. This is load-bearing for the generalization claim.
  3. [Baseline Comparisons] Baseline Comparisons: The zero-shot claim that FixAudit on 7B exceeds the 32B model is central to the significance argument, yet the text lacks explicit controls for equivalent test usage, candidate sampling budgets, or prompt engineering differences between the 7B and 32B evaluations.
minor comments (2)
  1. [Abstract] Abstract: AvgPassRatio is used in the quantitative claims but is not defined or motivated in the abstract or early sections, which may hinder quick comprehension of the metric's meaning.
  2. [Introduction] Introduction: The motivation section could more explicitly contrast the proposed iterative cycle against CURE's non-iterative selection to clarify the novelty in the repair loop.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important areas for improving the rigor and reproducibility of our experimental claims. We address each point below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: [Experimental Evaluation] Experimental Evaluation section: The central performance claims (7B surpassing 32B baseline; 35.1–36.8% Pass@1 lift) are reported without details on run variance, number of trials, baseline implementation specifics (e.g., prompting for the 32B model), or statistical tests. This absence directly undermines assessment of whether the gains are robust or reproducible.

    Authors: We agree that these details are essential for assessing robustness. In the revised manuscript, we will report standard deviations across multiple independent runs (e.g., 5 seeds), specify the exact number of trials per experiment, provide the precise prompting templates and sampling parameters used for the 32B baseline, and include statistical significance tests (paired t-tests with p-values) comparing FixAudit against baselines. These additions will be placed in the Experimental Evaluation section and its associated tables. revision: yes

  2. Referee: [Training Methodology] Training Methodology (four-stage process): The Auditor reads candidate code to generate tests and the Fixer repairs based on failures, but the manuscript provides no description of data splits, negative sampling, or hold-out mechanisms to prevent the loop from rewarding patterns that pass self-generated tests without improving behavior on the benchmarks' independent hidden suites. This is load-bearing for the generalization claim.

    Authors: We recognize that explicit safeguards against self-test overfitting are critical. The current training uses benchmark-provided hidden test suites for final evaluation and incorporates negative examples from failed repairs, but the manuscript lacks a dedicated description. We will expand the Training Methodology section to detail the data splits (train/validation/hold-out), negative sampling procedure (sampling incorrect code-test pairs), and hold-out mechanisms that ensure the Auditor's tests are validated against independent suites before use in the loop. This clarification will directly support the generalization claims. revision: yes

  3. Referee: [Baseline Comparisons] Baseline Comparisons: The zero-shot claim that FixAudit on 7B exceeds the 32B model is central to the significance argument, yet the text lacks explicit controls for equivalent test usage, candidate sampling budgets, or prompt engineering differences between the 7B and 32B evaluations.

    Authors: We agree that fair comparison requires explicit controls. In revision, we will add a dedicated subsection under Baseline Comparisons that specifies: (1) identical test budgets and sampling strategies across models, (2) the exact number of candidates generated per problem for both 7B and 32B, and (3) standardized prompt templates with no additional engineering for the larger model. These controls will be documented to ensure the 7B vs. 32B comparison is equitable under zero-shot conditions. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical performance on public benchmarks

full rationale

The paper describes a four-stage training process for a shared model with Fixer and Auditor roles, then reports direct empirical metrics (Pass@1, AvgPassRatio) on the public benchmarks APPS, CodeContests, and xCodeEval against external baselines. No equations, fitted parameters, or self-referential definitions appear in the derivation chain. Central claims rest on held-out test-suite execution rather than quantities that reduce to the model's own generated tests by construction. No self-citation load-bearing steps or uniqueness theorems are invoked. This matches the default case of a non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions about LLM fine-tuning and benchmark validity rather than new entities or heavily fitted parameters.

axioms (1)
  • domain assumption A single shared model can be effectively trained to perform both code repair and code-aware test generation roles without significant interference.
    The framework explicitly trains one model with two specialized roles through four stages.

pith-pipeline@v0.9.0 · 5640 in / 1291 out tokens · 67791 ms · 2026-05-10T19:41:05.510343+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 23 canonical work pages · 10 internal anchors

  1. [1]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models.arXiv preprint arXiv:2108.07732(2021). 10

  2. [2]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

  3. [3]

    Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2023. CodeT: Code Generation with Generated Tests.The Eleventh International Conference on Learning Representations(2023)

  4. [4]

    Junkai Chen, Zhiyuan Pan, Xing Hu, Zhenhao Li, Ge Li, and Xin Xia. 2025. Reasoning runtime behavior of a program with llm: How far are we?. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 1869–1881

  5. [6]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

  6. [7]

    Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2024. Teaching Large Language Models to Self-Debug.The Twelfth International Conference on Learning Representations(2024)

  7. [8]

    Yinghao Chen, Zehao Hu, Chen Zhi, Junxiao Han, Shuiguang Deng, and Jianwei Yin. 2024. ChatUniTest: A Framework for LLM-Based Test Generation. InCom- panion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 572–576

  8. [9]

    Yangruibo Ding, Jinjun Peng, Marcus Min, Gail Kaiser, Junfeng Yang, and Baishakhi Ray. 2024. SemCoder: Training Code Language Models with Compre- hensive Semantics Reasoning.Advances in Neural Information Processing Systems 37 (2024), 60275–60308

  9. [10]

    Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. 2024. Self-Collaboration Code Gener- ation via ChatGPT.ACM Transactions on Software Engineering and Methodology 33, 7 (2024), 1–38

  10. [11]

    Shihan Dou, Yan Liu, Haoxiang Jia, Enyu Zhou, Limao Xiong, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan, Zhiheng Xi, et al. 2024. Stepcoder: improving code generation with reinforcement learning from compiler feedback. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 4571–4585

  11. [12]

    Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: Automatic Test Suite Gen- eration for Object-Oriented Software. InProceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engi- neering. 416–419

  12. [13]

    Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, and Gabriel Synnaeve. 2024. Rlef: Grounding code llms in execution feedback with reinforcement learning.arXiv preprint arXiv:2410.02089(2024)

  13. [14]

    Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Syn- naeve, and Sida I Wang. 2024. Cruxeval: A benchmark for code reasoning, understanding and execution.arXiv preprint arXiv:2401.03065(2024)

  14. [15]

    Siqi Gu, Chunrong Fang, Quanjun Zhang, Fangyuan Tian, and Zhenyu Chen

  15. [16]

    TestArt: Improving LLM-Based Unit Test via Co-Evolution of Automated Generation and Repair Iteration.arXiv e-prints(2024)

  16. [17]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, Y. K. Li, and et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming—The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196(2024)

  17. [18]

    Qi Guo, Xiaofei Xie, Shangqing Liu, Ming Hu, Xiaohong Li, and Lei Bu. 2025. Intention is All You Need: Refining Your Code from Your Intention. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society

  18. [19]

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. 2021. Measuring Coding Challenge Competence With APPS. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id=sD...

  19. [20]

    AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

    Dong Huang, Qingwen Bu, Jie M. Zhang, Michael Luck, and Heming Cui. 2023. AgentCoder: Multi-Agent-Based Code Generation with Iterative Testing and Optimisation.arXiv preprint arXiv:2312.13010(2023)

  20. [21]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, and et al. 2024. Qwen2.5-Coder Tech- nical Report.arXiv preprint arXiv:2409.12186(2024)

  21. [22]

    Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez

    Md. Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. 2024. Map- Coder: Multi-Agent Code Generation for Competitive Problem Solving. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers). 4912–4944

  22. [23]

    Naman Jain, Skanda Vaidyanath, Arun Iyer, Nagarajan Natarajan, Suresh Parthasarathy, Sriram Rajamani, and Rahul Sharma. 2023. TiCoder: Teaching Code Generation with Test Inspections. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics

  23. [24]

    Xue Jiang, Yihong Dong, Lecheng Wang, Zheng Fang, Qiwei Shang, Ge Li, Zhi Jin, and Wenpin Jiao. 2024. Self-Planning Code Generation with Large Language Models.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–30

  24. [25]

    Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty

    Mohammad Abdullah Matin Khan, M. Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty. 2024. XCodeEval: An Execution-Based Large Scale Multilingual Multitask Benchmark for Code Understanding, Genera- tion, Translation and Retrieval. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Lo...

  25. [26]

    Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. 2022. Coderl: Mastering code generation through pretrained models and deep reinforcement learning.Advances in Neural Information Pro- cessing Systems35 (2022), 21314–21328

  26. [28]

    Gonzalez, and Ion Stoica

    Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, and Ion Stoica. 2025. S*: Test Time Scaling for Code Generation.arXiv preprint arXiv:2502.14382(2025)

  27. [29]

    Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. 2023. TACO: Topics in Algorithmic Code Generation Dataset.arXiv preprint arXiv:2312.14852(2023)

  28. [30]

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Remi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, and et al

  29. [31]

    Competition-Level Code Generation with AlphaCode.Science378, 6624 (2022), 1092–1097

  30. [32]

    Feng Lin, Dong Jae Kim, and Tse-Hsun (Peter) Chen. 2025. SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents. In2025 IEEE/ACM 46th International Conference on Software Engineering (ICSE)

  31. [33]

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation.Advances in Neural Information Processing Systems36 (2023), 21558–21572

  32. [34]

    Jiate Liu, Yiqin Zhu, Kaiwen Xiao, Qiang Fu, Xiao Han, Wei Yang, and Deheng Ye. 2023. Rltf: Reinforcement learning from unit test feedback.arXiv preprint arXiv:2307.04349(2023)

  33. [35]

    Zeyao Ma, Xiaokang Zhang, Jing Zhang, Jifan Yu, Sijia Luo, and Jie Tang. 2025. Dynamic Scaling of Unit Tests for Code Reward Modeling.arXiv preprint arXiv:2501.01054(2025)

  34. [36]

    Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Binquan Zhang, ChenXue Wang, Shichao Liu, and Qing Wang. 2024. ClarifyGPT: A Framework for Enhancing LLM-Based Code Generation via Requirements Clarification.Proceedings of the ACM on Software Engineering1, FSE (2024), 2332–2354

  35. [37]

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. 2025. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 20286–20332

  36. [38]

    Aniyan Ni, Pengcheng Yin, Yu Su, and Xiaodong Yin. 2023. LEVER: Learning to Verify Language-to-Code Generation with Execution. InInternational Conference on Machine Learning

  37. [39]

    Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama

    Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. 2024. Is Self-Repair a Silver Bullet for Code Generation?. InThe Twelfth International Conference on Learning Representations

  38. [40]

    OpenAI. 2024. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/

  39. [41]

    Carlos Pacheco and Michael D. Ernst. 2007. Randoop: Feedback-Directed Random Testing for Java. InCompanion to the 22nd ACM SIGPLAN Conference on Object- Oriented Programming Systems and Applications Companion. 815–816

  40. [42]

    Tal Ridnik, Dedy Kredo, and Itamar Friedman. 2024. Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering.arXiv preprint arXiv:2401.08500(2024)

  41. [43]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

  42. [44]

    Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347 (2017)

  43. [45]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.arXiv preprint arXiv:2402.03300(2024). doi:10.48550/arXiv.2402.03300

  44. [46]

    Jianjian Sheng et al. 2024. veRL: Volcano Engine Reinforcement Learning for LLM. https://github.com/volcengine/verl

  45. [47]

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, Vol. 36

  46. [48]

    Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, and Chandan K Reddy. 2023. Execution-based code generation using deep reinforcement learning.arXiv preprint arXiv:2301.13816(2023). 11

  47. [49]

    Benjamin Steenhoek, Michele Tufano, Neel Sundaresan, and Alexey Svyatkovskiy

  48. [50]

    In2024 IEEE/ACM International Workshop on Deep Learning for Testing and Testing for Deep Learning (DeepTest)

    Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation. In2024 IEEE/ACM International Workshop on Deep Learning for Testing and Testing for Deep Learning (DeepTest)

  49. [51]

    Qwen Team. 2025. QwQ-32B: Embracing the Power of Reinforcement Learning. https://qwenlm.github.io/blog/qwq-32b/

  50. [52]

    Zhao Tian and Junjie Chen. 2026. Aligning Requirement for Large Language Model’s Code Generation. In2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE ’26). doi:10.1145/3744916.3764572

  51. [53]

    Zhao Tian, Junjie Chen, and Xiangyu Zhang. 2025. Fixing Large Language Models’ Specification Misunderstanding for Better Code Generation. In2025 IEEE/ACM 46th International Conference on Software Engineering (ICSE)

  52. [54]

    Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, Shao Kun Deng, and Neel Sundaresan. 2020. Unit Test Case Generation with Transformers and Focal Context.arXiv preprint arXiv:2009.05617(2020)

  53. [55]

    Yinjie Wang, Ling Yang, Ye Tian, Ke Shen, and Mengdi Wang. 2025. Co- Evolving LLM Coder and Unit Tester via Reinforcement Learning.arXiv preprint arXiv:2506.03136(2025)

  54. [56]

    Bingyang Wei. 2024. Requirements are All You Need: From Requirements to Code with LLMs.arXiv preprint arXiv:2406.10101(2024)

  55. [57]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...

  56. [58]

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  57. [59]

    Huan Zhang, Wei Cheng, Yuhan Wu, and Wei Hu. 2024. A Pair Programming Framework for Code Generation via Multi-Plan Exploration and Feedback-Driven Refinement. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1319–1331

  58. [60]

    Kechi Zhang, Ge Li, Yihong Dong, Jingjing Xu, Jun Zhang, Jing Su, Yongfei Liu, and Zhi Jin. 2024. CodeDPO: Aligning Code Models with Self Generated and Verified Source Code.arXiv preprint arXiv:2410.05605(2024)

  59. [61]

    Kechi Zhang, Ge Li, Jia Li, Yihong Dong, and Zhi Jin. 2025. Focused-DPO: Enhancing Code Generation Through Focused Preference Optimization on Error- Prone Points.arXiv preprint arXiv:2502.11475(2025)

  60. [62]

    Kechi Zhang, Ge Li, Jia Li, Hu Zhu, and Zhi Jin. 2023. Self-Edit: Fault-Aware Code Editor for Code Generation. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics

  61. [63]

    Tenenbaum, and Chuang Gan

    Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B. Tenenbaum, and Chuang Gan. 2023. Planning with Large Language Models for Code Genera- tion. InThe Eleventh International Conference on Learning Representations

  62. [64]

    Li Zhong, Zilong Wang, and Weiyi Shang. 2024. Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step by Step. In Findings of the Association for Computational Linguistics: ACL 2024. 12