pith. machine review for the scientific record. sign in

arxiv: 2604.04580 · v1 · submitted 2026-04-06 · 💻 cs.SE

Recognition: no theorem link

Beyond Fixed Tests: Repository-Level Issue Resolution as Coevolution of Code and Behavioral Constraints

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:08 UTC · model grok-4.3

classification 💻 cs.SE
keywords repository-level issue resolutioncode repairtest evolutioncoevolutionmulti-agent LLM systemsbehavioral constraintsSWE-benchbug fixing
0
0 comments X

The pith

Repository-level issue resolution requires coevolving code and tests rather than optimizing against fixed tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper argues that engineers fixing bugs in large repositories do not hold existing tests as unchanging oracles. They instead refine both the code and the tests together as they uncover missing assumptions or misread failure conditions in the original report. Most current LLM repair systems keep tests fixed and apply them only as post-generation filters, which produces under-constrained searches and brittle fixes. The authors introduce a multi-agent framework that explores code and test changes jointly, using mutual evaluation to narrow the space of behaviors matching the issue. Experiments show this yields higher repair success and better test reproduction on standard benchmarks than methods that freeze the tests.

Core claim

The authors claim that repository-level issue resolution is fundamentally not optimization under fixed tests, but search over evolving behavioral constraints. They operationalize this view with Agent-CoEvo, a coevolutionary multi-agent framework in which candidate code patches and test patches are jointly explored, iteratively refined through mutual evaluation, and recombined semantically to narrow the space of behavior consistent with the issue description. On SWE-bench Lite and SWT-bench Lite, the framework outperforms state-of-the-art agent-based and agentless baselines in both repair success and test reproduction quality.

What carries the argument

Agent-CoEvo, a coevolutionary multi-agent framework that treats tests as dynamic constraints which both guide and are revised by the code repair process.

If this is right

  • Higher repair success rates on SWE-bench Lite and SWT-bench Lite than baselines that keep tests fixed.
  • Improved quality of reproduced tests that better align with the original issue description.
  • Fewer brittle or overfitted fixes because constraints evolve with the code.
  • A shift in automated repair from code-only optimization to coevolution of implementation and specification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar coevolution of implementation and specification could apply to other incomplete-specification settings such as API evolution or formal method assistance.
  • Repair benchmarks may need new metrics that measure alignment with revised intent rather than only passage of the original tests.
  • Interactive tools could let developers review and steer the evolved constraints to prevent unintended drift.

Load-bearing premise

Mutual evaluation and semantic recombination between code and test candidates will reliably narrow the space of behavior consistent with the issue description without introducing new inconsistencies or losing the original intent.

What would settle it

A concrete case on the evaluation benchmarks where the coevolved tests accept a patch that independent human review determines fails to address the reported issue or where the evolved tests diverge from the bug report in ways that mask the actual fault.

Figures

Figures reproduced from arXiv: 2604.04580 by Kefan Li, Mengfei Wang, Mu Li, Ping Yang, Shihao Zheng, Weifeng Lv, Wei Wang, Yuan Yuan.

Figure 1
Figure 1. Figure 1: Overview of the Agent-CoEvo framework. test patches, we retain only those that fail on the buggy repository 𝑅: Ptest ← {𝑡 ∈ Ptest | Run(𝑅, 𝑡) = FAIL}. (1) This filtering ensures that generated tests are behaviorally relevant to the reported issue. Importantly, this step does not assume that tests are complete or fully correct specifications; rather, their adequacy is progressively refined during coevolutio… view at source ↗
Figure 2
Figure 2. Figure 2: Venn diagram illustrating the overlap of resolved [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation of resolved rates and average costs across five evolutionary iterations on SWE-bench Lite [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The issue description. The issue concerns the solve_poly_system function, which computes solutions for systems of polynomial equations. The defect arises from the absence of a dimensionality validation mechanism. When the number of equations is fewer than the number of variables, the system is underdetermined and admits infinitely many solutions. The original implementation failed to detect this condition … view at source ↗
Figure 5
Figure 5. Figure 5: Evolution of code patches. Iteration 2 produces two partial hypotheses; Iteration 3 synthesizes them [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The refinement of test patches. The crossover mechanism aligns the test cases with the correct [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
read the original abstract

Software engineers resolving repository-level issues do not treat existing tests as immutable correctness oracles. Instead, they iteratively refine both code and the tests used to characterize intended behavior, as new modifications expose missing assumptions or misinterpreted failure conditions. In contrast, most existing large language model (LLM)-based repair systems adopt a linear pipeline in which tests or other validation signals act mostly as post-hoc filters, treating behavioral constraints as fixed during repair. This formulation reduces repair to optimizing code under static and potentially misaligned constraints, leading to under-constrained search and brittle or overfitted fixes. We argue that repository-level issue resolution is fundamentally not optimization under fixed tests, but search over evolving behavioral constraints. To operationalize this view, we propose Agent-CoEvo, a coevolutionary multi-agent framework in which candidate code patches and test patches are jointly explored and iteratively refined. Rather than treating tests as immutable oracles, our framework models them as dynamic constraints that both guide and are revised by the repair process. Through mutual evaluation and semantic recombination, code and test candidates progressively narrow the space of behavior consistent with the issue description. Evaluated on SWE-bench Lite and SWT-bench Lite, Agent-CoEvo consistently outperforms state-of-the-art agent-based and agentless baselines in both repair success and test reproduction quality. Our findings suggest that enabling repair agents to revise behavioral constraints during search is critical for reliable issue resolution, pointing toward a shift from code-only optimization to coevolution of implementation and specification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper argues that repository-level issue resolution is not optimization under fixed tests but search over evolving behavioral constraints. It proposes Agent-CoEvo, a multi-agent coevolutionary framework in which code patches and test patches are jointly explored and refined via mutual evaluation and semantic recombination to narrow the behavior space consistent with the issue description. On SWE-bench Lite and SWT-bench Lite, Agent-CoEvo outperforms state-of-the-art agent-based and agentless baselines in both repair success and test reproduction quality.

Significance. If the central claims hold, this work would be significant for automated program repair and LLM-based software engineering agents by shifting from linear fixed-oracle pipelines to dynamic coevolution of code and tests, aligning better with developer practice. A clear strength is the use of external benchmarks (SWE-bench Lite, SWT-bench Lite) rather than self-referential quantities, avoiding circularity. The empirical outperformance, if robustly validated, would support the broader claim that revising behavioral constraints during search improves reliability.

major comments (3)
  1. [Abstract] Abstract: The claim that 'through mutual evaluation and semantic recombination, code and test candidates progressively narrow the space of behavior consistent with the issue description' is load-bearing for the central thesis but provides no concrete mechanism (e.g., entailment checks against the bug report, embedding similarity, or rejection sampling on original failure conditions) to prevent test drift or loss of original intent. Without this, reported gains in repair success and test reproduction quality could stem from expanded search rather than faithful coevolution.
  2. [Evaluation section] Evaluation on SWE-bench Lite and SWT-bench Lite: No details are given on experimental controls for how test revisions are validated against the original issue intent, whether post-hoc adjustments occurred, or how 'test reproduction quality' is measured to confirm fidelity. This directly weakens support for the outperformance claims and the assertion that coevolution is critical for reliable resolution.
  3. [Agent-CoEvo framework] Framework description: The mutual evaluation step is presented as reliably narrowing consistent behavior, yet the manuscript does not specify rejection criteria or grounding steps that would address the risk of introducing new inconsistencies while preserving the bug report semantics. This is load-bearing for the weakest assumption in the coevolution operationalization.
minor comments (2)
  1. [Abstract] The term 'test reproduction quality' is used in the abstract and results without an explicit definition or formula in the provided text; add a precise metric description early in the evaluation section.
  2. Figure or table captions (if present in the full manuscript) should explicitly state the number of runs, statistical significance tests, and baseline configurations to improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies important areas for clarification in our presentation of the coevolution approach. We agree that greater specificity on mechanisms and controls will strengthen the manuscript and will make the requested revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'through mutual evaluation and semantic recombination, code and test candidates progressively narrow the space of behavior consistent with the issue description' is load-bearing for the central thesis but provides no concrete mechanism (e.g., entailment checks against the bug report, embedding similarity, or rejection sampling on original failure conditions) to prevent test drift or loss of original intent. Without this, reported gains in repair success and test reproduction quality could stem from expanded search rather than faithful coevolution.

    Authors: We acknowledge that the abstract states the high-level claim without enumerating the concrete safeguards. The full framework section describes mutual evaluation via embedding-based semantic similarity to the issue description combined with rejection sampling on consistency with the original failing conditions. We will revise the abstract to include a concise reference to these steps (entailment-style consistency checks and rejection on drift from the reported failure) so the central thesis is better grounded. revision: yes

  2. Referee: [Evaluation section] Evaluation on SWE-bench Lite and SWT-bench Lite: No details are given on experimental controls for how test revisions are validated against the original issue intent, whether post-hoc adjustments occurred, or how 'test reproduction quality' is measured to confirm fidelity. This directly weakens support for the outperformance claims and the assertion that coevolution is critical for reliable resolution.

    Authors: We agree that the evaluation section would benefit from explicit controls. We will add a dedicated paragraph describing: (1) validation of revised tests against the original issue description and failure reproduction on the pre-patch codebase, (2) confirmation that no post-hoc test adjustments were performed after patch generation, and (3) the exact metric for test reproduction quality (semantic alignment plus reproduction of the original failing behavior). These additions will directly support the reported gains. revision: yes

  3. Referee: [Agent-CoEvo framework] Framework description: The mutual evaluation step is presented as reliably narrowing consistent behavior, yet the manuscript does not specify rejection criteria or grounding steps that would address the risk of introducing new inconsistencies while preserving the bug report semantics. This is load-bearing for the weakest assumption in the coevolution operationalization.

    Authors: We accept that the current description leaves the rejection criteria implicit. We will expand the framework section with explicit rejection rules: a candidate test patch is rejected if its embedding similarity to the issue description falls below a threshold or if it fails to reproduce the original failure on the unmodified code. We will also include a short example illustrating preservation of bug-report semantics during recombination. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claim and framework are independent of evaluation metrics

full rationale

The paper reframes issue resolution as coevolution of code and tests, operationalized via the Agent-CoEvo multi-agent framework using mutual evaluation and semantic recombination. This is presented as a conceptual shift from fixed-test optimization, with no equations, fitted parameters, or self-referential definitions in the abstract or described structure. Evaluation relies on external benchmarks (SWE-bench Lite, SWT-bench Lite) and comparisons to independent baselines, not on quantities defined from the method itself. No self-citation load-bearing steps, uniqueness theorems, or ansatzes smuggled via prior work are identifiable. The derivation chain remains self-contained and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the premise that tests are dynamic constraints that can be revised without losing fidelity to the issue description; no numerical free parameters are mentioned, and the framework itself is the primary invented construct.

invented entities (1)
  • Agent-CoEvo multi-agent coevolutionary framework no independent evidence
    purpose: To jointly explore and refine code patches and test patches via mutual evaluation and semantic recombination
    Newly introduced in the paper as the operationalization of the coevolution view

pith-pipeline@v0.9.0 · 5584 in / 1284 out tokens · 68876 ms · 2026-05-10T20:08:28.995836+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 33 canonical work pages · 8 internal anchors

  1. [1]

    Vaibhav Aggarwal, Ojasv Kamal, Abhinav Japesh, Zhijing Jin, and Bernhard Schölkopf. 2025. Dars: Dynamic action re-sampling to enhance coding agent performance by adaptive tree traversal.arXiv preprint arXiv:2503.14269(2025)

  2. [2]

    Toufique Ahmed, Jatin Ganhotra, Rangeet Pan, Avraham Shinnar, Saurabh Sinha, and Martin Hirzel. 2025. Otter: Generating Tests from Issues to Validate SWE Patches.arXiv preprint arXiv:2502.05368(2025)

  3. [3]

    Toufique Ahmed, Jatin Ganhotra, Avraham Shinnar, and Martin Hirzel. 2025. Heterogeneous Prompting and Execution Feedback for SWE Issue Test Generation and Selection.arXiv preprint arXiv:2508.06365(2025)

  4. [4]

    Toufique Ahmed, Martin Hirzel, Rangeet Pan, Avraham Shinnar, and Saurabh Sinha. 2024. TDD-Bench Verified: Can LLMs Generate Tests for Issues Before They Get Resolved?arXiv preprint arXiv:2412.02883(2024)

  5. [5]

    Antonis Antoniades, Albert Örwall, Kexun Zhang, Yuxi Xie, Anirudh Goyal, and William Wang. 2024. Swe-search: Enhancing software agents with monte carlo tree search and iterative refinement.arXiv preprint arXiv:2410.20285 (2024)

  6. [6]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732 (2021)

  7. [7]

    Earl T Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2014. The oracle problem in software testing: A survey.IEEE transactions on software engineering41, 5 (2014), 507–525

  8. [8]

    Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2024. Repairagent: An autonomous, llm-based agent for program repair.arXiv preprint arXiv:2403.17134(2024)

  9. [9]

    Dong Chen, Shaoxin Lin, Muhan Zeng, Daoguang Zan, Jian-Gang Wang, Anton Cheshkov, Jun Sun, Hao Yu, Guoliang Dong, Artem Aliev, et al. 2024. Coder: Issue resolving with multi-agent and task graphs.arXiv preprint arXiv:2406.01304 (2024)

  10. [10]

    Mark Chen. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

  11. [11]

    Silin Chen, Shaoxin Lin, Xiaodong Gu, Yuling Shi, Heng Lian, Longfei Yun, Dong Chen, Weiguo Sun, Lin Cao, and Qianxiang Wang. 2025. Swe-exp: Experience-driven software issue resolution.arXiv preprint arXiv:2507.23361(2025)

  12. [12]

    Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128(2023)

  13. [13]

    Xiancai Chen, Zhengwei Tao, Kechi Zhang, Changzhi Zhou, Wanli Gu, Yuanpeng He, Mengdi Zhang, Xunliang Cai, Haiyan Zhao, and Zhi Jin. 2025. Revisit self-debugging with self-generated tests for code generation.arXiv preprint arXiv:2501.12793(2025)

  14. [14]

    Yoav Freund and Robert E Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting.Journal of computer and system sciences55, 1 (1997), 119–139

  15. [15]

    Pengfei Gao, Zhao Tian, Xiangxin Meng, Xinchen Wang, Ruida Hu, Yuanan Xiao, Yizhou Liu, Zhao Zhang, Junjie Chen, Cuiyun Gao, et al. 2025. Trae agent: An llm-based agent for software engineering with test-time scaling.arXiv preprint arXiv:2507.23370(2025)

  16. [16]

    Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, and Gabriel Synnaeve

  17. [17]

    Rlef: Grounding code llms in execution feedback with reinforcement learning.arXiv preprint arXiv:2410.02089 (2024)

  18. [18]

    Lianghong Guo, Wei Tao, Runhan Jiang, Yanlin Wang, Jiachi Chen, Xilin Liu, Yuchi Ma, Mingzhi Mao, Hongyu Zhang, and Zibin Zheng. 2025. Omnigirl: A multilingual and multimodal benchmark for github issue resolution.Proceedings , Vol. 1, No. 1, Article . Publication date: April 2025. 20 Li et al. of the ACM on Software Engineering2, ISSTA (2025), 24–46

  19. [19]

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology(2024)

  20. [20]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023)

  21. [21]

    Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2023. Large language models are few-shot testers: Exploring llm-based general bug reproduction. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2312–2323

  22. [22]

    Lara Khatib, Noble Saji Mathews, and Meiyappan Nagappan. 2025. AssertFlip: Reproducing Bugs via Inversion of LLM-Generated Passing Tests.arXiv preprint arXiv:2507.17542(2025)

  23. [23]

    Konstantinos Kitsios, Marco Castelluccio, and Alberto Bacchelli. 2025. Automated generation of issue-reproducing tests by combining llms and search-based testing.arXiv preprint arXiv:2509.01616(2025)

  24. [24]

    Han Li, Yuling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Yantao Jia, Tao Huang, and Qianxiang Wang

  25. [25]

    Swe-debate: Competitive multi-agent debate for software issue resolution.arXiv preprint arXiv:2507.23348(2025)

  26. [26]

    Kefan Li, Yuan Yuan, Hongyue Yu, Tingyu Guo, and Shijie Cao. 2025. CoCoEvo: Co-Evolution of Programs and Test Cases to Enhance Code Generation.IEEE Transactions on Evolutionary Computation(2025)

  27. [27]

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. 2022. Competition-level code generation with alphacode.Science378, 6624 (2022), 1092–1097

  28. [28]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

  29. [29]

    Yizhou Liu, Pengfei Gao, Xinchen Wang, Jie Liu, Yexuan Shi, Zhao Zhang, and Chao Peng. 2024. Marscode agent: Ai-native automated bug fixing.arXiv preprint arXiv:2409.00899(2024)

  30. [30]

    Yingwei Ma, Rongyu Cao, Yongchang Cao, Yue Zhang, Jue Chen, Yibo Liu, Yuchen Liu, Binhua Li, Fei Huang, and Yongbin Li. 2024. Lingma swe-gpt: An open development-process-centric language model for automated software improvement.arXiv preprint arXiv:2411.00622(2024)

  31. [31]

    Yingwei Ma, Yongbin Li, Yihong Dong, Xue Jiang, Rongyu Cao, Jue Chen, Fei Huang, and Binhua Li. 2025. Thinking longer, not larger: Enhancing software engineering agents via scaling test-time compute.arXiv preprint arXiv:2503.23803 (2025)

  32. [32]

    Yingwei Ma, Qingping Yang, Rongyu Cao, Binhua Li, Fei Huang, and Yongbin Li. 2025. Alibaba lingmaagent: Improving automated issue resolution via comprehensive repository exploration. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 238–249

  33. [33]

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. 2023. Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems36 (2023), 46534–46594

  34. [34]

    Fangwen Mu, Junjie Wang, Lin Shi, Song Wang, Shoubin Li, and Qing Wang. 2025. EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair.arXiv preprint arXiv:2506.10484(2025)

  35. [35]

    Niels Mündler, Mark Müller, Jingxuan He, and Martin Vechev. 2024. SWT-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems37 (2024), 81857–81887

  36. [36]

    Noor Nashid, Islem Bouzenia, Michael Pradel, and Ali Mesbah. 2025. Issue2Test: Generating Reproducing Test Cases from Issue Reports.arXiv preprint arXiv:2503.16320(2025)

  37. [37]

    Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-tau Yih, Sida Wang, and Xi Victoria Lin. 2023. Lever: Learning to verify language-to-code generation with execution. InInternational Conference on Machine Learning. PMLR, 26106–26128

  38. [38]

    Pengyu Nie, Rahul Banerjee, Junyi Jessy Li, Raymond J Mooney, and Milos Gligoric. 2023. Learning deep semantics for test completion. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2111–2123

  39. [39]

    Theo X Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. 2023. Is self-repair a silver bullet for code generation?arXiv preprint arXiv:2306.09896(2023)

  40. [40]

    Siru Ouyang, Wenhao Yu, Kaixin Ma, Zilin Xiao, Zhihan Zhang, Mengzhao Jia, Jiawei Han, Hongming Zhang, and Dong Yu. 2024. Repograph: Enhancing ai software engineering with repository-level code graph.arXiv preprint arXiv:2410.14684(2024)

  41. [41]

    Ravin Ravi, Dylan Bradshaw, Stefano Ruberto, Gunel Jahangirova, and Valerio Terragni. 2025. LLMLOOP: Improving LLM-Generated Code and Tests through Automated Iterative Feedback Loops.ICSME. IEEE(2025)

  42. [42]

    Haifeng Ruan, Yuntong Zhang, and Abhik Roychoudhury. 2024. Specrover: Code intent extraction via llms.arXiv preprint arXiv:2408.02232(2024)

  43. [43]

    Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An empirical evaluation of using large language models for automated unit test generation.IEEE Transactions on Software Engineering50, 1 (2023), 85–105. , Vol. 1, No. 1, Article . Publication date: April 2025. Beyond Fixed Tests: Repository-Level Issue Resolution as Coevolution of Code and Behav...

  44. [44]

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems36 (2023), 8634–8652

  45. [45]

    Atharv Sonwane, Isadora White, Hyunji Lee, Matheus Pereira, Lucas Caccia, Minseon Kim, Zhengyan Shi, Chinmay Singh, Alessandro Sordoni, Marc-Alexandre Côté, et al. 2025. BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills.arXiv preprint arXiv:2510.19898(2025)

  46. [46]

    Xinchen Wang, Pengfei Gao, Xiangxin Meng, Chao Peng, Ruida Hu, Yun Lin, and Cuiyun Gao. 2024. AEGIS: An agent-based framework for general bug reproduction from issue descriptions.arXiv preprint arXiv:2411.18015(2024)

  47. [47]

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. 2024. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741(2024)

  48. [48]

    solved issues

    You Wang, Michael Pradel, and Zhongxin Liu. 2025. Are" Solved Issues" in SWE-bench Really Solved Correctly? An Empirical Study.arXiv preprint arXiv:2503.15223(2025)

  49. [49]

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agentless: Demystifying llm-based software engineering agents.arXiv preprint arXiv:2407.01489(2024)

  50. [50]

    Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated program repair in the era of large pre-trained language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1482–1494

  51. [51]

    Boyang Yang, Haoye Tian, Jiadong Ren, Shunfu Jin, Yang Liu, Feng Liu, and Bach Le. 2025. Enhancing repository-level software repair via repository-aware knowledge graphs.arXiv preprint arXiv:2503.21710(2025)

  52. [52]

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

  53. [53]

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Autocoderover: Autonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1592–1604

  54. [54]

    Albert Örwall. 2024. Moatless Tools. https://github.com/aorwall/moatless-tools. Accessed: 2024-11-13. , Vol. 1, No. 1, Article . Publication date: April 2025