EvoRepair: Enhancing Vulnerability Repair Agents Through Experience-Based Self-Evolution

Chunrong Fang; Guoqing Xie; Haichuan Hu; Jiawei Liu; Liang Xiao; Quanjun Zhang; Shengcheng Yu; Zhenyu Chen

arxiv: 2605.30105 · v1 · pith:QFOHJO5Onew · submitted 2026-05-28 · 💻 cs.SE

EvoRepair: Enhancing Vulnerability Repair Agents Through Experience-Based Self-Evolution

Haichuan Hu , Guoqing Xie , Quanjun Zhang , Jiawei Liu , Shengcheng Yu , Chunrong Fang , Zhenyu Chen , Liang Xiao This is my paper

Pith reviewed 2026-06-29 06:14 UTC · model grok-4.3

classification 💻 cs.SE

keywords automated vulnerability repairLLM agentsself-evolving agentsexperience reuseprogram repairsoftware securityagent frameworkspatch evaluation

0 comments

The pith

EvoRepair lets LLMs accumulate and reuse repair experiences across vulnerabilities via a cyclic learn-and-repair loop with quality scoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EvoRepair to address LLMs' lack of experience accumulation within one repair task and reuse across different vulnerabilities in automated vulnerability repair. It implements a framework where the agent retrieves past experiences to guide fixes, extracts new experiences from completed trajectories, and updates a shared experience bank using quality-aware scoring. This produces an evolving system that improves as it encounters more cases. Readers would care because it turns isolated repair attempts into a growing store of domain knowledge that reduces repeated errors and raises success rates on standard benchmarks.

Core claim

EvoRepair is the first experience-based self-evolving AVR agent framework that enables LLMs to accumulate, refine, and leverage domain-specific knowledge across long-horizon vulnerability repairs through a cyclic learn-and-repair process that retrieves relevant past experiences to guide repair, extracts new experiences from repair trajectories, and updates an experience bank using quality-aware scoring.

What carries the argument

The experience bank maintained by a cyclic retrieve-repair-extract-update loop that applies quality-aware scoring to decide what repair trajectories to retain and reuse.

If this is right

LLMs avoid repeating similar mistakes across iterative repairs of the same vulnerability.
Repair knowledge extracted from one vulnerability becomes available for unrelated future cases.
Overall success reaches 93.47 percent on PATCHEVAL and 87.00 percent on SEC-bench when using GPT-5-mini.
The approach exceeds recent LLM-based baselines such as LoopRepair by more than 33 percentage points on both benchmarks.
Transfer experiments confirm the same gains hold across different models, programming languages, and datasets such as VUL4J.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same retrieve-and-score loop could be applied to other agentic code tasks such as general bug fixing or test generation.
Quality-aware scoring acts as a filter that may be essential for preventing degradation when experience volume grows large.
Explicit generalization of stored experiences beyond direct retrieval could further increase reuse efficiency.

Load-bearing premise

The cyclic learn-and-repair process together with quality-aware scoring of repair trajectories will produce net-positive, reusable experiences rather than noise or harmful patterns that degrade future performance.

What would settle it

An experiment that runs EvoRepair on a long sequence of vulnerabilities and measures whether success rate on fresh cases falls below a non-evolving baseline after multiple cycles.

Figures

Figures reproduced from arXiv: 2605.30105 by Chunrong Fang, Guoqing Xie, Haichuan Hu, Jiawei Liu, Liang Xiao, Quanjun Zhang, Shengcheng Yu, Zhenyu Chen.

**Figure 1.** Figure 1: Yearly growth in reported CVEs. are insufficient for assessing real-world repair effectiveness. Specifically, they primarily capture surface-level similarity to reference patches and may therefore overestimate a model’s ability to generalize across heterogeneous datasets. Recently, the rise of Large Language Models (LLMs) has opened new opportunities for AVR [82]. Compared with earlier learning-based metho… view at source ↗

**Figure 2.** Figure 2: Motivation example of EVOREPAIR. To illustrate this motivation, we conduct a case study on CVE-2020-8132 ( [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: presents the overall workflow of EVOREPAIR. Built on top of a vanilla agent (Figure 3, upper-left), EVOREPAIR continually accumulates and summarizes domain-specific experience across multi-turn vulnerability repair trajectories, enabling self-evolution in repair performance. In each repair turn, the agent operates on the vulnerabilities that remain unfixed from the previous turn and continues until it eit… view at source ↗

**Figure 4.** Figure 4: Turn-level performance on PATCHEVAL [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Turn-level performance on SEC-bench. 5.3 RQ3: Ablation Study In RQ3, we conduct ablation studies on the core components and configurations of EVOREPAIR, including the number of retrieved experiences, the mechanism for experience retrieval, the prompting strategies for experience construction, and the methodologies employed to address the cold-start problem of the experience bank. We choose PATCHEVAL for a… view at source ↗

**Figure 6.** Figure 6: Relationship between number of retrieved experiences and repair performance of [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have shown promise for automated vulnerability repair (AVR), but they still face several limitations, including the lack of intra-vulnerability experience accumulation and the lack of cross-vulnerability experience reuse. As a result, LLMs may repeatedly make similar mistakes during iterative repair and underutilize valuable repair knowledge from historical vulnerabilities. To address these challenges, we propose EvoRepair, the first experience-based self-evolving AVR agent framework that enables LLMs to accumulate, refine, and leverage domain-specific knowledge across long-horizon vulnerability repairs. EvoRepair follows a cyclic learn-and-repair process that retrieves relevant past experiences to guide repair, extracts new experiences from repair trajectories, and updates an experience bank using quality-aware scoring. We evaluate EvoRepair against 12 representative vulnerability repair baselines on PATCHEVAL and SEC-bench using GPT-5-mini. Results show that EvoRepair achieves the best overall performance, reaching 93.47% on PATCHEVAL, 87.00% on SEC-bench, and 90.46% overall. In particular, EvoRepair outperforms latest LLM-based baseline LoopRepair by 39.56% and 33.50% on PATCHEVAL and SEC-bench, respectively, and surpasses IntentFix by 70.86% and 50.50%. Across both benchmarks, EvoRepair also exceeds the recent self-evolving agent Live-SWE-Agent by 6.98% overall. Additional transfer experiments on VUL4J further demonstrate the robustness of EvoRepair across models, programming languages, and datasets. These findings demonstrate that experience-based self-evolution substantially strengthens agentic AVR and goes beyond existing self-evolving techniques.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EvoRepair adds a quality-scored experience bank to vulnerability repair agents and claims large gains over recent baselines, but the abstract supplies almost no experimental protocol or implementation detail.

read the letter

The core idea here is a cyclic process where the agent pulls past repair experiences to guide the current fix, then extracts and scores new trajectories before adding them to a shared bank. This is positioned as the first such self-evolving setup aimed at automated vulnerability repair, with explicit comparisons to Live-SWE-Agent and LoopRepair.

What stands out is the reported performance: 93.47% on PATCHEVAL and 87% on SEC-bench, with a 39-point edge over LoopRepair on the first benchmark and a 7-point overall lift over Live-SWE-Agent. The transfer tests on VUL4J across models and languages are also noted. The quality-aware scoring step is the main technical addition that could let the system accumulate reusable knowledge instead of repeating mistakes.

The main limitation is that none of the evaluation mechanics are described. There is no information on how experiences are represented, what the scoring function actually computes, how many items end up in the bank, or whether any controls were run to check for harmful pattern accumulation. The abstract also omits statistical tests, variance across runs, or error analysis, so the size of the reported improvements cannot be assessed from the given text. The assumption that the learn-and-repair loop produces net-positive experiences rather than noise is left unexamined.

This work sits at the intersection of LLM agents and security-focused code repair. Readers already working on experience reuse or self-improving repair systems would find the framework worth examining once the methods are filled in. The paper deserves a serious referee to verify the experimental setup and see whether the gains survive closer inspection of the baselines and scoring details.

Referee Report

2 major / 0 minor

Summary. The paper proposes EvoRepair, the first experience-based self-evolving agent framework for automated vulnerability repair (AVR). It introduces a cyclic learn-and-repair process in which an LLM agent retrieves relevant past experiences from an 'experience bank' to guide repairs, extracts new experiences from repair trajectories, and updates the bank using quality-aware scoring. The framework is evaluated against 12 baselines on PATCHEVAL and SEC-bench using GPT-5-mini, claiming best overall performance (93.47% on PATCHEVAL, 87.00% on SEC-bench, 90.46% overall) with large margins over LoopRepair, IntentFix, and Live-SWE-Agent; additional transfer results on VUL4J are reported.

Significance. If the empirical claims hold under rigorous evaluation, the work would be significant for the AVR and agentic LLM literature by demonstrating that explicit experience accumulation and cross-vulnerability reuse can substantially improve repair success rates beyond existing self-evolving techniques. The introduction of a reusable, quality-scored experience bank as a first-class component is a concrete architectural contribution that could be adopted more broadly.

major comments (2)

[Abstract] Abstract: The performance claims (93.47% on PATCHEVAL, 87.00% on SEC-bench, and specific margins over LoopRepair and IntentFix) are presented without any description of the experimental protocol, baseline implementations, statistical tests, number of runs, or error analysis, rendering the central empirical claims impossible to evaluate from the supplied text.
[Abstract] Abstract (paragraph describing the framework): The core assumption that the cyclic learn-and-repair process together with quality-aware scoring will produce net-positive, reusable experiences rather than noise or harmful patterns is stated but not supported by any analysis, ablation, or failure-case examination in the provided description; this assumption is load-bearing for the claim that experience-based self-evolution strengthens AVR.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the abstract. We address each point below and will revise the manuscript accordingly to improve clarity and self-containment of the claims.

read point-by-point responses

Referee: [Abstract] Abstract: The performance claims (93.47% on PATCHEVAL, 87.00% on SEC-bench, and specific margins over LoopRepair and IntentFix) are presented without any description of the experimental protocol, baseline implementations, statistical tests, number of runs, or error analysis, rendering the central empirical claims impossible to evaluate from the supplied text.

Authors: We acknowledge that the abstract's brevity precludes full experimental details. The complete protocol, baseline implementations, use of GPT-5-mini, three-run averages with standard deviations, and error analysis appear in Sections 4 and 5 of the manuscript. To make the abstract more self-contained, we will add a concise clause noting the evaluation setup and directing readers to the full experimental sections. revision: yes
Referee: [Abstract] Abstract (paragraph describing the framework): The core assumption that the cyclic learn-and-repair process together with quality-aware scoring will produce net-positive, reusable experiences rather than noise or harmful patterns is stated but not supported by any analysis, ablation, or failure-case examination in the provided description; this assumption is load-bearing for the claim that experience-based self-evolution strengthens AVR.

Authors: The abstract summarizes the framework at a high level. The manuscript supports the assumption with ablation studies on the experience bank (Section 5), quantitative metrics on experience quality and reuse rates, and failure-case analysis (Section 6) showing net-positive outcomes. We will revise the abstract to include a short qualifier indicating that these benefits are validated through ablations and analyses presented in the paper body. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical agent framework with no derivation chain

full rationale

The paper presents an empirical agent framework for vulnerability repair, describing a cyclic learn-and-repair process evaluated on benchmarks (PATCHEVAL, SEC-bench, VUL4J) against baselines. No equations, fitted parameters, predictions derived from inputs, or mathematical derivations are present. Claims of performance improvements are supported by direct experimental comparisons rather than any self-referential reduction or self-citation chain that would make results equivalent to inputs by construction. The core assumption about experience accumulation is an empirical hypothesis tested via evaluation, not a definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; ledger entries are therefore limited to elements explicitly named in the abstract.

axioms (1)

domain assumption LLMs benefit from retrieving and updating domain-specific repair experiences across multiple vulnerabilities
This premise underpins the entire cyclic learn-and-repair process described in the abstract.

invented entities (1)

experience bank no independent evidence
purpose: Store and retrieve scored repair experiences to guide future repairs
Introduced as the central memory component of the EvoRepair framework.

pith-pipeline@v0.9.1-grok · 5856 in / 1345 out tokens · 23456 ms · 2026-06-29T06:14:40.193692+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 13 canonical work pages · 1 internal anchor

[1]

Last accessed: November 20, 2022

Infer static analyzer, 2022. Last accessed: November 20, 2022

2022
[2]

Last accessed: November 20, 2022

Spotbugs: Find bugs in java programs, 2022. Last accessed: November 20, 2022. 15

2022
[3]

Ahmad, S

B. Ahmad, S. Thakur, B. Tan, R. Karri, and H. Pearce. On hardware security bug code fixes by prompting large language models.IEEE Transactions on Information Forensics and Security, 19:4043–4057, 2024

2024
[4]

Bao and S

K. Bao and S. Chen. A smart contract vulnerability detection method based on graph neural networks and zero-shot learning. InInternational Conference on Blockchain and Trustworthy Systems, pages 32–46. Springer, 2025

2025
[5]

Belleville, W

B. Belleville, W. Shen, S. V olckaert, A. M. Azab, and M. Franz. KALD: detecting direct pointer disclosure vulnerabilities.IEEE Trans. Dependable Secur. Comput., 18(3):1369–1377,
[6]

URL https://doi.org/10.1109/TDSC.2019.2915829

doi: 10.1109/TDSC.2019.2915829. URL https://doi.org/10.1109/TDSC.2019.2915829

work page doi:10.1109/tdsc.2019.2915829 2019
[7]

G. P. Bhandari, A. Naseer, and L. Moonen. Cvefixes: Automated collection of vulnerabilities and their fixes from open-source software.CoRR, abs/2107.08760, 2021. URL https://arxiv. org/abs/2107.08760

arXiv 2021
[8]

Bilge and T

L. Bilge and T. Dumitra¸ s. Before we knew it: an empirical study of zero-day attacks in the real world. InProceedings of the 2012 ACM conference on Computer and communications security, pages 833–844, 2012

2012
[9]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

1901
[10]

Q.-C. Bui, R. Scandariato, and N. E. D. Ferreyra. Vul4j: A dataset of reproducible java vul- nerabilities geared towards the study of program repair techniques. InProceedings of the 19th International Conference on Mining Software Repositories, pages 464–468, 2022

2022
[11]

Q.-C. Bui, R. Paramitha, D.-L. Vu, F. Massacci, and R. Scandariato. Apr4vul: an empirical study of automatic program repair techniques on real-world java vulnerabilities.Empirical software engineering, 29(1):18, 2024

2024
[12]

S. Chen, S. Lin, X. Gu, Y . Shi, H. Lian, L. Yun, D. Chen, W. Sun, L. Cao, and Q. Wang. Swe- exp: Experience-driven software issue resolution.arXiv preprint arXiv:2507.23361, 2025

arXiv 2025
[13]

Y . Chen, Y . Wang, S. Zhu, H. Yu, T. Feng, M. Zhang, M. Patwary, and J. You. Multi-agent evolve: Llm self-improve through co-evolution.arXiv preprint arXiv:2510.23595, 2025

arXiv 2025
[14]

Z. Chen, S. Kommrusch, and M. Monperrus. Neural transfer learning for repairing security vulnerabilities in C code.IEEE Trans. Software Eng., 49(1):147–165, 2023. doi: 10.1109/ TSE.2022.3147265. URL https://doi.org/10.1109/TSE.2022.3147265

work page doi:10.1109/tse.2022.3147265 2023
[15]

Cheng, Q

S. Cheng, Q. Yu, Y . Zhu, and Z. Huang. Automated vulnerability repair based on retrieval- augmented generation. In2025 7th International Conference on Information Science, Electri- cal and Automation Engineering (ISEAE), pages 941–947. IEEE, 2025

2025
[16]

J. Chi, Y . Qu, T. Liu, Q. Zheng, and H. Yin. Seqtrans: Automatic vulnerability fix via sequence to sequence learning.IEEE Trans. Software Eng., 49(2):564–585, 2023. doi: 10.1109/TSE. 2022.3156637. URL https://doi.org/10.1109/TSE.2022.3156637

work page doi:10.1109/tse 2023
[17]

Costin, H

A. Costin, H. Turtiainen, N. Yousefnezhad, V . Bogulean, and T. Hämäläinen. Evaluating zero- shot chatgpt performance on predicting cve data from vulnerability descriptions. InProceed- ings of the European Conference on Cyber Warfare and Security, number 1. Academic Con- ferences International Ltd, 2024

2024
[18]

Ding and L

Y . Ding and L. Zhang. Swe-replay: Efficient test-time scaling for software engineering agents. arXiv preprint arXiv:2601.22129, 2026

arXiv 2026
[19]

R. Duan, A. Bijlani, Y . Ji, O. Alrawi, Y . Xiong, M. Ike, B. Saltaformag- gio, and W. Lee. Automating patching of vulnerable open-source software ver- sions in application binaries. In26th Annual Network and Distributed System Security Symposium, NDSS 2019, San Diego, California, USA, February 24-27,

2019
[20]

URL https://www.ndss-symposium.org/ndss-paper/ automating-patching-of-vulnerable-open-source-software-versions-in-application-binaries/

The Internet Society, 2019. URL https://www.ndss-symposium.org/ndss-paper/ automating-patching-of-vulnerable-open-source-software-versions-in-application-binaries/. 16

2019
[21]

Fakih, R

M. Fakih, R. Dharmaji, H. Bouzidi, G. Q. Araya, O. Ogundare, M. Siddika, and M. A. A. Faruque. LLM4CVE: enabling iterative automated vulnerability repair with large language models. In28th Euromicro Conference on Digital System Design, DSD 2025, Salerno, Italy, September 10-12, 2025, pages 592–599. IEEE, 2025. doi: 10.1109/DSD67783.2025.00087. URL https:/...

work page doi:10.1109/dsd67783.2025.00087 2025
[22]

Fakih, R

M. Fakih, R. Dharmaji, H. Bouzidi, G. Q. Araya, O. Ogundare, M. A. Siddika, and M. A. Al Faruque. Llm4cve: Enabling iterative automated vulnerability repair with large language models. In2025 28th Euromicro Conference on Digital System Design (DSD), pages 592–599. IEEE, 2025

2025
[23]

M. A. Ferrag, A. Battah, N. Tihanyi, R. Jain, D. Maimut, F. Alwahedi, T. Lestable, N. S. Thandi, A. Mechri, M. Debbah, and L. C. Cordeiro. Securefalcon: Are we there yet in automated software vulnerability detection with llms?IEEE Trans. Software Eng., 51(4): 1248–1265, 2025. doi: 10.1109/TSE.2025.3548168. URL https://doi.org/10.1109/TSE.2025. 3548168

work page doi:10.1109/tse.2025.3548168 2025
[24]

M. Fu. Toward more effective deep learning-based automated software vulnerability predic- tion, classification, and repair. In2023 IEEE/ACM 45th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), pages 208–212. IEEE, 2023

2023
[25]

M. Fu, C. Tantithamthavorn, T. Le, V . Nguyen, and D. Phung. Vulrepair: a t5-based automated software vulnerability repair. InProceedings of the 30th ACM joint european software engi- neering conference and symposium on the foundations of software engineering, pages 935– 947, 2022

2022
[26]

X. Gao, S. Mechtaev, and A. Roychoudhury. Crash-avoiding program repair. InProceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 8–18, 2019

2019
[27]

X. Gao, B. Wang, G. J. Duck, R. Ji, Y . Xiong, and A. Roychoudhury. Beyond tests: Program vulnerability repair via crash constraint extraction.ACM Transactions on Software Engineer- ing and Methodology (TOSEM), 30(2):1–27, 2021

2021
[28]

W. Han, Y . Kwak, M. Yu, K. Kim, Y . Lee, H. Moon, and Y . Paek. Rethinking the ca- pability of fine-tuned language models for automated vulnerability repair.arXiv preprint arXiv:2512.22633, 2025

arXiv 2025
[29]

Z. Hao, H. Wang, J. Luo, J. Zhang, Y . Zhou, Q. Lin, C. Wang, H. Dong, and J. Chen. Recreate: Reasoning and creating domain agents driven by experience.arXiv preprint arXiv:2601.11100, 2026

Pith/arXiv arXiv 2026
[30]

S. Hong, J. Lee, J. Lee, and H. Oh. Saver: scalable, precise, and safe memory-error repair. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pages 271–283, 2020

2020
[31]

H. Hu, Y . Shang, W. Sun, and Q. Zhang. Tsapr: A tree search framework for automated program repair.arXiv preprint arXiv:2507.01827, 2025

arXiv 2025
[32]

T. Hu, R. Chen, S. Zhang, J. Yin, M. X. Feng, J. Liu, S. Zhang, W. Jiang, Y . Fang, S. Hu, et al. Controlled self-evolution for algorithmic code optimization.arXiv preprint arXiv:2601.07348, 2026

arXiv 2026
[33]

Huang, W

C. Huang, W. Yu, X. Wang, H. Zhang, Z. Li, R. Li, J. Huang, H. Mi, and D. Yu. R-zero: Self-evolving reasoning llm from zero data.arXiv preprint arXiv:2508.05004, 2025

Pith/arXiv arXiv 2025
[34]

Huang, J

K. Huang, J. Zhang, X. Meng, and Y . Liu. Template-guided program repair in the era of large language models. InICSE, pages 1895–1907, 2025

1907
[35]

Huang, D

Z. Huang, D. Lie, G. Tan, and T. Jaeger. Using safety properties to generate vulnerability patches. In2019 IEEE symposium on security and privacy (SP), pages 539–554. IEEE, 2019. 17

2019
[36]

R. Jiao, Y . Zhang, J. Li, and J. Ma. Hit the bullseye on the first shot: Improving llms using multi-sample self-reward feedback for vulnerability repair. In2025 40th IEEE/ACM Interna- tional Conference on Automated Software Engineering (ASE), pages 791–803. IEEE, 2025

2025
[37]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

Pith/arXiv arXiv 2023
[38]

Jinseok, C

H. Jinseok, C. Dongwook, K. Jinyoung, K. Misoo, and L. Eunseok. Intentfix: Automated logic vulnerability repair via llm-driven intent modeling. InProceedings of the IEEE/ACM 48th International Conference on Software Engineering, ICSE ’26. Association for Computing Machinery, 2026

2026
[39]

W. Kim, S. Min, M. Gwon, D. Baik, H. Lee, H. Heo, M. Lee, M. W. Baek, Y . Jin, Y . Park, Y . Choi, T. Kim, S. Park, and I. Yun. Patchisland: Orchestration of llm agents for continuous vulnerability repair.arXiv preprint arXiv:2601.17471, 2026

arXiv 2026
[40]

Y . Kim, S. Shin, H. Kim, and J. Yoon. Logs in, patches out: Automated vulnerability repair via {Tree-of-Thought}{LLM}analysis. In34th USENIX Security Symposium (USENIX Security 25), pages 4401–4419, 2025

2025
[41]

Kojima, S

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

2022
[42]

Kulsum, H

U. Kulsum, H. Zhu, B. Xu, and M. d’Amorim. A case study of llm for automated vulnerability repair: Assessing impact of reasoning and patch validation feedback. InProceedings of the 1st ACM International Conference on AI-Powered Software, pages 103–111, 2024

2024
[43]

H. Lee, Z. Zhang, H. Lu, and L. Zhang. Sec-bench: Automated benchmarking of llm agents on real-world software security tasks.arXiv preprint arXiv:2506.11791, 2025

arXiv 2025
[44]

J. Lin, Y . Guo, Y . Han, S. Hu, Z. Ni, L. Wang, M. Chen, H. Liu, R. Chen, Y . He, et al. Se- agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents. arXiv preprint arXiv:2508.02085, 2025

arXiv 2025
[45]

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

Pith/arXiv arXiv 2024
[47]

Q. Mao, Z. Li, X. Hu, K. Liu, X. Xia, and J. Sun. Towards explainable vulnerability detection with large language models.IEEE Trans. Software Eng., 51(10):2957–2971, 2025. doi: 10. 1109/TSE.2025.3605442. URL https://doi.org/10.1109/TSE.2025.3605442

work page doi:10.1109/tse.2025.3605442 2025
[48]

Noller, R

Y . Noller, R. Shariffdeen, X. Gao, and A. Roychoudhury. Trust enhancement issues in program repair. In44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022, pages 2228–2240. ACM, 2022. doi: 10.1145/3510003. 3510040. URL https://doi.org/10.1145/3510003.3510040

work page doi:10.1145/3510003 2022
[49]

Rastogi, A

A. Rastogi, A. Yang, A. Q. Jiang, A. H. Liu, A. Sablayrolles, A. Héliou, A. Martin, A. Agar- wal, A. Ehrenberg, A. Lo, et al. Devstral: Fine-tuning language models for coding agent applications.arXiv preprint arXiv:2509.25193, 2025

arXiv 2025
[50]

S. Ren, D. Guo, S. Lu, L. Zhou, S. Liu, D. Tang, N. Sundaresan, M. Zhou, A. Blanco, and S. Ma. Codebleu: a method for automatic evaluation of code synthesis.arXiv preprint arXiv:2009.10297, 2020

Pith/arXiv arXiv 2009
[51]

Roucher, A

A. Roucher, A. V . del Moral, T. Wolf, L. von Werra, and E. Kaunismäki. smolagents: A smol library to build great agentic systems.Hugging Face, 2025. 18

2025
[52]

C. Seas, G. Fitzpatrick, J. A. H. Jr., and M. C. Carlisle. Automated vulnerability detection in source code using deep representation learning. In R. Paul and A. Kundu, editors,14th IEEE Annual Computing and Communication Workshop and Conference, CCWC 2024, Las Vegas, NV , USA, January 8-10, 2024, pages 484–490. IEEE, 2024. doi: 10.1109/CCWC60891.2024. 10...

work page doi:10.1109/ccwc60891.2024 2024
[53]

Neural Machine Translation of Rare Words with Subword Units

R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with sub- word units. InProceedings of the 54th Annual Meeting of the Association for Computa- tional Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Pa- pers. The Association for Computer Linguistics, 2016. doi: 10.18653/V1/P16-1162. URL https://doi.o...

work page doi:10.18653/v1/p16-1162 2016
[54]

Shahriar, S

A. Shahriar, S. J. Hisham, K. A. Rahman, M. R. Islam, M. S. Hossain, R.-H. Hwang, and Y .-D. Lin. 5gpt: 5g vulnerability detection by combining zero-shot capabilities of gpt-4 with domain aware strategies through prompt engineering.IEEE Transactions on Information Forensics and Security, 2025

2025
[55]

M. Shao, Y . Ding, C. Gao, J. Wang, and G. Zhu. Fix pattern-aware vulnerability patch gener- ation via in-context learning.ACM Transactions on Software Engineering and Methodology, 2026

2026
[56]

Shen and S

Z. Shen and S. Chen. A survey of automatic software vulnerability detection, program repair, and defect prediction techniques.Security and Communication Networks, 2020(1):8858010, 2020

2020
[57]

Y . Shin, A. Meneely, L. Williams, and J. A. Osborne. Evaluating complexity, code churn, and developer activity metrics as indicators of software vulnerabilities.IEEE transactions on software engineering, 37(6):772–787, 2010

2010
[58]

Singh, A

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

Pith/arXiv arXiv 2025
[59]

P. Wang, X. Liu, and C. Xiao. Cve-bench: Benchmarking llm-based software engineer- ing agent’s ability to repair real-world CVE vulnerabilities. In L. Chiruzzo, A. Ritter, and L. Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chap- ter of the Association for Computational Linguistics: Human Language Technologies, NAACL 202...

work page doi:10.18653/v1/2025 2025
[61]

X. Wang, B. Li, Y . Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y . Song, B. Li, J. Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

Pith/arXiv arXiv 2024
[62]

Z. Wei, J. Zeng, M. Wen, Z. Yu, K. Cheng, Y . Zhu, J. Guo, S. Zhou, L. Yin, X. Su, et al. Patcheval: A new benchmark for evaluating llms on patching real-world vulnerabilities.arXiv preprint arXiv:2511.11019, 2025

arXiv 2025
[63]

X. Wen, Z. Lin, Y . Yang, C. Gao, and D. Ye. Vul-r2: A reasoning LLM for automated vulnera- bility repair. In40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025, Seoul, Korea, Republic of, November 16-20, 2025, pages 26–38. IEEE, 2025. doi: 10.1109/ASE63991.2025.00011. URL https://doi.org/10.1109/ASE63991.2025.00011

work page doi:10.1109/ase63991.2025.00011 2025
[64]

Z. Weng, A. Antoniades, D. Nathani, Z. Zhang, X. Pu, and X. E. Wang. Group-evolving agents: Open-ended self-improvement via experience sharing.arXiv preprint arXiv:2602.04837, 2026. 19

arXiv 2026
[65]

R. Wu, X. Wang, J. Mei, P. Cai, D. Fu, C. Yang, L. Wen, X. Yang, Y . Shen, Y . Wang, et al. Evolver: Self-evolving llm agents through an experience-driven lifecycle.arXiv preprint arXiv:2510.16079, 2025

Pith/arXiv arXiv 2025
[66]

C. S. Xia and L. Zhang. Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using chatgpt. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 819–831, 2024

2024
[67]

C. S. Xia, Z. Wang, Y . Yang, Y . Wei, and L. Zhang. Live-swe-agent: Can software engineering agents self-evolve on the fly?arXiv preprint arXiv:2511.13646, 2025

arXiv 2025
[68]

W. Xu, Z. Liang, K. Mei, H. Gao, J. Tan, and Y . Zhang. A-mem: Agentic memory for llm agents. InAdvances in Neural Information Processing Systems, 2025

2025
[69]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[70]

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe- agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

2024
[71]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

2022
[72]

Z. Ye, X. Sun, S. Cao, L. Bo, and B. Li. Well begun is half done: Location-aware and trace- guided iterative automated vulnerability repair.arXiv preprint arXiv:2512.20203, 2025

arXiv 2025
[73]

Z. Ye, X. Sun, S. Cao, L. Bo, and B. Li. Well begun is half done: Location-aware and trace- guided iterative automated vulnerability repair. InProceedings of the IEEE/ACM 48th Interna- tional Conference on Software Engineering, ICSE ’26. Association for Computing Machinery, 2026

2026
[74]

Zhang, C

J. Zhang, C. Wang, A. Li, W. Wang, T. Li, and Y . Liu. Vuladvisor: Natural language sug- gestion generation for software vulnerability repair. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, pages 1932–1944, 2024

1932
[77]

Zhang, X

M. Zhang, X. Wang, J. Zhang, X. Meng, J. Zhang, and C. Hu. Vulnresolver: A hy- brid agent framework for llm-based automated vulnerability issue resolution.arXiv preprint arXiv:2601.13933, 2026

arXiv 2026
[78]

Design Initiative for a 10 TeV pCM Wakefield Collider,

Q. Zhang, Y . Zhao, W. Sun, C. Fang, Z. Wang, and L. Zhang. Program repair: Automated vs. manual.CoRR, abs/2203.05166, 2022. doi: 10.48550/ARXIV .2203.05166. URL https: //doi.org/10.48550/arXiv.2203.05166

work page internal anchor Pith review doi:10.48550/arxiv 2022
[79]

Zhang, C

Q. Zhang, C. Fang, B. Yu, W. Sun, T. Zhang, and Z. Chen. Pre-trained model-based automated software vulnerability repair: How far are we?IEEE Transactions on Dependable and Secure Computing, 21(4):2507–2525, 2023

2023
[80]

A. Zhao, D. Huang, Q. Xu, M. Lin, Y .-J. Liu, and G. Huang. Expel: Llm agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024

2024
[81]

A. Zhao, Y . Wu, Y . Yue, T. Wu, Q. Xu, M. Lin, S. Wang, Q. Wu, Z. Zheng, and G. Huang. Ab- solute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025

Pith/arXiv arXiv 2025
[82]

X. Zhou, K. Kim, B. Xu, D. Han, and D. Lo. Out of sight, out of mind: Better automatic vulnerability repair by broadening input ranges and sources. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ICSE ’24. Association for Computing Machinery, 2024. 20

2024
[83]

X. Zhou, K. Kim, B. Xu, D. Han, and D. Lo. Out of sight, out of mind: Better automatic vulnerability repair by broadening input ranges and sources. InProceedings of the IEEE/ACM 46th international conference on software engineering, pages 1–13, 2024

2024
[84]

X. Zhou, S. Cao, X. Sun, and D. Lo. Large language model for vulnerability detection and repair: Literature review and the road ahead.ACM Trans. Softw. Eng. Methodol., 34(5):145:1– 145:31, 2025. doi: 10.1145/3708522. URL https://doi.org/10.1145/3708522

work page doi:10.1145/3708522 2025

Showing first 80 references.

[1] [1]

Last accessed: November 20, 2022

Infer static analyzer, 2022. Last accessed: November 20, 2022

2022

[2] [2]

Last accessed: November 20, 2022

Spotbugs: Find bugs in java programs, 2022. Last accessed: November 20, 2022. 15

2022

[3] [3]

Ahmad, S

B. Ahmad, S. Thakur, B. Tan, R. Karri, and H. Pearce. On hardware security bug code fixes by prompting large language models.IEEE Transactions on Information Forensics and Security, 19:4043–4057, 2024

2024

[4] [4]

Bao and S

K. Bao and S. Chen. A smart contract vulnerability detection method based on graph neural networks and zero-shot learning. InInternational Conference on Blockchain and Trustworthy Systems, pages 32–46. Springer, 2025

2025

[5] [5]

Belleville, W

B. Belleville, W. Shen, S. V olckaert, A. M. Azab, and M. Franz. KALD: detecting direct pointer disclosure vulnerabilities.IEEE Trans. Dependable Secur. Comput., 18(3):1369–1377,

[6] [6]

URL https://doi.org/10.1109/TDSC.2019.2915829

doi: 10.1109/TDSC.2019.2915829. URL https://doi.org/10.1109/TDSC.2019.2915829

work page doi:10.1109/tdsc.2019.2915829 2019

[7] [7]

G. P. Bhandari, A. Naseer, and L. Moonen. Cvefixes: Automated collection of vulnerabilities and their fixes from open-source software.CoRR, abs/2107.08760, 2021. URL https://arxiv. org/abs/2107.08760

arXiv 2021

[8] [8]

Bilge and T

L. Bilge and T. Dumitra¸ s. Before we knew it: an empirical study of zero-day attacks in the real world. InProceedings of the 2012 ACM conference on Computer and communications security, pages 833–844, 2012

2012

[9] [9]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

1901

[10] [10]

Q.-C. Bui, R. Scandariato, and N. E. D. Ferreyra. Vul4j: A dataset of reproducible java vul- nerabilities geared towards the study of program repair techniques. InProceedings of the 19th International Conference on Mining Software Repositories, pages 464–468, 2022

2022

[11] [11]

Q.-C. Bui, R. Paramitha, D.-L. Vu, F. Massacci, and R. Scandariato. Apr4vul: an empirical study of automatic program repair techniques on real-world java vulnerabilities.Empirical software engineering, 29(1):18, 2024

2024

[12] [12]

S. Chen, S. Lin, X. Gu, Y . Shi, H. Lian, L. Yun, D. Chen, W. Sun, L. Cao, and Q. Wang. Swe- exp: Experience-driven software issue resolution.arXiv preprint arXiv:2507.23361, 2025

arXiv 2025

[13] [13]

Y . Chen, Y . Wang, S. Zhu, H. Yu, T. Feng, M. Zhang, M. Patwary, and J. You. Multi-agent evolve: Llm self-improve through co-evolution.arXiv preprint arXiv:2510.23595, 2025

arXiv 2025

[14] [14]

Z. Chen, S. Kommrusch, and M. Monperrus. Neural transfer learning for repairing security vulnerabilities in C code.IEEE Trans. Software Eng., 49(1):147–165, 2023. doi: 10.1109/ TSE.2022.3147265. URL https://doi.org/10.1109/TSE.2022.3147265

work page doi:10.1109/tse.2022.3147265 2023

[15] [15]

Cheng, Q

S. Cheng, Q. Yu, Y . Zhu, and Z. Huang. Automated vulnerability repair based on retrieval- augmented generation. In2025 7th International Conference on Information Science, Electri- cal and Automation Engineering (ISEAE), pages 941–947. IEEE, 2025

2025

[16] [16]

J. Chi, Y . Qu, T. Liu, Q. Zheng, and H. Yin. Seqtrans: Automatic vulnerability fix via sequence to sequence learning.IEEE Trans. Software Eng., 49(2):564–585, 2023. doi: 10.1109/TSE. 2022.3156637. URL https://doi.org/10.1109/TSE.2022.3156637

work page doi:10.1109/tse 2023

[17] [17]

Costin, H

A. Costin, H. Turtiainen, N. Yousefnezhad, V . Bogulean, and T. Hämäläinen. Evaluating zero- shot chatgpt performance on predicting cve data from vulnerability descriptions. InProceed- ings of the European Conference on Cyber Warfare and Security, number 1. Academic Con- ferences International Ltd, 2024

2024

[18] [18]

Ding and L

Y . Ding and L. Zhang. Swe-replay: Efficient test-time scaling for software engineering agents. arXiv preprint arXiv:2601.22129, 2026

arXiv 2026

[19] [19]

R. Duan, A. Bijlani, Y . Ji, O. Alrawi, Y . Xiong, M. Ike, B. Saltaformag- gio, and W. Lee. Automating patching of vulnerable open-source software ver- sions in application binaries. In26th Annual Network and Distributed System Security Symposium, NDSS 2019, San Diego, California, USA, February 24-27,

2019

[20] [20]

URL https://www.ndss-symposium.org/ndss-paper/ automating-patching-of-vulnerable-open-source-software-versions-in-application-binaries/

The Internet Society, 2019. URL https://www.ndss-symposium.org/ndss-paper/ automating-patching-of-vulnerable-open-source-software-versions-in-application-binaries/. 16

2019

[21] [21]

Fakih, R

M. Fakih, R. Dharmaji, H. Bouzidi, G. Q. Araya, O. Ogundare, M. Siddika, and M. A. A. Faruque. LLM4CVE: enabling iterative automated vulnerability repair with large language models. In28th Euromicro Conference on Digital System Design, DSD 2025, Salerno, Italy, September 10-12, 2025, pages 592–599. IEEE, 2025. doi: 10.1109/DSD67783.2025.00087. URL https:/...

work page doi:10.1109/dsd67783.2025.00087 2025

[22] [22]

Fakih, R

M. Fakih, R. Dharmaji, H. Bouzidi, G. Q. Araya, O. Ogundare, M. A. Siddika, and M. A. Al Faruque. Llm4cve: Enabling iterative automated vulnerability repair with large language models. In2025 28th Euromicro Conference on Digital System Design (DSD), pages 592–599. IEEE, 2025

2025

[23] [23]

M. A. Ferrag, A. Battah, N. Tihanyi, R. Jain, D. Maimut, F. Alwahedi, T. Lestable, N. S. Thandi, A. Mechri, M. Debbah, and L. C. Cordeiro. Securefalcon: Are we there yet in automated software vulnerability detection with llms?IEEE Trans. Software Eng., 51(4): 1248–1265, 2025. doi: 10.1109/TSE.2025.3548168. URL https://doi.org/10.1109/TSE.2025. 3548168

work page doi:10.1109/tse.2025.3548168 2025

[24] [24]

M. Fu. Toward more effective deep learning-based automated software vulnerability predic- tion, classification, and repair. In2023 IEEE/ACM 45th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), pages 208–212. IEEE, 2023

2023

[25] [25]

M. Fu, C. Tantithamthavorn, T. Le, V . Nguyen, and D. Phung. Vulrepair: a t5-based automated software vulnerability repair. InProceedings of the 30th ACM joint european software engi- neering conference and symposium on the foundations of software engineering, pages 935– 947, 2022

2022

[26] [26]

X. Gao, S. Mechtaev, and A. Roychoudhury. Crash-avoiding program repair. InProceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 8–18, 2019

2019

[27] [27]

X. Gao, B. Wang, G. J. Duck, R. Ji, Y . Xiong, and A. Roychoudhury. Beyond tests: Program vulnerability repair via crash constraint extraction.ACM Transactions on Software Engineer- ing and Methodology (TOSEM), 30(2):1–27, 2021

2021

[28] [28]

W. Han, Y . Kwak, M. Yu, K. Kim, Y . Lee, H. Moon, and Y . Paek. Rethinking the ca- pability of fine-tuned language models for automated vulnerability repair.arXiv preprint arXiv:2512.22633, 2025

arXiv 2025

[29] [29]

Z. Hao, H. Wang, J. Luo, J. Zhang, Y . Zhou, Q. Lin, C. Wang, H. Dong, and J. Chen. Recreate: Reasoning and creating domain agents driven by experience.arXiv preprint arXiv:2601.11100, 2026

Pith/arXiv arXiv 2026

[30] [30]

S. Hong, J. Lee, J. Lee, and H. Oh. Saver: scalable, precise, and safe memory-error repair. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pages 271–283, 2020

2020

[31] [31]

H. Hu, Y . Shang, W. Sun, and Q. Zhang. Tsapr: A tree search framework for automated program repair.arXiv preprint arXiv:2507.01827, 2025

arXiv 2025

[32] [32]

T. Hu, R. Chen, S. Zhang, J. Yin, M. X. Feng, J. Liu, S. Zhang, W. Jiang, Y . Fang, S. Hu, et al. Controlled self-evolution for algorithmic code optimization.arXiv preprint arXiv:2601.07348, 2026

arXiv 2026

[33] [33]

Huang, W

C. Huang, W. Yu, X. Wang, H. Zhang, Z. Li, R. Li, J. Huang, H. Mi, and D. Yu. R-zero: Self-evolving reasoning llm from zero data.arXiv preprint arXiv:2508.05004, 2025

Pith/arXiv arXiv 2025

[34] [34]

Huang, J

K. Huang, J. Zhang, X. Meng, and Y . Liu. Template-guided program repair in the era of large language models. InICSE, pages 1895–1907, 2025

1907

[35] [35]

Huang, D

Z. Huang, D. Lie, G. Tan, and T. Jaeger. Using safety properties to generate vulnerability patches. In2019 IEEE symposium on security and privacy (SP), pages 539–554. IEEE, 2019. 17

2019

[36] [36]

R. Jiao, Y . Zhang, J. Li, and J. Ma. Hit the bullseye on the first shot: Improving llms using multi-sample self-reward feedback for vulnerability repair. In2025 40th IEEE/ACM Interna- tional Conference on Automated Software Engineering (ASE), pages 791–803. IEEE, 2025

2025

[37] [37]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

Pith/arXiv arXiv 2023

[38] [38]

Jinseok, C

H. Jinseok, C. Dongwook, K. Jinyoung, K. Misoo, and L. Eunseok. Intentfix: Automated logic vulnerability repair via llm-driven intent modeling. InProceedings of the IEEE/ACM 48th International Conference on Software Engineering, ICSE ’26. Association for Computing Machinery, 2026

2026

[39] [39]

W. Kim, S. Min, M. Gwon, D. Baik, H. Lee, H. Heo, M. Lee, M. W. Baek, Y . Jin, Y . Park, Y . Choi, T. Kim, S. Park, and I. Yun. Patchisland: Orchestration of llm agents for continuous vulnerability repair.arXiv preprint arXiv:2601.17471, 2026

arXiv 2026

[40] [40]

Y . Kim, S. Shin, H. Kim, and J. Yoon. Logs in, patches out: Automated vulnerability repair via {Tree-of-Thought}{LLM}analysis. In34th USENIX Security Symposium (USENIX Security 25), pages 4401–4419, 2025

2025

[41] [41]

Kojima, S

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

2022

[42] [42]

Kulsum, H

U. Kulsum, H. Zhu, B. Xu, and M. d’Amorim. A case study of llm for automated vulnerability repair: Assessing impact of reasoning and patch validation feedback. InProceedings of the 1st ACM International Conference on AI-Powered Software, pages 103–111, 2024

2024

[43] [43]

H. Lee, Z. Zhang, H. Lu, and L. Zhang. Sec-bench: Automated benchmarking of llm agents on real-world software security tasks.arXiv preprint arXiv:2506.11791, 2025

arXiv 2025

[44] [44]

J. Lin, Y . Guo, Y . Han, S. Hu, Z. Ni, L. Wang, M. Chen, H. Liu, R. Chen, Y . He, et al. Se- agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents. arXiv preprint arXiv:2508.02085, 2025

arXiv 2025

[45] [45]

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

Pith/arXiv arXiv 2024

[46] [47]

Q. Mao, Z. Li, X. Hu, K. Liu, X. Xia, and J. Sun. Towards explainable vulnerability detection with large language models.IEEE Trans. Software Eng., 51(10):2957–2971, 2025. doi: 10. 1109/TSE.2025.3605442. URL https://doi.org/10.1109/TSE.2025.3605442

work page doi:10.1109/tse.2025.3605442 2025

[47] [48]

Noller, R

Y . Noller, R. Shariffdeen, X. Gao, and A. Roychoudhury. Trust enhancement issues in program repair. In44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022, pages 2228–2240. ACM, 2022. doi: 10.1145/3510003. 3510040. URL https://doi.org/10.1145/3510003.3510040

work page doi:10.1145/3510003 2022

[48] [49]

Rastogi, A

A. Rastogi, A. Yang, A. Q. Jiang, A. H. Liu, A. Sablayrolles, A. Héliou, A. Martin, A. Agar- wal, A. Ehrenberg, A. Lo, et al. Devstral: Fine-tuning language models for coding agent applications.arXiv preprint arXiv:2509.25193, 2025

arXiv 2025

[49] [50]

S. Ren, D. Guo, S. Lu, L. Zhou, S. Liu, D. Tang, N. Sundaresan, M. Zhou, A. Blanco, and S. Ma. Codebleu: a method for automatic evaluation of code synthesis.arXiv preprint arXiv:2009.10297, 2020

Pith/arXiv arXiv 2009

[50] [51]

Roucher, A

A. Roucher, A. V . del Moral, T. Wolf, L. von Werra, and E. Kaunismäki. smolagents: A smol library to build great agentic systems.Hugging Face, 2025. 18

2025

[51] [52]

C. Seas, G. Fitzpatrick, J. A. H. Jr., and M. C. Carlisle. Automated vulnerability detection in source code using deep representation learning. In R. Paul and A. Kundu, editors,14th IEEE Annual Computing and Communication Workshop and Conference, CCWC 2024, Las Vegas, NV , USA, January 8-10, 2024, pages 484–490. IEEE, 2024. doi: 10.1109/CCWC60891.2024. 10...

work page doi:10.1109/ccwc60891.2024 2024

[52] [53]

Neural Machine Translation of Rare Words with Subword Units

R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with sub- word units. InProceedings of the 54th Annual Meeting of the Association for Computa- tional Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Pa- pers. The Association for Computer Linguistics, 2016. doi: 10.18653/V1/P16-1162. URL https://doi.o...

work page doi:10.18653/v1/p16-1162 2016

[53] [54]

Shahriar, S

A. Shahriar, S. J. Hisham, K. A. Rahman, M. R. Islam, M. S. Hossain, R.-H. Hwang, and Y .-D. Lin. 5gpt: 5g vulnerability detection by combining zero-shot capabilities of gpt-4 with domain aware strategies through prompt engineering.IEEE Transactions on Information Forensics and Security, 2025

2025

[54] [55]

M. Shao, Y . Ding, C. Gao, J. Wang, and G. Zhu. Fix pattern-aware vulnerability patch gener- ation via in-context learning.ACM Transactions on Software Engineering and Methodology, 2026

2026

[55] [56]

Shen and S

Z. Shen and S. Chen. A survey of automatic software vulnerability detection, program repair, and defect prediction techniques.Security and Communication Networks, 2020(1):8858010, 2020

2020

[56] [57]

Y . Shin, A. Meneely, L. Williams, and J. A. Osborne. Evaluating complexity, code churn, and developer activity metrics as indicators of software vulnerabilities.IEEE transactions on software engineering, 37(6):772–787, 2010

2010

[57] [58]

Singh, A

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

Pith/arXiv arXiv 2025

[58] [59]

P. Wang, X. Liu, and C. Xiao. Cve-bench: Benchmarking llm-based software engineer- ing agent’s ability to repair real-world CVE vulnerabilities. In L. Chiruzzo, A. Ritter, and L. Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chap- ter of the Association for Computational Linguistics: Human Language Technologies, NAACL 202...

work page doi:10.18653/v1/2025 2025

[59] [61]

X. Wang, B. Li, Y . Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y . Song, B. Li, J. Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

Pith/arXiv arXiv 2024

[60] [62]

Z. Wei, J. Zeng, M. Wen, Z. Yu, K. Cheng, Y . Zhu, J. Guo, S. Zhou, L. Yin, X. Su, et al. Patcheval: A new benchmark for evaluating llms on patching real-world vulnerabilities.arXiv preprint arXiv:2511.11019, 2025

arXiv 2025

[61] [63]

X. Wen, Z. Lin, Y . Yang, C. Gao, and D. Ye. Vul-r2: A reasoning LLM for automated vulnera- bility repair. In40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025, Seoul, Korea, Republic of, November 16-20, 2025, pages 26–38. IEEE, 2025. doi: 10.1109/ASE63991.2025.00011. URL https://doi.org/10.1109/ASE63991.2025.00011

work page doi:10.1109/ase63991.2025.00011 2025

[62] [64]

Z. Weng, A. Antoniades, D. Nathani, Z. Zhang, X. Pu, and X. E. Wang. Group-evolving agents: Open-ended self-improvement via experience sharing.arXiv preprint arXiv:2602.04837, 2026. 19

arXiv 2026

[63] [65]

R. Wu, X. Wang, J. Mei, P. Cai, D. Fu, C. Yang, L. Wen, X. Yang, Y . Shen, Y . Wang, et al. Evolver: Self-evolving llm agents through an experience-driven lifecycle.arXiv preprint arXiv:2510.16079, 2025

Pith/arXiv arXiv 2025

[64] [66]

C. S. Xia and L. Zhang. Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using chatgpt. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 819–831, 2024

2024

[65] [67]

C. S. Xia, Z. Wang, Y . Yang, Y . Wei, and L. Zhang. Live-swe-agent: Can software engineering agents self-evolve on the fly?arXiv preprint arXiv:2511.13646, 2025

arXiv 2025

[66] [68]

W. Xu, Z. Liang, K. Mei, H. Gao, J. Tan, and Y . Zhang. A-mem: Agentic memory for llm agents. InAdvances in Neural Information Processing Systems, 2025

2025

[67] [69]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[68] [70]

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe- agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

2024

[69] [71]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

2022

[70] [72]

Z. Ye, X. Sun, S. Cao, L. Bo, and B. Li. Well begun is half done: Location-aware and trace- guided iterative automated vulnerability repair.arXiv preprint arXiv:2512.20203, 2025

arXiv 2025

[71] [73]

Z. Ye, X. Sun, S. Cao, L. Bo, and B. Li. Well begun is half done: Location-aware and trace- guided iterative automated vulnerability repair. InProceedings of the IEEE/ACM 48th Interna- tional Conference on Software Engineering, ICSE ’26. Association for Computing Machinery, 2026

2026

[72] [74]

Zhang, C

J. Zhang, C. Wang, A. Li, W. Wang, T. Li, and Y . Liu. Vuladvisor: Natural language sug- gestion generation for software vulnerability repair. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, pages 1932–1944, 2024

1932

[73] [77]

Zhang, X

M. Zhang, X. Wang, J. Zhang, X. Meng, J. Zhang, and C. Hu. Vulnresolver: A hy- brid agent framework for llm-based automated vulnerability issue resolution.arXiv preprint arXiv:2601.13933, 2026

arXiv 2026

[74] [78]

Design Initiative for a 10 TeV pCM Wakefield Collider,

Q. Zhang, Y . Zhao, W. Sun, C. Fang, Z. Wang, and L. Zhang. Program repair: Automated vs. manual.CoRR, abs/2203.05166, 2022. doi: 10.48550/ARXIV .2203.05166. URL https: //doi.org/10.48550/arXiv.2203.05166

work page internal anchor Pith review doi:10.48550/arxiv 2022

[75] [79]

Zhang, C

Q. Zhang, C. Fang, B. Yu, W. Sun, T. Zhang, and Z. Chen. Pre-trained model-based automated software vulnerability repair: How far are we?IEEE Transactions on Dependable and Secure Computing, 21(4):2507–2525, 2023

2023

[76] [80]

A. Zhao, D. Huang, Q. Xu, M. Lin, Y .-J. Liu, and G. Huang. Expel: Llm agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024

2024

[77] [81]

A. Zhao, Y . Wu, Y . Yue, T. Wu, Q. Xu, M. Lin, S. Wang, Q. Wu, Z. Zheng, and G. Huang. Ab- solute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025

Pith/arXiv arXiv 2025

[78] [82]

X. Zhou, K. Kim, B. Xu, D. Han, and D. Lo. Out of sight, out of mind: Better automatic vulnerability repair by broadening input ranges and sources. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ICSE ’24. Association for Computing Machinery, 2024. 20

2024

[79] [83]

X. Zhou, K. Kim, B. Xu, D. Han, and D. Lo. Out of sight, out of mind: Better automatic vulnerability repair by broadening input ranges and sources. InProceedings of the IEEE/ACM 46th international conference on software engineering, pages 1–13, 2024

2024

[80] [84]

X. Zhou, S. Cao, X. Sun, and D. Lo. Large language model for vulnerability detection and repair: Literature review and the road ahead.ACM Trans. Softw. Eng. Methodol., 34(5):145:1– 145:31, 2025. doi: 10.1145/3708522. URL https://doi.org/10.1145/3708522

work page doi:10.1145/3708522 2025