Towards Demystifying and Repairing LLM-in-the-Loop Vulnerabilities

Chenxi Yang; Jialin Rong; Lili Quan; Qiang Hu; Xiaofei Xie; Yongqiang Lyu; Yujie Ma

arxiv: 2605.28893 · v1 · pith:ZGDPW4V4new · submitted 2026-05-27 · 💻 cs.SE · cs.CR

Towards Demystifying and Repairing LLM-in-the-Loop Vulnerabilities

Yujie Ma , Jialin Rong , Chenxi Yang , Lili Quan , Xiaofei Xie , Yongqiang Lyu , Qiang Hu This is my paper

Pith reviewed 2026-06-29 11:18 UTC · model grok-4.3

classification 💻 cs.SE cs.CR

keywords LLM-in-the-loop vulnerabilitiesLLMCVE datasetprompt injectionvulnerability repairSWE-AgentLLM securitysoftware vulnerabilities

0 comments

The pith

LLM-in-the-loop vulnerabilities are more difficult to repair than conventional software bugs, with prompt injection cases repaired correctly only 28.57 percent of the time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs the first dataset of vulnerabilities that arise specifically because of how large language models are integrated into software systems. It collects thousands of reported issues from LLM components and narrows them to 205 cases where the LLM or its surrounding framework is central to the flaw. Analysis shows LLMs usually act as targets or spreaders of problems rather than their origin. Tests of automated repair tools on this dataset reveal lower success rates than on ordinary code vulnerabilities.

Core claim

The authors present LLMCVE, a dataset of 205 LLM-in-the-loop vulnerabilities drawn from 2,888 collected issues across 230 components. In these cases LLMs function more often as targets or propagation vectors than as root causes. When existing agent-based repair methods are applied, success rates drop below those for conventional vulnerabilities, reaching only 28.57 percent Pass@1 for prompt-injection examples.

What carries the argument

The LLMCVE dataset built through manual filtering of multi-source vulnerability reports to isolate cases where LLM integration introduces or amplifies the flaw.

If this is right

Existing repair agents need adaptation to handle prompt-related and LLM-dependent issues.
Security analysis of LLM-integrated systems must treat the model as a potential attack surface or vector rather than only a code generator.
Prompt injection vulnerabilities within LLM loops require specialized detection and mitigation beyond standard code fixes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future benchmarks for automated repair should include LLM-in-the-loop cases to avoid overestimating tool performance on real deployments.
Developers integrating LLMs may need new design patterns that isolate the model from direct user input flows.
Classification of vulnerabilities could shift to track whether the LLM component is the source, target, or conduit.

Load-bearing premise

The manual review correctly separated the 205 LLM-in-the-loop cases from the larger collection and that those cases represent typical risks in deployed LLM software.

What would settle it

A replication study that applies the same repair agents to the LLMCVE dataset and obtains Pass@1 rates comparable to those on standard vulnerability benchmarks would falsify the claim of greater difficulty.

Figures

Figures reproduced from arXiv: 2605.28893 by Chenxi Yang, Jialin Rong, Lili Quan, Qiang Hu, Xiaofei Xie, Yongqiang Lyu, Yujie Ma.

**Figure 2.** Figure 2: The overall workflow of our study. It comprises three main phases: (A) constructing a benchmark of 205 LLM-in-the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Vulnerability categories in LLMCVE. 4.3 RQ2: Repair Effectiveness To answer RQ2, we analyze the Pass@1 repair success rates of the five agents across our benchmark. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 3.** Figure 3: Distribution of LLMCVE. Inflows and outflows differ because a single vulnerability may belong to multiple types. Answer to RQ1: LLM-in-the-loop vulnerabilities are predominantly high-impact and stem primarily from injection-based flaws, with LLMs more often acting as targets or propagation vectors rather than the root cause. 25.7% 42.8% 28.3% 3.2% Critical High Medium Low (a) Risk Severity CWE-78 CWE-20 C… view at source ↗

**Figure 5.** Figure 5: CDF of Fix Steps by Agent. 4.4 RQ3: Repair Efficiency To answer RQ3, we analyze the average monetary cost, token consumption, execution time, and number of interaction steps for successful repair attempts, as presented in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of Patch Generation Failure Reasons. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of Patch validation Failure Reasons. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: The ineffective patch generated for CVE-2024-12909. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: The over-restrictive patch for CVE-2024-5565. The [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

read the original abstract

Large Language Models(LLMs) have been actively integrated into modern software systems as critical components. LLM-in-the-loop vulnerabilities, where vulnerabilities are introduced by LLMs and their dependent downstream components, such as frameworks, introduce new risks. Although some benchmark datasets have been constructed to study the impact of such vulnerabilities, most works still remain at the analysis from the conventional software level, ignoring the harm actually caused by LLMs. Understanding real-world LLM-in-the-loop vulnerabilities is still an open problem. To address this gap, we build the first LLM-in-the-loop vulnerability dataset, LLMCVE, to facilitate the risk analysis of LLM-integrated software. To do so, we first collect 2,888 multi-source vulnerabilities across 230 popular LLM components. Then, through manual analysis, we identify 205 vulnerabilities that strictly fall under the concept of LLM-in-the-loop vulnerability. Through analysis, we found that LLMs more often play as targets or propagation vectors rather than the root cause of these vulnerabilities. Furthermore, based on LLMCVE, we evaluate the repairing capabilities of existing agent-based vulnerability repair methods, such as SWE-Agent. Experimental results demonstrate that compared to conventional software vulnerabilities, LLM-in-the-Loop vulnerabilities are more challenging to precisely fix, especially for those involving prompt injections where the Pass@1 rate is only 28.57%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds the first dataset of LLM-in-the-loop vulnerabilities and reports lower repair success rates than on conventional CVEs, but the manual filtering step lacks the details needed to trust the comparison.

read the letter

The core contribution is LLMCVE, a collection of 205 vulnerabilities drawn from 2,888 candidates across 230 LLM-related components. The authors manually isolate cases where LLMs act as targets or propagation vectors rather than root causes, then test SWE-Agent and similar tools, finding notably lower Pass@1 rates on prompt-injection examples (28.57 %) than on standard software bugs.

The dataset construction itself is the clearest new piece. Extending beyond existing CVE benchmarks to focus on LLM-integrated systems is a reasonable step, and the observation that LLMs more often serve as vectors than originators is worth recording.

The main weakness is the manual classification. The abstract states only that 205 items were identified through manual analysis and that they strictly fit the definition. No operational criteria, annotator count, or agreement statistic appears in the provided text. Without those, the 205-case set is hard to reproduce, and the reported performance gap could partly reflect how the boundary was drawn rather than an intrinsic property of the vulnerability class. The repair evaluation inherits that uncertainty.

This work is aimed at security researchers and tool builders who need concrete examples of LLM-related issues. The topic is timely and the dataset idea has value, but the methods section needs explicit documentation of the filtering process before the difficulty claims can be taken as settled. It is worth sending to referees rather than desk-rejecting.

Referee Report

1 major / 1 minor

Summary. The paper claims to construct LLMCVE, the first dataset of LLM-in-the-loop vulnerabilities, by collecting 2,888 multi-source vulnerabilities across 230 LLM components and using manual analysis to identify 205 that strictly match the concept. It analyzes the roles of LLMs (often as targets or propagation vectors) and evaluates agent-based repair tools such as SWE-Agent, reporting that these vulnerabilities are harder to fix than conventional ones, with a Pass@1 rate of only 28.57% on prompt-injection cases.

Significance. If the curation is made reproducible, the work supplies a new public dataset and empirical evidence of repair challenges specific to LLM-integrated systems, which could inform both vulnerability research and tool design. The direct experimental comparison of existing repair agents on the curated cases is a concrete contribution.

major comments (1)

[Abstract and LLMCVE construction section] Abstract and the section on LLMCVE construction: the reduction from 2,888 collected items to 205 LLM-in-the-loop vulnerabilities is performed by manual analysis, yet no operational definition, decision criteria, number of annotators, or inter-rater agreement statistic is supplied. Because the dataset composition directly determines the repair-rate comparisons (including the 28.57% Pass@1 figure), this omission renders the central empirical claims non-reproducible.

minor comments (1)

[Abstract] The abstract states collection and filtering numbers but does not reference a table or appendix that would allow readers to inspect the classification process.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in our dataset curation process. We address the single major comment below and will incorporate the requested details in a revised manuscript.

read point-by-point responses

Referee: [Abstract and LLMCVE construction section] Abstract and the section on LLMCVE construction: the reduction from 2,888 collected items to 205 LLM-in-the-loop vulnerabilities is performed by manual analysis, yet no operational definition, decision criteria, number of annotators, or inter-rater agreement statistic is supplied. Because the dataset composition directly determines the repair-rate comparisons (including the 28.57% Pass@1 figure), this omission renders the central empirical claims non-reproducible.

Authors: We agree that the current manuscript provides insufficient detail on the manual analysis step. The text states only that vulnerabilities were identified 'through manual analysis' to 'strictly fall under the concept of LLM-in-the-loop vulnerability,' without an operational definition, explicit decision criteria, annotator count, or agreement metric. Because the 205-case subset directly informs the reported repair rates, this information is necessary for reproducibility. In the revised manuscript we will add a dedicated subsection to the LLMCVE construction section that supplies: (1) a precise operational definition of LLM-in-the-loop vulnerabilities, (2) the decision criteria and annotation guidelines applied, (3) the number of annotators and their backgrounds, and (4) inter-rater agreement statistics. These additions will allow readers to verify the selection process and the validity of the empirical comparisons. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical data collection and evaluation are self-contained

full rationale

The paper collects 2,888 vulnerabilities, applies manual analysis to select 205 as LLM-in-the-loop cases, analyzes their roles, and evaluates existing repair agents (e.g., SWE-Agent) on the resulting dataset. No equations, parameter fitting, derivations, or predictions appear. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. The central claims rest on direct experimental measurement rather than any reduction to the paper's own inputs by construction. The manual classification step, while potentially under-specified for reproducibility, does not match any enumerated circularity pattern and does not force the reported Pass@1 gap by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the reliability of manual classification of vulnerabilities and the assumption that the collected sample from 230 components represents LLM-in-the-loop issues in practice.

axioms (1)

domain assumption Manual analysis can reliably and consistently identify vulnerabilities that strictly fall under the LLM-in-the-loop concept
Invoked when filtering the 2,888 collected issues down to 205 cases.

pith-pipeline@v0.9.1-grok · 5787 in / 1310 out tokens · 36591 ms · 2026-06-29T11:18:41.136757+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 22 canonical work pages · 8 internal anchors

[1]

Sara Abdali, Richard Anarfi, Carlos J Barberan, Jia He, and Erfan Shayegani
[2]

Securing large language models: Threats, vulnerabilities and responsible practices.arXiv preprint arXiv:2403.12503(2024)

work page arXiv 2024
[3]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Aider AI. 2024. Aider: AI pair programming in your terminal. https://github. com/Aider-AI/aider Accessed: 2026-03-20

2024
[5]

Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. 2024. Jail- breaking leading safety-aligned llms with simple adaptive attacks.arXiv preprint arXiv:2404.02151(2024)

work page arXiv 2024
[6]

Sagiv Antebi, Noam Azulay, Edan Habler, Ben Ganon, Asaf Shabtai, and Yuval Elovici. 2024. Gpt in sheep’s clothing: The risk of customized gpts.arXiv preprint arXiv:2401.09075(2024)

work page arXiv 2024
[7]

Berkay Berabi, Alexey Gronskiy, Veselin Raychev, Gishor Sivanrupan, Victor Chi- botaru, and Martin Vechev. 2024. Deepcode AI fix: Fixing security vulnerabilities with large language models.arXiv preprint arXiv:2402.13291(2024)

work page arXiv 2024
[8]

Farzana Ahamed Bhuiyan, Md Bulbul Sharif, and Akond Rahman. 2021. Security bug report usage for software vulnerability research: a systematic mapping study. IEEE Access9 (2021), 28471–28495

2021
[9]

Quang-Cuong Bui, Riccardo Scandariato, and Nicolás E Díaz Ferreyra. 2022. Vul4j: A dataset of reproducible java vulnerabilities geared towards the study of program repair techniques. InProceedings of the 19th International Conference on Mining Software Repositories. 464–468

2022
[10]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Satish Chitimoju. 2024. A Survey on the Security Vulnerabilities of Large Lan- guage Models and Their Countermeasures.Journal of Computational Innovation 4, 1 (2024)

2024
[12]

Badhan Chandra Das, M Hadi Amini, and Yanzhao Wu. 2025. Security and privacy challenges of large language models: A survey.Comput. Surveys57, 6 (2025), 1–39

2025
[13]

David de Fitero-Dominguez, Eva Garcia-Lopez, Antonio Garcia-Cabot, and Jose- Javier Martinez-Herraiz. 2024. Enhanced automated code vulnerability repair using large language models.Engineering Applications of Artificial Intelligence 138 (2024), 109291

2024
[14]

Xiaoning Dong, Wenbo Hu, Wei Xu, and Tianxing He. 2025. Sata: A paradigm for llm jailbreak via simple assistive task linkage. InFindings of the Association for Computational Linguistics: ACL 2025. 1952–1987

2025
[15]

Mohamad Fakih, Rahul Dharmaji, Halima Bouzidi, Gustavo Quiros Araya, Oluwatosin Ogundare, Mst Ayesha Siddika, and Mohammad Abdullah Al Faruque
[16]

In2025 28th Euromicro Conference on Digital System Design (DSD)

Llm4cve: Enabling iterative automated vulnerability repair with large lan- guage models. In2025 28th Euromicro Conference on Digital System Design (DSD). IEEE, 592–599
[17]

Tarek Gasmi, Ramzi Guesmi, Jihene Bennaceur, and Ines Belhadj. 2026. Bridging AI and software security: A comparative vulnerability assessment of LLM agent deployment paradigms.Information Sciences740 (2026), 123231

2026
[18]

Danielle Gonzalez, Holly Hastings, and Mehdi Mirakhorli. 2019. Automated characterization of software vulnerabilities. In2019 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 135–139

2019
[19]

Junda He, Christoph Treude, and David Lo. 2025. Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead.ACM Transactions on Software Engineering and Methodology34, 5 (2025), 1–30

2025
[20]

Xinyi Hou, Yanjie Zhao, and Haoyu Wang. 2025. On the (in) security of llm app stores. In2025 IEEE Symposium on Security and Privacy (SP). IEEE, 317–335

2025
[21]

Yiwei Hu, Zhen Liu, Kedie Shu, Shenghua Guan, Deqing Zou, Shouhuai Xu, Bin Yuan, and Hai Jin. 2025. {SoK}: Automated Vulnerability Repair: Methods, Tools, and Assessments. In34th USENIX Security Symposium (USENIX Security 25). 4421–4440

2025
[22]

Yanzhe Hu, Shenao Wang, Tianyuan Nie, Yanjie Zhao, and Haoyu Wang. 2025. Understanding Large Language Model Supply Chain: Structure, Domain, and Vulnerabilities.arXiv preprint arXiv:2504.20763(2025)

work page arXiv 2025
[23]

Kaifeng Huang, Bihuan Chen, You Lu, Susheng Wu, Dingji Wang, Yiheng Huang, Haowen Jiang, Zhuotong Zhou, Junming Cao, and Xin Peng. 2024. Lifting the Veil on Composition, Risks, and Mitigations of the Large Language Model Supply Chain.arXiv preprint arXiv:2410.21218(2024)

work page arXiv 2024
[24]

Bo Hui, Haolin Yuan, Neil Gong, Philippe Burlina, and Yinzhi Cao. 2024. Pleak: Prompt leaking attacks against large language model applications. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 3600–3614

2024
[25]

Umar Iqbal, Tadayoshi Kohno, and Franziska Roesner. 2024. LLM platform security: Applying a systematic evaluation framework to OpenAI’s ChatGPT plugins. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, Vol. 7. 611–623

2024
[26]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 international symposium on software testing and analysis. 437–440

2014
[28]

Youngjoon Kim, Sunguk Shin, Hyoungshick Kim, and Jiwon Yoon. 2025. Logs In, Patches Out: Automated Vulnerability Repair via {Tree-of-Thought} {LLM} Analysis. In34th USENIX Security Symposium (USENIX Security 25). 4401–4419

2025
[29]

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, et al. 2023. Prompt injec- tion attack against llm-integrated applications.arXiv preprint arXiv:2306.05499 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Jose Luna, Lili Quan, Ivan Tan, Lingxiao Jiang, Ming Hu, Qiang Hu, and Xiaofei Xie. 2026. Security and Safety Threats in the Large Language Model Supply Chain: A Systematic Survey and Taxonomy.A vailable at SSRN 6327419(2026)

2026
[31]

Yujie Ma, Lili Quan, Xiaofei Xie, Qiang Hu, Jiongchi Yu, Yao Zhang, and Sen Chen. 2025. Understanding the Supply Chain and Risks of Large Language Model Applications.arXiv preprint arXiv:2507.18105(2025)

work page arXiv 2025
[32]

MITRE Corporation. 2026. Common Weakness Enumeration (CWE). https: //cwe.mitre.org/ Accessed: 2026-03-27

2026
[33]

Vitor Hugo Galhardo Moia, Rodrigo Duarte de Meneses, and Igor Jochem Sanz
[34]

InSimpósio Brasileiro de Segurança da Informação e de Sistemas Computacionais (SBSeg)

An Analysis of Real-World Vulnerabilities and Root Causes in the LLM Supply Chain. InSimpósio Brasileiro de Segurança da Informação e de Sistemas Computacionais (SBSeg). SBC, 388–396
[35]

National Vulnerability Database (NVD). 2024. CVE-2024-12909 Detail. https: //nvd.nist.gov/vuln/detail/CVE-2024-12909. Accessed: 2026-03-26

2024
[36]

National Vulnerability Database (NVD). 2024. CVE-2024-5565 Detail. https: //nvd.nist.gov/vuln/detail/CVE-2024-5565. Accessed: 2026-03-26

2024
[37]

National Vulnerability Database (NVD). 2024. CVE-2024-8309 Detail. https: //nvd.nist.gov/vuln/detail/CVE-2024-8309. Accessed: 2026-03-26

2024
[38]

David Noever. 2023. Can large language models find and fix vulnerable software? arXiv preprint arXiv:2308.10345(2023)

work page arXiv 2023
[39]

Yu Nong, Haoran Yang, Long Cheng, Hongxin Hu, and Haipeng Cai. 2025. {APPATCH}: Automated adaptive prompting large language models for {Real- World} software vulnerability patching. In34th USENIX Security Symposium (USENIX Security 25). 4481–4500

2025
[40]

OWASP Foundation. 2025. OWASP Top 10 for LLM Applications 2025. https: //genai.owasp.org/llm-top-10/ Accessed: 2026-03-27

2025
[41]

Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan Dolan-Gavitt. 2023. Examining zero-shot vulnerability repair with large language models. In2023 IEEE symposium on security and privacy (SP). IEEE, 2339–2356

2023
[42]

Rodrigo Pedro, Daniel Castro, Paulo Carreira, and Nuno Santos. 2023. From prompt injections to sql injection attacks: How protected is your llm-integrated web application?arXiv preprint arXiv:2308.01990(2023)

work page arXiv 2023
[43]

Protect AI. 2024. Huntr. https://huntr.com/

2024
[44]

Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Zhang, Fangkai Yang, Hang Dong, Jue Zhang, Lu Wang, et al. 2023. Taskweaver: A code-first agent framework.arXiv preprint arXiv:2311.17541(2023)

work page arXiv 2023
[45]

Muhammad Shahzad, Muhammad Zubair Shafiq, and Alex X Liu. 2012. A large scale exploratory analysis of software vulnerability life cycles. In2012 34th International Conference on Software Engineering (ICSE). IEEE, 771–781

2012
[46]

Yuchen Shao, Yuheng Huang, Jiawei Shen, Lei Ma, Ting Su, and Chengcheng Wan. 2024. Are llms correctly integrated into software systems?arXiv preprint arXiv:2407.05138(2024)

work page arXiv 2024
[47]

Zhuoxiang Shen, Jiarun Dai, Yuan Zhang, and Min Yang. 2025. Security Debt in LLM Agent Applications: A Measurement Study of Vulnerabilities and Mitigation Trade-offs. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 559–570

2025
[48]

Dongxun Su, Yanjie Zhao, Xinyi Hou, Shenao Wang, and Haoyu Wang. 2025. Gpt store mining and analysis. InProceedings of the 16th International Conference on Internetware. 344–354

2025
[49]

Guanhong Tao, Siyuan Cheng, Zhuo Zhang, Junmin Zhu, Guangyu Shen, and Xiangyu Zhang. 2023. Opening a Pandora’s box: things you should know in the era of custom GPTs.arXiv preprint arXiv:2401.00905(2023)

work page arXiv 2023
[50]

Peiran Wang, Xiaogeng Liu, and Chaowei Xiao. 2025. Cve-bench: Benchmarking llm-based software engineering agent’s ability to repair real-world cve vulnerabil- ities. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 4207–4224

2025
[51]

Shenao Wang, Yanjie Zhao, Xinyi Hou, and Haoyu Wang. 2025. Large lan- guage model supply chain: A research agenda.ACM Transactions on Software Engineering and Methodology34, 5 (2025), 1–46

2025
[52]

Shenao Wang, Yanjie Zhao, Zhao Liu, Quanchen Zou, and Haoyu Wang. 2025. Sok: Understanding vulnerabilities in the large language model supply chain. arXiv preprint arXiv:2502.12497(2025). Yujie Ma1, Jialin Rong1, Chenxi Yang1, Lili Quan2, Xiaofei Xie2, Yongqiang Lyu1, Qiang Hu1 1Tianjin University 2Singapore Management University

work page arXiv 2025
[53]

Weizhe Wang, Wei Ma, Qiang Hu, Yao Zhang, Jianfei Sun, Bin Wu, Yang Liu, Guangquan Xu, and Lingxiao Jiang. 2025. VulnRepairEval: An Exploit-Based Evaluation Framework for Assessing Large Language Model Vulnerability Repair Capabilities.arXiv preprint arXiv:2509.03331(2025)

work page arXiv 2025
[54]

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. 2024. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Yi Wu, Nan Jiang, Hung Viet Pham, Thibaud Lutellier, Jordan Davis, Lin Tan, Petr Babkin, and Sameena Shah. 2023. How effective are neural networks for fixing security vulnerabilities. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1282–1294

2023
[56]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agentless: Demystifying llm-based software engineering agents.arXiv preprint arXiv:2407.01489(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated program repair in the era of large pre-trained language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1482–1494

2023
[58]

Yinglin Xie, Xinyi Hou, Yanjie Zhao, Kai Chen, and Haoyu Wang. 2025. LLM app squatting and cloning. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 64–74

2025
[59]

Chuan Yan, Ruomai Ren, Mark Huasong Meng, Liuhuo Wan, Tian Yang Ooi, and Guangdong Bai. 2024. Exploring chatgpt app ecosystem: Distribution, deploy- ment and security. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1370–1382

2024
[60]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

2024
[61]

John Yang, Kilian Lieret, Carlos E Jimenez, Alexander Wettig, Kabir Khand- pur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang
[62]

Swe-smith: Scaling data for software engineering agents.arXiv preprint arXiv:2504.21798(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Zheng Yu, Ziyi Guo, Yuhang Wu, Jiahao Yu, Meng Xu, Dongliang Mu, Yan Chen, and Xinyu Xing. 2025. {PATCHAGENT}: A Practical Program Repair Agent Mimicking Human Expertise. In34th USENIX Security Symposium (USENIX Security 25). 4381–4400

2025
[64]

Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. 2024. Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents.arXiv preprint arXiv:2410.02644(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

Lan Zhang, Qingtian Zou, Anoop Singhal, Xiaoyan Sun, and Peng Liu. 2024. Evaluating large language models for real-world vulnerability repair in c/c++ code. InProceedings of the 10th ACM International Workshop on Security and Privacy Analytics. 49–58

2024
[66]

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Autocoderover: Autonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1592– 1604

2024
[67]

Yanjie Zhao, Xinyi Hou, Shenao Wang, and Haoyu Wang. 2025. Llm app store analysis: A vision and roadmap.ACM Transactions on Software Engineering and Methodology34, 5 (2025), 1–25

2025

[1] [1]

Sara Abdali, Richard Anarfi, Carlos J Barberan, Jia He, and Erfan Shayegani

[2] [2]

Securing large language models: Threats, vulnerabilities and responsible practices.arXiv preprint arXiv:2403.12503(2024)

work page arXiv 2024

[3] [3]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Aider AI. 2024. Aider: AI pair programming in your terminal. https://github. com/Aider-AI/aider Accessed: 2026-03-20

2024

[5] [5]

Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. 2024. Jail- breaking leading safety-aligned llms with simple adaptive attacks.arXiv preprint arXiv:2404.02151(2024)

work page arXiv 2024

[6] [6]

Sagiv Antebi, Noam Azulay, Edan Habler, Ben Ganon, Asaf Shabtai, and Yuval Elovici. 2024. Gpt in sheep’s clothing: The risk of customized gpts.arXiv preprint arXiv:2401.09075(2024)

work page arXiv 2024

[7] [7]

Berkay Berabi, Alexey Gronskiy, Veselin Raychev, Gishor Sivanrupan, Victor Chi- botaru, and Martin Vechev. 2024. Deepcode AI fix: Fixing security vulnerabilities with large language models.arXiv preprint arXiv:2402.13291(2024)

work page arXiv 2024

[8] [8]

Farzana Ahamed Bhuiyan, Md Bulbul Sharif, and Akond Rahman. 2021. Security bug report usage for software vulnerability research: a systematic mapping study. IEEE Access9 (2021), 28471–28495

2021

[9] [9]

Quang-Cuong Bui, Riccardo Scandariato, and Nicolás E Díaz Ferreyra. 2022. Vul4j: A dataset of reproducible java vulnerabilities geared towards the study of program repair techniques. InProceedings of the 19th International Conference on Mining Software Repositories. 464–468

2022

[10] [10]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[11] [11]

Satish Chitimoju. 2024. A Survey on the Security Vulnerabilities of Large Lan- guage Models and Their Countermeasures.Journal of Computational Innovation 4, 1 (2024)

2024

[12] [12]

Badhan Chandra Das, M Hadi Amini, and Yanzhao Wu. 2025. Security and privacy challenges of large language models: A survey.Comput. Surveys57, 6 (2025), 1–39

2025

[13] [13]

David de Fitero-Dominguez, Eva Garcia-Lopez, Antonio Garcia-Cabot, and Jose- Javier Martinez-Herraiz. 2024. Enhanced automated code vulnerability repair using large language models.Engineering Applications of Artificial Intelligence 138 (2024), 109291

2024

[14] [14]

Xiaoning Dong, Wenbo Hu, Wei Xu, and Tianxing He. 2025. Sata: A paradigm for llm jailbreak via simple assistive task linkage. InFindings of the Association for Computational Linguistics: ACL 2025. 1952–1987

2025

[15] [15]

Mohamad Fakih, Rahul Dharmaji, Halima Bouzidi, Gustavo Quiros Araya, Oluwatosin Ogundare, Mst Ayesha Siddika, and Mohammad Abdullah Al Faruque

[16] [16]

In2025 28th Euromicro Conference on Digital System Design (DSD)

Llm4cve: Enabling iterative automated vulnerability repair with large lan- guage models. In2025 28th Euromicro Conference on Digital System Design (DSD). IEEE, 592–599

[17] [17]

Tarek Gasmi, Ramzi Guesmi, Jihene Bennaceur, and Ines Belhadj. 2026. Bridging AI and software security: A comparative vulnerability assessment of LLM agent deployment paradigms.Information Sciences740 (2026), 123231

2026

[18] [18]

Danielle Gonzalez, Holly Hastings, and Mehdi Mirakhorli. 2019. Automated characterization of software vulnerabilities. In2019 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 135–139

2019

[19] [19]

Junda He, Christoph Treude, and David Lo. 2025. Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead.ACM Transactions on Software Engineering and Methodology34, 5 (2025), 1–30

2025

[20] [20]

Xinyi Hou, Yanjie Zhao, and Haoyu Wang. 2025. On the (in) security of llm app stores. In2025 IEEE Symposium on Security and Privacy (SP). IEEE, 317–335

2025

[21] [21]

Yiwei Hu, Zhen Liu, Kedie Shu, Shenghua Guan, Deqing Zou, Shouhuai Xu, Bin Yuan, and Hai Jin. 2025. {SoK}: Automated Vulnerability Repair: Methods, Tools, and Assessments. In34th USENIX Security Symposium (USENIX Security 25). 4421–4440

2025

[22] [22]

Yanzhe Hu, Shenao Wang, Tianyuan Nie, Yanjie Zhao, and Haoyu Wang. 2025. Understanding Large Language Model Supply Chain: Structure, Domain, and Vulnerabilities.arXiv preprint arXiv:2504.20763(2025)

work page arXiv 2025

[23] [23]

Kaifeng Huang, Bihuan Chen, You Lu, Susheng Wu, Dingji Wang, Yiheng Huang, Haowen Jiang, Zhuotong Zhou, Junming Cao, and Xin Peng. 2024. Lifting the Veil on Composition, Risks, and Mitigations of the Large Language Model Supply Chain.arXiv preprint arXiv:2410.21218(2024)

work page arXiv 2024

[24] [24]

Bo Hui, Haolin Yuan, Neil Gong, Philippe Burlina, and Yinzhi Cao. 2024. Pleak: Prompt leaking attacks against large language model applications. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 3600–3614

2024

[25] [25]

Umar Iqbal, Tadayoshi Kohno, and Franziska Roesner. 2024. LLM platform security: Applying a systematic evaluation framework to OpenAI’s ChatGPT plugins. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, Vol. 7. 611–623

2024

[26] [26]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 international symposium on software testing and analysis. 437–440

2014

[28] [28]

Youngjoon Kim, Sunguk Shin, Hyoungshick Kim, and Jiwon Yoon. 2025. Logs In, Patches Out: Automated Vulnerability Repair via {Tree-of-Thought} {LLM} Analysis. In34th USENIX Security Symposium (USENIX Security 25). 4401–4419

2025

[29] [29]

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, et al. 2023. Prompt injec- tion attack against llm-integrated applications.arXiv preprint arXiv:2306.05499 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Jose Luna, Lili Quan, Ivan Tan, Lingxiao Jiang, Ming Hu, Qiang Hu, and Xiaofei Xie. 2026. Security and Safety Threats in the Large Language Model Supply Chain: A Systematic Survey and Taxonomy.A vailable at SSRN 6327419(2026)

2026

[31] [31]

Yujie Ma, Lili Quan, Xiaofei Xie, Qiang Hu, Jiongchi Yu, Yao Zhang, and Sen Chen. 2025. Understanding the Supply Chain and Risks of Large Language Model Applications.arXiv preprint arXiv:2507.18105(2025)

work page arXiv 2025

[32] [32]

MITRE Corporation. 2026. Common Weakness Enumeration (CWE). https: //cwe.mitre.org/ Accessed: 2026-03-27

2026

[33] [33]

Vitor Hugo Galhardo Moia, Rodrigo Duarte de Meneses, and Igor Jochem Sanz

[34] [34]

InSimpósio Brasileiro de Segurança da Informação e de Sistemas Computacionais (SBSeg)

An Analysis of Real-World Vulnerabilities and Root Causes in the LLM Supply Chain. InSimpósio Brasileiro de Segurança da Informação e de Sistemas Computacionais (SBSeg). SBC, 388–396

[35] [35]

National Vulnerability Database (NVD). 2024. CVE-2024-12909 Detail. https: //nvd.nist.gov/vuln/detail/CVE-2024-12909. Accessed: 2026-03-26

2024

[36] [36]

National Vulnerability Database (NVD). 2024. CVE-2024-5565 Detail. https: //nvd.nist.gov/vuln/detail/CVE-2024-5565. Accessed: 2026-03-26

2024

[37] [37]

National Vulnerability Database (NVD). 2024. CVE-2024-8309 Detail. https: //nvd.nist.gov/vuln/detail/CVE-2024-8309. Accessed: 2026-03-26

2024

[38] [38]

David Noever. 2023. Can large language models find and fix vulnerable software? arXiv preprint arXiv:2308.10345(2023)

work page arXiv 2023

[39] [39]

Yu Nong, Haoran Yang, Long Cheng, Hongxin Hu, and Haipeng Cai. 2025. {APPATCH}: Automated adaptive prompting large language models for {Real- World} software vulnerability patching. In34th USENIX Security Symposium (USENIX Security 25). 4481–4500

2025

[40] [40]

OWASP Foundation. 2025. OWASP Top 10 for LLM Applications 2025. https: //genai.owasp.org/llm-top-10/ Accessed: 2026-03-27

2025

[41] [41]

Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan Dolan-Gavitt. 2023. Examining zero-shot vulnerability repair with large language models. In2023 IEEE symposium on security and privacy (SP). IEEE, 2339–2356

2023

[42] [42]

Rodrigo Pedro, Daniel Castro, Paulo Carreira, and Nuno Santos. 2023. From prompt injections to sql injection attacks: How protected is your llm-integrated web application?arXiv preprint arXiv:2308.01990(2023)

work page arXiv 2023

[43] [43]

Protect AI. 2024. Huntr. https://huntr.com/

2024

[44] [44]

Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Zhang, Fangkai Yang, Hang Dong, Jue Zhang, Lu Wang, et al. 2023. Taskweaver: A code-first agent framework.arXiv preprint arXiv:2311.17541(2023)

work page arXiv 2023

[45] [45]

Muhammad Shahzad, Muhammad Zubair Shafiq, and Alex X Liu. 2012. A large scale exploratory analysis of software vulnerability life cycles. In2012 34th International Conference on Software Engineering (ICSE). IEEE, 771–781

2012

[46] [46]

Yuchen Shao, Yuheng Huang, Jiawei Shen, Lei Ma, Ting Su, and Chengcheng Wan. 2024. Are llms correctly integrated into software systems?arXiv preprint arXiv:2407.05138(2024)

work page arXiv 2024

[47] [47]

Zhuoxiang Shen, Jiarun Dai, Yuan Zhang, and Min Yang. 2025. Security Debt in LLM Agent Applications: A Measurement Study of Vulnerabilities and Mitigation Trade-offs. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 559–570

2025

[48] [48]

Dongxun Su, Yanjie Zhao, Xinyi Hou, Shenao Wang, and Haoyu Wang. 2025. Gpt store mining and analysis. InProceedings of the 16th International Conference on Internetware. 344–354

2025

[49] [49]

Guanhong Tao, Siyuan Cheng, Zhuo Zhang, Junmin Zhu, Guangyu Shen, and Xiangyu Zhang. 2023. Opening a Pandora’s box: things you should know in the era of custom GPTs.arXiv preprint arXiv:2401.00905(2023)

work page arXiv 2023

[50] [50]

Peiran Wang, Xiaogeng Liu, and Chaowei Xiao. 2025. Cve-bench: Benchmarking llm-based software engineering agent’s ability to repair real-world cve vulnerabil- ities. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 4207–4224

2025

[51] [51]

Shenao Wang, Yanjie Zhao, Xinyi Hou, and Haoyu Wang. 2025. Large lan- guage model supply chain: A research agenda.ACM Transactions on Software Engineering and Methodology34, 5 (2025), 1–46

2025

[52] [52]

Shenao Wang, Yanjie Zhao, Zhao Liu, Quanchen Zou, and Haoyu Wang. 2025. Sok: Understanding vulnerabilities in the large language model supply chain. arXiv preprint arXiv:2502.12497(2025). Yujie Ma1, Jialin Rong1, Chenxi Yang1, Lili Quan2, Xiaofei Xie2, Yongqiang Lyu1, Qiang Hu1 1Tianjin University 2Singapore Management University

work page arXiv 2025

[53] [53]

Weizhe Wang, Wei Ma, Qiang Hu, Yao Zhang, Jianfei Sun, Bin Wu, Yang Liu, Guangquan Xu, and Lingxiao Jiang. 2025. VulnRepairEval: An Exploit-Based Evaluation Framework for Assessing Large Language Model Vulnerability Repair Capabilities.arXiv preprint arXiv:2509.03331(2025)

work page arXiv 2025

[54] [54]

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. 2024. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [55]

Yi Wu, Nan Jiang, Hung Viet Pham, Thibaud Lutellier, Jordan Davis, Lin Tan, Petr Babkin, and Sameena Shah. 2023. How effective are neural networks for fixing security vulnerabilities. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1282–1294

2023

[56] [56]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agentless: Demystifying llm-based software engineering agents.arXiv preprint arXiv:2407.01489(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [57]

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated program repair in the era of large pre-trained language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1482–1494

2023

[58] [58]

Yinglin Xie, Xinyi Hou, Yanjie Zhao, Kai Chen, and Haoyu Wang. 2025. LLM app squatting and cloning. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 64–74

2025

[59] [59]

Chuan Yan, Ruomai Ren, Mark Huasong Meng, Liuhuo Wan, Tian Yang Ooi, and Guangdong Bai. 2024. Exploring chatgpt app ecosystem: Distribution, deploy- ment and security. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1370–1382

2024

[60] [60]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

2024

[61] [61]

John Yang, Kilian Lieret, Carlos E Jimenez, Alexander Wettig, Kabir Khand- pur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang

[62] [62]

Swe-smith: Scaling data for software engineering agents.arXiv preprint arXiv:2504.21798(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[63] [63]

Zheng Yu, Ziyi Guo, Yuhang Wu, Jiahao Yu, Meng Xu, Dongliang Mu, Yan Chen, and Xinyu Xing. 2025. {PATCHAGENT}: A Practical Program Repair Agent Mimicking Human Expertise. In34th USENIX Security Symposium (USENIX Security 25). 4381–4400

2025

[64] [64]

Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. 2024. Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents.arXiv preprint arXiv:2410.02644(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[65] [65]

Lan Zhang, Qingtian Zou, Anoop Singhal, Xiaoyan Sun, and Peng Liu. 2024. Evaluating large language models for real-world vulnerability repair in c/c++ code. InProceedings of the 10th ACM International Workshop on Security and Privacy Analytics. 49–58

2024

[66] [66]

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Autocoderover: Autonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1592– 1604

2024

[67] [67]

Yanjie Zhao, Xinyi Hou, Shenao Wang, and Haoyu Wang. 2025. Llm app store analysis: A vision and roadmap.ACM Transactions on Software Engineering and Methodology34, 5 (2025), 1–25

2025