arxiv: 2604.10800 · v1 · submitted 2026-04-12 · 💻 cs.SE · cs.AI· cs.CR· cs.LG· cs.PL

Recognition: unknown

Verify Before You Fix: Agentic Execution Grounding for Trustworthy Cross-Language Code Analysis

Jugal Gajjar

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:11 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CRcs.LGcs.PL

keywords cross-language vulnerability detectionLLM agentsexecution groundingsoftware securityuniversal abstract syntax treeagentic AIcode repairclosed-loop reasoning

0 comments

The pith

Execution confirmation of exploitability before any repair enables trustworthy LLM agents for cross-language code vulnerability analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a unified framework for analyzing and fixing software vulnerabilities across languages that forces LLM-driven agents to verify issues through actual execution before taking any repair action. This addresses the core problem that probabilistic model predictions can compound errors when acted upon without evidence. The approach normalizes Java, Python, and C++ code into a shared Universal Abstract Syntax Tree, fuses graph and language-model embeddings via learned gating for detection, and uses a closed-loop process of validation followed by iterative repair only upon confirmed exploitability. A sympathetic reader would care because the results show measurable gains in accuracy and resolution rates while demonstrating that grounding in observable execution is a practical way to increase reliability in agentic AI systems for security tasks.

Core claim

A hybrid structural-semantic detection stage using uAST normalization and two-way gated fusion of GraphSAGE and Qwen2.5-Coder embeddings feeds into execution-grounded agentic validation, after which validation-aware iterative repair proceeds only under the invariant that no repair action is taken without execution-based confirmation of exploitability. This lifecycle resolves 69.74% of vulnerabilities end-to-end at a 12.27% total failure rate, with intra-language accuracies of 89.84-92.02% and zero-shot cross-language F1 of 74.43-80.12%. Ablations confirm that removing uAST drops cross-language F1 by 23.42% and disabling validation increases unnecessary repairs by 131.7%, establishing the end

What carries the argument

The execution-grounded agentic validation stage under the strict invariant of no repair without execution-based confirmation of exploitability, supported by uAST cross-language normalization and gated hybrid embeddings that also supply per-sample explainability.

If this is right

Intra-language detection accuracy reaches 89.84-92.02%.
Zero-shot cross-language F1 scores reach 74.43-80.12%.
69.74% of vulnerabilities are resolved end-to-end at a 12.27% total failure rate.
Removing uAST normalization degrades cross-language F1 by 23.42%.
Disabling the validation stage increases unnecessary repairs by 131.7%.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same verify-before-repair loop could apply to other agentic code tasks such as automated test generation or refactoring where runtime checks are feasible.
Domains without easy execution environments would require surrogate verification mechanisms to achieve comparable trustworthiness.
The gating weights already provide intrinsic explainability that could be surfaced to developers for auditing LLM decisions in security tools.
Extending the approach to additional languages would test how well the uAST schema scales while keeping semantic fidelity.

Load-bearing premise

Execution-based confirmation of exploitability is always feasible and sufficient to catch all relevant vulnerability types while the uAST representation preserves all necessary semantic details across languages without significant loss.

What would settle it

Finding a substantial fraction of vulnerabilities that cannot be confirmed as exploitable through execution tests, or measuring a sharp drop in cross-language F1 when the uAST is applied to a language outside the tested set without schema extensions.

Figures

Figures reproduced from arXiv: 2604.10800 by Jugal Gajjar.

**Figure 2.** Figure 2: Intra-language detection performance. The hybrid model achieves 89.84–92.02% accuracy [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: End-to-end pipeline metrics. The full system resolves 69.74% of vulnerabilities, eliminates [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Architecture selection grid comparing graph encoders and LLM backbones across nine [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Validation performance across languages. (a) Flag [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Repair performance metrics across languages. (a) Success rates: 81.37–87.27%. (b) [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Python validation agent workflow showing hypothesis generation, payload construction, [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Java validation agent workflow utilizing bytecode instrumentation via AspectJ [ [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: C++ validation agent workflow utilizing AddressSanitizer, UndefinedBehaviorSanitizer, [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

read the original abstract

Learned classifiers deployed in agentic pipelines face a fundamental reliability problem: predictions are probabilistic inferences, not verified conclusions, and acting on them without grounding in observable evidence leads to compounding failures across downstream stages. Software vulnerability analysis makes this cost concrete and measurable. We address this through a unified cross-language vulnerability lifecycle framework built around three LLM-driven reasoning stages-hybrid structural-semantic detection, execution-grounded agentic validation, and validation-aware iterative repair-governed by a strict invariant: no repair action is taken without execution-based confirmation of exploitability. Cross-language generalization is achieved via a Universal Abstract Syntax Tree (uAST) normalizing Java, Python, and C++ into a shared structural schema, combined with a hybrid fusion of GraphSAGE and Qwen2.5-Coder-1.5B embeddings through learned two-way gating, whose per-sample weights provide intrinsic explainability at no additional cost. The framework achieves 89.84-92.02% intra-language detection accuracy and 74.43-80.12% zero-shot cross-language F1, resolving 69.74% of vulnerabilities end-to-end at a 12.27% total failure rate. Ablations establish necessity: removing uAST degrades cross-language F1 by 23.42%, while disabling validation increases unnecessary repairs by 131.7%. These results demonstrate that execution-grounded closed-loop reasoning is a principled and practically deployable mechanism for trustworthy LLM-driven agentic AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main move is enforcing execution confirmation before any LLM-driven repair in a cross-language setting, backed by uAST normalization and gated embeddings, but the evidence for broad applicability rests on assumptions that still need tighter testing.

read the letter

The key thing to know is that this work builds an agentic pipeline for vulnerability detection and repair across Java, Python, and C++ that refuses to act on a fix until execution has confirmed the issue is exploitable. That strict invariant, combined with their universal AST and the two-way gating fusion, is the concrete mechanism they put forward for reducing compounding errors in LLM agents.

Referee Report

3 major / 3 minor

Summary. The paper introduces a unified cross-language vulnerability lifecycle framework consisting of hybrid structural-semantic detection, execution-grounded agentic validation, and validation-aware iterative repair. It enforces the invariant that repairs are only taken after execution-based confirmation of exploitability. Cross-language generalization is facilitated by a Universal Abstract Syntax Tree (uAST) and a hybrid fusion of GraphSAGE and Qwen2.5-Coder embeddings via learned two-way gating. Empirical results show 89.84-92.02% intra-language detection accuracy, 74.43-80.12% zero-shot cross-language F1, and 69.74% end-to-end vulnerability resolution at a 12.27% failure rate, with ablations confirming the importance of uAST and validation.

Significance. Should the assumptions about execution feasibility and uAST semantic preservation hold, the work would represent a meaningful contribution to trustworthy agentic AI in software engineering. By grounding LLM inferences in observable execution evidence, it mitigates the risk of compounding errors in vulnerability analysis pipelines. The reported metrics and ablation studies provide concrete evidence of practicality, and the intrinsic explainability from gating weights is a nice addition. This could influence future designs of closed-loop reasoning systems for code.

major comments (3)

[§5 (Evaluation)] §5 (Evaluation) and ablation results: The central trustworthiness claim and the 69.74% end-to-end resolution rate rest on execution-based confirmation being feasible and sufficient for every relevant vulnerability class. The manuscript reports that disabling validation increases bad repairs by 131.7% and that uAST removal drops cross-language F1 by 23.42%, but provides no systematic breakdown or test cases for vulnerability types where execution grounding is infeasible or inconclusive (e.g., non-crashing logic errors, race conditions, or environment-dependent exploits). This is load-bearing for the claim that the invariant produces trustworthy outcomes, as the ablations establish necessity but not sufficiency or boundary conditions.
[§4.2 (uAST Construction)] §4.2 (uAST Construction) and cross-language results: The zero-shot F1 scores of 74.43-80.12% depend on the claim that uAST normalizes Java, Python, and C++ while preserving all semantics needed for detection and validation. Although the ablation shows a 23.42% F1 drop without uAST, the paper contains no direct evidence (such as semantic equivalence tests, round-trip fidelity metrics, or manual analysis of lost constructs) that the representation is lossless for vulnerability-relevant details across all language pairs.
[§5.3 (Ablations)] §5.3 (Ablations) and Table reporting failure rates: The 12.27% total failure rate is presented as evidence of practical deployability, yet the evaluation does not decompose this rate into categories (e.g., cases where the agent cannot produce executable confirmation tests versus cases where confirmation is negative). Without this, it is difficult to assess whether the strict invariant leaves a non-negligible fraction of vulnerabilities unaddressed.

minor comments (3)

[§4.1 (Hybrid Fusion)] The description of the learned two-way gating mechanism would be clearer if accompanied by an explicit equation showing how per-sample weights are computed from the two embedding streams and applied during fusion.
[Figure 1] Figure 1 (pipeline overview) would benefit from explicit arrows or labels indicating the exact points at which execution feedback is injected back into the repair stage.
[Related Work] A small number of citations to recent 2024 works on LLM-based code repair and verification appear to be missing from the related-work section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify areas where additional analysis would strengthen the trustworthiness claims. We respond point-by-point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: §5 (Evaluation) and ablation results: The central trustworthiness claim and the 69.74% end-to-end resolution rate rest on execution-based confirmation being feasible and sufficient for every relevant vulnerability class. The manuscript reports that disabling validation increases bad repairs by 131.7% and that uAST removal drops cross-language F1 by 23.42%, but provides no systematic breakdown or test cases for vulnerability types where execution grounding is infeasible or inconclusive (e.g., non-crashing logic errors, race conditions, or environment-dependent exploits). This is load-bearing for the claim that the invariant produces trustworthy outcomes, as the ablations establish necessity but not sufficiency or boundary conditions.

Authors: We agree that the current presentation would benefit from explicit discussion of boundary conditions. Our evaluation uses vulnerability datasets (primarily from sources containing executable test cases) where confirmation via execution is feasible by design. For classes such as non-crashing logic errors or certain race conditions, execution grounding can be inconclusive without additional oracles. In the revision we will add a new subsection in §5 that (1) enumerates the vulnerability categories present in our test sets, (2) reports the fraction of cases where the agent could not synthesize a confirming test, and (3) discusses fallback strategies (e.g., flagging for manual review) when execution evidence is unavailable. This will clarify both the scope and the limitations of the invariant. revision: yes
Referee: §4.2 (uAST Construction) and cross-language results: The zero-shot F1 scores of 74.43-80.12% depend on the claim that uAST normalizes Java, Python, and C++ while preserving all semantics needed for detection and validation. Although the ablation shows a 23.42% F1 drop without uAST, the paper contains no direct evidence (such as semantic equivalence tests, round-trip fidelity metrics, or manual analysis of lost constructs) that the representation is lossless for vulnerability-relevant details across all language pairs.

Authors: The uAST schema was constructed to retain the structural and data-flow elements most relevant to vulnerability patterns (control-flow graphs, call sites, buffer and pointer operations). The large ablation drop supports that these elements are retained for the detection task. Nevertheless, we acknowledge the absence of explicit fidelity metrics. In the revised manuscript we will add (a) a table of representative cross-language construct mappings with vulnerability relevance, (b) a small-scale manual audit of 50 randomly sampled functions showing preservation or loss of key constructs, and (c) a brief discussion of language-specific features deliberately omitted because they fall outside typical vulnerability patterns. revision: yes
Referee: §5.3 (Ablations) and Table reporting failure rates: The 12.27% total failure rate is presented as evidence of practical deployability, yet the evaluation does not decompose this rate into categories (e.g., cases where the agent cannot produce executable confirmation tests versus cases where confirmation is negative). Without this, it is difficult to assess whether the strict invariant leaves a non-negligible fraction of vulnerabilities unaddressed.

Authors: We will revise §5.3 and the associated table to provide the requested decomposition. Using execution logs already collected during the experiments, we will break the 12.27% failure rate into three mutually exclusive categories: (1) agent unable to generate any executable test, (2) generated test executed but returned negative confirmation of exploitability, and (3) runtime or environment errors preventing confirmation. The revised table will report both absolute counts and percentages, allowing readers to evaluate how often the invariant results in an unaddressed vulnerability versus a correctly rejected false positive. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation with ablations is self-contained

full rationale

The paper reports measured performance (intra-language accuracy 89.84-92.02%, zero-shot cross-language F1 74.43-80.12%, end-to-end resolution 69.74%) and component ablations (uAST removal drops F1 23.42%, validation removal increases bad repairs 131.7%). No mathematical derivation, equations, or first-principles predictions exist that could reduce to inputs by construction. The strict invariant (no repair without execution confirmation) and uAST are design choices whose necessity is tested directly via ablation against external datasets, not self-referential fitting or self-citation chains. This is the common honest case of an empirical systems paper whose central claims are falsifiable by replication on held-out data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that uAST captures sufficient cross-language semantics and that execution grounding reliably verifies exploitability; these are presented as novel without external independent validation in the abstract.

free parameters (1)

learned two-way gating weights
Per-sample weights for fusing GraphSAGE and Qwen2.5-Coder embeddings that also provide explainability.

axioms (1)

domain assumption uAST normalization preserves all necessary structural and semantic information for vulnerability detection across Java, Python, and C++
Invoked to justify cross-language generalization without language-specific retraining.

invented entities (1)

Universal Abstract Syntax Tree (uAST) no independent evidence
purpose: Normalizing Java, Python, and C++ into a shared structural schema for cross-language analysis
Newly introduced component central to the cross-language capability.

pith-pipeline@v0.9.0 · 5569 in / 1383 out tokens · 77589 ms · 2026-05-10T15:11:36.247721+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 21 canonical work pages · 8 internal anchors

[1]

Learning on graph with laplacian regularization.Advances in neural information processing systems, 19, 2006

Rie Ando and Tong Zhang. Learning on graph with laplacian regularization.Advances in neural information processing systems, 19, 2006

2006
[2]

Pendlebury

Daniel Arp, Erwin Quiring, and Feargus et al. Pendlebury. Dos and don’ts of machine learning in computer security. In31st USENIX Security Symposium (USENIX Security 22), pages 3971–3988, 2022

2022
[3]

Generating vulnerability security fixes with code language models.Information and Software Technology, 185:107786, 2025

Guru Bhandari, Nikola Gavric, and Andrii Shalaginov. Generating vulnerability security fixes with code language models.Information and Software Technology, 185:107786, 2025. 9

2025
[4]

Cvefixes: automated collection of vulner- abilities and their fixes from open-source software

Guru Bhandari, Amara Naseer, and Leon Moonen. Cvefixes: automated collection of vulner- abilities and their fixes from open-source software. InProceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering, pages 30–39, 2021

2021
[5]

Vulnerability-fix-dataset: A curated repository for training and evaluating au- tomated vulnerability remediation models

Sitanath Biswas. Vulnerability-fix-dataset: A curated repository for training and evaluating au- tomated vulnerability remediation models. Kaggle, https://www.kaggle.com/datasets/ jiscecseaiml/vulnerability-fix-dataset, 2026

2026
[6]

tree-sitter/tree-sitter: v0.26.6

Max Brunsfeld, Amaan Qureshi, and Andrew Hlynskyi et al. tree-sitter/tree-sitter: v0.26.6. Zenodo, 2026

2026
[7]

Vul4j: A dataset of reproducible java vulnerabilities geared towards the study of program repair techniques

Quang-Cuong Bui, Riccardo Scandariato, and Nicolás E Díaz Ferreyra. Vul4j: A dataset of reproducible java vulnerabilities geared towards the study of program repair techniques. In Proceedings of the 19th International Conference on Mining Software Repositories, pages 464–468, 2022

2022
[8]

Saikat Chakraborty, Rahul Krishna, and Yangruibo et al. Ding. Deep learning based vulnerability detection: Are we there yet?IEEE Transactions on Software Engineering, 48(9):3280–3296, 2021

2021
[9]

Yizheng Chen, Zhoujie Ding, and Lamya et al. Alowain. Diversevul: A new vulnerable source code dataset for deep learning based vulnerability detection. InProceedings of the 26th international symposium on research in attacks, intrusions and defenses, pages 654–668, 2023

2023
[10]

Zimin Chen, Steve Kommrusch, and Michele et al. Tufano. Sequencer: Sequence-to- sequence learning for end-to-end program repair.IEEE Transactions on Software Engineering, 47(9):1943–1959, 2019

1943
[11]

Benefits and risks of ai in health care: narrative review.Interactive Journal of Medical Research, 13(1):e53616, 2024

Margaret Chustecki. Benefits and risks of ai in health care: narrative review.Interactive Journal of Medical Research, 13(1):e53616, 2024

2024
[12]

Python vulnerability remediation: A specialized dataset for instruction-tuning llms on python security patches

Cmonplz. Python vulnerability remediation: A specialized dataset for instruction-tuning llms on python security patches. Hugging Face Datasets, https://huggingface.co/datasets/ cmonplz/Python_Vulnerability_Remediation, 2026

2026
[13]

Code_vulnerability_security_dpo: A preference-aligned dataset for securing code generation through direct preference optimization

CyberNative. Code_vulnerability_security_dpo: A preference-aligned dataset for securing code generation through direct preference optimization. Hugging Face Datasets, https://huggingface.co/datasets/CyberNative/Code_Vulnerability_ Security_DPO, 2026

2026
[14]

Yangruibo Ding, Yanjun Fu, and Omniyyah et al. Ibrahim. Vulnerability detection with code language models: How far are we?arXiv preprint arXiv:2403.18624, 2024

work page arXiv 2024
[15]

Zhangyin Feng, Daya Guo, and Duyu et al. Tang. Codebert: A pre-trained model for pro- gramming and natural languages. InFindings of the association for computational linguistics: EMNLP 2020, pages 1536–1547, 2020

2020
[16]

Linevul: A transformer-based line-level vulner- ability prediction

Michael Fu and Chakkrit Tantithamthavorn. Linevul: A transformer-based line-level vulner- ability prediction. InProceedings of the 19th international conference on mining software repositories, pages 608–620, 2022

2022
[17]

Bridging semantics & structure for software vulnerability detection using hybrid network models.arXiv preprint arXiv:2510.10321, 2025

Jugal Gajjar, Kaustik Ranaware, and Kamalasankari Subramaniakuppusamy. Bridging semantics & structure for software vulnerability detection using hybrid network models.arXiv preprint arXiv:2510.10321, 2025

work page arXiv 2025
[18]

Mlcpd: A unified multi-language code parsing dataset with universal ast schema.arXiv preprint arXiv:2510.16357, 2025

Jugal Gajjar and Kamalasankari Subramaniakuppusamy. Mlcpd: A unified multi-language code parsing dataset with universal ast schema.arXiv preprint arXiv:2510.16357, 2025

work page arXiv 2025
[19]

Malcodeai: Au- tonomous vulnerability detection and remediation via language agnostic code reasoning

Jugal Gajjar, Kamalasankari Subramaniakuppusamy, and Noha El Kachach. Malcodeai: Au- tonomous vulnerability detection and remediation via language agnostic code reasoning. In 2025 IEEE International Conference on Information Reuse and Integration and Data Science (IRI), pages 31–36. IEEE, 2025. 10

2025
[20]

Jugal Gajjar, Kamalasankari Subramaniakuppusamy, and Relsy et al. Puthal. Securefixa- gent: A hybrid llm agent for automated python static vulnerability repair.arXiv preprint arXiv:2509.16275, 2025

work page arXiv 2025
[21]

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Critic: Large language models can self-correct with tool-interactive critiquing.arXiv preprint arXiv:2305.11738, 2023

work page internal anchor Pith review arXiv 2023
[22]

Daya Guo, Shuo Ren, and Shuai et al. Lu. Graphcodebert: Pre-training code representations with data flow.arXiv preprint arXiv:2009.08366, 2020

work page arXiv 2009
[23]

Daya Guo, Qihao Zhu, and Dejian et al. Yang. Deepseek-coder: when the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196, 2024

work page internal anchor Pith review arXiv 2024
[24]

Inductive representation learning on large graphs.Advances in neural information processing systems, 30, 2017

Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs.Advances in neural information processing systems, 30, 2017

2017
[25]

Nima Shiri Harzevili, Alvine Boaye Belle, and Junjie et al. Wang. A survey on automated software vulnerability detection using machine learning and deep learning.arXiv preprint arXiv:2306.11673, 2023

work page arXiv 2023
[26]

Edward J Hu, Yelong Shen, and Phillip et al. Wallis. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022
[27]

Large Language Models Cannot Self-Correct Reasoning Yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet.arXiv preprint arXiv:2310.01798, 2023

work page internal anchor Pith review arXiv 2023
[28]

Binyuan Hui, Jian Yang, and Zeyu et al. Cui. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Ziwei Ji, Nayeon Lee, and Rita et al. Frieske. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

2023
[30]

An overview of aspectj

Gregor Kiczales, Erik Hilsdale, Jim Hugunin, Mik Kersten, Jeffrey Palm, and William G Griswold. An overview of aspectj. InEuropean Conference on Object-Oriented Programming, pages 327–354. Springer, 2001

2001
[31]

Youngjoon Kim, Sunguk Shin, and Hyoungshick et al. Kim. Logs in, patches out: Automated vulnerability repair via tree-of-thought llm analysis. In34th USENIX Security Symposium (USENIX Security 25), pages 4401–4419, 2025

2025
[32]

Semi-Supervised Classification with Graph Convolutional Networks

Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks.arXiv preprint arXiv:1609.02907, 2016

work page internal anchor Pith review arXiv 2016
[33]

Llvm: A compilation framework for lifelong program analysis & transformation

Chris Lattner and Vikram Adve. Llvm: A compilation framework for lifelong program analysis & transformation. InInternational symposium on code generation and optimization, 2004. CGO 2004., pages 75–86. IEEE, 2004

2004
[34]

Claire Le Goues, ThanhVu Nguyen, and Stephanie et al. Forrest. Genprog: A generic method for automatic software repair.Ieee transactions on software engineering, 38(1):54–72, 2011

2011
[35]

Youpeng Li, Kartik Joshi, and Xinda et al. Wang. Mavul: Multi-agent vulnerability detection via contextual reasoning and interactive refinement.arXiv preprint arXiv:2510.00317, 2025

work page arXiv 2025
[36]

Nelson F Liu, Kevin Lin, and John et al. Hewitt. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

2024
[37]

Automatic patch generation by learning correct code

Fan Long and Martin Rinard. Automatic patch generation by learning correct code. InPro- ceedings of the 43rd annual ACM SIGPLAN-SIGACT symposium on principles of programming languages, pages 298–312, 2016. 11

2016
[38]

Thibaud Lutellier, Hung Viet Pham, and Lawrence et al. Pang. Coconut: combining context- aware neural translation models using ensemble for program repair. InProceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis, pages 101–114, 2020

2020
[39]

Monteiro

Diego Marcilio, Rodrigo Bonifácio, and Eduardo et al. Monteiro. Are static analysis violations really fixed? a closer look at realistic usage of sonarqube. In2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), pages 209–219. IEEE, 2019

2019
[40]

vulnerable-code_chitchat_doss1232: A conversational dataset for instruction- tuning security-focused dialogue agents

MarioVar. vulnerable-code_chitchat_doss1232: A conversational dataset for instruction- tuning security-focused dialogue agents. Hugging Face Datasets,https://huggingface.co/ datasets/MarioVar/vulnerable-code_chitchat_doss1232, 2026

2026
[41]

Chao Ni, Liyu Shen, and Xiaohu et al. Yang. Megavul: Ac/c++ vulnerability dataset with comprehensive code representations. InProceedings of the 21st International Conference on Mining Software Repositories, pages 738–742, 2024

2024
[42]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[43]

Code_vulnerability_labeled_dataset

Abdellah Oumida and Mohammed Sbaihi. Code_vulnerability_labeled_dataset. Hugging Face Datasets, https://huggingface.co/datasets/lemon42-ai/Code_Vulnerability_ Labeled_Dataset, 2025

2025
[44]

Clang: A c language family frontend for llvm

LLVM Project. Clang: A c language family frontend for llvm. https://clang.llvm.org/, 2026

2026
[45]

Redbaron: A bottom-up approach to fst (full syntax tree) for python

RedBaron Project. Redbaron: A bottom-up approach to fst (full syntax tree) for python. https://github.com/PyCQA/redbaron, 2026

2026
[46]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, and Fabian et al. Gloeckle. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Source code vulnerability: A compact collection of labeled vulnerable code in c++, java, python, and more

Marat Saratov. Source code vulnerability: A compact collection of labeled vulnerable code in c++, java, python, and more. Kaggle, https://www.kaggle.com/datasets/ maratsaratov/source-code-vulnerability, 2026

2026
[48]

Minjae Seo, Wonwoo Choi, and Myoungsung et al. You. Autopatch: Multi-agent framework for patching real-world cve vulnerabilities.arXiv preprint arXiv:2505.04195, 2025

work page arXiv 2025
[49]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

2023
[50]

cppvul: A curated dataset of c++ vulnerabilities for security-oriented lan- guage model training

Shyyshawarma. cppvul: A curated dataset of c++ vulnerabilities for security-oriented lan- guage model training. Hugging Face Datasets, https://huggingface.co/datasets/ Shyyshawarma/cppvul, 2026

2026
[51]

Shiyu Sun, Shu Wang, and Xinda et al. Wang. Exploring security commits in python. In 2023 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 171–181. IEEE, 2023

2023
[52]

Javaparser: A set of libraries for analyzing, transforming, and generating java source code.https://github.com/javaparser/javaparser, 2026

JavaParser Team. Javaparser: A set of libraries for analyzing, transforming, and generating java source code.https://github.com/javaparser/javaparser, 2026

2026
[53]

Vulnerable programming dataset: A comprehensive collection of 550 unique code vulnerabilities across 10 programming languages

Sunny Thakur. Vulnerable programming dataset: A comprehensive collection of 550 unique code vulnerabilities across 10 programming languages. Kaggle, https://www.kaggle.com/ datasets/cyberprince/vulnerable-programming-dataset, 2026

2026
[54]

Deep learning aided software vulnerability detection: A survey.arXiv preprint arXiv:2503.04002, 2025

Md Nizam Uddin, Yihe Zhang, and Xiali Hei. Deep learning aided software vulnerability detection: A survey.arXiv preprint arXiv:2503.04002, 2025

work page arXiv 2025
[55]

Saad Ullah, Praneeth Balasubramanian, and Wenbo et al. Guo. From cve entries to verifi- able exploits: An automated multi-agent framework for reproducing cves.arXiv preprint arXiv:2509.01835, 2025. 12

work page arXiv 2025
[56]

Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change.Advances in Neural Information Processing Systems, 36:38975–38987, 2023

2023
[57]

Graph Attention Networks

Petar Veliˇckovi´c, Guillem Cucurull, and Arantxa et al. Casanova. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017

work page internal anchor Pith review arXiv 2017
[58]

Sixuan Wang, Chen Huang, and Dongjin et al. Yu. Vulgrab: Graph-embedding-based code vulnerability detection with bi-directional gated graph neural network.Software: Practice and Experience, 53(8):1631–1658, 2023

2023
[59]

Xinchen Wang, Ruida Hu, and Cuiyun et al. Gao. Reposvul: A repository-level high-quality vulnerability dataset. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, pages 472–483, 2024

2024
[60]

Yue Wang, Weishi Wang, and Shafiq et al. Joty. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 8696–8708, 2021

2021
[61]

Ziliang Wang, Ge Li, and Jia et al. Li. Vulagent: Hypothesis-validation based multi-agent vulnerability detection.arXiv preprint arXiv:2509.11523, 2025

work page arXiv 2025
[62]

Boyang Yang, Zijian Cai, and Fengling et al. Liu. A survey of llm-based automated program repair: Taxonomies, design paradigms, and applications.arXiv preprint arXiv:2506.23749, 2025

work page arXiv 2025
[63]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

2022
[64]

Ziyao Zhang, Chong Wang, and Yanlin et al. Wang. Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation.Proceedings of the ACM on Software Engineering, 2(ISSTA):481–503, 2025

2025
[65]

Xin Zhou, Sicong Cao, and Xiaobing et al. Sun. Large language model for vulnerability detection and repair: Literature review and the road ahead.ACM Transactions on Software Engineering and Methodology, 34(5):1–31, 2025

2025
[66]

Yaqin Zhou, Shangqing Liu, and Jingkai et al. Siow. Devign: Effective vulnerability identifi- cation by learning comprehensive program semantics via graph neural networks.Advances in neural information processing systems, 32, 2019. 13 A Dataset Details Table 6 provides the complete dataset composition. The full dataset integrates 15 real-world vul- nerabi...

work page arXiv 2019