pith. machine review for the scientific record. sign in

arxiv: 2604.10800 · v1 · submitted 2026-04-12 · 💻 cs.SE · cs.AI· cs.CR· cs.LG· cs.PL

Recognition: unknown

Verify Before You Fix: Agentic Execution Grounding for Trustworthy Cross-Language Code Analysis

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:11 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CRcs.LGcs.PL
keywords cross-language vulnerability detectionLLM agentsexecution groundingsoftware securityuniversal abstract syntax treeagentic AIcode repairclosed-loop reasoning
0
0 comments X

The pith

Execution confirmation of exploitability before any repair enables trustworthy LLM agents for cross-language code vulnerability analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a unified framework for analyzing and fixing software vulnerabilities across languages that forces LLM-driven agents to verify issues through actual execution before taking any repair action. This addresses the core problem that probabilistic model predictions can compound errors when acted upon without evidence. The approach normalizes Java, Python, and C++ code into a shared Universal Abstract Syntax Tree, fuses graph and language-model embeddings via learned gating for detection, and uses a closed-loop process of validation followed by iterative repair only upon confirmed exploitability. A sympathetic reader would care because the results show measurable gains in accuracy and resolution rates while demonstrating that grounding in observable execution is a practical way to increase reliability in agentic AI systems for security tasks.

Core claim

A hybrid structural-semantic detection stage using uAST normalization and two-way gated fusion of GraphSAGE and Qwen2.5-Coder embeddings feeds into execution-grounded agentic validation, after which validation-aware iterative repair proceeds only under the invariant that no repair action is taken without execution-based confirmation of exploitability. This lifecycle resolves 69.74% of vulnerabilities end-to-end at a 12.27% total failure rate, with intra-language accuracies of 89.84-92.02% and zero-shot cross-language F1 of 74.43-80.12%. Ablations confirm that removing uAST drops cross-language F1 by 23.42% and disabling validation increases unnecessary repairs by 131.7%, establishing the end

What carries the argument

The execution-grounded agentic validation stage under the strict invariant of no repair without execution-based confirmation of exploitability, supported by uAST cross-language normalization and gated hybrid embeddings that also supply per-sample explainability.

If this is right

  • Intra-language detection accuracy reaches 89.84-92.02%.
  • Zero-shot cross-language F1 scores reach 74.43-80.12%.
  • 69.74% of vulnerabilities are resolved end-to-end at a 12.27% total failure rate.
  • Removing uAST normalization degrades cross-language F1 by 23.42%.
  • Disabling the validation stage increases unnecessary repairs by 131.7%.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same verify-before-repair loop could apply to other agentic code tasks such as automated test generation or refactoring where runtime checks are feasible.
  • Domains without easy execution environments would require surrogate verification mechanisms to achieve comparable trustworthiness.
  • The gating weights already provide intrinsic explainability that could be surfaced to developers for auditing LLM decisions in security tools.
  • Extending the approach to additional languages would test how well the uAST schema scales while keeping semantic fidelity.

Load-bearing premise

Execution-based confirmation of exploitability is always feasible and sufficient to catch all relevant vulnerability types while the uAST representation preserves all necessary semantic details across languages without significant loss.

What would settle it

Finding a substantial fraction of vulnerabilities that cannot be confirmed as exploitable through execution tests, or measuring a sharp drop in cross-language F1 when the uAST is applied to a language outside the tested set without schema extensions.

Figures

Figures reproduced from arXiv: 2604.10800 by Jugal Gajjar.

Figure 1
Figure 1. Figure 1: Three-stage lifecycle architecture. The Fusion Detector combines graph and LLM embed [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Intra-language detection performance. The hybrid model achieves 89.84–92.02% accuracy [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: End-to-end pipeline metrics. The full system resolves 69.74% of vulnerabilities, eliminates [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Architecture selection grid comparing graph encoders and LLM backbones across nine [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Validation performance across languages. (a) Flag [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Repair performance metrics across languages. (a) Success rates: 81.37–87.27%. (b) [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Python validation agent workflow showing hypothesis generation, payload construction, [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Java validation agent workflow utilizing bytecode instrumentation via AspectJ [ [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: C++ validation agent workflow utilizing AddressSanitizer, UndefinedBehaviorSanitizer, [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
read the original abstract

Learned classifiers deployed in agentic pipelines face a fundamental reliability problem: predictions are probabilistic inferences, not verified conclusions, and acting on them without grounding in observable evidence leads to compounding failures across downstream stages. Software vulnerability analysis makes this cost concrete and measurable. We address this through a unified cross-language vulnerability lifecycle framework built around three LLM-driven reasoning stages-hybrid structural-semantic detection, execution-grounded agentic validation, and validation-aware iterative repair-governed by a strict invariant: no repair action is taken without execution-based confirmation of exploitability. Cross-language generalization is achieved via a Universal Abstract Syntax Tree (uAST) normalizing Java, Python, and C++ into a shared structural schema, combined with a hybrid fusion of GraphSAGE and Qwen2.5-Coder-1.5B embeddings through learned two-way gating, whose per-sample weights provide intrinsic explainability at no additional cost. The framework achieves 89.84-92.02% intra-language detection accuracy and 74.43-80.12% zero-shot cross-language F1, resolving 69.74% of vulnerabilities end-to-end at a 12.27% total failure rate. Ablations establish necessity: removing uAST degrades cross-language F1 by 23.42%, while disabling validation increases unnecessary repairs by 131.7%. These results demonstrate that execution-grounded closed-loop reasoning is a principled and practically deployable mechanism for trustworthy LLM-driven agentic AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces a unified cross-language vulnerability lifecycle framework consisting of hybrid structural-semantic detection, execution-grounded agentic validation, and validation-aware iterative repair. It enforces the invariant that repairs are only taken after execution-based confirmation of exploitability. Cross-language generalization is facilitated by a Universal Abstract Syntax Tree (uAST) and a hybrid fusion of GraphSAGE and Qwen2.5-Coder embeddings via learned two-way gating. Empirical results show 89.84-92.02% intra-language detection accuracy, 74.43-80.12% zero-shot cross-language F1, and 69.74% end-to-end vulnerability resolution at a 12.27% failure rate, with ablations confirming the importance of uAST and validation.

Significance. Should the assumptions about execution feasibility and uAST semantic preservation hold, the work would represent a meaningful contribution to trustworthy agentic AI in software engineering. By grounding LLM inferences in observable execution evidence, it mitigates the risk of compounding errors in vulnerability analysis pipelines. The reported metrics and ablation studies provide concrete evidence of practicality, and the intrinsic explainability from gating weights is a nice addition. This could influence future designs of closed-loop reasoning systems for code.

major comments (3)
  1. [§5 (Evaluation)] §5 (Evaluation) and ablation results: The central trustworthiness claim and the 69.74% end-to-end resolution rate rest on execution-based confirmation being feasible and sufficient for every relevant vulnerability class. The manuscript reports that disabling validation increases bad repairs by 131.7% and that uAST removal drops cross-language F1 by 23.42%, but provides no systematic breakdown or test cases for vulnerability types where execution grounding is infeasible or inconclusive (e.g., non-crashing logic errors, race conditions, or environment-dependent exploits). This is load-bearing for the claim that the invariant produces trustworthy outcomes, as the ablations establish necessity but not sufficiency or boundary conditions.
  2. [§4.2 (uAST Construction)] §4.2 (uAST Construction) and cross-language results: The zero-shot F1 scores of 74.43-80.12% depend on the claim that uAST normalizes Java, Python, and C++ while preserving all semantics needed for detection and validation. Although the ablation shows a 23.42% F1 drop without uAST, the paper contains no direct evidence (such as semantic equivalence tests, round-trip fidelity metrics, or manual analysis of lost constructs) that the representation is lossless for vulnerability-relevant details across all language pairs.
  3. [§5.3 (Ablations)] §5.3 (Ablations) and Table reporting failure rates: The 12.27% total failure rate is presented as evidence of practical deployability, yet the evaluation does not decompose this rate into categories (e.g., cases where the agent cannot produce executable confirmation tests versus cases where confirmation is negative). Without this, it is difficult to assess whether the strict invariant leaves a non-negligible fraction of vulnerabilities unaddressed.
minor comments (3)
  1. [§4.1 (Hybrid Fusion)] The description of the learned two-way gating mechanism would be clearer if accompanied by an explicit equation showing how per-sample weights are computed from the two embedding streams and applied during fusion.
  2. [Figure 1] Figure 1 (pipeline overview) would benefit from explicit arrows or labels indicating the exact points at which execution feedback is injected back into the repair stage.
  3. [Related Work] A small number of citations to recent 2024 works on LLM-based code repair and verification appear to be missing from the related-work section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify areas where additional analysis would strengthen the trustworthiness claims. We respond point-by-point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: §5 (Evaluation) and ablation results: The central trustworthiness claim and the 69.74% end-to-end resolution rate rest on execution-based confirmation being feasible and sufficient for every relevant vulnerability class. The manuscript reports that disabling validation increases bad repairs by 131.7% and that uAST removal drops cross-language F1 by 23.42%, but provides no systematic breakdown or test cases for vulnerability types where execution grounding is infeasible or inconclusive (e.g., non-crashing logic errors, race conditions, or environment-dependent exploits). This is load-bearing for the claim that the invariant produces trustworthy outcomes, as the ablations establish necessity but not sufficiency or boundary conditions.

    Authors: We agree that the current presentation would benefit from explicit discussion of boundary conditions. Our evaluation uses vulnerability datasets (primarily from sources containing executable test cases) where confirmation via execution is feasible by design. For classes such as non-crashing logic errors or certain race conditions, execution grounding can be inconclusive without additional oracles. In the revision we will add a new subsection in §5 that (1) enumerates the vulnerability categories present in our test sets, (2) reports the fraction of cases where the agent could not synthesize a confirming test, and (3) discusses fallback strategies (e.g., flagging for manual review) when execution evidence is unavailable. This will clarify both the scope and the limitations of the invariant. revision: yes

  2. Referee: §4.2 (uAST Construction) and cross-language results: The zero-shot F1 scores of 74.43-80.12% depend on the claim that uAST normalizes Java, Python, and C++ while preserving all semantics needed for detection and validation. Although the ablation shows a 23.42% F1 drop without uAST, the paper contains no direct evidence (such as semantic equivalence tests, round-trip fidelity metrics, or manual analysis of lost constructs) that the representation is lossless for vulnerability-relevant details across all language pairs.

    Authors: The uAST schema was constructed to retain the structural and data-flow elements most relevant to vulnerability patterns (control-flow graphs, call sites, buffer and pointer operations). The large ablation drop supports that these elements are retained for the detection task. Nevertheless, we acknowledge the absence of explicit fidelity metrics. In the revised manuscript we will add (a) a table of representative cross-language construct mappings with vulnerability relevance, (b) a small-scale manual audit of 50 randomly sampled functions showing preservation or loss of key constructs, and (c) a brief discussion of language-specific features deliberately omitted because they fall outside typical vulnerability patterns. revision: yes

  3. Referee: §5.3 (Ablations) and Table reporting failure rates: The 12.27% total failure rate is presented as evidence of practical deployability, yet the evaluation does not decompose this rate into categories (e.g., cases where the agent cannot produce executable confirmation tests versus cases where confirmation is negative). Without this, it is difficult to assess whether the strict invariant leaves a non-negligible fraction of vulnerabilities unaddressed.

    Authors: We will revise §5.3 and the associated table to provide the requested decomposition. Using execution logs already collected during the experiments, we will break the 12.27% failure rate into three mutually exclusive categories: (1) agent unable to generate any executable test, (2) generated test executed but returned negative confirmation of exploitability, and (3) runtime or environment errors preventing confirmation. The revised table will report both absolute counts and percentages, allowing readers to evaluate how often the invariant results in an unaddressed vulnerability versus a correctly rejected false positive. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation with ablations is self-contained

full rationale

The paper reports measured performance (intra-language accuracy 89.84-92.02%, zero-shot cross-language F1 74.43-80.12%, end-to-end resolution 69.74%) and component ablations (uAST removal drops F1 23.42%, validation removal increases bad repairs 131.7%). No mathematical derivation, equations, or first-principles predictions exist that could reduce to inputs by construction. The strict invariant (no repair without execution confirmation) and uAST are design choices whose necessity is tested directly via ablation against external datasets, not self-referential fitting or self-citation chains. This is the common honest case of an empirical systems paper whose central claims are falsifiable by replication on held-out data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that uAST captures sufficient cross-language semantics and that execution grounding reliably verifies exploitability; these are presented as novel without external independent validation in the abstract.

free parameters (1)
  • learned two-way gating weights
    Per-sample weights for fusing GraphSAGE and Qwen2.5-Coder embeddings that also provide explainability.
axioms (1)
  • domain assumption uAST normalization preserves all necessary structural and semantic information for vulnerability detection across Java, Python, and C++
    Invoked to justify cross-language generalization without language-specific retraining.
invented entities (1)
  • Universal Abstract Syntax Tree (uAST) no independent evidence
    purpose: Normalizing Java, Python, and C++ into a shared structural schema for cross-language analysis
    Newly introduced component central to the cross-language capability.

pith-pipeline@v0.9.0 · 5569 in / 1383 out tokens · 77589 ms · 2026-05-10T15:11:36.247721+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 21 canonical work pages · 8 internal anchors

  1. [1]

    Learning on graph with laplacian regularization.Advances in neural information processing systems, 19, 2006

    Rie Ando and Tong Zhang. Learning on graph with laplacian regularization.Advances in neural information processing systems, 19, 2006

  2. [2]

    Pendlebury

    Daniel Arp, Erwin Quiring, and Feargus et al. Pendlebury. Dos and don’ts of machine learning in computer security. In31st USENIX Security Symposium (USENIX Security 22), pages 3971–3988, 2022

  3. [3]

    Generating vulnerability security fixes with code language models.Information and Software Technology, 185:107786, 2025

    Guru Bhandari, Nikola Gavric, and Andrii Shalaginov. Generating vulnerability security fixes with code language models.Information and Software Technology, 185:107786, 2025. 9

  4. [4]

    Cvefixes: automated collection of vulner- abilities and their fixes from open-source software

    Guru Bhandari, Amara Naseer, and Leon Moonen. Cvefixes: automated collection of vulner- abilities and their fixes from open-source software. InProceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering, pages 30–39, 2021

  5. [5]

    Vulnerability-fix-dataset: A curated repository for training and evaluating au- tomated vulnerability remediation models

    Sitanath Biswas. Vulnerability-fix-dataset: A curated repository for training and evaluating au- tomated vulnerability remediation models. Kaggle, https://www.kaggle.com/datasets/ jiscecseaiml/vulnerability-fix-dataset, 2026

  6. [6]

    tree-sitter/tree-sitter: v0.26.6

    Max Brunsfeld, Amaan Qureshi, and Andrew Hlynskyi et al. tree-sitter/tree-sitter: v0.26.6. Zenodo, 2026

  7. [7]

    Vul4j: A dataset of reproducible java vulnerabilities geared towards the study of program repair techniques

    Quang-Cuong Bui, Riccardo Scandariato, and Nicolás E Díaz Ferreyra. Vul4j: A dataset of reproducible java vulnerabilities geared towards the study of program repair techniques. In Proceedings of the 19th International Conference on Mining Software Repositories, pages 464–468, 2022

  8. [8]

    Saikat Chakraborty, Rahul Krishna, and Yangruibo et al. Ding. Deep learning based vulnerability detection: Are we there yet?IEEE Transactions on Software Engineering, 48(9):3280–3296, 2021

  9. [9]

    Yizheng Chen, Zhoujie Ding, and Lamya et al. Alowain. Diversevul: A new vulnerable source code dataset for deep learning based vulnerability detection. InProceedings of the 26th international symposium on research in attacks, intrusions and defenses, pages 654–668, 2023

  10. [10]

    Zimin Chen, Steve Kommrusch, and Michele et al. Tufano. Sequencer: Sequence-to- sequence learning for end-to-end program repair.IEEE Transactions on Software Engineering, 47(9):1943–1959, 2019

  11. [11]

    Benefits and risks of ai in health care: narrative review.Interactive Journal of Medical Research, 13(1):e53616, 2024

    Margaret Chustecki. Benefits and risks of ai in health care: narrative review.Interactive Journal of Medical Research, 13(1):e53616, 2024

  12. [12]

    Python vulnerability remediation: A specialized dataset for instruction-tuning llms on python security patches

    Cmonplz. Python vulnerability remediation: A specialized dataset for instruction-tuning llms on python security patches. Hugging Face Datasets, https://huggingface.co/datasets/ cmonplz/Python_Vulnerability_Remediation, 2026

  13. [13]

    Code_vulnerability_security_dpo: A preference-aligned dataset for securing code generation through direct preference optimization

    CyberNative. Code_vulnerability_security_dpo: A preference-aligned dataset for securing code generation through direct preference optimization. Hugging Face Datasets, https://huggingface.co/datasets/CyberNative/Code_Vulnerability_ Security_DPO, 2026

  14. [14]

    Yangruibo Ding, Yanjun Fu, and Omniyyah et al. Ibrahim. Vulnerability detection with code language models: How far are we?arXiv preprint arXiv:2403.18624, 2024

  15. [15]

    Zhangyin Feng, Daya Guo, and Duyu et al. Tang. Codebert: A pre-trained model for pro- gramming and natural languages. InFindings of the association for computational linguistics: EMNLP 2020, pages 1536–1547, 2020

  16. [16]

    Linevul: A transformer-based line-level vulner- ability prediction

    Michael Fu and Chakkrit Tantithamthavorn. Linevul: A transformer-based line-level vulner- ability prediction. InProceedings of the 19th international conference on mining software repositories, pages 608–620, 2022

  17. [17]

    Bridging semantics & structure for software vulnerability detection using hybrid network models.arXiv preprint arXiv:2510.10321, 2025

    Jugal Gajjar, Kaustik Ranaware, and Kamalasankari Subramaniakuppusamy. Bridging semantics & structure for software vulnerability detection using hybrid network models.arXiv preprint arXiv:2510.10321, 2025

  18. [18]

    Mlcpd: A unified multi-language code parsing dataset with universal ast schema.arXiv preprint arXiv:2510.16357, 2025

    Jugal Gajjar and Kamalasankari Subramaniakuppusamy. Mlcpd: A unified multi-language code parsing dataset with universal ast schema.arXiv preprint arXiv:2510.16357, 2025

  19. [19]

    Malcodeai: Au- tonomous vulnerability detection and remediation via language agnostic code reasoning

    Jugal Gajjar, Kamalasankari Subramaniakuppusamy, and Noha El Kachach. Malcodeai: Au- tonomous vulnerability detection and remediation via language agnostic code reasoning. In 2025 IEEE International Conference on Information Reuse and Integration and Data Science (IRI), pages 31–36. IEEE, 2025. 10

  20. [20]

    Jugal Gajjar, Kamalasankari Subramaniakuppusamy, and Relsy et al. Puthal. Securefixa- gent: A hybrid llm agent for automated python static vulnerability repair.arXiv preprint arXiv:2509.16275, 2025

  21. [21]

    CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

    Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Critic: Large language models can self-correct with tool-interactive critiquing.arXiv preprint arXiv:2305.11738, 2023

  22. [22]

    Daya Guo, Shuo Ren, and Shuai et al. Lu. Graphcodebert: Pre-training code representations with data flow.arXiv preprint arXiv:2009.08366, 2020

  23. [23]

    Daya Guo, Qihao Zhu, and Dejian et al. Yang. Deepseek-coder: when the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196, 2024

  24. [24]

    Inductive representation learning on large graphs.Advances in neural information processing systems, 30, 2017

    Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs.Advances in neural information processing systems, 30, 2017

  25. [25]

    Nima Shiri Harzevili, Alvine Boaye Belle, and Junjie et al. Wang. A survey on automated software vulnerability detection using machine learning and deep learning.arXiv preprint arXiv:2306.11673, 2023

  26. [26]

    Edward J Hu, Yelong Shen, and Phillip et al. Wallis. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  27. [27]

    Large Language Models Cannot Self-Correct Reasoning Yet

    Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet.arXiv preprint arXiv:2310.01798, 2023

  28. [28]

    Binyuan Hui, Jian Yang, and Zeyu et al. Cui. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

  29. [29]

    Ziwei Ji, Nayeon Lee, and Rita et al. Frieske. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

  30. [30]

    An overview of aspectj

    Gregor Kiczales, Erik Hilsdale, Jim Hugunin, Mik Kersten, Jeffrey Palm, and William G Griswold. An overview of aspectj. InEuropean Conference on Object-Oriented Programming, pages 327–354. Springer, 2001

  31. [31]

    Youngjoon Kim, Sunguk Shin, and Hyoungshick et al. Kim. Logs in, patches out: Automated vulnerability repair via tree-of-thought llm analysis. In34th USENIX Security Symposium (USENIX Security 25), pages 4401–4419, 2025

  32. [32]

    Semi-Supervised Classification with Graph Convolutional Networks

    Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks.arXiv preprint arXiv:1609.02907, 2016

  33. [33]

    Llvm: A compilation framework for lifelong program analysis & transformation

    Chris Lattner and Vikram Adve. Llvm: A compilation framework for lifelong program analysis & transformation. InInternational symposium on code generation and optimization, 2004. CGO 2004., pages 75–86. IEEE, 2004

  34. [34]

    Claire Le Goues, ThanhVu Nguyen, and Stephanie et al. Forrest. Genprog: A generic method for automatic software repair.Ieee transactions on software engineering, 38(1):54–72, 2011

  35. [35]

    Youpeng Li, Kartik Joshi, and Xinda et al. Wang. Mavul: Multi-agent vulnerability detection via contextual reasoning and interactive refinement.arXiv preprint arXiv:2510.00317, 2025

  36. [36]

    Nelson F Liu, Kevin Lin, and John et al. Hewitt. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

  37. [37]

    Automatic patch generation by learning correct code

    Fan Long and Martin Rinard. Automatic patch generation by learning correct code. InPro- ceedings of the 43rd annual ACM SIGPLAN-SIGACT symposium on principles of programming languages, pages 298–312, 2016. 11

  38. [38]

    Thibaud Lutellier, Hung Viet Pham, and Lawrence et al. Pang. Coconut: combining context- aware neural translation models using ensemble for program repair. InProceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis, pages 101–114, 2020

  39. [39]

    Monteiro

    Diego Marcilio, Rodrigo Bonifácio, and Eduardo et al. Monteiro. Are static analysis violations really fixed? a closer look at realistic usage of sonarqube. In2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), pages 209–219. IEEE, 2019

  40. [40]

    vulnerable-code_chitchat_doss1232: A conversational dataset for instruction- tuning security-focused dialogue agents

    MarioVar. vulnerable-code_chitchat_doss1232: A conversational dataset for instruction- tuning security-focused dialogue agents. Hugging Face Datasets,https://huggingface.co/ datasets/MarioVar/vulnerable-code_chitchat_doss1232, 2026

  41. [41]

    Chao Ni, Liyu Shen, and Xiaohu et al. Yang. Megavul: Ac/c++ vulnerability dataset with comprehensive code representations. InProceedings of the 21st International Conference on Mining Software Repositories, pages 738–742, 2024

  42. [42]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

  43. [43]

    Code_vulnerability_labeled_dataset

    Abdellah Oumida and Mohammed Sbaihi. Code_vulnerability_labeled_dataset. Hugging Face Datasets, https://huggingface.co/datasets/lemon42-ai/Code_Vulnerability_ Labeled_Dataset, 2025

  44. [44]

    Clang: A c language family frontend for llvm

    LLVM Project. Clang: A c language family frontend for llvm. https://clang.llvm.org/, 2026

  45. [45]

    Redbaron: A bottom-up approach to fst (full syntax tree) for python

    RedBaron Project. Redbaron: A bottom-up approach to fst (full syntax tree) for python. https://github.com/PyCQA/redbaron, 2026

  46. [46]

    Code Llama: Open Foundation Models for Code

    Baptiste Roziere, Jonas Gehring, and Fabian et al. Gloeckle. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

  47. [47]

    Source code vulnerability: A compact collection of labeled vulnerable code in c++, java, python, and more

    Marat Saratov. Source code vulnerability: A compact collection of labeled vulnerable code in c++, java, python, and more. Kaggle, https://www.kaggle.com/datasets/ maratsaratov/source-code-vulnerability, 2026

  48. [48]

    Minjae Seo, Wonwoo Choi, and Myoungsung et al. You. Autopatch: Multi-agent framework for patching real-world cve vulnerabilities.arXiv preprint arXiv:2505.04195, 2025

  49. [49]

    Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

  50. [50]

    cppvul: A curated dataset of c++ vulnerabilities for security-oriented lan- guage model training

    Shyyshawarma. cppvul: A curated dataset of c++ vulnerabilities for security-oriented lan- guage model training. Hugging Face Datasets, https://huggingface.co/datasets/ Shyyshawarma/cppvul, 2026

  51. [51]

    Shiyu Sun, Shu Wang, and Xinda et al. Wang. Exploring security commits in python. In 2023 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 171–181. IEEE, 2023

  52. [52]

    Javaparser: A set of libraries for analyzing, transforming, and generating java source code.https://github.com/javaparser/javaparser, 2026

    JavaParser Team. Javaparser: A set of libraries for analyzing, transforming, and generating java source code.https://github.com/javaparser/javaparser, 2026

  53. [53]

    Vulnerable programming dataset: A comprehensive collection of 550 unique code vulnerabilities across 10 programming languages

    Sunny Thakur. Vulnerable programming dataset: A comprehensive collection of 550 unique code vulnerabilities across 10 programming languages. Kaggle, https://www.kaggle.com/ datasets/cyberprince/vulnerable-programming-dataset, 2026

  54. [54]

    Deep learning aided software vulnerability detection: A survey.arXiv preprint arXiv:2503.04002, 2025

    Md Nizam Uddin, Yihe Zhang, and Xiali Hei. Deep learning aided software vulnerability detection: A survey.arXiv preprint arXiv:2503.04002, 2025

  55. [55]

    Saad Ullah, Praneeth Balasubramanian, and Wenbo et al. Guo. From cve entries to verifi- able exploits: An automated multi-agent framework for reproducing cves.arXiv preprint arXiv:2509.01835, 2025. 12

  56. [56]

    Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change.Advances in Neural Information Processing Systems, 36:38975–38987, 2023

  57. [57]

    Graph Attention Networks

    Petar Veliˇckovi´c, Guillem Cucurull, and Arantxa et al. Casanova. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017

  58. [58]

    Sixuan Wang, Chen Huang, and Dongjin et al. Yu. Vulgrab: Graph-embedding-based code vulnerability detection with bi-directional gated graph neural network.Software: Practice and Experience, 53(8):1631–1658, 2023

  59. [59]

    Xinchen Wang, Ruida Hu, and Cuiyun et al. Gao. Reposvul: A repository-level high-quality vulnerability dataset. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, pages 472–483, 2024

  60. [60]

    Yue Wang, Weishi Wang, and Shafiq et al. Joty. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 8696–8708, 2021

  61. [61]

    Ziliang Wang, Ge Li, and Jia et al. Li. Vulagent: Hypothesis-validation based multi-agent vulnerability detection.arXiv preprint arXiv:2509.11523, 2025

  62. [62]

    Boyang Yang, Zijian Cai, and Fengling et al. Liu. A survey of llm-based automated program repair: Taxonomies, design paradigms, and applications.arXiv preprint arXiv:2506.23749, 2025

  63. [63]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

  64. [64]

    Ziyao Zhang, Chong Wang, and Yanlin et al. Wang. Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation.Proceedings of the ACM on Software Engineering, 2(ISSTA):481–503, 2025

  65. [65]

    Xin Zhou, Sicong Cao, and Xiaobing et al. Sun. Large language model for vulnerability detection and repair: Literature review and the road ahead.ACM Transactions on Software Engineering and Methodology, 34(5):1–31, 2025

  66. [66]

    Yaqin Zhou, Shangqing Liu, and Jingkai et al. Siow. Devign: Effective vulnerability identifi- cation by learning comprehensive program semantics via graph neural networks.Advances in neural information processing systems, 32, 2019. 13 A Dataset Details Table 6 provides the complete dataset composition. The full dataset integrates 15 real-world vul- nerabi...