Recognition: unknown
Verify Before You Fix: Agentic Execution Grounding for Trustworthy Cross-Language Code Analysis
Pith reviewed 2026-05-10 15:11 UTC · model grok-4.3
The pith
Execution confirmation of exploitability before any repair enables trustworthy LLM agents for cross-language code vulnerability analysis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A hybrid structural-semantic detection stage using uAST normalization and two-way gated fusion of GraphSAGE and Qwen2.5-Coder embeddings feeds into execution-grounded agentic validation, after which validation-aware iterative repair proceeds only under the invariant that no repair action is taken without execution-based confirmation of exploitability. This lifecycle resolves 69.74% of vulnerabilities end-to-end at a 12.27% total failure rate, with intra-language accuracies of 89.84-92.02% and zero-shot cross-language F1 of 74.43-80.12%. Ablations confirm that removing uAST drops cross-language F1 by 23.42% and disabling validation increases unnecessary repairs by 131.7%, establishing the end
What carries the argument
The execution-grounded agentic validation stage under the strict invariant of no repair without execution-based confirmation of exploitability, supported by uAST cross-language normalization and gated hybrid embeddings that also supply per-sample explainability.
If this is right
- Intra-language detection accuracy reaches 89.84-92.02%.
- Zero-shot cross-language F1 scores reach 74.43-80.12%.
- 69.74% of vulnerabilities are resolved end-to-end at a 12.27% total failure rate.
- Removing uAST normalization degrades cross-language F1 by 23.42%.
- Disabling the validation stage increases unnecessary repairs by 131.7%.
Where Pith is reading between the lines
- The same verify-before-repair loop could apply to other agentic code tasks such as automated test generation or refactoring where runtime checks are feasible.
- Domains without easy execution environments would require surrogate verification mechanisms to achieve comparable trustworthiness.
- The gating weights already provide intrinsic explainability that could be surfaced to developers for auditing LLM decisions in security tools.
- Extending the approach to additional languages would test how well the uAST schema scales while keeping semantic fidelity.
Load-bearing premise
Execution-based confirmation of exploitability is always feasible and sufficient to catch all relevant vulnerability types while the uAST representation preserves all necessary semantic details across languages without significant loss.
What would settle it
Finding a substantial fraction of vulnerabilities that cannot be confirmed as exploitable through execution tests, or measuring a sharp drop in cross-language F1 when the uAST is applied to a language outside the tested set without schema extensions.
Figures
read the original abstract
Learned classifiers deployed in agentic pipelines face a fundamental reliability problem: predictions are probabilistic inferences, not verified conclusions, and acting on them without grounding in observable evidence leads to compounding failures across downstream stages. Software vulnerability analysis makes this cost concrete and measurable. We address this through a unified cross-language vulnerability lifecycle framework built around three LLM-driven reasoning stages-hybrid structural-semantic detection, execution-grounded agentic validation, and validation-aware iterative repair-governed by a strict invariant: no repair action is taken without execution-based confirmation of exploitability. Cross-language generalization is achieved via a Universal Abstract Syntax Tree (uAST) normalizing Java, Python, and C++ into a shared structural schema, combined with a hybrid fusion of GraphSAGE and Qwen2.5-Coder-1.5B embeddings through learned two-way gating, whose per-sample weights provide intrinsic explainability at no additional cost. The framework achieves 89.84-92.02% intra-language detection accuracy and 74.43-80.12% zero-shot cross-language F1, resolving 69.74% of vulnerabilities end-to-end at a 12.27% total failure rate. Ablations establish necessity: removing uAST degrades cross-language F1 by 23.42%, while disabling validation increases unnecessary repairs by 131.7%. These results demonstrate that execution-grounded closed-loop reasoning is a principled and practically deployable mechanism for trustworthy LLM-driven agentic AI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a unified cross-language vulnerability lifecycle framework consisting of hybrid structural-semantic detection, execution-grounded agentic validation, and validation-aware iterative repair. It enforces the invariant that repairs are only taken after execution-based confirmation of exploitability. Cross-language generalization is facilitated by a Universal Abstract Syntax Tree (uAST) and a hybrid fusion of GraphSAGE and Qwen2.5-Coder embeddings via learned two-way gating. Empirical results show 89.84-92.02% intra-language detection accuracy, 74.43-80.12% zero-shot cross-language F1, and 69.74% end-to-end vulnerability resolution at a 12.27% failure rate, with ablations confirming the importance of uAST and validation.
Significance. Should the assumptions about execution feasibility and uAST semantic preservation hold, the work would represent a meaningful contribution to trustworthy agentic AI in software engineering. By grounding LLM inferences in observable execution evidence, it mitigates the risk of compounding errors in vulnerability analysis pipelines. The reported metrics and ablation studies provide concrete evidence of practicality, and the intrinsic explainability from gating weights is a nice addition. This could influence future designs of closed-loop reasoning systems for code.
major comments (3)
- [§5 (Evaluation)] §5 (Evaluation) and ablation results: The central trustworthiness claim and the 69.74% end-to-end resolution rate rest on execution-based confirmation being feasible and sufficient for every relevant vulnerability class. The manuscript reports that disabling validation increases bad repairs by 131.7% and that uAST removal drops cross-language F1 by 23.42%, but provides no systematic breakdown or test cases for vulnerability types where execution grounding is infeasible or inconclusive (e.g., non-crashing logic errors, race conditions, or environment-dependent exploits). This is load-bearing for the claim that the invariant produces trustworthy outcomes, as the ablations establish necessity but not sufficiency or boundary conditions.
- [§4.2 (uAST Construction)] §4.2 (uAST Construction) and cross-language results: The zero-shot F1 scores of 74.43-80.12% depend on the claim that uAST normalizes Java, Python, and C++ while preserving all semantics needed for detection and validation. Although the ablation shows a 23.42% F1 drop without uAST, the paper contains no direct evidence (such as semantic equivalence tests, round-trip fidelity metrics, or manual analysis of lost constructs) that the representation is lossless for vulnerability-relevant details across all language pairs.
- [§5.3 (Ablations)] §5.3 (Ablations) and Table reporting failure rates: The 12.27% total failure rate is presented as evidence of practical deployability, yet the evaluation does not decompose this rate into categories (e.g., cases where the agent cannot produce executable confirmation tests versus cases where confirmation is negative). Without this, it is difficult to assess whether the strict invariant leaves a non-negligible fraction of vulnerabilities unaddressed.
minor comments (3)
- [§4.1 (Hybrid Fusion)] The description of the learned two-way gating mechanism would be clearer if accompanied by an explicit equation showing how per-sample weights are computed from the two embedding streams and applied during fusion.
- [Figure 1] Figure 1 (pipeline overview) would benefit from explicit arrows or labels indicating the exact points at which execution feedback is injected back into the repair stage.
- [Related Work] A small number of citations to recent 2024 works on LLM-based code repair and verification appear to be missing from the related-work section.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments identify areas where additional analysis would strengthen the trustworthiness claims. We respond point-by-point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: §5 (Evaluation) and ablation results: The central trustworthiness claim and the 69.74% end-to-end resolution rate rest on execution-based confirmation being feasible and sufficient for every relevant vulnerability class. The manuscript reports that disabling validation increases bad repairs by 131.7% and that uAST removal drops cross-language F1 by 23.42%, but provides no systematic breakdown or test cases for vulnerability types where execution grounding is infeasible or inconclusive (e.g., non-crashing logic errors, race conditions, or environment-dependent exploits). This is load-bearing for the claim that the invariant produces trustworthy outcomes, as the ablations establish necessity but not sufficiency or boundary conditions.
Authors: We agree that the current presentation would benefit from explicit discussion of boundary conditions. Our evaluation uses vulnerability datasets (primarily from sources containing executable test cases) where confirmation via execution is feasible by design. For classes such as non-crashing logic errors or certain race conditions, execution grounding can be inconclusive without additional oracles. In the revision we will add a new subsection in §5 that (1) enumerates the vulnerability categories present in our test sets, (2) reports the fraction of cases where the agent could not synthesize a confirming test, and (3) discusses fallback strategies (e.g., flagging for manual review) when execution evidence is unavailable. This will clarify both the scope and the limitations of the invariant. revision: yes
-
Referee: §4.2 (uAST Construction) and cross-language results: The zero-shot F1 scores of 74.43-80.12% depend on the claim that uAST normalizes Java, Python, and C++ while preserving all semantics needed for detection and validation. Although the ablation shows a 23.42% F1 drop without uAST, the paper contains no direct evidence (such as semantic equivalence tests, round-trip fidelity metrics, or manual analysis of lost constructs) that the representation is lossless for vulnerability-relevant details across all language pairs.
Authors: The uAST schema was constructed to retain the structural and data-flow elements most relevant to vulnerability patterns (control-flow graphs, call sites, buffer and pointer operations). The large ablation drop supports that these elements are retained for the detection task. Nevertheless, we acknowledge the absence of explicit fidelity metrics. In the revised manuscript we will add (a) a table of representative cross-language construct mappings with vulnerability relevance, (b) a small-scale manual audit of 50 randomly sampled functions showing preservation or loss of key constructs, and (c) a brief discussion of language-specific features deliberately omitted because they fall outside typical vulnerability patterns. revision: yes
-
Referee: §5.3 (Ablations) and Table reporting failure rates: The 12.27% total failure rate is presented as evidence of practical deployability, yet the evaluation does not decompose this rate into categories (e.g., cases where the agent cannot produce executable confirmation tests versus cases where confirmation is negative). Without this, it is difficult to assess whether the strict invariant leaves a non-negligible fraction of vulnerabilities unaddressed.
Authors: We will revise §5.3 and the associated table to provide the requested decomposition. Using execution logs already collected during the experiments, we will break the 12.27% failure rate into three mutually exclusive categories: (1) agent unable to generate any executable test, (2) generated test executed but returned negative confirmation of exploitability, and (3) runtime or environment errors preventing confirmation. The revised table will report both absolute counts and percentages, allowing readers to evaluate how often the invariant results in an unaddressed vulnerability versus a correctly rejected false positive. revision: yes
Circularity Check
No circularity: empirical evaluation with ablations is self-contained
full rationale
The paper reports measured performance (intra-language accuracy 89.84-92.02%, zero-shot cross-language F1 74.43-80.12%, end-to-end resolution 69.74%) and component ablations (uAST removal drops F1 23.42%, validation removal increases bad repairs 131.7%). No mathematical derivation, equations, or first-principles predictions exist that could reduce to inputs by construction. The strict invariant (no repair without execution confirmation) and uAST are design choices whose necessity is tested directly via ablation against external datasets, not self-referential fitting or self-citation chains. This is the common honest case of an empirical systems paper whose central claims are falsifiable by replication on held-out data.
Axiom & Free-Parameter Ledger
free parameters (1)
- learned two-way gating weights
axioms (1)
- domain assumption uAST normalization preserves all necessary structural and semantic information for vulnerability detection across Java, Python, and C++
invented entities (1)
-
Universal Abstract Syntax Tree (uAST)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Learning on graph with laplacian regularization.Advances in neural information processing systems, 19, 2006
Rie Ando and Tong Zhang. Learning on graph with laplacian regularization.Advances in neural information processing systems, 19, 2006
2006
-
[2]
Pendlebury
Daniel Arp, Erwin Quiring, and Feargus et al. Pendlebury. Dos and don’ts of machine learning in computer security. In31st USENIX Security Symposium (USENIX Security 22), pages 3971–3988, 2022
2022
-
[3]
Generating vulnerability security fixes with code language models.Information and Software Technology, 185:107786, 2025
Guru Bhandari, Nikola Gavric, and Andrii Shalaginov. Generating vulnerability security fixes with code language models.Information and Software Technology, 185:107786, 2025. 9
2025
-
[4]
Cvefixes: automated collection of vulner- abilities and their fixes from open-source software
Guru Bhandari, Amara Naseer, and Leon Moonen. Cvefixes: automated collection of vulner- abilities and their fixes from open-source software. InProceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering, pages 30–39, 2021
2021
-
[5]
Vulnerability-fix-dataset: A curated repository for training and evaluating au- tomated vulnerability remediation models
Sitanath Biswas. Vulnerability-fix-dataset: A curated repository for training and evaluating au- tomated vulnerability remediation models. Kaggle, https://www.kaggle.com/datasets/ jiscecseaiml/vulnerability-fix-dataset, 2026
2026
-
[6]
tree-sitter/tree-sitter: v0.26.6
Max Brunsfeld, Amaan Qureshi, and Andrew Hlynskyi et al. tree-sitter/tree-sitter: v0.26.6. Zenodo, 2026
2026
-
[7]
Vul4j: A dataset of reproducible java vulnerabilities geared towards the study of program repair techniques
Quang-Cuong Bui, Riccardo Scandariato, and Nicolás E Díaz Ferreyra. Vul4j: A dataset of reproducible java vulnerabilities geared towards the study of program repair techniques. In Proceedings of the 19th International Conference on Mining Software Repositories, pages 464–468, 2022
2022
-
[8]
Saikat Chakraborty, Rahul Krishna, and Yangruibo et al. Ding. Deep learning based vulnerability detection: Are we there yet?IEEE Transactions on Software Engineering, 48(9):3280–3296, 2021
2021
-
[9]
Yizheng Chen, Zhoujie Ding, and Lamya et al. Alowain. Diversevul: A new vulnerable source code dataset for deep learning based vulnerability detection. InProceedings of the 26th international symposium on research in attacks, intrusions and defenses, pages 654–668, 2023
2023
-
[10]
Zimin Chen, Steve Kommrusch, and Michele et al. Tufano. Sequencer: Sequence-to- sequence learning for end-to-end program repair.IEEE Transactions on Software Engineering, 47(9):1943–1959, 2019
1943
-
[11]
Benefits and risks of ai in health care: narrative review.Interactive Journal of Medical Research, 13(1):e53616, 2024
Margaret Chustecki. Benefits and risks of ai in health care: narrative review.Interactive Journal of Medical Research, 13(1):e53616, 2024
2024
-
[12]
Python vulnerability remediation: A specialized dataset for instruction-tuning llms on python security patches
Cmonplz. Python vulnerability remediation: A specialized dataset for instruction-tuning llms on python security patches. Hugging Face Datasets, https://huggingface.co/datasets/ cmonplz/Python_Vulnerability_Remediation, 2026
2026
-
[13]
Code_vulnerability_security_dpo: A preference-aligned dataset for securing code generation through direct preference optimization
CyberNative. Code_vulnerability_security_dpo: A preference-aligned dataset for securing code generation through direct preference optimization. Hugging Face Datasets, https://huggingface.co/datasets/CyberNative/Code_Vulnerability_ Security_DPO, 2026
2026
- [14]
-
[15]
Zhangyin Feng, Daya Guo, and Duyu et al. Tang. Codebert: A pre-trained model for pro- gramming and natural languages. InFindings of the association for computational linguistics: EMNLP 2020, pages 1536–1547, 2020
2020
-
[16]
Linevul: A transformer-based line-level vulner- ability prediction
Michael Fu and Chakkrit Tantithamthavorn. Linevul: A transformer-based line-level vulner- ability prediction. InProceedings of the 19th international conference on mining software repositories, pages 608–620, 2022
2022
-
[17]
Jugal Gajjar, Kaustik Ranaware, and Kamalasankari Subramaniakuppusamy. Bridging semantics & structure for software vulnerability detection using hybrid network models.arXiv preprint arXiv:2510.10321, 2025
-
[18]
Jugal Gajjar and Kamalasankari Subramaniakuppusamy. Mlcpd: A unified multi-language code parsing dataset with universal ast schema.arXiv preprint arXiv:2510.16357, 2025
-
[19]
Malcodeai: Au- tonomous vulnerability detection and remediation via language agnostic code reasoning
Jugal Gajjar, Kamalasankari Subramaniakuppusamy, and Noha El Kachach. Malcodeai: Au- tonomous vulnerability detection and remediation via language agnostic code reasoning. In 2025 IEEE International Conference on Information Reuse and Integration and Data Science (IRI), pages 31–36. IEEE, 2025. 10
2025
- [20]
-
[21]
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Critic: Large language models can self-correct with tool-interactive critiquing.arXiv preprint arXiv:2305.11738, 2023
work page internal anchor Pith review arXiv 2023
- [22]
-
[23]
Daya Guo, Qihao Zhu, and Dejian et al. Yang. Deepseek-coder: when the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196, 2024
work page internal anchor Pith review arXiv 2024
-
[24]
Inductive representation learning on large graphs.Advances in neural information processing systems, 30, 2017
Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs.Advances in neural information processing systems, 30, 2017
2017
- [25]
-
[26]
Edward J Hu, Yelong Shen, and Phillip et al. Wallis. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022
2022
-
[27]
Large Language Models Cannot Self-Correct Reasoning Yet
Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet.arXiv preprint arXiv:2310.01798, 2023
work page internal anchor Pith review arXiv 2023
-
[28]
Binyuan Hui, Jian Yang, and Zeyu et al. Cui. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Ziwei Ji, Nayeon Lee, and Rita et al. Frieske. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023
2023
-
[30]
An overview of aspectj
Gregor Kiczales, Erik Hilsdale, Jim Hugunin, Mik Kersten, Jeffrey Palm, and William G Griswold. An overview of aspectj. InEuropean Conference on Object-Oriented Programming, pages 327–354. Springer, 2001
2001
-
[31]
Youngjoon Kim, Sunguk Shin, and Hyoungshick et al. Kim. Logs in, patches out: Automated vulnerability repair via tree-of-thought llm analysis. In34th USENIX Security Symposium (USENIX Security 25), pages 4401–4419, 2025
2025
-
[32]
Semi-Supervised Classification with Graph Convolutional Networks
Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks.arXiv preprint arXiv:1609.02907, 2016
work page internal anchor Pith review arXiv 2016
-
[33]
Llvm: A compilation framework for lifelong program analysis & transformation
Chris Lattner and Vikram Adve. Llvm: A compilation framework for lifelong program analysis & transformation. InInternational symposium on code generation and optimization, 2004. CGO 2004., pages 75–86. IEEE, 2004
2004
-
[34]
Claire Le Goues, ThanhVu Nguyen, and Stephanie et al. Forrest. Genprog: A generic method for automatic software repair.Ieee transactions on software engineering, 38(1):54–72, 2011
2011
- [35]
-
[36]
Nelson F Liu, Kevin Lin, and John et al. Hewitt. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024
2024
-
[37]
Automatic patch generation by learning correct code
Fan Long and Martin Rinard. Automatic patch generation by learning correct code. InPro- ceedings of the 43rd annual ACM SIGPLAN-SIGACT symposium on principles of programming languages, pages 298–312, 2016. 11
2016
-
[38]
Thibaud Lutellier, Hung Viet Pham, and Lawrence et al. Pang. Coconut: combining context- aware neural translation models using ensemble for program repair. InProceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis, pages 101–114, 2020
2020
-
[39]
Monteiro
Diego Marcilio, Rodrigo Bonifácio, and Eduardo et al. Monteiro. Are static analysis violations really fixed? a closer look at realistic usage of sonarqube. In2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), pages 209–219. IEEE, 2019
2019
-
[40]
vulnerable-code_chitchat_doss1232: A conversational dataset for instruction- tuning security-focused dialogue agents
MarioVar. vulnerable-code_chitchat_doss1232: A conversational dataset for instruction- tuning security-focused dialogue agents. Hugging Face Datasets,https://huggingface.co/ datasets/MarioVar/vulnerable-code_chitchat_doss1232, 2026
2026
-
[41]
Chao Ni, Liyu Shen, and Xiaohu et al. Yang. Megavul: Ac/c++ vulnerability dataset with comprehensive code representations. InProceedings of the 21st International Conference on Mining Software Repositories, pages 738–742, 2024
2024
-
[42]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[43]
Code_vulnerability_labeled_dataset
Abdellah Oumida and Mohammed Sbaihi. Code_vulnerability_labeled_dataset. Hugging Face Datasets, https://huggingface.co/datasets/lemon42-ai/Code_Vulnerability_ Labeled_Dataset, 2025
2025
-
[44]
Clang: A c language family frontend for llvm
LLVM Project. Clang: A c language family frontend for llvm. https://clang.llvm.org/, 2026
2026
-
[45]
Redbaron: A bottom-up approach to fst (full syntax tree) for python
RedBaron Project. Redbaron: A bottom-up approach to fst (full syntax tree) for python. https://github.com/PyCQA/redbaron, 2026
2026
-
[46]
Code Llama: Open Foundation Models for Code
Baptiste Roziere, Jonas Gehring, and Fabian et al. Gloeckle. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Source code vulnerability: A compact collection of labeled vulnerable code in c++, java, python, and more
Marat Saratov. Source code vulnerability: A compact collection of labeled vulnerable code in c++, java, python, and more. Kaggle, https://www.kaggle.com/datasets/ maratsaratov/source-code-vulnerability, 2026
2026
- [48]
-
[49]
Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023
2023
-
[50]
cppvul: A curated dataset of c++ vulnerabilities for security-oriented lan- guage model training
Shyyshawarma. cppvul: A curated dataset of c++ vulnerabilities for security-oriented lan- guage model training. Hugging Face Datasets, https://huggingface.co/datasets/ Shyyshawarma/cppvul, 2026
2026
-
[51]
Shiyu Sun, Shu Wang, and Xinda et al. Wang. Exploring security commits in python. In 2023 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 171–181. IEEE, 2023
2023
-
[52]
Javaparser: A set of libraries for analyzing, transforming, and generating java source code.https://github.com/javaparser/javaparser, 2026
JavaParser Team. Javaparser: A set of libraries for analyzing, transforming, and generating java source code.https://github.com/javaparser/javaparser, 2026
2026
-
[53]
Vulnerable programming dataset: A comprehensive collection of 550 unique code vulnerabilities across 10 programming languages
Sunny Thakur. Vulnerable programming dataset: A comprehensive collection of 550 unique code vulnerabilities across 10 programming languages. Kaggle, https://www.kaggle.com/ datasets/cyberprince/vulnerable-programming-dataset, 2026
2026
-
[54]
Deep learning aided software vulnerability detection: A survey.arXiv preprint arXiv:2503.04002, 2025
Md Nizam Uddin, Yihe Zhang, and Xiali Hei. Deep learning aided software vulnerability detection: A survey.arXiv preprint arXiv:2503.04002, 2025
- [55]
-
[56]
Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change.Advances in Neural Information Processing Systems, 36:38975–38987, 2023
2023
-
[57]
Petar Veliˇckovi´c, Guillem Cucurull, and Arantxa et al. Casanova. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017
work page internal anchor Pith review arXiv 2017
-
[58]
Sixuan Wang, Chen Huang, and Dongjin et al. Yu. Vulgrab: Graph-embedding-based code vulnerability detection with bi-directional gated graph neural network.Software: Practice and Experience, 53(8):1631–1658, 2023
2023
-
[59]
Xinchen Wang, Ruida Hu, and Cuiyun et al. Gao. Reposvul: A repository-level high-quality vulnerability dataset. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, pages 472–483, 2024
2024
-
[60]
Yue Wang, Weishi Wang, and Shafiq et al. Joty. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 8696–8708, 2021
2021
- [61]
- [62]
-
[63]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022
2022
-
[64]
Ziyao Zhang, Chong Wang, and Yanlin et al. Wang. Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation.Proceedings of the ACM on Software Engineering, 2(ISSTA):481–503, 2025
2025
-
[65]
Xin Zhou, Sicong Cao, and Xiaobing et al. Sun. Large language model for vulnerability detection and repair: Literature review and the road ahead.ACM Transactions on Software Engineering and Methodology, 34(5):1–31, 2025
2025
-
[66]
Yaqin Zhou, Shangqing Liu, and Jingkai et al. Siow. Devign: Effective vulnerability identifi- cation by learning comprehensive program semantics via graph neural networks.Advances in neural information processing systems, 32, 2019. 13 A Dataset Details Table 6 provides the complete dataset composition. The full dataset integrates 15 real-world vul- nerabi...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.