arxiv: 2604.19049 · v1 · submitted 2026-04-21 · 💻 cs.CR · cs.AI· cs.SE

Recognition: unknown

Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery

Abhinav Agarwal

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:07 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.SE

keywords LLM defect discoveryadversarial multi-agent reviewfalse positive filteringsecurity vulnerability detectionprecision in AI reportsempirical validation gatemulti-agent critique

0 comments

The pith

Adversarial multi-agent review with kill mandates filters most false positives from LLM defect reports.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Refute-or-Promote as a structured way to raise precision in LLM-assisted defect discovery. It generates candidate reports through stratified context hunting and then routes them through promotion gates where dedicated adversarial agents receive kill mandates to disprove them. Context asymmetry and cross-model critique reduce shared blind spots, and a mandatory empirical gate requires concrete testing before any candidate advances to disclosure. In a 31-day run across seven targets the process discarded roughly 79 percent of 171 candidates, with survivors producing externally accepted CVEs, C++ standard changes, compiler fixes, and other real outcomes.

Core claim

Refute-or-Promote is an inference-time pattern that combines stratified context hunting for candidate generation, adversarial agents given explicit kill mandates at each gate, context asymmetry between reviewers, cross-model critique to catch correlated errors, and a final mandatory empirical validation step. No defect was found autonomously by the agents; the contribution lies in the external filtering structure that removed most plausible-but-wrong reports before they reached maintainers.

What carries the argument

The Refute-or-Promote pattern of successive promotion gates at which adversarial agents attempt to disprove each LLM-generated defect candidate.

If this is right

High elimination rates mean far fewer incorrect reports reach human maintainers.
Surviving candidates produced four CVEs plus accepted changes to the C++ standard and compilers.
The empirical gate proved necessary when all reviewers initially endorsed a non-existent bug.
A simplified variant also resolved previously unsolved instances on SWE-bench Verified.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pattern indicates that LLM systems for discovery tasks gain reliability more from external refute structures than from internal improvements alone.
Automating the empirical gate with targeted test harnesses could allow the method to scale to larger candidate volumes.
Similar staged kill-mandate review could apply to other domains where LLMs produce plausible but unverifiable outputs, such as hypothesis generation in science.

Load-bearing premise

Adversarial agents supplied with kill mandates and separated contexts can eliminate false-positive defect reports without also discarding genuine defects, with the empirical gate acting as the final check.

What would settle it

A test set containing both confirmed real defects and known fabricated reports where the pipeline either advances a fabricated report to disclosure or discards a real defect that independent verification later confirms.

read the original abstract

LLM-assisted defect discovery has a precision crisis: plausible-but-wrong reports overwhelm maintainers and degrade credibility for real findings. We present Refute-or-Promote, an inference-time reliability pattern combining Stratified Context Hunting (SCH) for candidate generation, adversarial kill mandates, context asymmetry, and a Cross-Model Critic (CMC). Adversarial agents attempt to disprove candidates at each promotion gate; cold-start reviewers are intended to reduce anchoring cascades; cross-family review can catch correlated blind spots that same-family review misses. Over a 31-day campaign across 7 targets (security libraries, the ISO C++ standard, major compilers), the pipeline killed roughly 79% of 171 candidates before advancing to disclosure (retrospective aggregate); on a consolidated-protocol subset (lcms2, wolfSSL; n=30), the prospective kill rate was 83%. Outcomes: 4 CVEs (3 public, 1 embargoed); LWG 4549 accepted to the C++ working paper; 5 merged C++ editorial PRs; 3 compiler conformance bugs; 8 merged security-related fixes without CVE; an RFC 9000 errata filed under committee review; and 1+ FIPS 140-3 normative compliance issues under coordinated disclosure -- all evaluated by external acceptance, not benchmarks. The most instructive failure: ten dedicated reviewers unanimously endorsed a non-existent Bleichenbacher padding oracle in OpenSSL's CMS module; it was killed only by a single empirical test, motivating the mandatory empirical gate. No vulnerability was discovered autonomously; the contribution is external structure that filters LLM agents' persistent false positives. As a preliminary transfer test beyond defect discovery, a simplified cross-family critique variant also solved five previously unsolved SymPy instances on SWE-bench Verified and one SWE-rebench hard task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete multi-agent filtering pipeline for LLM defect reports that produced real CVEs and standard changes, but offers no controlled check on whether it also discards true defects.

read the letter

The main thing here is a staged review process called Refute-or-Promote. It generates candidates with stratified context hunting, then runs adversarial agents with explicit kill mandates at each gate, adds cross-model critics to catch shared blind spots, and requires an empirical test before anything advances. The authors applied this over 31 days to seven targets including security libraries and C++ standards work. They report killing roughly 79% of 171 candidates overall and 83% on a smaller prospective subset, with the survivors leading to four CVEs, an accepted C++ working paper change, merged PRs, compiler bugs, and other fixes. They also describe one clear failure where ten reviewers all endorsed a nonexistent Bleichenbacher issue that only the empirical test caught. That example is useful because it shows why the final gate matters in practice. The work stays grounded in external acceptance rather than internal model scores, which is a strength for this kind of applied claim. The transfer test on SymPy is noted as preliminary and limited. The central gap is the missing false-negative measurement. We see that promoted items held up externally, but there is no count or rate for true defects that may have been killed at any stage, and no evaluation on labeled data with known ground truth. The reported kill rates are retrospective aggregates without per-candidate logs or statistical controls, so reproducibility and the precision-recall tradeoff remain hard to judge. This is aimed at researchers and practitioners building LLM agents for security analysis or standards review. Someone already working on agentic filtering or precision issues in vulnerability discovery would get practical structure and a concrete failure case from it. The external outcomes are enough to justify sending it to peer review, though any referee would reasonably ask for false-negative quantification and more detailed decision traces.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Refute-or-Promote, an inference-time adversarial stage-gated multi-agent methodology for LLM-assisted defect discovery. It combines Stratified Context Hunting (SCH) for candidate generation, adversarial agents with kill mandates, context asymmetry to mitigate anchoring, a Cross-Model Critic (CMC), and a mandatory empirical gate. Over a 31-day campaign on 7 targets, the pipeline reports retrospective kill rates of ~79% on 171 candidates and prospective rates of 83% on a 30-candidate subset (lcms2, wolfSSL), yielding 4 CVEs, acceptance of LWG 4549 into the C++ working paper, 5 merged editorial PRs, 3 compiler bugs, 8 merged security fixes, an RFC 9000 errata, and FIPS issues—all validated via external acceptance rather than internal benchmarks. The paper highlights a failure case in which 10 reviewers unanimously endorsed a non-existent Bleichenbacher padding oracle in OpenSSL CMS, killed only by empirical testing, and includes a preliminary transfer test on SWE-bench Verified.

Significance. If the filtering claims hold, the work provides a practical, externally validated pattern for raising precision in LLM-driven security analysis and standards work, with demonstrated real-world impact through CVEs and accepted changes. The explicit use of external acceptance (CVEs, PR merges, working-paper acceptance) as the evaluation criterion, rather than synthetic benchmarks, is a strength, as is the detailed documentation of the unanimous false-positive endorsement that motivated the empirical gate. The preliminary SWE-bench transfer result suggests broader applicability beyond defect discovery.

major comments (2)

[Abstract and evaluation section] Abstract and evaluation section: The central claim that adversarial kill mandates, context asymmetry, and the mandatory empirical gate reliably eliminate persistent false positives without collateral loss of true defects is not supported by any measurement of false-negative rate. No controlled experiment on labeled data or ground-truth defect pool is reported to quantify how many true defects (if any) were discarded at any gate; only promoted items' external acceptance is shown.
[Methodology description] Methodology description: The account of how cold-start reviewers and cross-family critique prevent loss of true positives lacks any formal safeguard, backtracking rule, or ablation showing survival rates for known-true candidates; without this, the assertion that the structure 'filters … persistent false positives' without discarding true defects remains unquantified.

minor comments (2)

[Abstract] The abstract refers to a 'consolidated-protocol subset' without defining the protocol differences or selection criteria, which obscures interpretation of the 83% prospective kill rate.
[Evaluation section] Candidate lists, per-gate kill reasons, and decision logs are not provided even in summary form, limiting reproducibility and independent assessment of the 79%/83% aggregates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the real-world impact demonstrated through external validations such as CVEs and standards acceptances. We address the major comments below, agreeing that certain quantifications are absent and proposing partial revisions to clarify limitations while preserving the manuscript's emphasis on external evaluation criteria.

read point-by-point responses

Referee: [Abstract and evaluation section] Abstract and evaluation section: The central claim that adversarial kill mandates, context asymmetry, and the mandatory empirical gate reliably eliminate persistent false positives without collateral loss of true defects is not supported by any measurement of false-negative rate. No controlled experiment on labeled data or ground-truth defect pool is reported to quantify how many true defects (if any) were discarded at any gate; only promoted items' external acceptance is shown.

Authors: We agree that no controlled false-negative rate is reported, as the real-world setting provides no complete ground-truth pool of all possible defects across the targets. Evaluation instead relies on external acceptance of promoted items (4 CVEs, LWG 4549, merged PRs, etc.). The mandatory empirical gate is motivated by the documented case of unanimous reviewer endorsement of a non-existent Bleichenbacher oracle, which was killed only by testing. We do not assert zero collateral loss but note that the structure surfaced multiple externally validated defects. We will revise the evaluation section to explicitly discuss this limitation and the rationale for prioritizing external validation over synthetic FNR metrics. revision: partial
Referee: [Methodology description] Methodology description: The account of how cold-start reviewers and cross-family critique prevent loss of true positives lacks any formal safeguard, backtracking rule, or ablation showing survival rates for known-true candidates; without this, the assertion that the structure 'filters … persistent false positives' without discarding true defects remains unquantified.

Authors: We acknowledge the absence of formal safeguards, backtracking rules, or ablations on known-true candidates. Cold-start reviewers and cross-family critique are intended to mitigate anchoring and correlated blind spots, but without labeled data these mechanisms cannot be quantified via survival rates. The 31-day campaign did promote multiple defects that received external acceptance. We will revise the methodology section to clarify the design intent of these elements and to note the lack of ablations as a limitation, while emphasizing that the pipeline's value is shown through real-world outcomes rather than internal benchmarks. revision: partial

Circularity Check

0 steps flagged

No circularity: outcomes rest on external acceptances independent of internal pipeline definitions

full rationale

The paper presents a multi-agent review methodology and reports aggregate kill rates plus specific external outcomes (4 CVEs, LWG 4549 acceptance, merged PRs, etc.) from a real-world campaign. These results are evaluated by third-party acceptance rather than by any internal prediction, fitted parameter, or quantity defined by the methodology itself. No equations, self-citations, ansatzes, or uniqueness theorems appear in the provided text that would reduce the reported findings to the inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about LLM behavior and the effectiveness of adversarial review rather than explicit free parameters or new physical entities.

axioms (1)

domain assumption LLM agents produce plausible-but-wrong defect reports at high rates that require external adversarial filtering to reach usable precision.
Stated as the precision crisis motivating the entire pipeline.

pith-pipeline@v0.9.0 · 5639 in / 1485 out tokens · 55765 ms · 2026-05-10T03:07:36.637875+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 31 canonical work pages · 6 internal anchors

[1]

Big Sleep: Real-world vulnerability discovery with large language models.Google Security Blog, November 2024

Google Project Zero and Google DeepMind. Big Sleep: Real-world vulnerability discovery with large language models.Google Security Blog, November 2024

2024
[2]

Google DeepMind. Project Naptime: Evaluating 4https://github.com/abhinavagarwal07/refute-or-promote 5DOI: 10.5281/zenodo.19668799 8 Offensive Security Capabilities of Large Language Models.Google Security Blog, June 2024

work page doi:10.5281/zenodo.19668799 2024
[3]

S. Fang, W. Ding, Z. Cao, Z. Yang, and B. Xu. AEGIS: From Clues to Verdicts—Graph-Guided Deep Vulnerability Reasoning via Dialectics and Meta-Auditing.arXiv:2603.20637, March 2026

work page arXiv 2026
[4]

Ullah, M

S. Ullah, M. Sultana, and L. Williams. On the Reliability of Large Language Models to Mitigate Security Vulnerabilities. InIEEE S&P 2024

2024
[5]

AI-Assisted SAST: False Positive Rates in Production.semgrep.dev/blog, 2025

Semgrep, Inc. AI-Assisted SAST: False Positive Rates in Production.semgrep.dev/blog, 2025

2025
[6]

Stenberg

D. Stenberg. AI bug reports and the maintainer burden.daniel.haxx.se/blog, 2024–2026

2024
[7]

Stenberg

D. Stenberg. The end of the curl bug bounty. daniel.haxx.se/blog, January 2026

2026
[8]

Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch. Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv:2305.14325, 2023

work page internal anchor Pith review arXiv 2023
[9]

Khan et al

A. Khan et al. Debating with More Persuasive LLMs Leads to More Truthful Answers. InICML 2024 (Best Paper)

2024
[10]

A. Wynn, H. Satija, and G. Hadfield. Talk Isn’t Always Cheap: Understanding Failure Modes in Multi-Agent Debate.arXiv:2509.05396, September 2025

work page arXiv 2025
[11]

T.-E. Song. Cross-Context Review: Improving LLM Output Quality by Separating Production and Re- view Sessions.arXiv:2603.12123, March 2026

work page arXiv 2026
[12]

T.-E. Song. More Rounds, More Noise: Why Multi- Turn Review Fails to Improve Cross-Context Verifi- cation.arXiv:2603.16244, March 2026

work page arXiv 2026
[13]

POPPER: Agentic fal- sification of free-form hypotheses.arXiv preprint arXiv:2502.09858, 2025

K. Huang, Y. Jin, R. Li, M. Y. Li, E. Candès, and J. Leskovec. POPPER: Automated Hypothe- sis Validation with Agentic Sequential Falsifications. arXiv:2502.09858, 2025. InICML 2025

work page arXiv 2025
[14]

arXiv preprint arXiv:2502.20379 , year=

D. Lifshitz et al. Multi-Agent Verification: Scal- ing Test-Time Compute with Multiple Verifiers. arXiv:2502.20379, February 2025

work page arXiv 2025
[15]

Widyasari, M

R. Widyasari, M. Weyssow, I. C. Irsan, H. W. Ang, F. Liauw, E. L. Ouh, L. K. Shar, H. J. Kang, and D. Lo. Let the Trial Begin: A Mock-Court Approach to Vulnerability Detection using LLM-Based Agents (VulTrial).arXiv:2505.10961, May 2025. Accepted at ICSE 2026

work page arXiv 2025
[16]

Vlcek, J

O. Vlcek, J. Baloo, and S. Fort. AISLE: An AI- Native Cyber Reasoning System for Autonomous Vulnerability Remediation. https://aisle.com/ about-us, 2025–2026

2025
[17]

Z. Li, S. Dutta, and M. Naik. IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities. arXiv:2405.17238, 2024

work page arXiv 2024
[18]

Y. Li, K. Joshi, X. Wang, and E. Wong. MAVUL: Multi-Agent Vulnerability Detection via Contextual Reasoning and Interactive Refinement. arXiv:2510.00317, 2025

work page arXiv 2025
[19]

Z. Wang, G. Li, J. Li, H. Zhu, and Z. Jin. VulAgent: Hypothesis-Validation based Multi-Agent Vulnera- bility Detection.arXiv:2509.11523, September 2025

work page arXiv 2025
[20]

Zero-Day Vulnerability Discovery with Frontier Models

Anthropic. Zero-Day Vulnerability Discovery with Frontier Models. https://red.anthropic.com/ 2026/zero-days/, February 2026

2026
[21]

Codex Security: Now in Research Preview

OpenAI. Codex Security: Now in Research Preview. openai.com, March 2026

2026
[22]

Avizienis and J

A. Avizienis and J. P. J. Kelly. Fault Tolerance by Design Diversity: Concepts and Experiments. Computer, 17(8):67–80, 1984

1984
[23]

J. C. Knight and N. G. Leveson. An Experimen- tal Evaluation of the Assumption of Independence in Multiversion Programming.IEEE Transactions on Software Engineering, SE-12(1):96–109, January 1986

1986
[24]

Xiong and T

Y. Xiong and T. Zhang. Sifting the Noise: A Com- parative Study of LLM Agents in Vulnerability False Positive Filtering.arXiv:2601.22952, January 2026

work page arXiv 2026
[25]

Du et al

X. Du et al. Reducing False Positives in Static Bug Detection with LLMs: An Empirical Study in Industry.arXiv:2601.18844, January 2026. InICSE 2026

work page arXiv 2026
[26]

Kraidia, I

I. Kraidia, I. Qaddara, A. Almutairi, N. Alzaben, and S. B. Belhouari. When Collaboration Fails: Persuasion-Driven Adversarial Influence in Multi- Agent LLM Debate.Scientific Reports, April 2026

2026
[27]

Claude Mythos Preview

Anthropic. Claude Mythos Preview. red.anthropic.com, April 2026. https: //red.anthropic.com/2026/mythos-preview/

2026
[28]

Project Glasswing: Securing Critical Software for the AI Era.anthropic.com, April 2026

Anthropic. Project Glasswing: Securing Critical Software for the AI Era.anthropic.com, April 2026. https://www.anthropic.com/glasswing

2026
[29]

Trusted Access for the Next Era of Cyber Defense.openai.com, April 2026

OpenAI. Trusted Access for the Next Era of Cyber Defense.openai.com, April 2026. https://openai.com/index/ scaling-trusted-access-for-cyber-defense/

2026
[30]

E. Kim, A. Garg, K. Peng, and N. Garg. Correlated Errors in Large Language Models. InICML 2025. arXiv:2506.07962. 9

work page arXiv 2025
[31]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

P. Röttger, H. Kirk, B. Vidgen, G. Attanasio, F. Bianchi, and D. Hovy. XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models. InProceedings of NAACL 2024, pp. 5377–5400.arXiv:2308.01263

work page internal anchor Pith review arXiv 2024
[32]

Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947,

J. Cui, W.-L. Chiang, I. Stoica, and C.-J. Hsieh. OR-Bench: An Over-Refusal Benchmark for Large Language Models. InICML 2025.arXiv:2405.20947

work page arXiv 2025
[33]

Linder, M

N. Linder, M. Segal, O. Antverg, G. Gekker, T. Fichman, O. Bodenheimer, E. Maor, and O. Nevo. A Content-Based Framework for Cyberse- curity Refusal Decisions in Large Language Models. arXiv:2602.15689, February 2026

work page arXiv 2026
[34]

Buildingan AdversarialConsensus Engine: Multi-Agent LLMs for Automated Malware Analysis

P.Stokes. Buildingan AdversarialConsensus Engine: Multi-Agent LLMs for Automated Malware Analysis. SentinelOne Labs, March 2026

2026
[35]

W. Li, Y. Lin, M. Xia, and C. Jin. Rethinking Mixture-of-Agents: Is Mixing Different Large Lan- guage Models Beneficial?arXiv:2502.00674, Febru- ary 2025

work page arXiv 2025
[36]

J. Wang, J. Wang, B. Athiwaratkun, C. Zhang, and J. Zou. Mixture-of-Agents Enhances Large Lan- guage Model Capabilities. InICLR 2025(Spotlight). arXiv:2406.04692

work page arXiv 2025
[37]

S. Joyce. Cloud CISO Perspectives: Our Big Sleep Agent Makes a Big Leap.Google Cloud Blog, July 2025

2025
[38]

Franceschi-Bicchierai

L. Franceschi-Bicchierai. Google Says Its AI- Based Bug Hunter Found 20 Security Vulnerabilities. TechCrunch, August 2025

2025
[39]

S. Heelan. How I Used o3 to Find CVE-2025-37899, a Remote Zeroday Vulnerability in the Linux Kernel’s SMB Implementation.sean.heelan.io, May 2025

2025
[40]

J. Guo, C. Wang, X. Xu, Z. Su, and X. Zhang. RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing. InICML 2025. arXiv:2501.18160

work page arXiv 2025
[41]

R. A. Popa and F. Flynn. Introducing CodeMender: An AI Agent for Code Security.Google DeepMind Blog, October 2025

2025
[42]

M. Cooter. Internet Bug Bounty Program Hits Pause on Payouts.InfoWorld, April 2026

2026
[43]

Reflexion: Language Agents with Verbal Reinforcement Learning

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language Agents with Verbal Reinforcement Learning. InNeurIPS 2023. arXiv:2303.11366

work page internal anchor Pith review arXiv 2023
[44]

Self-Refine: Iterative Refinement with Self-Feedback

A. Madaan, N. Tandon, P. Gupta, et al. Self-Refine: Iterative Refinement with Self-Feedback. InNeurIPS 2023.arXiv:2303.17651

work page internal anchor Pith review arXiv 2023
[45]

AI safety via debate

G. Irving, P. Christiano, and D. Amodei. AI Safety via Debate.arXiv:1805.00899, 2018

work page internal anchor Pith review arXiv 2018
[46]

Denisov-Blanch, J

Y. Denisov-Blanch, J. Kazdan, J. Chudnovsky, R. Schaeffer, S. Guan, S. Adeshina, and S. Koyejo. Consensus is Not Verification: Why Crowd Wisdom Strategies Fail for LLM Truthfulness. arXiv:2603.06612, 2026

work page arXiv 2026
[47]

K. Li, M. Wang, H. Zhang, et al. InfCode: Adversar- ial Iterative Refinement of Tests and Patches for Re- liable Software Issue Resolution.arXiv:2511.16004, November 2025

work page arXiv 2025
[48]

H. Li, Y. Shi, S. Lin, et al. SWE-Debate: Competi- tive Multi-Agent Debate for Software Issue Resolu- tion.arXiv:2507.23348, July 2025

work page arXiv 2025
[49]

Argus: Reorchestrating Static Analysis via a Multi-Agent Ensemble for Full-Chain Security Vulnerability Detection

Z. Liang, Q. Xie, J. He, B. Xue, W. Wang, Y. Cai, F. Luo, B. Zhang, H. Hu, and K. Wu. Ar- gus: Multi-Agent Ensemble Reorchestrating Static Analysis for Full-Chain Vulnerability Detection. arXiv:2604.06633, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

S. Jain, U. Z. Ahmed, S. Sahai, and B. Leong. Be- yond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations.arXiv:2510.11822, Oc- tober 2025

work page arXiv 2025
[51]

D3: Dissecting multi-agent debate for LLM evaluation, 2024

A. Harrasse, C. Bandi, and H. Bandi. Debate, De- liberate, Decide (D3): A Cost-Aware Adversarial Framework for Reliable and Interpretable LLM Eval- uation.arXiv:2410.04663, 2024. EACL 2026

work page arXiv 2024
[52]

Tianjun Wang, Yujia Liu, Yiming Zhang, and 1 others

S. Ullah, P. Balasubramanian, W. Guo, A. Burnett, H. Pearce, C. Kruegel, G. Vigna, and G. Stringhini. From CVE Entries to Verifiable Exploits: CVE- GENIE Multi-Agent Framework.arXiv:2509.01835, September 2025. 10

work page arXiv 2025