pith. machine review for the scientific record. sign in

arxiv: 2604.19049 · v1 · submitted 2026-04-21 · 💻 cs.CR · cs.AI· cs.SE

Recognition: unknown

Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:07 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.SE
keywords LLM defect discoveryadversarial multi-agent reviewfalse positive filteringsecurity vulnerability detectionprecision in AI reportsempirical validation gatemulti-agent critique
0
0 comments X

The pith

Adversarial multi-agent review with kill mandates filters most false positives from LLM defect reports.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Refute-or-Promote as a structured way to raise precision in LLM-assisted defect discovery. It generates candidate reports through stratified context hunting and then routes them through promotion gates where dedicated adversarial agents receive kill mandates to disprove them. Context asymmetry and cross-model critique reduce shared blind spots, and a mandatory empirical gate requires concrete testing before any candidate advances to disclosure. In a 31-day run across seven targets the process discarded roughly 79 percent of 171 candidates, with survivors producing externally accepted CVEs, C++ standard changes, compiler fixes, and other real outcomes.

Core claim

Refute-or-Promote is an inference-time pattern that combines stratified context hunting for candidate generation, adversarial agents given explicit kill mandates at each gate, context asymmetry between reviewers, cross-model critique to catch correlated errors, and a final mandatory empirical validation step. No defect was found autonomously by the agents; the contribution lies in the external filtering structure that removed most plausible-but-wrong reports before they reached maintainers.

What carries the argument

The Refute-or-Promote pattern of successive promotion gates at which adversarial agents attempt to disprove each LLM-generated defect candidate.

If this is right

  • High elimination rates mean far fewer incorrect reports reach human maintainers.
  • Surviving candidates produced four CVEs plus accepted changes to the C++ standard and compilers.
  • The empirical gate proved necessary when all reviewers initially endorsed a non-existent bug.
  • A simplified variant also resolved previously unsolved instances on SWE-bench Verified.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The pattern indicates that LLM systems for discovery tasks gain reliability more from external refute structures than from internal improvements alone.
  • Automating the empirical gate with targeted test harnesses could allow the method to scale to larger candidate volumes.
  • Similar staged kill-mandate review could apply to other domains where LLMs produce plausible but unverifiable outputs, such as hypothesis generation in science.

Load-bearing premise

Adversarial agents supplied with kill mandates and separated contexts can eliminate false-positive defect reports without also discarding genuine defects, with the empirical gate acting as the final check.

What would settle it

A test set containing both confirmed real defects and known fabricated reports where the pipeline either advances a fabricated report to disclosure or discards a real defect that independent verification later confirms.

read the original abstract

LLM-assisted defect discovery has a precision crisis: plausible-but-wrong reports overwhelm maintainers and degrade credibility for real findings. We present Refute-or-Promote, an inference-time reliability pattern combining Stratified Context Hunting (SCH) for candidate generation, adversarial kill mandates, context asymmetry, and a Cross-Model Critic (CMC). Adversarial agents attempt to disprove candidates at each promotion gate; cold-start reviewers are intended to reduce anchoring cascades; cross-family review can catch correlated blind spots that same-family review misses. Over a 31-day campaign across 7 targets (security libraries, the ISO C++ standard, major compilers), the pipeline killed roughly 79% of 171 candidates before advancing to disclosure (retrospective aggregate); on a consolidated-protocol subset (lcms2, wolfSSL; n=30), the prospective kill rate was 83%. Outcomes: 4 CVEs (3 public, 1 embargoed); LWG 4549 accepted to the C++ working paper; 5 merged C++ editorial PRs; 3 compiler conformance bugs; 8 merged security-related fixes without CVE; an RFC 9000 errata filed under committee review; and 1+ FIPS 140-3 normative compliance issues under coordinated disclosure -- all evaluated by external acceptance, not benchmarks. The most instructive failure: ten dedicated reviewers unanimously endorsed a non-existent Bleichenbacher padding oracle in OpenSSL's CMS module; it was killed only by a single empirical test, motivating the mandatory empirical gate. No vulnerability was discovered autonomously; the contribution is external structure that filters LLM agents' persistent false positives. As a preliminary transfer test beyond defect discovery, a simplified cross-family critique variant also solved five previously unsolved SymPy instances on SWE-bench Verified and one SWE-rebench hard task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Refute-or-Promote, an inference-time adversarial stage-gated multi-agent methodology for LLM-assisted defect discovery. It combines Stratified Context Hunting (SCH) for candidate generation, adversarial agents with kill mandates, context asymmetry to mitigate anchoring, a Cross-Model Critic (CMC), and a mandatory empirical gate. Over a 31-day campaign on 7 targets, the pipeline reports retrospective kill rates of ~79% on 171 candidates and prospective rates of 83% on a 30-candidate subset (lcms2, wolfSSL), yielding 4 CVEs, acceptance of LWG 4549 into the C++ working paper, 5 merged editorial PRs, 3 compiler bugs, 8 merged security fixes, an RFC 9000 errata, and FIPS issues—all validated via external acceptance rather than internal benchmarks. The paper highlights a failure case in which 10 reviewers unanimously endorsed a non-existent Bleichenbacher padding oracle in OpenSSL CMS, killed only by empirical testing, and includes a preliminary transfer test on SWE-bench Verified.

Significance. If the filtering claims hold, the work provides a practical, externally validated pattern for raising precision in LLM-driven security analysis and standards work, with demonstrated real-world impact through CVEs and accepted changes. The explicit use of external acceptance (CVEs, PR merges, working-paper acceptance) as the evaluation criterion, rather than synthetic benchmarks, is a strength, as is the detailed documentation of the unanimous false-positive endorsement that motivated the empirical gate. The preliminary SWE-bench transfer result suggests broader applicability beyond defect discovery.

major comments (2)
  1. [Abstract and evaluation section] Abstract and evaluation section: The central claim that adversarial kill mandates, context asymmetry, and the mandatory empirical gate reliably eliminate persistent false positives without collateral loss of true defects is not supported by any measurement of false-negative rate. No controlled experiment on labeled data or ground-truth defect pool is reported to quantify how many true defects (if any) were discarded at any gate; only promoted items' external acceptance is shown.
  2. [Methodology description] Methodology description: The account of how cold-start reviewers and cross-family critique prevent loss of true positives lacks any formal safeguard, backtracking rule, or ablation showing survival rates for known-true candidates; without this, the assertion that the structure 'filters … persistent false positives' without discarding true defects remains unquantified.
minor comments (2)
  1. [Abstract] The abstract refers to a 'consolidated-protocol subset' without defining the protocol differences or selection criteria, which obscures interpretation of the 83% prospective kill rate.
  2. [Evaluation section] Candidate lists, per-gate kill reasons, and decision logs are not provided even in summary form, limiting reproducibility and independent assessment of the 79%/83% aggregates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the real-world impact demonstrated through external validations such as CVEs and standards acceptances. We address the major comments below, agreeing that certain quantifications are absent and proposing partial revisions to clarify limitations while preserving the manuscript's emphasis on external evaluation criteria.

read point-by-point responses
  1. Referee: [Abstract and evaluation section] Abstract and evaluation section: The central claim that adversarial kill mandates, context asymmetry, and the mandatory empirical gate reliably eliminate persistent false positives without collateral loss of true defects is not supported by any measurement of false-negative rate. No controlled experiment on labeled data or ground-truth defect pool is reported to quantify how many true defects (if any) were discarded at any gate; only promoted items' external acceptance is shown.

    Authors: We agree that no controlled false-negative rate is reported, as the real-world setting provides no complete ground-truth pool of all possible defects across the targets. Evaluation instead relies on external acceptance of promoted items (4 CVEs, LWG 4549, merged PRs, etc.). The mandatory empirical gate is motivated by the documented case of unanimous reviewer endorsement of a non-existent Bleichenbacher oracle, which was killed only by testing. We do not assert zero collateral loss but note that the structure surfaced multiple externally validated defects. We will revise the evaluation section to explicitly discuss this limitation and the rationale for prioritizing external validation over synthetic FNR metrics. revision: partial

  2. Referee: [Methodology description] Methodology description: The account of how cold-start reviewers and cross-family critique prevent loss of true positives lacks any formal safeguard, backtracking rule, or ablation showing survival rates for known-true candidates; without this, the assertion that the structure 'filters … persistent false positives' without discarding true defects remains unquantified.

    Authors: We acknowledge the absence of formal safeguards, backtracking rules, or ablations on known-true candidates. Cold-start reviewers and cross-family critique are intended to mitigate anchoring and correlated blind spots, but without labeled data these mechanisms cannot be quantified via survival rates. The 31-day campaign did promote multiple defects that received external acceptance. We will revise the methodology section to clarify the design intent of these elements and to note the lack of ablations as a limitation, while emphasizing that the pipeline's value is shown through real-world outcomes rather than internal benchmarks. revision: partial

Circularity Check

0 steps flagged

No circularity: outcomes rest on external acceptances independent of internal pipeline definitions

full rationale

The paper presents a multi-agent review methodology and reports aggregate kill rates plus specific external outcomes (4 CVEs, LWG 4549 acceptance, merged PRs, etc.) from a real-world campaign. These results are evaluated by third-party acceptance rather than by any internal prediction, fitted parameter, or quantity defined by the methodology itself. No equations, self-citations, ansatzes, or uniqueness theorems appear in the provided text that would reduce the reported findings to the inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about LLM behavior and the effectiveness of adversarial review rather than explicit free parameters or new physical entities.

axioms (1)
  • domain assumption LLM agents produce plausible-but-wrong defect reports at high rates that require external adversarial filtering to reach usable precision.
    Stated as the precision crisis motivating the entire pipeline.

pith-pipeline@v0.9.0 · 5639 in / 1485 out tokens · 55765 ms · 2026-05-10T03:07:36.637875+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 31 canonical work pages · 6 internal anchors

  1. [1]

    Big Sleep: Real-world vulnerability discovery with large language models.Google Security Blog, November 2024

    Google Project Zero and Google DeepMind. Big Sleep: Real-world vulnerability discovery with large language models.Google Security Blog, November 2024

  2. [2]

    Google DeepMind. Project Naptime: Evaluating 4https://github.com/abhinavagarwal07/refute-or-promote 5DOI: 10.5281/zenodo.19668799 8 Offensive Security Capabilities of Large Language Models.Google Security Blog, June 2024

  3. [3]

    S. Fang, W. Ding, Z. Cao, Z. Yang, and B. Xu. AEGIS: From Clues to Verdicts—Graph-Guided Deep Vulnerability Reasoning via Dialectics and Meta-Auditing.arXiv:2603.20637, March 2026

  4. [4]

    Ullah, M

    S. Ullah, M. Sultana, and L. Williams. On the Reliability of Large Language Models to Mitigate Security Vulnerabilities. InIEEE S&P 2024

  5. [5]

    AI-Assisted SAST: False Positive Rates in Production.semgrep.dev/blog, 2025

    Semgrep, Inc. AI-Assisted SAST: False Positive Rates in Production.semgrep.dev/blog, 2025

  6. [6]

    Stenberg

    D. Stenberg. AI bug reports and the maintainer burden.daniel.haxx.se/blog, 2024–2026

  7. [7]

    Stenberg

    D. Stenberg. The end of the curl bug bounty. daniel.haxx.se/blog, January 2026

  8. [8]

    Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch. Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv:2305.14325, 2023

  9. [9]

    Khan et al

    A. Khan et al. Debating with More Persuasive LLMs Leads to More Truthful Answers. InICML 2024 (Best Paper)

  10. [10]

    A. Wynn, H. Satija, and G. Hadfield. Talk Isn’t Always Cheap: Understanding Failure Modes in Multi-Agent Debate.arXiv:2509.05396, September 2025

  11. [11]

    T.-E. Song. Cross-Context Review: Improving LLM Output Quality by Separating Production and Re- view Sessions.arXiv:2603.12123, March 2026

  12. [12]

    T.-E. Song. More Rounds, More Noise: Why Multi- Turn Review Fails to Improve Cross-Context Verifi- cation.arXiv:2603.16244, March 2026

  13. [13]

    POPPER: Agentic fal- sification of free-form hypotheses.arXiv preprint arXiv:2502.09858, 2025

    K. Huang, Y. Jin, R. Li, M. Y. Li, E. Candès, and J. Leskovec. POPPER: Automated Hypothe- sis Validation with Agentic Sequential Falsifications. arXiv:2502.09858, 2025. InICML 2025

  14. [14]

    arXiv preprint arXiv:2502.20379 , year=

    D. Lifshitz et al. Multi-Agent Verification: Scal- ing Test-Time Compute with Multiple Verifiers. arXiv:2502.20379, February 2025

  15. [15]

    Widyasari, M

    R. Widyasari, M. Weyssow, I. C. Irsan, H. W. Ang, F. Liauw, E. L. Ouh, L. K. Shar, H. J. Kang, and D. Lo. Let the Trial Begin: A Mock-Court Approach to Vulnerability Detection using LLM-Based Agents (VulTrial).arXiv:2505.10961, May 2025. Accepted at ICSE 2026

  16. [16]

    Vlcek, J

    O. Vlcek, J. Baloo, and S. Fort. AISLE: An AI- Native Cyber Reasoning System for Autonomous Vulnerability Remediation. https://aisle.com/ about-us, 2025–2026

  17. [17]

    Z. Li, S. Dutta, and M. Naik. IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities. arXiv:2405.17238, 2024

  18. [18]

    Y. Li, K. Joshi, X. Wang, and E. Wong. MAVUL: Multi-Agent Vulnerability Detection via Contextual Reasoning and Interactive Refinement. arXiv:2510.00317, 2025

  19. [19]

    Z. Wang, G. Li, J. Li, H. Zhu, and Z. Jin. VulAgent: Hypothesis-Validation based Multi-Agent Vulnera- bility Detection.arXiv:2509.11523, September 2025

  20. [20]

    Zero-Day Vulnerability Discovery with Frontier Models

    Anthropic. Zero-Day Vulnerability Discovery with Frontier Models. https://red.anthropic.com/ 2026/zero-days/, February 2026

  21. [21]

    Codex Security: Now in Research Preview

    OpenAI. Codex Security: Now in Research Preview. openai.com, March 2026

  22. [22]

    Avizienis and J

    A. Avizienis and J. P. J. Kelly. Fault Tolerance by Design Diversity: Concepts and Experiments. Computer, 17(8):67–80, 1984

  23. [23]

    J. C. Knight and N. G. Leveson. An Experimen- tal Evaluation of the Assumption of Independence in Multiversion Programming.IEEE Transactions on Software Engineering, SE-12(1):96–109, January 1986

  24. [24]

    Xiong and T

    Y. Xiong and T. Zhang. Sifting the Noise: A Com- parative Study of LLM Agents in Vulnerability False Positive Filtering.arXiv:2601.22952, January 2026

  25. [25]

    Du et al

    X. Du et al. Reducing False Positives in Static Bug Detection with LLMs: An Empirical Study in Industry.arXiv:2601.18844, January 2026. InICSE 2026

  26. [26]

    Kraidia, I

    I. Kraidia, I. Qaddara, A. Almutairi, N. Alzaben, and S. B. Belhouari. When Collaboration Fails: Persuasion-Driven Adversarial Influence in Multi- Agent LLM Debate.Scientific Reports, April 2026

  27. [27]

    Claude Mythos Preview

    Anthropic. Claude Mythos Preview. red.anthropic.com, April 2026. https: //red.anthropic.com/2026/mythos-preview/

  28. [28]

    Project Glasswing: Securing Critical Software for the AI Era.anthropic.com, April 2026

    Anthropic. Project Glasswing: Securing Critical Software for the AI Era.anthropic.com, April 2026. https://www.anthropic.com/glasswing

  29. [29]

    Trusted Access for the Next Era of Cyber Defense.openai.com, April 2026

    OpenAI. Trusted Access for the Next Era of Cyber Defense.openai.com, April 2026. https://openai.com/index/ scaling-trusted-access-for-cyber-defense/

  30. [30]

    E. Kim, A. Garg, K. Peng, and N. Garg. Correlated Errors in Large Language Models. InICML 2025. arXiv:2506.07962. 9

  31. [31]

    XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

    P. Röttger, H. Kirk, B. Vidgen, G. Attanasio, F. Bianchi, and D. Hovy. XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models. InProceedings of NAACL 2024, pp. 5377–5400.arXiv:2308.01263

  32. [32]

    Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947,

    J. Cui, W.-L. Chiang, I. Stoica, and C.-J. Hsieh. OR-Bench: An Over-Refusal Benchmark for Large Language Models. InICML 2025.arXiv:2405.20947

  33. [33]

    Linder, M

    N. Linder, M. Segal, O. Antverg, G. Gekker, T. Fichman, O. Bodenheimer, E. Maor, and O. Nevo. A Content-Based Framework for Cyberse- curity Refusal Decisions in Large Language Models. arXiv:2602.15689, February 2026

  34. [34]

    Buildingan AdversarialConsensus Engine: Multi-Agent LLMs for Automated Malware Analysis

    P.Stokes. Buildingan AdversarialConsensus Engine: Multi-Agent LLMs for Automated Malware Analysis. SentinelOne Labs, March 2026

  35. [35]

    W. Li, Y. Lin, M. Xia, and C. Jin. Rethinking Mixture-of-Agents: Is Mixing Different Large Lan- guage Models Beneficial?arXiv:2502.00674, Febru- ary 2025

  36. [36]

    J. Wang, J. Wang, B. Athiwaratkun, C. Zhang, and J. Zou. Mixture-of-Agents Enhances Large Lan- guage Model Capabilities. InICLR 2025(Spotlight). arXiv:2406.04692

  37. [37]

    S. Joyce. Cloud CISO Perspectives: Our Big Sleep Agent Makes a Big Leap.Google Cloud Blog, July 2025

  38. [38]

    Franceschi-Bicchierai

    L. Franceschi-Bicchierai. Google Says Its AI- Based Bug Hunter Found 20 Security Vulnerabilities. TechCrunch, August 2025

  39. [39]

    S. Heelan. How I Used o3 to Find CVE-2025-37899, a Remote Zeroday Vulnerability in the Linux Kernel’s SMB Implementation.sean.heelan.io, May 2025

  40. [40]

    J. Guo, C. Wang, X. Xu, Z. Su, and X. Zhang. RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing. InICML 2025. arXiv:2501.18160

  41. [41]

    R. A. Popa and F. Flynn. Introducing CodeMender: An AI Agent for Code Security.Google DeepMind Blog, October 2025

  42. [42]

    M. Cooter. Internet Bug Bounty Program Hits Pause on Payouts.InfoWorld, April 2026

  43. [43]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language Agents with Verbal Reinforcement Learning. InNeurIPS 2023. arXiv:2303.11366

  44. [44]

    Self-Refine: Iterative Refinement with Self-Feedback

    A. Madaan, N. Tandon, P. Gupta, et al. Self-Refine: Iterative Refinement with Self-Feedback. InNeurIPS 2023.arXiv:2303.17651

  45. [45]

    AI safety via debate

    G. Irving, P. Christiano, and D. Amodei. AI Safety via Debate.arXiv:1805.00899, 2018

  46. [46]

    Denisov-Blanch, J

    Y. Denisov-Blanch, J. Kazdan, J. Chudnovsky, R. Schaeffer, S. Guan, S. Adeshina, and S. Koyejo. Consensus is Not Verification: Why Crowd Wisdom Strategies Fail for LLM Truthfulness. arXiv:2603.06612, 2026

  47. [47]

    K. Li, M. Wang, H. Zhang, et al. InfCode: Adversar- ial Iterative Refinement of Tests and Patches for Re- liable Software Issue Resolution.arXiv:2511.16004, November 2025

  48. [48]

    H. Li, Y. Shi, S. Lin, et al. SWE-Debate: Competi- tive Multi-Agent Debate for Software Issue Resolu- tion.arXiv:2507.23348, July 2025

  49. [49]

    Argus: Reorchestrating Static Analysis via a Multi-Agent Ensemble for Full-Chain Security Vulnerability Detection

    Z. Liang, Q. Xie, J. He, B. Xue, W. Wang, Y. Cai, F. Luo, B. Zhang, H. Hu, and K. Wu. Ar- gus: Multi-Agent Ensemble Reorchestrating Static Analysis for Full-Chain Vulnerability Detection. arXiv:2604.06633, 2026

  50. [50]

    S. Jain, U. Z. Ahmed, S. Sahai, and B. Leong. Be- yond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations.arXiv:2510.11822, Oc- tober 2025

  51. [51]

    D3: Dissecting multi-agent debate for LLM evaluation, 2024

    A. Harrasse, C. Bandi, and H. Bandi. Debate, De- liberate, Decide (D3): A Cost-Aware Adversarial Framework for Reliable and Interpretable LLM Eval- uation.arXiv:2410.04663, 2024. EACL 2026

  52. [52]

    Tianjun Wang, Yujia Liu, Yiming Zhang, and 1 others

    S. Ullah, P. Balasubramanian, W. Guo, A. Burnett, H. Pearce, C. Kruegel, G. Vigna, and G. Stringhini. From CVE Entries to Verifiable Exploits: CVE- GENIE Multi-Agent Framework.arXiv:2509.01835, September 2025. 10