Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks

Bhavik Shangari; Chandra Khatri; Sunny Nehra; Vipul Dholariya; Vivek Dahiya

arxiv: 2605.23243 · v5 · pith:GM5KJ4KJnew · submitted 2026-05-22 · 💻 cs.CR · cs.AI

Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks

Vivek Dahiya , Sunny Nehra , Vipul Dholariya , Bhavik Shangari , Chandra Khatri This is my paper

Pith reviewed 2026-06-30 16:23 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords cybersecurityLLM evaluationvulnerability detectionpenetration testingvertical foundation modelsfalse positive ratesblack-box testingwhite-box benchmarks

0 comments

The pith

Domain-specialized agents using structured penetration-testing methodology detect over 50 percent of vulnerabilities per family, while frontier LLMs reach only 4-8 percent coverage even with tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests six frontier LLMs and two specialized models on white-box function-level vulnerability detection across C, Java and Python plus black-box testing of five production web applications containing 118 confirmed vulnerabilities. Frontier models show 10-50 percent false positive rates on the white-box task and cover only 4-8 percent of ground-truth issues in black-box mode, rising to 10-19 percent when given external tools. Encoding structured penetration-testing steps into domain-specialized agents lifts per-family detection above 50 percent, and a single-GPU specialized defense model reaches 0.904 precision with a 9.7 percent false positive rate. The authors locate the gap in missing end-to-end security traces, failure-heavy data and multi-step attack chains, and propose self-play security testing to generate the needed data. These results support building vertical foundation models purpose-built for cybersecurity rather than relying on general frontier scale.

Core claim

Frontier LLMs are not ready for cybersecurity tasks because they produce high false positive rates and low ground-truth coverage; structured methodology encoded in domain-specialized agents raises per-family detection above 50 percent and a domain-specialized defense model achieves the highest precision and lowest false positive rate, showing that methodology rather than scale is the decisive factor.

What carries the argument

Dual-mode benchmarks (white-box VulnLLM-R and black-box testing on five apps with 118 ground-truth vulnerabilities) that isolate the effect of encoding structured penetration-testing methodology into agents versus relying on general model scale.

If this is right

Every tested frontier model over-predicts vulnerabilities at 10-50 percent false positive rates on white-box detection.
Black-box coverage stays below 20 percent even when frontier models are paired with external security tools.
A domain-specialized defense model reaches 0.904 precision and 9.7 percent false positive rate on one GPU.
Absence of end-to-end request/response sequences, failure cases and multi-step attack chains in training data is the core bottleneck.
Self-play security testing is proposed as a scalable way to generate the missing structured traces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training pipelines for any security model would need to prioritize failure traces and long attack chains over additional clean code examples.
The same methodology-encoding approach could be tested on other high-stakes verticals such as financial transaction monitoring.
If the performance gap persists after larger frontier models are released, it would strengthen the case that domain-specific data curation is required rather than raw scale.

Load-bearing premise

The five production-style applications and 118 ground-truth vulnerabilities across more than twenty CWE families give a representative measure of real-world performance.

What would settle it

A frontier model achieving above 50 percent per-family detection on the same black-box applications without any domain-specialized methodology or additional structured traces.

Figures

Figures reproduced from arXiv: 2605.23243 by Bhavik Shangari, Chandra Khatri, Sunny Nehra, Vipul Dholariya, Vivek Dahiya.

read the original abstract

We evaluate whether frontier LLMs are ready for cybersecurity through a dual-mode benchmark: white-box function-level vulnerability detection (VulnLLM-R, across C/Java/Python) and black-box web application security testing (five production-style applications with 118 ground-truth vulnerabilities across 20+ CWE families, which we will open-source). We test six frontier models (GPT-5.4, Codex~5.3, Claude Opus~4.7, Sonnet~4.6, Gemini~3.1~Pro and Gemini~3~Flash) and two domain-specialized models across four testing paradigms. Our findings are sobering: (1)~every frontier model produces 10-50% false positive rates in white-box detection, systematically over-predicting vulnerabilities; (2)~in black-box testing, frontier models achieve only 4-8% ground-truth coverage, improving to just 10-19% even with external security tools (Playwright MCP, Burp Suite MCP); (3)~structured penetration-testing methodology encoded in domain-specialized agents raises per-family detection above 50%, demonstrating that methodology, not scale, is the primary lever; and (4)~a domain-specialized defense model achieves the highest precision (0.904) and lowest false positive rate (9.7%) among all models, on a single GPU. We identify the absence of structured security testing traces end-to-end request/response sequences, failure-heavy data, and multi-step attack chains as the fundamental training data bottleneck, and propose self-play security testing as a data generation strategy. Our results make the case for vertical foundation models purpose-built for cybersecurity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Specialized agents outperform frontier models on this benchmark, but the five-app test set is too narrow to support broad claims about methodology over scale.

read the letter

The paper's main result is that domain-specialized agents reach above 50% per-family detection on the black-box tasks while frontier models stay at 4-19% coverage, and one specialized defense model posts 0.904 precision with a 9.7% false-positive rate. That comparison is the concrete takeaway.

What the work does is run a dual-mode evaluation: white-box function-level detection across C, Java, and Python plus black-box testing on five production-style applications that contain 118 labeled vulnerabilities spanning more than 20 CWE families. They test six frontier models against two specialized ones under four paradigms and report the gap. Releasing the benchmark is a clear positive.

The soft spot is scale and selection. Five applications are not enough to establish that the results reflect typical enterprise surfaces rather than the particular codebases chosen. The abstract gives no diversity metrics, no comparison against public vulnerability corpora, and no statistical tests or error bars on the reported percentages. The training-data-bottleneck argument follows from the numbers but is not backed by any analysis of actual training traces.

This paper is for groups working on LLM agents for security or on vertical models. Anyone deciding whether to invest in general scaling versus domain data will find the head-to-head numbers worth seeing.

It deserves peer review because the question is timely and the measurements are direct, even though the methods section will need expansion and the benchmark will need external validation before the conclusions can be treated as settled.

Referee Report

2 major / 2 minor

Summary. The paper evaluates frontier LLMs (GPT-5.4, Codex~5.3, Claude Opus~4.7, Sonnet~4.6, Gemini~3.1~Pro, Gemini~3~Flash) and two domain-specialized models on cybersecurity via a dual-mode benchmark: white-box function-level detection (VulnLLM-R across C/Java/Python) and black-box testing on five production-style applications containing 118 ground-truth vulnerabilities across 20+ CWE families. It reports 10-50% false positives in white-box detection, 4-19% ground-truth coverage in black-box even with tools like Playwright/Burp, >50% per-family detection via structured agents, and top precision (0.904) plus lowest FPR (9.7%) for the specialized defense model. The authors attribute results to training-data bottlenecks (lack of end-to-end traces, failure-heavy data, multi-step chains) and advocate vertical foundation models plus self-play data generation; the benchmark will be open-sourced.

Significance. If the benchmark is representative, the work supplies concrete empirical evidence that methodology and specialization outperform raw scale in cybersecurity (e.g., frontier coverage remains 4-19% while specialized agents exceed 50% per family), strengthening the case for vertical models. The planned open-sourcing of the five applications, 118 vulnerabilities, and VulnLLM-R constitutes a reproducible contribution that future work can directly extend or falsify.

major comments (2)

[Benchmark construction (methods section)] Benchmark construction (methods section describing the five applications and 118 vulnerabilities): no quantitative justification is supplied for representativeness—no diversity metrics, codebase-size statistics, comparison against public vulnerability corpora (e.g., CVE, OWASP, or Juliet), or external validation that the 20+ CWE families and production-style apps capture typical enterprise attack surfaces. This directly underpins the central claims of 4-19% frontier coverage and “methodology, not scale” as the primary lever.
[Results reporting (abstract and black-box evaluation tables)] Results reporting (abstract and black-box evaluation tables): concrete percentages (4-8%, 10-19%, per-family >50%) are presented without error bars, confidence intervals, or statistical significance tests across the 118 vulnerabilities or 20+ CWE families, weakening the robustness of the comparative claims between frontier and specialized models.

minor comments (2)

[Abstract] Model identifiers in the abstract (GPT-5.4, Codex~5.3, etc.) use approximate version strings; provide exact checkpoint names, API versions, and prompting templates used in each paradigm for reproducibility.
[VulnLLM-R description] The white-box VulnLLM-R benchmark is introduced without a table summarizing its size, language distribution, or ground-truth labeling procedure; a summary table would clarify the 10-50% false-positive claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below and commit to revisions that directly strengthen the manuscript's claims on benchmark representativeness and statistical robustness.

read point-by-point responses

Referee: [Benchmark construction (methods section)] Benchmark construction (methods section describing the five applications and 118 vulnerabilities): no quantitative justification is supplied for representativeness—no diversity metrics, codebase-size statistics, comparison against public vulnerability corpora (e.g., CVE, OWASP, or Juliet), or external validation that the 20+ CWE families and production-style apps capture typical enterprise attack surfaces. This directly underpins the central claims of 4-19% frontier coverage and “methodology, not scale” as the primary lever.

Authors: We agree that explicit quantitative justification is needed to support claims about representativeness. In the revised manuscript we will add a new subsection to Methods that reports: (i) codebase statistics (LOC, endpoints, and language breakdown) for each of the five applications; (ii) a direct comparison of the observed 20+ CWE families against the distribution in OWASP Top 10 and a sample of recent CVEs; and (iii) diversity metrics such as Shannon entropy over CWE classes. Because the applications and vulnerability set will be open-sourced, these metrics can be independently verified. revision: yes
Referee: [Results reporting (abstract and black-box evaluation tables)] Results reporting (abstract and black-box evaluation tables): concrete percentages (4-8%, 10-19%, per-family >50%) are presented without error bars, confidence intervals, or statistical significance tests across the 118 vulnerabilities or 20+ CWE families, weakening the robustness of the comparative claims between frontier and specialized models.

Authors: We accept that the absence of uncertainty quantification weakens the comparative claims. In the revision we will add 95% bootstrap confidence intervals (resampled over the 118 vulnerabilities) to all reported coverage, precision, and FPR figures in the abstract, tables, and text. We will also include pairwise statistical tests (McNemar’s test with Bonferroni correction) between frontier and specialized models, reported both overall and per CWE family where sample sizes permit. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical evaluation

full rationale

The paper reports direct measurements of LLM performance on two benchmarks (VulnLLM-R white-box detection and black-box testing of five applications with 118 ground-truth vulnerabilities) against explicit ground truth. No equations, fitted parameters, predictions derived from inputs, or self-citations are used to support the central claims; results are presented as observed detection rates, precision, and false-positive rates. The evaluation is self-contained against the stated benchmarks with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmark study with no mathematical derivations; relies on standard assumptions that the chosen applications and vulnerabilities constitute valid test cases for cybersecurity capability.

pith-pipeline@v0.9.1-grok · 5851 in / 1210 out tokens · 46662 ms · 2026-06-30T16:23:22.090681+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 11 canonical work pages · 5 internal anchors

[1]

arXiv:2406.05590. Y . Li et al. CVE-Bench: A Benchmark for AI Agents on Real-World CVE Exploitation. InProc. ICML, 2025 (Spotlight). arXiv:2503.17332. J. Luo et al. CAIBench: A Comprehensive Meta-Benchmark for AI Cybersecurity Agents. arXiv:2510.24317,

work page arXiv 2025
[2]

HackSynth: LLM agent and evaluation framework for autonomous penetration testing.arXiv preprint arXiv:2412.01778, 2024

D. Muzsai, D. Imolai, and A. Lukács. HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing. arXiv:2412.01778,

work page arXiv
[3]

AXE: Grey-Box Exploitability Confirmation for Localized Vulnerability Reports

K. Pham et al. AXE: Agentic eXploit Engine for Multi-Agent Web Application Exploitation. arXiv:2602.14345,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

arXiv preprint arXiv:2603.02297 (2026), https: //arxiv.org/abs/2603.02297

Y . Wen et al. ZeroDayBench: Evaluating LLM Agents for Zero-Day Vulnerability Patching. arXiv:2603.02297,

work page arXiv
[5]

Comparing ai agents to cybersecurity professionals in real-world penetration testing.arXiv preprint arXiv:2512.09882, 2025

J. Lin et al. Comparing AI Agents to Cybersecu- rity Professionals in Real-World Penetration Testing. arXiv:2512.09882,

work page arXiv
[6]

Measuring and Exploiting Contextual Bias in LLM-Assisted Security Code Review

J. He et al. Confirmation Bias in LLM-Based Vulnera- bility Detection. arXiv:2603.18740,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering

J. Compton et al. Using LLM Agents to Filter False Pos- itives in Static Analysis. arXiv:2601.22952,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Fang et al

R. Fang et al. Evaluating LLM Agents for Web Vulner- ability Reproduction Under Incomplete Information. arXiv:2510.14700,

work page arXiv
[9]

arXiv:2404.13161. A. Fan et al. Large Language Models for Software Engineering: Survey and Open Problems. InProc. ICSE-FoSE,

work page arXiv
[10]

Evaluating Large Language Models Trained on Code

M. Chen et al. Evaluating Large Language Models Trained on Code. arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

BloombergGPT: A Large Language Model for Finance

S. Wu et al. BloombergGPT: A Large Language Model for Finance. arXiv:2303.17564,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

potential

Appendix A Benchmark Application Details 11 B Reasoning Efficiency and Latency 11 C Security Agent Design 11 D Attack Model Capabilities 12 E Pipeline Detail and Operational Compar- ison 12 F Training Data Taxonomy 12 G Ground-Truth Inventory 14 A Benchmark Application Details Mercury E-Commerce Marketplace. Python/FastAPI, React SPA, JWT bearer tokens, m...

2023

[1] [1]

arXiv:2406.05590. Y . Li et al. CVE-Bench: A Benchmark for AI Agents on Real-World CVE Exploitation. InProc. ICML, 2025 (Spotlight). arXiv:2503.17332. J. Luo et al. CAIBench: A Comprehensive Meta-Benchmark for AI Cybersecurity Agents. arXiv:2510.24317,

work page arXiv 2025

[2] [2]

HackSynth: LLM agent and evaluation framework for autonomous penetration testing.arXiv preprint arXiv:2412.01778, 2024

D. Muzsai, D. Imolai, and A. Lukács. HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing. arXiv:2412.01778,

work page arXiv

[3] [3]

AXE: Grey-Box Exploitability Confirmation for Localized Vulnerability Reports

K. Pham et al. AXE: Agentic eXploit Engine for Multi-Agent Web Application Exploitation. arXiv:2602.14345,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

arXiv preprint arXiv:2603.02297 (2026), https: //arxiv.org/abs/2603.02297

Y . Wen et al. ZeroDayBench: Evaluating LLM Agents for Zero-Day Vulnerability Patching. arXiv:2603.02297,

work page arXiv

[5] [5]

Comparing ai agents to cybersecurity professionals in real-world penetration testing.arXiv preprint arXiv:2512.09882, 2025

J. Lin et al. Comparing AI Agents to Cybersecu- rity Professionals in Real-World Penetration Testing. arXiv:2512.09882,

work page arXiv

[6] [6]

Measuring and Exploiting Contextual Bias in LLM-Assisted Security Code Review

J. He et al. Confirmation Bias in LLM-Based Vulnera- bility Detection. arXiv:2603.18740,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering

J. Compton et al. Using LLM Agents to Filter False Pos- itives in Static Analysis. arXiv:2601.22952,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Fang et al

R. Fang et al. Evaluating LLM Agents for Web Vulner- ability Reproduction Under Incomplete Information. arXiv:2510.14700,

work page arXiv

[9] [9]

arXiv:2404.13161. A. Fan et al. Large Language Models for Software Engineering: Survey and Open Problems. InProc. ICSE-FoSE,

work page arXiv

[10] [10]

Evaluating Large Language Models Trained on Code

M. Chen et al. Evaluating Large Language Models Trained on Code. arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

BloombergGPT: A Large Language Model for Finance

S. Wu et al. BloombergGPT: A Large Language Model for Finance. arXiv:2303.17564,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

potential

Appendix A Benchmark Application Details 11 B Reasoning Efficiency and Latency 11 C Security Agent Design 11 D Attack Model Capabilities 12 E Pipeline Detail and Operational Compar- ison 12 F Training Data Taxonomy 12 G Ground-Truth Inventory 14 A Benchmark Application Details Mercury E-Commerce Marketplace. Python/FastAPI, React SPA, JWT bearer tokens, m...

2023