Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks
Pith reviewed 2026-06-30 16:23 UTC · model grok-4.3
The pith
Domain-specialized agents using structured penetration-testing methodology detect over 50 percent of vulnerabilities per family, while frontier LLMs reach only 4-8 percent coverage even with tools.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Frontier LLMs are not ready for cybersecurity tasks because they produce high false positive rates and low ground-truth coverage; structured methodology encoded in domain-specialized agents raises per-family detection above 50 percent and a domain-specialized defense model achieves the highest precision and lowest false positive rate, showing that methodology rather than scale is the decisive factor.
What carries the argument
Dual-mode benchmarks (white-box VulnLLM-R and black-box testing on five apps with 118 ground-truth vulnerabilities) that isolate the effect of encoding structured penetration-testing methodology into agents versus relying on general model scale.
If this is right
- Every tested frontier model over-predicts vulnerabilities at 10-50 percent false positive rates on white-box detection.
- Black-box coverage stays below 20 percent even when frontier models are paired with external security tools.
- A domain-specialized defense model reaches 0.904 precision and 9.7 percent false positive rate on one GPU.
- Absence of end-to-end request/response sequences, failure cases and multi-step attack chains in training data is the core bottleneck.
- Self-play security testing is proposed as a scalable way to generate the missing structured traces.
Where Pith is reading between the lines
- Training pipelines for any security model would need to prioritize failure traces and long attack chains over additional clean code examples.
- The same methodology-encoding approach could be tested on other high-stakes verticals such as financial transaction monitoring.
- If the performance gap persists after larger frontier models are released, it would strengthen the case that domain-specific data curation is required rather than raw scale.
Load-bearing premise
The five production-style applications and 118 ground-truth vulnerabilities across more than twenty CWE families give a representative measure of real-world performance.
What would settle it
A frontier model achieving above 50 percent per-family detection on the same black-box applications without any domain-specialized methodology or additional structured traces.
Figures
read the original abstract
We evaluate whether frontier LLMs are ready for cybersecurity through a dual-mode benchmark: white-box function-level vulnerability detection (VulnLLM-R, across C/Java/Python) and black-box web application security testing (five production-style applications with 118 ground-truth vulnerabilities across 20+ CWE families, which we will open-source). We test six frontier models (GPT-5.4, Codex~5.3, Claude Opus~4.7, Sonnet~4.6, Gemini~3.1~Pro and Gemini~3~Flash) and two domain-specialized models across four testing paradigms. Our findings are sobering: (1)~every frontier model produces 10-50% false positive rates in white-box detection, systematically over-predicting vulnerabilities; (2)~in black-box testing, frontier models achieve only 4-8% ground-truth coverage, improving to just 10-19% even with external security tools (Playwright MCP, Burp Suite MCP); (3)~structured penetration-testing methodology encoded in domain-specialized agents raises per-family detection above 50%, demonstrating that methodology, not scale, is the primary lever; and (4)~a domain-specialized defense model achieves the highest precision (0.904) and lowest false positive rate (9.7%) among all models, on a single GPU. We identify the absence of structured security testing traces end-to-end request/response sequences, failure-heavy data, and multi-step attack chains as the fundamental training data bottleneck, and propose self-play security testing as a data generation strategy. Our results make the case for vertical foundation models purpose-built for cybersecurity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates frontier LLMs (GPT-5.4, Codex~5.3, Claude Opus~4.7, Sonnet~4.6, Gemini~3.1~Pro, Gemini~3~Flash) and two domain-specialized models on cybersecurity via a dual-mode benchmark: white-box function-level detection (VulnLLM-R across C/Java/Python) and black-box testing on five production-style applications containing 118 ground-truth vulnerabilities across 20+ CWE families. It reports 10-50% false positives in white-box detection, 4-19% ground-truth coverage in black-box even with tools like Playwright/Burp, >50% per-family detection via structured agents, and top precision (0.904) plus lowest FPR (9.7%) for the specialized defense model. The authors attribute results to training-data bottlenecks (lack of end-to-end traces, failure-heavy data, multi-step chains) and advocate vertical foundation models plus self-play data generation; the benchmark will be open-sourced.
Significance. If the benchmark is representative, the work supplies concrete empirical evidence that methodology and specialization outperform raw scale in cybersecurity (e.g., frontier coverage remains 4-19% while specialized agents exceed 50% per family), strengthening the case for vertical models. The planned open-sourcing of the five applications, 118 vulnerabilities, and VulnLLM-R constitutes a reproducible contribution that future work can directly extend or falsify.
major comments (2)
- [Benchmark construction (methods section)] Benchmark construction (methods section describing the five applications and 118 vulnerabilities): no quantitative justification is supplied for representativeness—no diversity metrics, codebase-size statistics, comparison against public vulnerability corpora (e.g., CVE, OWASP, or Juliet), or external validation that the 20+ CWE families and production-style apps capture typical enterprise attack surfaces. This directly underpins the central claims of 4-19% frontier coverage and “methodology, not scale” as the primary lever.
- [Results reporting (abstract and black-box evaluation tables)] Results reporting (abstract and black-box evaluation tables): concrete percentages (4-8%, 10-19%, per-family >50%) are presented without error bars, confidence intervals, or statistical significance tests across the 118 vulnerabilities or 20+ CWE families, weakening the robustness of the comparative claims between frontier and specialized models.
minor comments (2)
- [Abstract] Model identifiers in the abstract (GPT-5.4, Codex~5.3, etc.) use approximate version strings; provide exact checkpoint names, API versions, and prompting templates used in each paradigm for reproducibility.
- [VulnLLM-R description] The white-box VulnLLM-R benchmark is introduced without a table summarizing its size, language distribution, or ground-truth labeling procedure; a summary table would clarify the 10-50% false-positive claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below and commit to revisions that directly strengthen the manuscript's claims on benchmark representativeness and statistical robustness.
read point-by-point responses
-
Referee: [Benchmark construction (methods section)] Benchmark construction (methods section describing the five applications and 118 vulnerabilities): no quantitative justification is supplied for representativeness—no diversity metrics, codebase-size statistics, comparison against public vulnerability corpora (e.g., CVE, OWASP, or Juliet), or external validation that the 20+ CWE families and production-style apps capture typical enterprise attack surfaces. This directly underpins the central claims of 4-19% frontier coverage and “methodology, not scale” as the primary lever.
Authors: We agree that explicit quantitative justification is needed to support claims about representativeness. In the revised manuscript we will add a new subsection to Methods that reports: (i) codebase statistics (LOC, endpoints, and language breakdown) for each of the five applications; (ii) a direct comparison of the observed 20+ CWE families against the distribution in OWASP Top 10 and a sample of recent CVEs; and (iii) diversity metrics such as Shannon entropy over CWE classes. Because the applications and vulnerability set will be open-sourced, these metrics can be independently verified. revision: yes
-
Referee: [Results reporting (abstract and black-box evaluation tables)] Results reporting (abstract and black-box evaluation tables): concrete percentages (4-8%, 10-19%, per-family >50%) are presented without error bars, confidence intervals, or statistical significance tests across the 118 vulnerabilities or 20+ CWE families, weakening the robustness of the comparative claims between frontier and specialized models.
Authors: We accept that the absence of uncertainty quantification weakens the comparative claims. In the revision we will add 95% bootstrap confidence intervals (resampled over the 118 vulnerabilities) to all reported coverage, precision, and FPR figures in the abstract, tables, and text. We will also include pairwise statistical tests (McNemar’s test with Bonferroni correction) between frontier and specialized models, reported both overall and per CWE family where sample sizes permit. revision: yes
Circularity Check
No significant circularity: purely empirical evaluation
full rationale
The paper reports direct measurements of LLM performance on two benchmarks (VulnLLM-R white-box detection and black-box testing of five applications with 118 ground-truth vulnerabilities) against explicit ground truth. No equations, fitted parameters, predictions derived from inputs, or self-citations are used to support the central claims; results are presented as observed detection rates, precision, and false-positive rates. The evaluation is self-contained against the stated benchmarks with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
D. Muzsai, D. Imolai, and A. Lukács. HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing. arXiv:2412.01778,
-
[3]
AXE: Grey-Box Exploitability Confirmation for Localized Vulnerability Reports
K. Pham et al. AXE: Agentic eXploit Engine for Multi-Agent Web Application Exploitation. arXiv:2602.14345,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
arXiv preprint arXiv:2603.02297 (2026), https: //arxiv.org/abs/2603.02297
Y . Wen et al. ZeroDayBench: Evaluating LLM Agents for Zero-Day Vulnerability Patching. arXiv:2603.02297,
-
[5]
J. Lin et al. Comparing AI Agents to Cybersecu- rity Professionals in Real-World Penetration Testing. arXiv:2512.09882,
-
[6]
Measuring and Exploiting Contextual Bias in LLM-Assisted Security Code Review
J. He et al. Confirmation Bias in LLM-Based Vulnera- bility Detection. arXiv:2603.18740,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering
J. Compton et al. Using LLM Agents to Filter False Pos- itives in Static Analysis. arXiv:2601.22952,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
R. Fang et al. Evaluating LLM Agents for Web Vulner- ability Reproduction Under Incomplete Information. arXiv:2510.14700,
- [9]
-
[10]
Evaluating Large Language Models Trained on Code
M. Chen et al. Evaluating Large Language Models Trained on Code. arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
BloombergGPT: A Large Language Model for Finance
S. Wu et al. BloombergGPT: A Large Language Model for Finance. arXiv:2303.17564,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
potential
Appendix A Benchmark Application Details 11 B Reasoning Efficiency and Latency 11 C Security Agent Design 11 D Attack Model Capabilities 12 E Pipeline Detail and Operational Compar- ison 12 F Training Data Taxonomy 12 G Ground-Truth Inventory 14 A Benchmark Application Details Mercury E-Commerce Marketplace. Python/FastAPI, React SPA, JWT bearer tokens, m...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.