pith. machine review for the scientific record. sign in

arxiv: 2604.13764 · v1 · submitted 2026-04-15 · 💻 cs.CR

Recognition: unknown

RealVuln: Benchmarking Rule-Based, General-Purpose LLM, and Security-Specialized Scanners on Real-World Code

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:36 UTC · model grok-4.3

classification 💻 cs.CR
keywords vulnerability detectionSAST benchmarkingLLM security toolsPython vulnerabilitiesrule-based scannerssecurity specialized scannersF3 score evaluationfalse positive analysis
0
0 comments X

The pith

Real-world vulnerable code shows security-specialized scanners outperforming LLMs by a wide margin and LLMs beating rule-based tools nearly threefold under recall-weighted scoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RealVuln, a benchmark built from 26 intentionally vulnerable Python repositories containing 796 hand-labeled entries that include both true vulnerabilities and false-positive traps. It evaluates fifteen scanners split into rule-based static analysis tools, general-purpose large language models, and security-specialized scanners, ranking them primarily by F3 score that weights recall nine times higher than precision. The evaluation produces a stable three-tier ordering in which the top security-specialized scanner reaches an F3 of 73.0, the strongest general LLM reaches 51.7, and the best rule-based tool reaches only 17.7. A sympathetic reader would care because the results supply concrete, open data on which class of tool is likeliest to catch real flaws in practice rather than relying on vendor claims or synthetic tests. The authors release the full labeled dataset, scanner outputs, and scoring scripts so the benchmark can grow into a community resource.

Core claim

We present RealVuln, the first open-source benchmark comparing Rule-Based SAST, General-Purpose LLMs, and Security-Specialized scanners on 26 intentionally vulnerable Python repositories with 796 hand-labeled entries. Under the F3 metric a clear three-tier ranking emerges with the security-specialized scanner Kolega.Dev at 73.0, the best general-purpose LLM Claude Sonnet 4.6 at 51.7, and the best rule-based tool Semgrep at 17.7; the hierarchy persists across alternative weightings of recall and precision even though exact positions inside each tier shift.

What carries the argument

The RealVuln benchmark dataset of 676 labeled vulnerabilities and 120 false-positive traps drawn from 26 educational and CTF Python repositories, scored with the F3 metric that penalizes missed vulnerabilities far more heavily than false alarms.

If this is right

  • Security teams choosing tools for Python vulnerability detection should consider specialized scanners first when recall is the dominant requirement.
  • General-purpose LLMs constitute a usable middle tier that already exceeds rule-based SAST performance under the same recall-heavy metric.
  • The open release of labeled data, raw scanner outputs, and scoring scripts allows the community to extend the benchmark and test new scanners.
  • The three-tier separation remains visible under every beta value tested, indicating the ordering is not an artifact of one particular precision-recall tradeoff.
  • An interactive dashboard makes the raw scanner decisions on each labeled instance directly inspectable for further analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the tiered ordering generalizes, organizations may obtain better security coverage by adding a specialized scanner layer rather than relying only on general LLMs or legacy rule sets.
  • The current focus on educational and CTF repositories leaves open whether the same gaps would appear in large, messy production codebases that contain more context and legacy patterns.
  • Extending the benchmark to other languages would test whether the observed performance hierarchy is language-specific or reflects deeper differences in how each scanner class handles code semantics.

Load-bearing premise

The 796 hand-labeled entries in the 26 educational and CTF repositories form an accurate and representative ground truth for the kinds of vulnerabilities that appear in real-world code.

What would settle it

An independent study that labels a fresh collection of production Python codebases and finds either no performance gap between the three scanner categories or a reversal of the reported ranking would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.13764 by Faizan Raza, John Pellew.

Figure 1
Figure 1. Figure 1: Precision vs. recall for all 15 scanners, colored by category. Kolega.Dev occupies the high-recall right side (0.809) with moderate precision, GP-LLMs cluster in the center with higher precision but lower recall, and Rule-Based SAST tools are confined to the low-recall left. The two Security-Specialized scanners sit at op￾posite extremes: Kolega.Dev (high recall) and SecLab Agent (high precision, very low … view at source ↗
Figure 2
Figure 2. Figure 2: F3 score (strict) vs. cost per 100k LOC. Kolega.Dev achieves the highest F3 at the lowest cost among all non-free scanners. The best GP-LLM (Son￾net) costs 3.3× more for 29% lower F3. (0.205) and Snyk (0.282), though lower than the General-Purpose LLMs (0.6–0.9) which achieve their precision by flagging far fewer findings overall. This is a deliberate design tradeoff: Kolega.Dev is optimized for F3-weighte… view at source ↗
read the original abstract

How do security scanners perform on real-world code? We present RealVuln, the first open-source benchmark comparing Rule-Based SAST, General-Purpose LLMs, and Security-Specialized scanners on 26 intentionally vulnerable Python repositories (educational and Capture-The-Flag applications) with 796 hand-labeled entries (676 vulnerabilities, 120 false-positive traps). We test 15 scanners (3 Rule-Based SAST, 10 General-Purpose LLM, 2 Security-Specialized) and rank them by F3 score (beta=3, weighting recall 9x over precision). A clear three-tier ranking emerges under all metrics. Under F3, the Security-Specialized scanner Kolega.Dev (73.0) leads, followed by the best General-Purpose LLM, Claude Sonnet 4.6 (51.7), which in turn scores nearly 3x higher than the best Rule-Based tool, Semgrep (17.7). Under F1, Sonnet 4.6 leads (60.9) with Kolega.Dev at 52.4. Rankings within tiers shift with beta, but the three-tier hierarchy holds across all weightings. All code, ground-truth data, scanner outputs, and scoring scripts are released under an open-source license. An interactive dashboard is at https://realvuln.kolega.dev/. RealVuln is a living benchmark: versioned, community-driven, with a roadmap toward multi-language coverage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces RealVuln, an open benchmark evaluating 15 scanners (3 rule-based SAST, 10 general-purpose LLMs, 2 security-specialized) on 26 intentionally vulnerable Python repositories (educational and CTF) containing 796 hand-labeled entries (676 vulnerabilities and 120 false-positive traps). It reports a consistent three-tier performance hierarchy under F3 (beta=3) and other metrics, with security-specialized scanners (Kolega.Dev at 73.0) outperforming the best general LLM (Claude Sonnet 4.6 at 51.7), which in turn outperforms the best rule-based tool (Semgrep at 17.7). All ground-truth data, scanner outputs, and scoring scripts are released, and the benchmark is positioned as living and community-driven.

Significance. If the hand labels are accurate and reproducible, the work supplies a valuable, fully open empirical comparison of scanner categories on vulnerable code, with released artifacts enabling direct verification and extension. The explicit three-tier finding and metric robustness analysis provide actionable guidance for tool selection in security contexts.

major comments (2)
  1. [Dataset section] Dataset section: The manuscript states that the 796 entries were 'hand-labeled' but supplies no protocol details, including the number of annotators, inter-rater reliability statistics, disagreement-resolution method, or any external audit. Because every reported score and the three-tier ranking are computed directly from these 676 positive and 120 negative labels, the absence of this information makes the ground-truth reliability unverifiable from the text alone.
  2. [Results and Evaluation sections] Results and Evaluation sections: The claim that the three-tier hierarchy is stable 'across all weightings' is asserted, yet the text notes that top-two positions swap under F1; a supplementary table or explicit sensitivity analysis quantifying rank changes for beta values between 1 and 5 would strengthen the robustness argument.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'real-world code' appears in the title and abstract while the body correctly qualifies the repositories as educational and CTF; a brief parenthetical clarification in the abstract would avoid potential misreading.
  2. [Methods] Scanner configuration: While scripts are released, the main text should list the exact command-line flags or API parameters used for each of the 15 scanners to allow readers to reproduce the exact outputs without inspecting the repository.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper accordingly to improve transparency and robustness.

read point-by-point responses
  1. Referee: [Dataset section] Dataset section: The manuscript states that the 796 entries were 'hand-labeled' but supplies no protocol details, including the number of annotators, inter-rater reliability statistics, disagreement-resolution method, or any external audit. Because every reported score and the three-tier ranking are computed directly from these 676 positive and 120 negative labels, the absence of this information makes the ground-truth reliability unverifiable from the text alone.

    Authors: We agree that the absence of annotation protocol details limits verifiability of the ground truth. In the revised manuscript, we will add a dedicated subsection to the Dataset section that specifies the number of annotators, inter-rater reliability statistics, the disagreement-resolution procedure, and any external audit steps performed. This will make the labeling process fully transparent and allow readers to assess label reliability directly. revision: yes

  2. Referee: [Results and Evaluation sections] Results and Evaluation sections: The claim that the three-tier hierarchy is stable 'across all weightings' is asserted, yet the text notes that top-two positions swap under F1; a supplementary table or explicit sensitivity analysis quantifying rank changes for beta values between 1 and 5 would strengthen the robustness argument.

    Authors: The manuscript already acknowledges the top-two swap under F1 while asserting that the three-tier hierarchy remains intact. To strengthen this claim as suggested, we will add a supplementary table in the revised version that reports rankings for F-beta scores across beta values from 1 to 5, explicitly quantifying any rank changes within and between tiers. This sensitivity analysis will provide concrete evidence for the stability of the observed hierarchy. revision: yes

Circularity Check

0 steps flagged

No circularity detected in empirical benchmark results

full rationale

The paper conducts a direct empirical evaluation by hand-labeling 796 code entries across 26 repositories, executing 15 scanners, and computing standard performance metrics (F3 with beta=3, F1) to produce rankings. No equations, derivations, fitted parameters, or first-principles predictions are present that could reduce the reported scores or three-tier hierarchy to the inputs by construction. The central claims derive from the released dataset and scanner outputs rather than any self-referential loop, self-citation chain, or ansatz smuggling. This is a standard benchmark study whose validity hinges on label quality and representativeness, not on circular reasoning in the analysis pipeline.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on two domain assumptions: that the chosen educational and CTF repositories plus hand labels form valid ground truth, and that F3 with beta=3 is an appropriate primary metric. No free parameters are fitted to the results themselves.

free parameters (1)
  • F3 beta = 3
    The paper selects beta=3 to weight recall nine times higher than precision as an explicit design choice for the benchmark.
axioms (1)
  • domain assumption The 26 repositories and 796 hand-labeled entries accurately represent real-world vulnerable code and false-positive traps
    Stated in the abstract as the basis for the benchmark construction.

pith-pipeline@v0.9.0 · 5575 in / 1312 out tokens · 46152 ms · 2026-05-10T12:36:47.067788+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 10 canonical work pages · 1 internal anchor

  1. [1]

    Wobbrock, and Katharina Reinecke

    Guru Bhandari, Amara Naseer, and Leon Moonen. CVEFixes: Automated collection of vulnerabilities and their fixes from open- source software. InProceedings of the 17th International Conference on Predictive Mod- els and Data Analytics in Software Engineering (PROMISE ’21). ACM, 2021. doi: 10.1145/34 75960.3475985

  2. [2]

    Tim Boland and Paul E. Black. Juliet 1.1 C/C++ and Java test suite.Computer, 45(10): 88–90, 2012. doi: 10.1109/MC.2012.345

  3. [3]

    WebenchmarkedthebestSASTtools

    Cycode. WebenchmarkedthebestSASTtools. Blog post, 2024. URLhttps://cycode.com /blog/benchmarking-top-sast-products/

  4. [4]

    Black, Vadim Okun, Terry Cohen, and Athos Ribeiro

    Aurelien Delaitre, Bertrand Stivalet, Paul E. Black, Vadim Okun, Terry Cohen, and Athos Ribeiro. SATE V report: Ten years of static analysis tool expositions. Technical Report NIST SP 500-326, National Institute of Stan- dards and Technology, 2018. URLhttps: //nvlpubs.nist.gov/nistpubs/SpecialP ublications/NIST.SP.500-326.pdf

  5. [5]

    Aurelien Delaitre, Bertrand Stivalet, and Paul E. Black. SATE VI report. Technical re- port, NationalInstituteofStandardsandTech- nology, 2023. URLhttps://samate.nist.go v/SATE6.html

  6. [6]

    https://doi.org/10.48550/ARXIV.2403.18624 arXiv:2403.18624

    Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. Vulnerability detection with code language models: How far are we? arXiv preprint arXiv:2403.18624, 2024. URL https://arxiv.org/abs/2403.18624. Ac- cepted at ICSE 2025. 14

  7. [7]

    2025 SAST accuracy report

    DryRun Security. 2025 SAST accuracy report. Technical report, 2025. URLhttps://www.dr yrun.security/sast-accuracy-report

  8. [8]

    Castle: Benchmarking dataset for static code analyzers and llms towards cwe detection,

    Richard A. Dubniczky, Krisztofer Zoltán Horvát, Tamás Bisztray, Mohamed Amine Fer- rag, Lucas C. Cordeiro, and Norbert Tihanyi. CASTLE: Benchmarking dataset for static code analyzers and LLMs towards CWE detec- tion.arXiv preprint arXiv:2503.09433, 2025. URLhttps://arxiv.org/abs/2503.09433

  9. [9]

    SastBench: A benchmark for testing agentic SAST triage

    Jake Feiglin and Guy Dar. SastBench: A benchmark for testing agentic SAST triage. arXiv preprint arXiv:2601.02941, January

  10. [10]

    Rival Labs

    URLhttps://arxiv.org/abs/2601 .02941. Rival Labs

  11. [11]

    Toolbenchmark: BoostingAST accuracy through pentesting

    FluidAttacks. Toolbenchmark: BoostingAST accuracy through pentesting. Technical report,

  12. [12]

    URLhttps://fluidattacks.com/ben chmarking-top-appsec-tools-and-pentest ing/

  13. [13]

    Exorcising the SAST demons: Contextual application security test- ing (CAST)

    Ghost Security. Exorcising the SAST demons: Contextual application security test- ing (CAST). Technical report, 2025. URL https://reports.ghostsecurity.com/ca st.pdf

  14. [14]

    Seclab taskflow agent: Multi-stage threat-model-driven security au- ditor

    GitHub Security Lab. Seclab taskflow agent: Multi-stage threat-model-driven security au- ditor. Software (open source), 2025. URL https://github.com/GitHubSecurityLab /seclab-taskflow-agent

  15. [15]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wet- tig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can lan- guagemodelsresolvereal-worldGitHubissues? arXiv preprint arXiv:2310.06770, 2024. URL https://arxiv.org/abs/2310.06770

  16. [16]

    CleanVul: Automatic function-level vulnerability detection in code commits using LLM heuristics.arXiv preprint arXiv:2411.17274, 2025

    Yikun Li, Ting Zhang, Ratnadira Widyasari, Yan Naing Tun, Huu Hung Nguyen, Tan Bui, Ivana Clairine Irsan, Yiran Cheng, Xiang Lan, Han Wei Ang, Frank Liauw, Martin Weyssow, Hong Jin Kang, Eng Lieh Ouh, Lwin Khin Shar, and David Lo. CleanVul: Automatic function-level vulnerability detection in code commits using LLM heuristics.arXiv preprint arXiv:2411.1727...

  17. [17]

    Introducing SWE-bench verified

    OpenAI. Introducing SWE-bench verified. Blog post, February 2025. URLhttps://op enai.com/index/introducing-swe-bench-v erified/

  18. [18]

    OWASP benchmark project, 2024

    OWASP. OWASP benchmark project, 2024. URLhttps://github.com/OWASP-Benchma rk/BenchmarkJava. Ongoing project

  19. [19]

    Semgrep: Lightweight static analysis for many languages

    Semgrep, Inc. Semgrep: Lightweight static analysis for many languages. Software, 2024. URLhttps://semgrep.dev

  20. [20]

    Snyk code: Developer-first SAST

    Snyk Ltd. Snyk code: Developer-first SAST. Software, 2024. URLhttps://snyk.io

  21. [21]

    SonarQube: Code quality and security

    SonarSource SA. SonarQube: Code quality and security. Software, 2024. URLhttps://ww w.sonarsource.com/products/sonarqube/

  22. [22]

    ReposVul: A repository-level high- quality vulnerability dataset.arXiv preprint arXiv:2401.13169, 2024

    Xinchen Wang, Ruida Hu, Cuiyun Gao, Xin-Cheng Wen, Yujia Chen, and Qing Liao. ReposVul: A repository-level high- quality vulnerability dataset.arXiv preprint arXiv:2401.13169, 2024. URLhttps://arxi v.org/abs/2401.13169

  23. [23]

    Teo, Yiling Lou, Yebo Feng, Chong Wang, and Dinil Mon Divakaran

    Alperen Yildiz, Sin G. Teo, Yiling Lou, Yebo Feng, Chong Wang, and Dinil M. Divakaran. Benchmarking LLMs and LLM-based agents in practical vulnerability detection for code repositories.arXiv preprint arXiv:2503.03586,

  24. [24]

    URLhttps://arxiv.org/abs/2503.0 3586

  25. [25]

    Towards actual SAST bench- marks

    ZeroPath Team. Towards actual SAST bench- marks. Blog post, November 2024. URL https://zeropath.com/blog/toward-act ual-benchmarks

  26. [26]

    Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, and Daniel Kang

    Yuxuan Zhu, Antony Kellermann, Dylan Bow- man, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, Jet Geronimo, Avi Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, and Daniel Kang. CVE-Bench: A benchmark for AI agents’ abil- ity to exploit real-world web application vul- nerabilities.arXiv preprint arXiv:2503.17332, 15

  27. [27]

    URLhttps://arxiv.org/abs/2503.1 7332. 16