Recognition: unknown
RealVuln: Benchmarking Rule-Based, General-Purpose LLM, and Security-Specialized Scanners on Real-World Code
Pith reviewed 2026-05-10 12:36 UTC · model grok-4.3
The pith
Real-world vulnerable code shows security-specialized scanners outperforming LLMs by a wide margin and LLMs beating rule-based tools nearly threefold under recall-weighted scoring.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present RealVuln, the first open-source benchmark comparing Rule-Based SAST, General-Purpose LLMs, and Security-Specialized scanners on 26 intentionally vulnerable Python repositories with 796 hand-labeled entries. Under the F3 metric a clear three-tier ranking emerges with the security-specialized scanner Kolega.Dev at 73.0, the best general-purpose LLM Claude Sonnet 4.6 at 51.7, and the best rule-based tool Semgrep at 17.7; the hierarchy persists across alternative weightings of recall and precision even though exact positions inside each tier shift.
What carries the argument
The RealVuln benchmark dataset of 676 labeled vulnerabilities and 120 false-positive traps drawn from 26 educational and CTF Python repositories, scored with the F3 metric that penalizes missed vulnerabilities far more heavily than false alarms.
If this is right
- Security teams choosing tools for Python vulnerability detection should consider specialized scanners first when recall is the dominant requirement.
- General-purpose LLMs constitute a usable middle tier that already exceeds rule-based SAST performance under the same recall-heavy metric.
- The open release of labeled data, raw scanner outputs, and scoring scripts allows the community to extend the benchmark and test new scanners.
- The three-tier separation remains visible under every beta value tested, indicating the ordering is not an artifact of one particular precision-recall tradeoff.
- An interactive dashboard makes the raw scanner decisions on each labeled instance directly inspectable for further analysis.
Where Pith is reading between the lines
- If the tiered ordering generalizes, organizations may obtain better security coverage by adding a specialized scanner layer rather than relying only on general LLMs or legacy rule sets.
- The current focus on educational and CTF repositories leaves open whether the same gaps would appear in large, messy production codebases that contain more context and legacy patterns.
- Extending the benchmark to other languages would test whether the observed performance hierarchy is language-specific or reflects deeper differences in how each scanner class handles code semantics.
Load-bearing premise
The 796 hand-labeled entries in the 26 educational and CTF repositories form an accurate and representative ground truth for the kinds of vulnerabilities that appear in real-world code.
What would settle it
An independent study that labels a fresh collection of production Python codebases and finds either no performance gap between the three scanner categories or a reversal of the reported ranking would falsify the central claim.
Figures
read the original abstract
How do security scanners perform on real-world code? We present RealVuln, the first open-source benchmark comparing Rule-Based SAST, General-Purpose LLMs, and Security-Specialized scanners on 26 intentionally vulnerable Python repositories (educational and Capture-The-Flag applications) with 796 hand-labeled entries (676 vulnerabilities, 120 false-positive traps). We test 15 scanners (3 Rule-Based SAST, 10 General-Purpose LLM, 2 Security-Specialized) and rank them by F3 score (beta=3, weighting recall 9x over precision). A clear three-tier ranking emerges under all metrics. Under F3, the Security-Specialized scanner Kolega.Dev (73.0) leads, followed by the best General-Purpose LLM, Claude Sonnet 4.6 (51.7), which in turn scores nearly 3x higher than the best Rule-Based tool, Semgrep (17.7). Under F1, Sonnet 4.6 leads (60.9) with Kolega.Dev at 52.4. Rankings within tiers shift with beta, but the three-tier hierarchy holds across all weightings. All code, ground-truth data, scanner outputs, and scoring scripts are released under an open-source license. An interactive dashboard is at https://realvuln.kolega.dev/. RealVuln is a living benchmark: versioned, community-driven, with a roadmap toward multi-language coverage.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RealVuln, an open benchmark evaluating 15 scanners (3 rule-based SAST, 10 general-purpose LLMs, 2 security-specialized) on 26 intentionally vulnerable Python repositories (educational and CTF) containing 796 hand-labeled entries (676 vulnerabilities and 120 false-positive traps). It reports a consistent three-tier performance hierarchy under F3 (beta=3) and other metrics, with security-specialized scanners (Kolega.Dev at 73.0) outperforming the best general LLM (Claude Sonnet 4.6 at 51.7), which in turn outperforms the best rule-based tool (Semgrep at 17.7). All ground-truth data, scanner outputs, and scoring scripts are released, and the benchmark is positioned as living and community-driven.
Significance. If the hand labels are accurate and reproducible, the work supplies a valuable, fully open empirical comparison of scanner categories on vulnerable code, with released artifacts enabling direct verification and extension. The explicit three-tier finding and metric robustness analysis provide actionable guidance for tool selection in security contexts.
major comments (2)
- [Dataset section] Dataset section: The manuscript states that the 796 entries were 'hand-labeled' but supplies no protocol details, including the number of annotators, inter-rater reliability statistics, disagreement-resolution method, or any external audit. Because every reported score and the three-tier ranking are computed directly from these 676 positive and 120 negative labels, the absence of this information makes the ground-truth reliability unverifiable from the text alone.
- [Results and Evaluation sections] Results and Evaluation sections: The claim that the three-tier hierarchy is stable 'across all weightings' is asserted, yet the text notes that top-two positions swap under F1; a supplementary table or explicit sensitivity analysis quantifying rank changes for beta values between 1 and 5 would strengthen the robustness argument.
minor comments (2)
- [Abstract] Abstract: The phrase 'real-world code' appears in the title and abstract while the body correctly qualifies the repositories as educational and CTF; a brief parenthetical clarification in the abstract would avoid potential misreading.
- [Methods] Scanner configuration: While scripts are released, the main text should list the exact command-line flags or API parameters used for each of the 15 scanners to allow readers to reproduce the exact outputs without inspecting the repository.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper accordingly to improve transparency and robustness.
read point-by-point responses
-
Referee: [Dataset section] Dataset section: The manuscript states that the 796 entries were 'hand-labeled' but supplies no protocol details, including the number of annotators, inter-rater reliability statistics, disagreement-resolution method, or any external audit. Because every reported score and the three-tier ranking are computed directly from these 676 positive and 120 negative labels, the absence of this information makes the ground-truth reliability unverifiable from the text alone.
Authors: We agree that the absence of annotation protocol details limits verifiability of the ground truth. In the revised manuscript, we will add a dedicated subsection to the Dataset section that specifies the number of annotators, inter-rater reliability statistics, the disagreement-resolution procedure, and any external audit steps performed. This will make the labeling process fully transparent and allow readers to assess label reliability directly. revision: yes
-
Referee: [Results and Evaluation sections] Results and Evaluation sections: The claim that the three-tier hierarchy is stable 'across all weightings' is asserted, yet the text notes that top-two positions swap under F1; a supplementary table or explicit sensitivity analysis quantifying rank changes for beta values between 1 and 5 would strengthen the robustness argument.
Authors: The manuscript already acknowledges the top-two swap under F1 while asserting that the three-tier hierarchy remains intact. To strengthen this claim as suggested, we will add a supplementary table in the revised version that reports rankings for F-beta scores across beta values from 1 to 5, explicitly quantifying any rank changes within and between tiers. This sensitivity analysis will provide concrete evidence for the stability of the observed hierarchy. revision: yes
Circularity Check
No circularity detected in empirical benchmark results
full rationale
The paper conducts a direct empirical evaluation by hand-labeling 796 code entries across 26 repositories, executing 15 scanners, and computing standard performance metrics (F3 with beta=3, F1) to produce rankings. No equations, derivations, fitted parameters, or first-principles predictions are present that could reduce the reported scores or three-tier hierarchy to the inputs by construction. The central claims derive from the released dataset and scanner outputs rather than any self-referential loop, self-citation chain, or ansatz smuggling. This is a standard benchmark study whose validity hinges on label quality and representativeness, not on circular reasoning in the analysis pipeline.
Axiom & Free-Parameter Ledger
free parameters (1)
- F3 beta =
3
axioms (1)
- domain assumption The 26 repositories and 796 hand-labeled entries accurately represent real-world vulnerable code and false-positive traps
Reference graph
Works this paper leans on
-
[1]
Wobbrock, and Katharina Reinecke
Guru Bhandari, Amara Naseer, and Leon Moonen. CVEFixes: Automated collection of vulnerabilities and their fixes from open- source software. InProceedings of the 17th International Conference on Predictive Mod- els and Data Analytics in Software Engineering (PROMISE ’21). ACM, 2021. doi: 10.1145/34 75960.3475985
work page doi:10.1145/34 2021
-
[2]
Tim Boland and Paul E. Black. Juliet 1.1 C/C++ and Java test suite.Computer, 45(10): 88–90, 2012. doi: 10.1109/MC.2012.345
-
[3]
WebenchmarkedthebestSASTtools
Cycode. WebenchmarkedthebestSASTtools. Blog post, 2024. URLhttps://cycode.com /blog/benchmarking-top-sast-products/
2024
-
[4]
Black, Vadim Okun, Terry Cohen, and Athos Ribeiro
Aurelien Delaitre, Bertrand Stivalet, Paul E. Black, Vadim Okun, Terry Cohen, and Athos Ribeiro. SATE V report: Ten years of static analysis tool expositions. Technical Report NIST SP 500-326, National Institute of Stan- dards and Technology, 2018. URLhttps: //nvlpubs.nist.gov/nistpubs/SpecialP ublications/NIST.SP.500-326.pdf
2018
-
[5]
Aurelien Delaitre, Bertrand Stivalet, and Paul E. Black. SATE VI report. Technical re- port, NationalInstituteofStandardsandTech- nology, 2023. URLhttps://samate.nist.go v/SATE6.html
2023
-
[6]
https://doi.org/10.48550/ARXIV.2403.18624 arXiv:2403.18624
Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. Vulnerability detection with code language models: How far are we? arXiv preprint arXiv:2403.18624, 2024. URL https://arxiv.org/abs/2403.18624. Ac- cepted at ICSE 2025. 14
-
[7]
2025 SAST accuracy report
DryRun Security. 2025 SAST accuracy report. Technical report, 2025. URLhttps://www.dr yrun.security/sast-accuracy-report
2025
-
[8]
Castle: Benchmarking dataset for static code analyzers and llms towards cwe detection,
Richard A. Dubniczky, Krisztofer Zoltán Horvát, Tamás Bisztray, Mohamed Amine Fer- rag, Lucas C. Cordeiro, and Norbert Tihanyi. CASTLE: Benchmarking dataset for static code analyzers and LLMs towards CWE detec- tion.arXiv preprint arXiv:2503.09433, 2025. URLhttps://arxiv.org/abs/2503.09433
-
[9]
SastBench: A benchmark for testing agentic SAST triage
Jake Feiglin and Guy Dar. SastBench: A benchmark for testing agentic SAST triage. arXiv preprint arXiv:2601.02941, January
-
[10]
Rival Labs
URLhttps://arxiv.org/abs/2601 .02941. Rival Labs
-
[11]
Toolbenchmark: BoostingAST accuracy through pentesting
FluidAttacks. Toolbenchmark: BoostingAST accuracy through pentesting. Technical report,
-
[12]
URLhttps://fluidattacks.com/ben chmarking-top-appsec-tools-and-pentest ing/
-
[13]
Exorcising the SAST demons: Contextual application security test- ing (CAST)
Ghost Security. Exorcising the SAST demons: Contextual application security test- ing (CAST). Technical report, 2025. URL https://reports.ghostsecurity.com/ca st.pdf
2025
-
[14]
Seclab taskflow agent: Multi-stage threat-model-driven security au- ditor
GitHub Security Lab. Seclab taskflow agent: Multi-stage threat-model-driven security au- ditor. Software (open source), 2025. URL https://github.com/GitHubSecurityLab /seclab-taskflow-agent
2025
-
[15]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez, John Yang, Alexander Wet- tig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can lan- guagemodelsresolvereal-worldGitHubissues? arXiv preprint arXiv:2310.06770, 2024. URL https://arxiv.org/abs/2310.06770
work page internal anchor Pith review arXiv 2024
-
[16]
Yikun Li, Ting Zhang, Ratnadira Widyasari, Yan Naing Tun, Huu Hung Nguyen, Tan Bui, Ivana Clairine Irsan, Yiran Cheng, Xiang Lan, Han Wei Ang, Frank Liauw, Martin Weyssow, Hong Jin Kang, Eng Lieh Ouh, Lwin Khin Shar, and David Lo. CleanVul: Automatic function-level vulnerability detection in code commits using LLM heuristics.arXiv preprint arXiv:2411.1727...
-
[17]
Introducing SWE-bench verified
OpenAI. Introducing SWE-bench verified. Blog post, February 2025. URLhttps://op enai.com/index/introducing-swe-bench-v erified/
2025
-
[18]
OWASP benchmark project, 2024
OWASP. OWASP benchmark project, 2024. URLhttps://github.com/OWASP-Benchma rk/BenchmarkJava. Ongoing project
2024
-
[19]
Semgrep: Lightweight static analysis for many languages
Semgrep, Inc. Semgrep: Lightweight static analysis for many languages. Software, 2024. URLhttps://semgrep.dev
2024
-
[20]
Snyk code: Developer-first SAST
Snyk Ltd. Snyk code: Developer-first SAST. Software, 2024. URLhttps://snyk.io
2024
-
[21]
SonarQube: Code quality and security
SonarSource SA. SonarQube: Code quality and security. Software, 2024. URLhttps://ww w.sonarsource.com/products/sonarqube/
2024
-
[22]
Xinchen Wang, Ruida Hu, Cuiyun Gao, Xin-Cheng Wen, Yujia Chen, and Qing Liao. ReposVul: A repository-level high- quality vulnerability dataset.arXiv preprint arXiv:2401.13169, 2024. URLhttps://arxi v.org/abs/2401.13169
-
[23]
Teo, Yiling Lou, Yebo Feng, Chong Wang, and Dinil Mon Divakaran
Alperen Yildiz, Sin G. Teo, Yiling Lou, Yebo Feng, Chong Wang, and Dinil M. Divakaran. Benchmarking LLMs and LLM-based agents in practical vulnerability detection for code repositories.arXiv preprint arXiv:2503.03586,
-
[24]
URLhttps://arxiv.org/abs/2503.0 3586
-
[25]
Towards actual SAST bench- marks
ZeroPath Team. Towards actual SAST bench- marks. Blog post, November 2024. URL https://zeropath.com/blog/toward-act ual-benchmarks
2024
-
[26]
Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, and Daniel Kang
Yuxuan Zhu, Antony Kellermann, Dylan Bow- man, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, Jet Geronimo, Avi Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, and Daniel Kang. CVE-Bench: A benchmark for AI agents’ abil- ity to exploit real-world web application vul- nerabilities.arXiv preprint arXiv:2503.17332, 15
-
[27]
URLhttps://arxiv.org/abs/2503.1 7332. 16
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.