LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks

Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, Gianluca Stringhini · 2024 · 2024 IEEE Symposium on Security and Privacy (SP) · DOI 10.1109/sp54263.2024.00210

3 Pith papers cite this work, alongside 80 external citations. Polarity classification is still indexing.

3 Pith papers citing it

80 external citations · external index

open at publisher browse 3 citing papers

representative citing papers

Veritas: A Semantically Grounded Agentic Framework for Memory Corruption Vulnerability Detection in Binaries

cs.SE · 2026-05-14 · unverdicted · novelty 6.0

Veritas detects memory corruption vulnerabilities in stripped binaries by combining static value-flow slicing, dual-view LLM reasoning, and multi-agent runtime validation, reporting 90% recall, zero false positives on 623 exhaustive cases, and discovery of a real Apple CVE.

Do Fine-Tuned LLMs Understand Vulnerabilities? An Investigation into the Semantic Trap

cs.CR · 2026-01-30 · unverdicted · novelty 6.0

Fine-tuned decoder-only LLMs fall into a Semantic Trap on vulnerability detection, achieving high scores on unpaired normal code but failing on paired vulnerable-patched code, semantic perturbations, and gap analysis, while reasoning supervision reduces symptoms at the cost of recall.

VulWeaver: Weaving Broken Semantics for Grounded Vulnerability Detection

cs.SE · 2026-04-12 · unverdicted · novelty 5.0

VulWeaver improves Java vulnerability detection to 0.75 F1 by enhancing dependency graphs with LLM semantic fixes, extracting full context from slices plus implicit usage info, and applying type-specific meta-prompting with majority voting.

citing papers explorer

Showing 3 of 3 citing papers.

Veritas: A Semantically Grounded Agentic Framework for Memory Corruption Vulnerability Detection in Binaries cs.SE · 2026-05-14 · unverdicted · none · ref 47
Veritas detects memory corruption vulnerabilities in stripped binaries by combining static value-flow slicing, dual-view LLM reasoning, and multi-agent runtime validation, reporting 90% recall, zero false positives on 623 exhaustive cases, and discovery of a real Apple CVE.
Do Fine-Tuned LLMs Understand Vulnerabilities? An Investigation into the Semantic Trap cs.CR · 2026-01-30 · unverdicted · none · ref 44
Fine-tuned decoder-only LLMs fall into a Semantic Trap on vulnerability detection, achieving high scores on unpaired normal code but failing on paired vulnerable-patched code, semantic perturbations, and gap analysis, while reasoning supervision reduces symptoms at the cost of recall.
VulWeaver: Weaving Broken Semantics for Grounded Vulnerability Detection cs.SE · 2026-04-12 · unverdicted · none · ref 46
VulWeaver improves Java vulnerability detection to 0.75 F1 by enhancing dependency graphs with LLM semantic fixes, extracting full context from slices plus implicit usage info, and applying type-specific meta-prompting with majority voting.

LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks

fields

years

verdicts

representative citing papers

citing papers explorer