Recognition: unknown
Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps
Pith reviewed 2026-05-10 02:20 UTC · model grok-4.3
The pith
Large language models fail to identify malicious events in raw security logs, achieving only 3.8 percent average recall.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Cyber Defense Benchmark requires LLM agents to query an in-memory SQLite database of 75,000 to 135,000 Windows event logs per episode and explicitly flag malicious timestamps. Ground truth comes from Sigma rules derived from 106 attack procedures spanning 86 MITRE ATT&CK sub-techniques. Across evaluations of Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash, the top model reaches only 3.8 percent average recall, no run finds all flags, and no model meets the 50 percent recall threshold on every tactic.
What carries the argument
The Gymnasium reinforcement-learning environment that converts real attack recordings into time-shifted, entity-obfuscated SQLite databases and scores agents on iterative SQL queries against Sigma-rule ground truth.
If this is right
- No current frontier LLM meets the minimum performance bar for unsupervised threat hunting in a security operations center.
- Results on curated security question-and-answer benchmarks do not predict success on open-ended log analysis tasks.
- Agents must develop better strategies for evidence gathering and iterative querying to handle real security data volumes.
- The benchmark provides a concrete, reproducible test that future models or agent designs can be measured against.
Where Pith is reading between the lines
- The benchmark could be expanded to test agents with access to additional tools such as log search interfaces or external knowledge bases.
- Persistent memory across episodes or better long-term planning might be required before LLMs can handle extended investigations.
- Hybrid setups that combine LLM reasoning with traditional rule-based detectors could bridge the current performance gap.
Load-bearing premise
The time-shifted and entity-obfuscated log database with SQL-only access accurately represents the main difficulties of real-world security operations center threat hunting.
What would settle it
An LLM agent that achieves at least 50 percent recall on malicious events for every one of the 13 tested ATT&CK tactics across multiple full campaigns in the benchmark.
Figures
read the original abstract
We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The benchmark wraps 106 real attack procedures from the OTRF Security-Datasets corpus - spanning 86 MITRE ATT&CK sub-techniques across 12 tactics - into a Gymnasium reinforcement-learning environment. Each episode presents the agent with an in-memory SQLite database of 75,000-135,000 log records produced by a deterministic campaign simulator that time-shifts and entity-obfuscates the raw recordings. The agent must iteratively submit SQL queries to discover malicious event timestamps and explicitly flag them, scored CTF-style against Sigma-rule-derived ground truth. Evaluating five frontier models - Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash - on 26 campaigns covering 105 of 106 procedures, we find that all models fail dramatically: the best model (Claude Opus 4.6) submits correct flags for only 3.8% of malicious events on average, and no run across any model ever finds all flags. We define a passing score as >= 50% recall on every ATT&CK tactic - the minimum bar for unsupervised SOC deployment. No model passes: the leader clears this bar on 5 of 13 tactics and the remaining four on zero. These results suggest that current LLMs are poorly suited for open-ended, evidence-driven threat hunting despite strong performance on curated Q&A security benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Cyber Defense Benchmark, a Gymnasium RL environment that presents LLM agents with in-memory SQLite databases of 75k–135k time-shifted, entity-obfuscated Windows event logs drawn from 106 real attack procedures (86 MITRE ATT&CK sub-techniques). Agents must iteratively issue SQL queries to locate malicious timestamps and explicitly flag them; performance is scored CTF-style against Sigma-rule ground truth. On 26 campaigns covering 105 procedures, five frontier models (Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, Gemini 3 Flash) achieve at most 3.8 % average recall (Claude), with zero full successes; no model meets the proposed passing threshold of ≥50 % recall on every tactic.
Significance. If the benchmark environment is accepted as a faithful proxy for core threat-hunting reasoning, the results supply a concrete, reproducible demonstration that current LLM agents remain far from unsupervised SecOps deployment, despite strong curated-Q&A performance. The use of real attack recordings, deterministic simulation, and external Sigma ground truth constitutes a clear empirical contribution that future agent work can build upon.
major comments (2)
- [Abstract and §3] Abstract (final paragraph) and §3 (Benchmark Design, implied): the claim that the observed failures demonstrate LLMs are “poorly suited for open-ended, evidence-driven threat hunting” treats the Gymnasium SQL-only interface as load-bearing. Real SOC workflows routinely supply SIEM search UIs, log aggregation views, external IOC enrichment, and free-text notes; the present setup’s restriction to iterative SQL against obfuscated entities may primarily test query-generation and JOIN proficiency rather than attack-pattern recognition. Without ablation runs on non-obfuscated logs or richer tool interfaces, the generalization rests on an untested environment-fidelity assumption.
- [§4] §4 (Evaluation Protocol, implied): the reported 3.8 % recall and “no run ever finds all flags” figures are presented without stated numbers of independent trials per campaign, temperature settings, or prompt templates. Given the stochastic nature of LLM agents and the large state space (75k–135k records), these omissions prevent assessment of whether the dramatic failure rates are statistically stable or sensitive to minor prompting variations.
minor comments (2)
- [Abstract] Abstract: the relationship between the 106 procedures, 26 campaigns, and 13 tactics is stated but not tabulated; a small table or sentence clarifying coverage would improve readability.
- [Abstract] The passing criterion (“≥50 % recall on every ATT&CK tactic”) is introduced without justification against operational SOC thresholds; a brief reference or sensitivity analysis would strengthen the interpretation.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: [Abstract and §3] The claim that observed failures demonstrate LLMs are poorly suited for open-ended, evidence-driven threat hunting treats the Gymnasium SQL-only interface as load-bearing. Real SOC workflows supply SIEM UIs, IOC enrichment, and richer views; without ablations on non-obfuscated logs or richer tool interfaces, the generalization rests on an untested environment-fidelity assumption.
Authors: We agree that the benchmark deliberately restricts agents to iterative SQL queries against time-shifted and entity-obfuscated logs, which foregrounds query formulation, JOIN reasoning, and evidence chaining rather than the full spectrum of SOC tooling. This design isolates a core, reproducible component of threat hunting. At the same time, we recognize that the strong wording in the abstract and §3 risks overgeneralization without supporting ablations. In the revised version we will qualify the relevant claims to specify that the results apply to this controlled SQL-only setting, add explicit discussion of the interface limitations, and outline planned extensions to richer interfaces and non-obfuscated data. This is a partial revision consisting of textual clarifications and expanded limitations text. revision: partial
-
Referee: [§4] The reported 3.8 % recall and “no run ever finds all flags” figures are presented without stated numbers of independent trials per campaign, temperature settings, or prompt templates. Given LLM stochasticity and the large state space, these omissions prevent assessment of statistical stability.
Authors: The referee correctly notes that the evaluation protocol section lacks these reproducibility details. The original submission omitted them primarily for length reasons. We will revise §4 to report the number of independent trials conducted per campaign, the temperature settings used, the prompt templates, and any observed variance across runs. These additions will allow readers to evaluate the stability of the reported performance figures. This is a full revision to improve methodological transparency. revision: yes
Circularity Check
No circularity: direct empirical measurements on new benchmark
full rationale
The paper introduces a Gymnasium environment wrapping OTRF Security-Datasets into SQLite DBs with time-shifting and entity obfuscation, defines ground truth via Sigma rules, and reports direct performance metrics (recall, full-success rate) for five LLMs across 26 campaigns. The central claim follows immediately from these observed values (Claude Opus at 3.8% average recall, zero full successes, no model passing the 50% recall bar on all tactics) without any derivation, fitted parameter, self-referential definition, or load-bearing self-citation. The protocol is externally verifiable against the cited public datasets and rules; no step reduces the result to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The 106 attack procedures from the OTRF Security-Datasets corpus are representative of real-world threats.
- domain assumption Sigma-rule-derived ground truth accurately labels malicious events in the simulated logs.
Reference graph
Works this paper leans on
-
[1]
B., et al
Rodriguez, J. B., et al. (2020). Mordor: Pre-recorded Security Events. OTRF Security-Datasets. github.com/OTRF/Security-Datasets
2020
- [2]
- [3]
-
[4]
Liu, Y., et al. (2024). CTI-Bench: Evaluating LLMs in Cyber Threat Intelligence. ACL Findings 2024
2024
-
[5]
Tann, W., et al. (2023). Using LLMs for Cybersecurity CTF Challenges and Certification Questions. IEEE SSCI 2023. Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps Page 15 Simbian AI · Technical Report v1.0 · April 2026 · research@simbian.ai
2023
-
[6]
Yang, J., et al. (2023). InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback. NeurIPS 2023
2023
- [7]
-
[8]
Happe, A., et al. (2023). Getting pwn'd by AI: Penetration Testing with Large Language Models. ESEC/FSE 2023
2023
-
[9]
(2015-2024)
Sigma Project. (2015-2024). SigmaHQ: Generic Signature Format for SIEM Systems. github.com/SigmaHQ/sigma
2015
-
[10]
WithSecureLabs. (2022). Chainsaw: Rapidly Search and Hunt Through Windows Forensic Artefacts. github.com/WithSecureLabs/chainsaw
2022
-
[11]
Strom, B., et al. (2018). MITRE ATT&CK;: Design and Philosophy. MITRE Technical Report MTR180110
2018
-
[12]
Towers, M., et al. (2023). Gymnasium: A Standard Interface for Reinforcement Learning Environments. arXiv:2407.17032
work page internal anchor Pith review arXiv 2023
-
[13]
Berrevoets, J., et al. (2024). CAI-Bench: Benchmarks for AI Capabilities in Cybersecurity. Preprint
2024
-
[14]
LiteLLM. (2023). Call all LLM APIs using the OpenAI format. github.com/BerriAI/litellm
2023
-
[15]
Wu Y., et al. (2025). ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation. arxiv.org/abs/2507.14201
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Begimher D., et al. (2026). SIR-Bench: Evaluating Investigation Depth in Security Incident Response Agents. arxiv.org/abs/2604.12040
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[17]
Simbian AI. (2025). AI SOC LLM Leaderboard: Benchmarking LLMs on End-to-End Agentic Alert Investigation. simbian.ai/best-ai-for-cybersecurity
2025
-
[18]
Deason, L., et al. (2025). CyberSOCEval: Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning. arXiv:2509.20166. APPENDIX A: BENCHMARK SEEDS We sampled 26 seeds from the DIVERSE_INTRUSION chain generator and projected each resulting campaign into the same binary seed × procedure space used elsewhere in the report (106 proc...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.