A new benchmark shows frontier LLMs achieve only 3.8% average recall identifying malicious events from raw logs and fail to meet 50% recall thresholds on most tactics.
CyberMetric: A benchmark dataset for evaluating large language models knowledge in cybersecurity
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.CR 2years
2026 2roles
background 1polarities
background 1representative citing papers
LanG presents a governance-aware agentic AI platform for unified security operations that reports strong performance on incident correlation, rule generation, attack reconstruction, and AI safety guardrails in an open-source package.
citing papers explorer
-
Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps
A new benchmark shows frontier LLMs achieve only 3.8% average recall identifying malicious events from raw logs and fail to meet 50% recall thresholds on most tactics.
-
LanG -- A Governance-Aware Agentic AI Platform for Unified Security Operations
LanG presents a governance-aware agentic AI platform for unified security operations that reports strong performance on incident correlation, rule generation, attack reconstruction, and AI safety guardrails in an open-source package.