{"total":15,"items":[{"citing_arxiv_id":"2605.24949","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"APT-Agent: Automated Penetration Testing using Large Language Models","primary_cat":"cs.CR","submitted_at":"2026-05-24T08:54:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"APT-Agent automates penetration testing with LLMs using rectification and memory modules, achieving 84.29% end-to-end success on Metasploitable 2 versus lower rates for baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21773","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HIDBench: Benchmarking Large Language Models for Host-Based Intrusion Detection","primary_cat":"cs.CR","submitted_at":"2026-05-20T22:07:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HIDBench unifies DARPA-E3, DARPA-E5, and NodLink datasets with a data pipeline to benchmark LLMs for host-based intrusion detection, showing high precision on simple logs but sharp drops in MCC and rises in false positives on complex noisy data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21694","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PocketAgents: A Manifest-Driven Library of Autonomous Defense Agents","primary_cat":"cs.CR","submitted_at":"2026-05-20T19:52:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PocketAgents introduces a manifest-driven library for LLM-based autonomous defense agents, evaluated in 18 closed-loop trials against a DarkSide-inspired attack where 13 trials produced validated blocking actions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07830","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios","primary_cat":"cs.CR","submitted_at":"2026-05-08T14:57:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM agents exhibit persistent attack-selection biases as fixed traits independent of success rates, with a bias momentum effect that resists steering and yields no performance gain.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Large language models (LLMs) have rapidly evolved into autonomous agents capable of complex reasoning, tool use, and long-horizon planning [30, 31]. In cybersecurity, this shift has accelerated the automation of offensive operations such as penetration testing, vulnerability discovery, and red- teaming [3, 5], extending from isolated coding tasks to multi-step exploits [29, 37]. Characterizing agent behavior in this setting is therefore central to both risk assessment and governance. Recent threat assessments identify LLM agents as a major accelerant of offensive cyber opera- tions‡, motivating benchmarks that span Capture The Flag (CTF)-based subtask decomposition [33], real-world Common Vulnerabilities and Exposures (CVE) exploitation [36], and bug-bounty work-"},{"citing_arxiv_id":"2605.06486","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Autonomous Adversary: Red-Teaming in the age of LLM","primary_cat":"cs.CR","submitted_at":"2026-05-07T16:07:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Expert-defined action plans for LLM agents achieve higher task completion in lateral-movement scenarios than fully autonomous or self-scaffolded modes, but failures remain common due to brittle commands and state handling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04499","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Pen-Strategist: A Reasoning Framework for Penetration Testing Strategy Formation and Analysis","primary_cat":"cs.CR","submitted_at":"2026-05-06T05:02:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Pen-Strategist fine-tunes Qwen-3-14B with RL on a pentesting reasoning dataset and pairs it with a CNN step classifier, reporting 87% better strategy derivation, 47.5% more subtask completions than baselines, and gains on CTFKnow and user studies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02346","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"APIOT: Autonomous Vulnerability Management Across Bare-Metal Industrial OT Networks","primary_cat":"cs.CR","submitted_at":"2026-05-04T08:47:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"APIOT is the first LLM framework to complete the full autonomous discovery-to-remediation cycle on bare-metal OT devices, reaching 90% success across 290 runs on Zephyr RTOS.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00741","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Self-Adaptive Multi-Agent LLM-Based Security Pattern Selection for IoT Systems","primary_cat":"cs.CR","submitted_at":"2026-05-01T15:42:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ASPO combines multi-agent LLM proposals with deterministic enforcement in a MAPE-K loop to select conflict-free, resource-feasible security patterns for IoT, delivering 100% safety invariants and 21-23% tail latency/energy reductions on testbed workloads.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27143","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Enhancing Linux Privilege Escalation Attack Capabilities of Local LLM Agents","primary_cat":"cs.CR","submitted_at":"2026-04-29T19:54:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Targeted prompting and system interventions enable local LLMs such as Llama 3.1 70B to exploit 83% of tested Linux privilege escalation vulnerabilities.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"work [10], which balances diversity in command generation with output coher- ence. Platform defaults are used for all other parameters. Model RefusalsThe only LLM exhibiting model refusal was Llama3.1 8B, which refused to generate exploitation commands in a substantial fraction of iterations. While various jailbreaking methods have been researched [33], we use Llama-3.1-8B-Lexi-Uncensored-V23, a fine-tuned variant of Llama3.1 8B with safety filters removed. This is representative of real-world offensive use, where practitioners routinely deploy uncensored variants. We note that this fine-tune preserves the base model's architecture and general capabilities; the removal of safety training does not introduce new failure modes (hallucinations, repetition)"},{"citing_arxiv_id":"2604.06019","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CritBench: A Framework for Evaluating Cybersecurity Capabilities of Large Language Models in IEC 61850 Digital Substation Environments","primary_cat":"cs.CR","submitted_at":"2026-04-07T16:16:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CritBench evaluates five LLMs on 81 tasks in IEC 61850 environments, showing reliable performance on static analysis and single-tool reconnaissance but degradation on dynamic live-system tasks that require sequential reasoning, with domain-specific tools improving results.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05719","ref_index":121,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing","primary_cat":"cs.CR","submitted_at":"2026-04-07T11:19:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"The first SoK on LLM-based AutoPT frameworks provides a six-dimension taxonomy of agent designs and a unified empirical benchmark evaluating 15 frameworks via over 10 billion tokens and 1,500 manually reviewed logs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"and interact with the environment [ 116]. However, this definition struggles to finely reveal the design-level differences among various PT frameworks. Taking PentestGPT-v2 [ 27] as an 11 example, the authors define their system as a single-agent system, yet it actually incorporates roles such as a summarizer. In contrast, methods like ARACNE [ 81] and AutoAttacker [121] treat the summarizer as an independent agent role. If we follow the single-agent classification of PentestGPT-v2, it becomes diﬀicult to distinguish it from other single-agent systems. Therefore, drawing on existing studies [ 51, 91, 87], this paper uniformly defines an agent as a LLM assigned specific roles and responsibilities, possessing an independent context"},{"citing_arxiv_id":"2604.05440","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LanG -- A Governance-Aware Agentic AI Platform for Unified Security Operations","primary_cat":"cs.CR","submitted_at":"2026-04-07T05:22:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LanG presents a governance-aware agentic AI platform for unified security operations that reports strong performance on incident correlation, rule generation, attack reconstruction, and AI safety guardrails in an open-source package.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"behaviour, extract system prompts, or bypass safety filters and AI usage policies. The OW ASP Top 10 for LLM Applica- tions [30] catalogues these risks, including insecure output handling, excessive agency, and model denial-of-service, as systemic threats to any production LLM deployment.Second, uncontrolled tool access granted to LLM agents may lead to privilege escalation or data exfiltration [31], [32].Third, LLM hallucinations [33], fabricated IoCs, malformed rules, or incorrect MITRE ATT&CK mappings can erode analyst trust and, in the worst case, cause operational damage or total/par- tial service(s) shutdown.Fourth, regulatory frameworks such as the EU AI Act [34] and the NIST AI Risk Management Framework [35] increasingly require auditability, transparency,"},{"citing_arxiv_id":"2512.22753","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Rookie to Expert: Manipulating LLMs for Automated Vulnerability Exploitation in Enterprise Software","primary_cat":"cs.SE","submitted_at":"2025-12-28T02:55:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RSA prompting enables LLMs to automatically create functional exploits for CVEs in Odoo ERP, succeeding on all tested cases in 3-5 rounds and removing the need for manual effort.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.13021","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"xOffense: An Autonomous Multi-Agent Framework for Penetration Testing with Domain-Adapted Large Language Models","primary_cat":"cs.CR","submitted_at":"2025-09-16T12:45:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"xOffense automates penetration testing via a fine-tuned Qwen3-32B LLM in a multi-agent setup with specialized agents for reconnaissance, vulnerability scanning, and exploitation, reporting 79.17% sub-task completion on AutoPenBench and AI-Pentest-Benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.04984","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Frontier Models are Capable of In-context Scheming","primary_cat":"cs.AI","submitted_at":"2024-12-06T12:09:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}