APIOT is the first LLM framework to complete the full autonomous discovery-to-remediation cycle on bare-metal OT devices, reaching 90% success across 290 runs on Zephyr RTOS.
hub
AUTOATTACKER: A large language model guided system to implement automatic cyber-attacks
17 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
The first SoK on LLM-based AutoPT frameworks provides a six-dimension taxonomy of agent designs and a unified empirical benchmark evaluating 15 frameworks via over 10 billion tokens and 1,500 manually reviewed logs.
ZERO-APT is a closed-loop framework that integrates an LLM attacker, configurable LLM defender, and judge agent to achieve 79% attack success rate, 0.860 causal consistency, and full decision auditability in penetration testing under intelligent defense.
HIDBench unifies DARPA-E3, DARPA-E5, and NodLink datasets with a data pipeline to benchmark LLMs for host-based intrusion detection, showing high precision on simple logs but sharp drops in MCC and rises in false positives on complex noisy data.
LLM agents exhibit persistent attack-selection biases as fixed traits independent of success rates, with a bias momentum effect that resists steering and yields no performance gain.
Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.
AI agents enable adaptive computer worms that propagate autonomously by reasoning about targets and synthesizing attacks using LLMs on stolen compute.
APT-Agent automates penetration testing with LLMs using rectification and memory modules, achieving 84.29% end-to-end success on Metasploitable 2 versus lower rates for baselines.
PocketAgents introduces a manifest-driven library for LLM-based autonomous defense agents, evaluated in 18 closed-loop trials against a DarkSide-inspired attack where 13 trials produced validated blocking actions.
ASPO combines multi-agent LLM proposals with deterministic enforcement in a MAPE-K loop to select conflict-free, resource-feasible security patterns for IoT, delivering 100% safety invariants and 21-23% tail latency/energy reductions on testbed workloads.
CritBench evaluates five LLMs on 81 tasks in IEC 61850 environments, showing reliable performance on static analysis and single-tool reconnaissance but degradation on dynamic live-system tasks that require sequential reasoning, with domain-specific tools improving results.
RSA prompting enables LLMs to automatically create functional exploits for CVEs in Odoo ERP, succeeding on all tested cases in 3-5 rounds and removing the need for manual effort.
Expert-defined action plans for LLM agents achieve higher task completion in lateral-movement scenarios than fully autonomous or self-scaffolded modes, but failures remain common due to brittle commands and state handling.
Pen-Strategist fine-tunes Qwen-3-14B with RL on a pentesting reasoning dataset and pairs it with a CNN step classifier, reporting 87% better strategy derivation, 47.5% more subtask completions than baselines, and gains on CTFKnow and user studies.
Targeted prompting and system interventions enable local LLMs such as Llama 3.1 70B to exploit 83% of tested Linux privilege escalation vulnerabilities.
LanG presents a governance-aware agentic AI platform for unified security operations that reports strong performance on incident correlation, rule generation, attack reconstruction, and AI safety guardrails in an open-source package.
xOffense automates penetration testing via a fine-tuned Qwen3-32B LLM in a multi-agent setup with specialized agents for reconnaissance, vulnerability scanning, and exploitation, reporting 79.17% sub-task completion on AutoPenBench and AI-Pentest-Benchmark.
citing papers explorer
-
Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing
The first SoK on LLM-based AutoPT frameworks provides a six-dimension taxonomy of agent designs and a unified empirical benchmark evaluating 15 frameworks via over 10 billion tokens and 1,500 manually reviewed logs.
-
CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios
LLM agents exhibit persistent attack-selection biases as fixed traits independent of success rates, with a bias momentum effect that resists steering and yields no performance gain.
-
Enhancing Linux Privilege Escalation Attack Capabilities of Local LLM Agents
Targeted prompting and system interventions enable local LLMs such as Llama 3.1 70B to exploit 83% of tested Linux privilege escalation vulnerabilities.
-
LanG -- A Governance-Aware Agentic AI Platform for Unified Security Operations
LanG presents a governance-aware agentic AI platform for unified security operations that reports strong performance on incident correlation, rule generation, attack reconstruction, and AI safety guardrails in an open-source package.