arxiv: 2605.06486 · v1 · submitted 2026-05-07 · 💻 cs.CR

Recognition: unknown

Autonomous Adversary: Red-Teaming in the age of LLM

Mohammad Mamun , Mohamed Gaber , Scott Buffett , Sherif Saad

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:03 UTC · model grok-4.3

classification 💻 cs.CR

keywords language model agentsred-teamingadversary emulationlateral movementMITRE ATT&CKautonomous agentscybersecurity evaluation

0 comments

The pith

Language model agents for red-teaming succeed more often with expert-defined plans than when running fully autonomously.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores the use of language model agents to enhance red-team operations, including attack planning and emulating multi-step adversary activities such as lateral movement. It tests these agents in a controlled setting across two scenarios, comparing fully autonomous execution, self-scaffolded planning, and expert-defined action plans. Findings suggest that providing expert plans improves task completion rates, yet all methods face high failure rates due to issues like unreliable command execution, environmental instability, and problems with managing credentials and state. Readers should care because reliable autonomous agents could transform how organizations test their defenses against advanced threats.

Core claim

Using frameworks such as MITRE ATT&CK, language model agents can support core offensive functions in red-teaming, and benchmarks in lateral-movement scenarios show that expert-defined action plans yield higher task-completion rates than fully autonomous or self-scaffolded modes, although failures remain common due to brittle command invocation and state handling errors.

What carries the argument

Benchmarking three operational modalities—fully autonomous execution, self-scaffolded planning, and expert-defined action plans—within ordered task chains for lateral movement, validated deterministically via an LLM-as-a-Judge paradigm in an instrumented environment.

If this is right

Expert-defined plans offer a higher-performing starting point for integrating language model agents into adversary emulation.
Improvements in command reliability, environment stability, and credential management are needed to make any modality viable for practical use.
Systematic evaluation with explicit validation predicates can identify specific weaknesses in current language model agent designs for cyber operations.
The mapping to MITRE ATT&CK highlights where language model agents align with established adversary tactics and techniques.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid systems combining expert guidance with agent autonomy may be the most effective way to scale red-teaming in the short term.
Persistent state and credential errors point to a need for agents that maintain better internal models of the environment or interface with external memory tools.
These evaluation methods could be applied to other phases of attack campaigns, such as reconnaissance or persistence, to build a fuller picture of agent capabilities.

Load-bearing premise

That results from a controlled adversary-emulation setup with instrumented agents and predefined validation rules will hold in uncontrolled real-world red-teaming against live targets.

What would settle it

Running the same language model agents on actual production-like networks without instrumentation or fixed validation predicates and checking if the performance gap between expert plans and autonomous modes disappears or reverses.

Figures

Figures reproduced from arXiv: 2605.06486 by Mohamed Gaber, Mohammad Mamun, Scott Buffett, Sherif Saad.

**Figure 1.** Figure 1: Framework overview. Step-1: The Objective and operational context (knowledge Base) are view at source ↗

**Figure 2.** Figure 2: Enterprise Active Directory Testbed. We evaluate adversary behaviour in a post-compromise Windows AD environment ( view at source ↗

**Figure 3.** Figure 3: System prompt used for enterprise attack graph generation in Fully Autonomous mode. view at source ↗

**Figure 4.** Figure 4: Cumulative number of Tasks completed on Scenario 1 (a 9-step attack chain) as a function view at source ↗

**Figure 5.** Figure 5: Task completion rates across LMAs grouped by Scenario-1 (S1) and Scenario-2 (S2). view at source ↗

read the original abstract

Language Model Agents (LMAs) are emerging as a powerful primitive for augmenting red-team operations. They can support attack planning, adversary emulation, and the orchestration of multi-step activity such as lateral movement, a core enabling capability of advanced persistent threat (APT) campaigns. Using frameworks such as MITRE ATT&CK, we analyze where these agents intersect with core offensive functions and assess current strengths and limitations of LMAs with an emphasis on governance and realistic evaluation. We benchmark LMAs across two lateral-movement scenarios in a controlled adversary-emulation environment, where LMAs interact with instrumented cyber agents, observe execution artifacts, and iteratively adapt based on environmental feedback. Each scenario is formalized as an ordered task chain with explicit validation predicates, leveraging an LLM-as-a-Judge paradigm to ensure deterministic outcome verification. We compare three operational modalities: fully autonomous execution, self-scaffolded planning, and expert-defined action plans. Preliminary findings indicate that expert-defined action plans yield higher task-completion rates relative to other operational modes. However, failure remains frequent across all modalities, largely attributable to brittle command invocation, environmental and deployment instability, and recurring errors in credential management and state handling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Expert plans help LLM agents more than autonomous modes in this red-teaming benchmark, but results stay preliminary and the lab setup limits how far the findings travel.

read the letter

The main thing to know is that this paper compares three ways of running language model agents on lateral movement tasks in a simulated red-team environment and reports that expert-defined action plans get higher completion rates than fully autonomous execution or self-scaffolded planning, even though every mode fails often from command brittleness, credential errors, and state handling problems. They ground the work in MITRE ATT&CK, formalize the scenarios as ordered task chains, and use an LLM-as-Judge with explicit validation predicates for outcome checks. That gives a clean way to compare the modalities in two scenarios. The authors also call out the practical governance and evaluation issues that come with putting these agents into offensive workflows, which is useful. They are straightforward about the failures instead of overstating what works. The experimental framing is clear enough that it could serve as a baseline for later work on agent scaffolding in cybersecurity. The soft spots are mostly around evidence strength and generalizability. The findings are labeled preliminary and stay qualitative with no success rates, run counts, or variance numbers, so the size of the expert-plan advantage is hard to judge. The controlled environment with instrumented agents and deterministic predicates also matches the stress-test concern: real red-teaming has less observability and messier feedback, so the relative performance could shift once those controls are removed. This is aimed at people building or evaluating AI tools for adversary emulation and red-teaming practice. It gives a concrete starting comparison even if the numbers are missing. The thinking is direct and they engage honestly with the limits they observe, so it deserves a serious referee to ask for quantitative results and some discussion of how the lab conditions map to less tidy operations.

Referee Report

2 major / 1 minor

Summary. The paper investigates the application of Language Model Agents (LMAs) to red-teaming tasks, including adversary emulation and lateral movement, framed within the MITRE ATT&CK framework. It evaluates three operational modalities—fully autonomous execution, self-scaffolded planning, and expert-defined action plans—in a controlled environment using instrumented cyber agents and an LLM-as-Judge for verification. The preliminary findings suggest that expert-defined action plans achieve higher task-completion rates, but failures are common across all modes due to issues like brittle command invocation, environmental instability, and credential management errors.

Significance. The work provides an initial structured assessment of LMAs in offensive security contexts, leveraging established frameworks like MITRE ATT&CK and LLM-as-Judge verification. If the performance differences hold under more rigorous testing and realistic conditions, it could help shape best practices for deploying such agents, underscoring the importance of expert guidance to mitigate current limitations in autonomy.

major comments (2)

The claim of superior performance for expert-defined action plans is based on preliminary qualitative findings. The abstract and results lack quantitative metrics, error bars, sample sizes, or detailed tables, which weakens the evidential support for the central comparative claim.
The controlled adversary-emulation setup with explicit validation predicates and instrumented agents is used to compare modalities. However, as the paper notes brittleness and instability as key failure modes, it is unclear if this setup accurately captures real-world red-teaming challenges, where such controls are absent. This is load-bearing for translating the findings to practical recommendations.

minor comments (1)

The abstract mentions 'two lateral-movement scenarios' but does not specify the number of tasks or runs, which would help contextualize the preliminary findings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we will incorporate to improve the paper.

read point-by-point responses

Referee: The claim of superior performance for expert-defined action plans is based on preliminary qualitative findings. The abstract and results lack quantitative metrics, error bars, sample sizes, or detailed tables, which weakens the evidential support for the central comparative claim.

Authors: We agree that the current version presents the comparative results in a preliminary manner without sufficient quantitative detail. In the revised manuscript we will expand the results section to report explicit task-completion rates for each of the three modalities, the number of trials per scenario, and a summary table of outcomes. Where repeated executions were performed we will include measures of variability. These additions will provide clearer quantitative support for the observed performance differences while preserving the preliminary framing of the study. revision: yes
Referee: The controlled adversary-emulation setup with explicit validation predicates and instrumented agents is used to compare modalities. However, as the paper notes brittleness and instability as key failure modes, it is unclear if this setup accurately captures real-world red-teaming challenges, where such controls are absent. This is load-bearing for translating the findings to practical recommendations.

Authors: We acknowledge that the controlled environment, while enabling reproducible LLM-as-Judge verification, limits direct extrapolation to uncontrolled real-world red-teaming. In the revision we will add an expanded limitations subsection that explicitly discusses how the documented failure modes (brittle command invocation, environmental instability, and credential/state errors) may behave differently without instrumentation and validation predicates. We will also outline concrete directions for future work that move toward more realistic deployments. This will better situate the practical implications of the findings. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical benchmarking relies on external frameworks and direct observation

full rationale

The paper's core contribution is an empirical comparison of three LMA operational modalities (autonomous, self-scaffolded, expert-defined) on lateral-movement task chains in an instrumented environment. Task-completion rates are measured via explicit validation predicates and LLM-as-Judge verification against MITRE ATT&CK mappings. No equations, fitted parameters, or self-referential definitions appear; results are reported as observed outcomes rather than derived predictions. The setup uses external standards and does not reduce any claim to its own inputs by construction, satisfying the criteria for a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on domain assumptions about LLM agent adaptability and the reliability of automated judging rather than introducing free parameters or new entities.

axioms (2)

domain assumption LLM agents can observe execution artifacts and iteratively adapt based on environmental feedback in cyber environments
Central to the benchmarking setup described in the abstract.
domain assumption The LLM-as-a-Judge paradigm ensures deterministic outcome verification for task chains
Used to validate success in the controlled scenarios.

pith-pipeline@v0.9.0 · 5507 in / 1277 out tokens · 88805 ms · 2026-05-08T09:03:43.694728+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 9 canonical work pages · 1 internal anchor

[1]

Cybench: A framework for evaluating cybersecurity capabilities and risks of language models

Zhang, A.K., Perry, N., Dulepet, R., Ji, J., Menders, C., Lin, J.W., Jones, E., Hussein, G., Liu, S., Jasper, D. and Peetathawatchai, P., 2024. Cybench: A framework for evaluating cybersecurity capabilities and risks of language models. arXiv preprint arXiv:2408.08926

work page arXiv 2024
[2]

Incalmo: An Au- tonomous LLM-assisted System for Red Teaming Multi-Host Networks, November 2025

Singer, B., Lucas, K., Adiga, L., Jain, M., Bauer, L. and Sekar, V., 2025. On the feasibility of using llms to execute multistage network attacks. arXiv preprint arXiv:2501.16466

work page arXiv 2025
[3]

Towards Effective Offensive Security LLM Agents: Hyperparameter Tuning, LLM as a Judge, and a Lightweight CTF Benchmark

Shao, Minghao, Nanda Rani, Kimberly Milner, Haoran Xi, Meet Udeshi, Saksham Aggarwal, Venkata Sai Charan Putrevu et al. ”Towards Effective Offensive Security LLM Agents: Hyperparameter Tuning, LLM as a Judge, and a Lightweight CTF Benchmark.” arXiv preprint arXiv:2508.05674 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

From promise to peril: Rethink- ing cybersecurity red and blue teaming in the age of LLMs.arXiv preprint arXiv:2506.13434, 2025

A. Abuadbba, C. Hicks, K. Moore, V. Mavroudis, B. Hasircioglu, D. Goel, and P. Jennings, “From Promise to Peril: Rethinking Cybersecurity Red and Blue Teaming in the Age of LLMs,”arXiv preprint arXiv:2506.13434, 2025. doi: 10.48550/arXiv.2506.13434

work page doi:10.48550/arxiv.2506.13434 2025
[5]

and Rass, S., 2024

Deng, G., Liu, Y., Mayoral-Vilches, V., Liu, P., Li, Y., Xu, Y., Zhang, T., Liu, Y., Pinzger, M. and Rass, S., 2024. PentestGPT: Evaluating and harnessing large language models for automated penetration testing. In33rd USENIX Security Symposium (USENIX Security 24), pp. 847–864

2024
[6]

and Li, Z., 2024

Xu, J., Stokes, J.W., McDonald, G., Bai, X., Marshall, D., Wang, S., Swaminathan, A. and Li, Z., 2024. AutoAttacker: A large language model guided system to implement automatic cyber-attacks.arXiv preprint arXiv:2403.01038

work page arXiv 2024
[7]

M. Xu, J. Fan, X. Huang, C. Zhou, J. Kang, D. Niyato, S. Mao, Z. Han, X. Shen, and K.-Y. Lam, Forewarned is Forearmed: A Survey on Large Language Model-based Agents in Autonomous Cyberattacks,arXiv preprint arXiv:2505.12786, 2025

work page arXiv 2025
[8]

Gioacchini, L., Mellia, M., Drago, I., Delsanto, A., Siracusano, G., and Bifulco, R.,
[9]

Siracusano, and Roberto Bifulco

AutoPenBench: Benchmarking Generative Agents for Penetration Testing. arXiv preprint arXiv:2410.03225

work page arXiv
[10]

Vilches, Francesco Balassone, Luis Javier Navarrete-Lozano, Cristóbal R

Sanz-G´ omez, M., Mayoral-Vilches, V., Balassone, F., Navarrete-Lozano, L.J., Chavez, C.R. and de Torres, M.D.M., 2025. Cybersecurity AI Benchmark (CAIBench): A Meta-Benchmark for Evaluating Cybersecurity AI Agents. arXiv preprint arXiv:2510.24317

work page arXiv 2025
[11]

Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeff Wu

Folkerts, L., Payne, W., Inman, S., Giavridis, P., Skinner, J., Deverett, S., Aung, J., Zorer, E., Schmatz, M., Ghanem, M. and Wilkinson, J., 2026. Measuring AI Agents’ Progress on Multi-Step Cyber Attack Scenarios. arXiv preprint arXiv:2603.11214

work page arXiv 2026