Recognition: unknown
Autonomous Adversary: Red-Teaming in the age of LLM
Pith reviewed 2026-05-08 09:03 UTC · model grok-4.3
The pith
Language model agents for red-teaming succeed more often with expert-defined plans than when running fully autonomously.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using frameworks such as MITRE ATT&CK, language model agents can support core offensive functions in red-teaming, and benchmarks in lateral-movement scenarios show that expert-defined action plans yield higher task-completion rates than fully autonomous or self-scaffolded modes, although failures remain common due to brittle command invocation and state handling errors.
What carries the argument
Benchmarking three operational modalities—fully autonomous execution, self-scaffolded planning, and expert-defined action plans—within ordered task chains for lateral movement, validated deterministically via an LLM-as-a-Judge paradigm in an instrumented environment.
If this is right
- Expert-defined plans offer a higher-performing starting point for integrating language model agents into adversary emulation.
- Improvements in command reliability, environment stability, and credential management are needed to make any modality viable for practical use.
- Systematic evaluation with explicit validation predicates can identify specific weaknesses in current language model agent designs for cyber operations.
- The mapping to MITRE ATT&CK highlights where language model agents align with established adversary tactics and techniques.
Where Pith is reading between the lines
- Hybrid systems combining expert guidance with agent autonomy may be the most effective way to scale red-teaming in the short term.
- Persistent state and credential errors point to a need for agents that maintain better internal models of the environment or interface with external memory tools.
- These evaluation methods could be applied to other phases of attack campaigns, such as reconnaissance or persistence, to build a fuller picture of agent capabilities.
Load-bearing premise
That results from a controlled adversary-emulation setup with instrumented agents and predefined validation rules will hold in uncontrolled real-world red-teaming against live targets.
What would settle it
Running the same language model agents on actual production-like networks without instrumentation or fixed validation predicates and checking if the performance gap between expert plans and autonomous modes disappears or reverses.
Figures
read the original abstract
Language Model Agents (LMAs) are emerging as a powerful primitive for augmenting red-team operations. They can support attack planning, adversary emulation, and the orchestration of multi-step activity such as lateral movement, a core enabling capability of advanced persistent threat (APT) campaigns. Using frameworks such as MITRE ATT&CK, we analyze where these agents intersect with core offensive functions and assess current strengths and limitations of LMAs with an emphasis on governance and realistic evaluation. We benchmark LMAs across two lateral-movement scenarios in a controlled adversary-emulation environment, where LMAs interact with instrumented cyber agents, observe execution artifacts, and iteratively adapt based on environmental feedback. Each scenario is formalized as an ordered task chain with explicit validation predicates, leveraging an LLM-as-a-Judge paradigm to ensure deterministic outcome verification. We compare three operational modalities: fully autonomous execution, self-scaffolded planning, and expert-defined action plans. Preliminary findings indicate that expert-defined action plans yield higher task-completion rates relative to other operational modes. However, failure remains frequent across all modalities, largely attributable to brittle command invocation, environmental and deployment instability, and recurring errors in credential management and state handling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates the application of Language Model Agents (LMAs) to red-teaming tasks, including adversary emulation and lateral movement, framed within the MITRE ATT&CK framework. It evaluates three operational modalities—fully autonomous execution, self-scaffolded planning, and expert-defined action plans—in a controlled environment using instrumented cyber agents and an LLM-as-Judge for verification. The preliminary findings suggest that expert-defined action plans achieve higher task-completion rates, but failures are common across all modes due to issues like brittle command invocation, environmental instability, and credential management errors.
Significance. The work provides an initial structured assessment of LMAs in offensive security contexts, leveraging established frameworks like MITRE ATT&CK and LLM-as-Judge verification. If the performance differences hold under more rigorous testing and realistic conditions, it could help shape best practices for deploying such agents, underscoring the importance of expert guidance to mitigate current limitations in autonomy.
major comments (2)
- The claim of superior performance for expert-defined action plans is based on preliminary qualitative findings. The abstract and results lack quantitative metrics, error bars, sample sizes, or detailed tables, which weakens the evidential support for the central comparative claim.
- The controlled adversary-emulation setup with explicit validation predicates and instrumented agents is used to compare modalities. However, as the paper notes brittleness and instability as key failure modes, it is unclear if this setup accurately captures real-world red-teaming challenges, where such controls are absent. This is load-bearing for translating the findings to practical recommendations.
minor comments (1)
- The abstract mentions 'two lateral-movement scenarios' but does not specify the number of tasks or runs, which would help contextualize the preliminary findings.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we will incorporate to improve the paper.
read point-by-point responses
-
Referee: The claim of superior performance for expert-defined action plans is based on preliminary qualitative findings. The abstract and results lack quantitative metrics, error bars, sample sizes, or detailed tables, which weakens the evidential support for the central comparative claim.
Authors: We agree that the current version presents the comparative results in a preliminary manner without sufficient quantitative detail. In the revised manuscript we will expand the results section to report explicit task-completion rates for each of the three modalities, the number of trials per scenario, and a summary table of outcomes. Where repeated executions were performed we will include measures of variability. These additions will provide clearer quantitative support for the observed performance differences while preserving the preliminary framing of the study. revision: yes
-
Referee: The controlled adversary-emulation setup with explicit validation predicates and instrumented agents is used to compare modalities. However, as the paper notes brittleness and instability as key failure modes, it is unclear if this setup accurately captures real-world red-teaming challenges, where such controls are absent. This is load-bearing for translating the findings to practical recommendations.
Authors: We acknowledge that the controlled environment, while enabling reproducible LLM-as-Judge verification, limits direct extrapolation to uncontrolled real-world red-teaming. In the revision we will add an expanded limitations subsection that explicitly discusses how the documented failure modes (brittle command invocation, environmental instability, and credential/state errors) may behave differently without instrumentation and validation predicates. We will also outline concrete directions for future work that move toward more realistic deployments. This will better situate the practical implications of the findings. revision: yes
Circularity Check
No circularity detected; empirical benchmarking relies on external frameworks and direct observation
full rationale
The paper's core contribution is an empirical comparison of three LMA operational modalities (autonomous, self-scaffolded, expert-defined) on lateral-movement task chains in an instrumented environment. Task-completion rates are measured via explicit validation predicates and LLM-as-Judge verification against MITRE ATT&CK mappings. No equations, fitted parameters, or self-referential definitions appear; results are reported as observed outcomes rather than derived predictions. The setup uses external standards and does not reduce any claim to its own inputs by construction, satisfying the criteria for a self-contained empirical study.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM agents can observe execution artifacts and iteratively adapt based on environmental feedback in cyber environments
- domain assumption The LLM-as-a-Judge paradigm ensures deterministic outcome verification for task chains
Reference graph
Works this paper leans on
-
[1]
Cybench: A framework for evaluating cybersecurity capabilities and risks of language models
Zhang, A.K., Perry, N., Dulepet, R., Ji, J., Menders, C., Lin, J.W., Jones, E., Hussein, G., Liu, S., Jasper, D. and Peetathawatchai, P., 2024. Cybench: A framework for evaluating cybersecurity capabilities and risks of language models. arXiv preprint arXiv:2408.08926
-
[2]
Incalmo: An Au- tonomous LLM-assisted System for Red Teaming Multi-Host Networks, November 2025
Singer, B., Lucas, K., Adiga, L., Jain, M., Bauer, L. and Sekar, V., 2025. On the feasibility of using llms to execute multistage network attacks. arXiv preprint arXiv:2501.16466
-
[3]
Shao, Minghao, Nanda Rani, Kimberly Milner, Haoran Xi, Meet Udeshi, Saksham Aggarwal, Venkata Sai Charan Putrevu et al. ”Towards Effective Offensive Security LLM Agents: Hyperparameter Tuning, LLM as a Judge, and a Lightweight CTF Benchmark.” arXiv preprint arXiv:2508.05674 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
A. Abuadbba, C. Hicks, K. Moore, V. Mavroudis, B. Hasircioglu, D. Goel, and P. Jennings, “From Promise to Peril: Rethinking Cybersecurity Red and Blue Teaming in the Age of LLMs,”arXiv preprint arXiv:2506.13434, 2025. doi: 10.48550/arXiv.2506.13434
-
[5]
and Rass, S., 2024
Deng, G., Liu, Y., Mayoral-Vilches, V., Liu, P., Li, Y., Xu, Y., Zhang, T., Liu, Y., Pinzger, M. and Rass, S., 2024. PentestGPT: Evaluating and harnessing large language models for automated penetration testing. In33rd USENIX Security Symposium (USENIX Security 24), pp. 847–864
2024
-
[6]
Xu, J., Stokes, J.W., McDonald, G., Bai, X., Marshall, D., Wang, S., Swaminathan, A. and Li, Z., 2024. AutoAttacker: A large language model guided system to implement automatic cyber-attacks.arXiv preprint arXiv:2403.01038
- [7]
-
[8]
Gioacchini, L., Mellia, M., Drago, I., Delsanto, A., Siracusano, G., and Bifulco, R.,
-
[9]
Siracusano, and Roberto Bifulco
AutoPenBench: Benchmarking Generative Agents for Penetration Testing. arXiv preprint arXiv:2410.03225
-
[10]
Vilches, Francesco Balassone, Luis Javier Navarrete-Lozano, Cristóbal R
Sanz-G´ omez, M., Mayoral-Vilches, V., Balassone, F., Navarrete-Lozano, L.J., Chavez, C.R. and de Torres, M.D.M., 2025. Cybersecurity AI Benchmark (CAIBench): A Meta-Benchmark for Evaluating Cybersecurity AI Agents. arXiv preprint arXiv:2510.24317
-
[11]
Folkerts, L., Payne, W., Inman, S., Giavridis, P., Skinner, J., Deverett, S., Aung, J., Zorer, E., Schmatz, M., Ghanem, M. and Wilkinson, J., 2026. Measuring AI Agents’ Progress on Multi-Step Cyber Attack Scenarios. arXiv preprint arXiv:2603.11214
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.