Recognition: unknown
Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture the Flag Challenges
Pith reviewed 2026-05-10 02:34 UTC · model grok-4.3
The pith
LLM agents achieve only 35 percent average checkpoint completion on realistic Capture the Flag challenges, with the largest gaps on non-standard discovery and longer-horizon tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeepRed places agents in virtualized attacker environments with full terminal access and records complete execution traces; checkpoint lists derived from public writeups plus an automated summarise-then-judge pipeline then assign partial credit for each challenge. The resulting evaluation of ten models shows the best performer completing 35 percent of checkpoints on average, with markedly lower scores on challenges that demand non-standard discovery or extended planning sequences.
What carries the argument
Partial-credit scoring via challenge-specific checkpoints extracted from public writeups, together with a summarise-then-judge pipeline that labels completion from full execution logs.
If this is right
- Agents perform reliably on common challenge categories but drop sharply on tasks that require non-standard discovery.
- Longer-horizon adaptation remains the clearest performance bottleneck.
- Full execution traces enable post-hoc analysis of exactly where agents stall or loop.
- Current commercial models are not yet capable of reliable autonomous progress on realistic offensive tasks.
Where Pith is reading between the lines
- The same partial-credit approach could be reused on other agent benchmarks to expose incremental progress instead of binary outcomes.
- The observed discovery and planning gaps suggest that adding explicit exploration mechanisms or longer context memory would produce the largest gains.
- If future agents close these gaps, the same benchmark format could serve as a safety test before deployment in real networks.
Load-bearing premise
Checkpoints taken from public writeups plus the automated summarise-then-judge process give an unbiased and sufficiently complete measure of meaningful progress toward solving each challenge.
What would settle it
Re-evaluating the same ten challenges with an expanded, independently verified checkpoint list or with human experts judging the same logs would show whether the 35 percent figure and the category gaps hold or shrink substantially.
Figures
read the original abstract
Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments. DeepRed places an agent in a Kali attacker environment with terminal tools and optional web search, connected over a private network to a target challenge, and records full execution traces for analysis. To move beyond binary solved/unsolved outcomes, we introduce a partial-credit scoring method based on challenge-specific checkpoints derived from public writeups, together with an automated summarise-then-judge labelling pipeline for assigning checkpoint completion from logs. Using DeepRed, we benchmark ten commercially accessible LLMs on ten VM-based CTF challenges spanning different challenge categories. The results indicate that current agents remain limited: the best model achieves only 35% average checkpoint completion, performing strongest on common challenge types and weakest on tasks requiring non-standard discovery and longer-horizon adaptation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DeepRed, an open-source benchmark for LLM agents performing Capture the Flag challenges in isolated virtualized Kali environments with terminal and optional web-search tools. It defines partial-credit scoring via challenge-specific checkpoints extracted from public writeups, labels execution traces with an automated summarise-then-judge LLM pipeline, and reports results for ten commercially available LLMs across ten VM-based CTF challenges. The headline finding is that the strongest model reaches only 35% average checkpoint completion, performing better on common challenge categories and worse on tasks needing non-standard discovery or longer-horizon adaptation.
Significance. If the partial-credit methodology proves reliable, the work supplies a reproducible, open benchmark that moves CTF agent evaluation beyond binary solved/unsolved outcomes and supplies concrete evidence of current limitations in exploratory and adaptive cybersecurity tasks. The provision of full execution traces and the open-source release are clear strengths that enable follow-on research.
major comments (3)
- [Section 3.2] Section 3.2 (Checkpoint Definition): the claim that checkpoints derived from public writeups constitute a complete and unbiased proxy for meaningful partial progress is not supported by evidence that alternative successful paths, failed branches, or non-standard actions are systematically captured; public writeups typically document only one route, which directly affects the validity of both the 35% aggregate score and the reported performance gap between common and non-standard challenges.
- [Section 4.2] Section 4.2 (Summarise-then-Judge Pipeline): no validation is reported for the LLM summarizer and judge (e.g., human agreement rates, error analysis on long terminal logs, or cases of partial success mislabeled as failure), which is load-bearing for the central empirical claims given that checkpoint completion is the sole quantitative outcome measure.
- [Results] Results section (Table 2 or equivalent): the 35% figure and category-wise comparisons lack error bars, statistical tests, or details on how the ten challenges and ten models were selected, making it impossible to assess whether the observed limitations are robust or artifacts of the particular sample.
minor comments (2)
- [Abstract] Abstract and §1: the selection criteria for the ten challenges and ten models should be stated explicitly to allow readers to judge potential selection bias.
- [Figure captions] Figure captions and §5: clarify whether the reported checkpoint percentages are averaged across all checkpoints per challenge or weighted by difficulty.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We address each of the three major comments point by point below, indicating the revisions we will make to improve the work.
read point-by-point responses
-
Referee: [Section 3.2] Section 3.2 (Checkpoint Definition): the claim that checkpoints derived from public writeups constitute a complete and unbiased proxy for meaningful partial progress is not supported by evidence that alternative successful paths, failed branches, or non-standard actions are systematically captured; public writeups typically document only one route, which directly affects the validity of both the 35% aggregate score and the reported performance gap between common and non-standard challenges.
Authors: We agree that checkpoints extracted from public writeups represent only documented successful paths and do not systematically capture alternative routes, failed branches, or non-standard actions. This is an inherent limitation of the approach and could affect the interpretation of the 35% average checkpoint completion and the category-wise performance differences. Our rationale for using writeups was that they provide expert-validated milestones that are reproducible and publicly accessible, serving as a practical proxy for partial progress. In the revised manuscript we will expand Section 3.2 to explicitly acknowledge this bias, discuss its potential impact on the reported results, and note that the benchmark is designed to be extensible so that additional checkpoints from multiple sources can be incorporated in future iterations. revision: partial
-
Referee: [Section 4.2] Section 4.2 (Summarise-then-Judge Pipeline): no validation is reported for the LLM summarizer and judge (e.g., human agreement rates, error analysis on long terminal logs, or cases of partial success mislabeled as failure), which is load-bearing for the central empirical claims given that checkpoint completion is the sole quantitative outcome measure.
Authors: We recognize that the absence of validation for the summarise-then-judge pipeline is a significant gap, given its central role in producing the quantitative results. The original submission omitted this validation primarily due to the substantial effort required to manually annotate lengthy terminal traces. For the revised version we will add a validation subsection in Section 4.2 that reports human-expert agreement rates on a sampled subset of traces, together with an error analysis focused on long logs and instances of partial success. This will provide direct evidence of the pipeline's reliability. revision: yes
-
Referee: [Results] Results section (Table 2 or equivalent): the 35% figure and category-wise comparisons lack error bars, statistical tests, or details on how the ten challenges and ten models were selected, making it impossible to assess whether the observed limitations are robust or artifacts of the particular sample.
Authors: We accept that greater transparency and statistical context are needed. The ten challenges were chosen to span representative CTF categories drawn from public platforms, and the ten models were selected as a cross-section of commercially available LLMs at the time of the study. Each model-challenge pair was evaluated in a single run because of the high computational cost of full VM-based agent executions. Consequently, we cannot retroactively supply error bars from repeated trials without new experiments. In the revision we will add explicit selection criteria for both challenges and models, clarify the single-run design, and discuss the resulting limitations on statistical inference. We will also include any feasible measures of variability across categories. revision: partial
Circularity Check
No circularity: empirical benchmark with direct measurements
full rationale
This is a purely empirical benchmark paper that evaluates LLM agents on CTF challenges via recorded execution traces, checkpoints extracted from public writeups, and an automated summarise-then-judge pipeline. The central result (35% average checkpoint completion) is a direct aggregate measurement from the described evaluation process on ten models and ten challenges; no equations, fitted parameters, derivations, or self-citations are invoked to produce it. The study contains no load-bearing self-citations, uniqueness theorems, ansatzes, or renamings that reduce claims to inputs by construction. The evaluation pipeline is presented as a methodological choice whose validity can be assessed externally against the raw logs and writeups, satisfying the criteria for a self-contained, non-circular empirical result.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kim- berly Milner, Sofija Jancheska, John Yang, Carlos E Jimenez, Farshad Khor- rami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik R Narasimhan, Ramesh Karri, and Ofir Press. 2025. EnIGMA: Inter- active Tools Substantially Assist LM Agents in Finding Securi...
2025
-
[2]
Tanwir Ahmad, Matko Butkovic, and Dragos Truscan. 2025. Using reinforcement learning for security testing: A systematic mapping study. In2025 IEEE Inter- national Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 208–216
2025
-
[3]
Ali Al-Kaswan, Sebastian Deatc, Begüm Koç, Arie van Deursen, and Maliheh Izadi. 2025. Code red! on the harmfulness of applying off-the-shelf large language models to programming tasks.Proceedings of the ACM on Software Engineering2, FSE (2025), 2477–2499
2025
-
[4]
Thiem Nguyen Ba, Binh Nguyen Thanh, and Viet-Trung Tran. 2024. CoverNexus: Multi-agent LLM System for Automated Code Coverage Enhancement. InInter- national Symposium on Information and Communication Technology. Springer, 472–484
2024
-
[5]
Miles Brundage, Shahar Avin, Jack Clark, Helen Toner, Peter Eckersley, Ben Garfinkel, Allan Dafoe, Paul Scharre, Thomas Zeitzoff, Bobby Filar, et al. 2018. The malicious use of artificial intelligence: Forecasting, prevention, and mitigation. arXiv preprint arXiv:1802.07228(2018)
-
[6]
Theo Combe, Antony Martin, and Roberto Di Pietro. 2016. To Docker or not to Docker: A security perspective.IEEE Cloud Computing3, 5 (2016), 54–62
2016
-
[7]
Tiago Conceição and Nuno Cruz. 2025. Evaluation of the maturity of LLMs in the cybersecurity domain: T. Conceição, N. Cruz.International Journal of Information Security24, 5 (2025), 197
2025
-
[8]
Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. 2024. {PentestGPT}: Evaluating and harnessing large language models for automated penetration testing. In33rd USENIX Security Symposium (USENIX Security 24). 847–864
2024
-
[9]
Ismayil Hasanov, Seppo Virtanen, Antti Hakkala, and Jouni Isoaho. 2024. Appli- cation of large language models in cybersecurity: A systematic literature review. IEEE access12 (2024), 176751–176778
2024
- [10]
-
[11]
Maliheh Izadi, Jonathan Katzy, Tim Van Dam, Marc Otten, Razvan Mihai Popescu, and Arie Van Deursen. 2024. Language models for code completion: A practical evaluation. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13
2024
-
[12]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023)
work page internal anchor Pith review arXiv 2023
-
[13]
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology. 1–22
2023
-
[14]
Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2025. Asleep at the keyboard? assessing the security of github copilot’s code contributions.Commun. ACM68, 2 (2025), 96–105
2025
-
[15]
Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. 2025. ‘smolagents‘: a smol library to build great agentic systems. https://github.com/huggingface/smolagents
2025
-
[16]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture the Flag Challenges AIware ’26, July 6–7, 2026, Montreal, QC, Canada Toolformer: Language models can teach themselves to us...
2023
-
[17]
Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Prashanth Krishnamurthy, Far- shad Khorrami, et al. 2024. Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security.Advances in Neural Information Processing Systems37 (2024), 57472–57498
2024
-
[18]
Mohammed Latif Siddiq, Joanna C. S. Santos, Sajith Devareddy, and Anna Muller
-
[19]
arXiv:2311.00889 [cs.SE] https://arxiv.org/abs/2311.00889
Generate and Pray: Using SALLMS to Evaluate the Security of LLM Gener- ated Code. arXiv:2311.00889 [cs.SE] https://arxiv.org/abs/2311.00889
-
[20]
Alba Thaqi, Arbena Musa, and Blerim Rexha. 2024. Leveraging ai for ctf chal- lenge optimization. In2024 5th International Conference on Communications, Information, Electronic and Energy Systems (CIEES). IEEE, 1–5
2024
- [21]
-
[22]
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652
2024
-
[23]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations
2022
-
[24]
Andy K Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin W Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Julian Jasper, et al
-
[25]
InThe Thirteenth International Conference on Learning Representations
Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models. InThe Thirteenth International Conference on Learning Representations
-
[26]
Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, et al. 2025. CVE-Bench: A Benchmark for AI Agents’ Ability to Exploit Real-World Web Application Vulnerabilities. InInternational Conference on Machine Learning. PMLR, 79850– 79867. Received 2026-02-15; accepted 2026-03-28
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.