pith. sign in

arxiv: 2605.26195 · v2 · pith:672TUPNYnew · submitted 2026-05-25 · 💻 cs.CR · cs.AI

CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly

Pith reviewed 2026-06-29 21:17 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords self-evolving agentscybersecurity agentsLLM agentsscaffold optimizationCTF challengesvulnerability exploitationpenetration testing
0
0 comments X

The pith

CyberEvolver lets LLM cybersecurity agents revise their own scaffolds from failed execution traces, lifting success rates by 13.6 percent on average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CyberEvolver as a framework that lets LLM agents for cybersecurity tasks iteratively rewrite their own operating structures after unsuccessful runs. It tackles the problems of unstructured revision spaces, sparse or hidden feedback, and compounding mistakes by splitting the agent into four layers, mapping execution logs to specific diagnosis signals, and running a population-based beam search to keep multiple variants alive. If the approach holds, agents could adjust to new targets and failure patterns without repeated human redesign of their scaffolds. This would matter for tasks such as capture-the-flag challenges, vulnerability exploitation, and penetration testing, where fixed designs often fall short across different environments and models.

Core claim

CyberEvolver decomposes scaffold optimization into a four-layer evolvable agent architecture, converts noisy execution logs into actionable revision signals via a trace-to-diagnosis mechanism, and preserves diversity through population-based beam search, allowing the agent to improve its own structure over iterations and achieve a 13.6 percent average gain in success rate over the seed agent while beating six human-designed agents and two adapted self-improvement baselines on CTF, exploitation, and penetration-testing tasks with four open-source LLMs.

What carries the argument

The four-layer evolvable agent architecture that decomposes scaffold optimization into structured components, paired with the trace-to-diagnosis mechanism and population-based beam search.

If this is right

  • Agents built this way adapt across diverse targets and failure modes without fixed human scaffolds.
  • The same framework works with multiple open-source LLMs on CTF challenges, vulnerability exploitation, and penetration testing.
  • It produces higher success rates than six existing human-designed cybersecurity agents.
  • It also exceeds two self-improvement methods taken from other domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition and diversity-preserving search could be tested in agent settings outside cybersecurity where feedback is also sparse.
  • If the diagnosis mapping proves stable, teams might shift effort from hand-crafting scaffolds to curating better log-to-signal translators.
  • Maintaining a population of variants may prevent the kind of premature convergence seen in single-lineage self-improvement loops.

Load-bearing premise

The trace-to-diagnosis step can turn noisy or obscured execution logs into reliable revision signals without errors compounding across iterations.

What would settle it

Run repeated self-evolution cycles on a set of tasks with deliberately obscured logs and check whether success rates stop rising or begin to fall after the first few iterations.

Figures

Figures reproduced from arXiv: 2605.26195 by Changyi Li, Hong Geng, Jiarun Dai, Lichen Xu, Min Yang, Xudong Pan, Yihe Fan.

Figure 1
Figure 1. Figure 1: Left: CyberEvolver consistently improves over the seed agent and outperforms existing self￾improving methods across benchmarks, with performance continuing to increase over generations. Right: On a 488-point challenge solved by only 4.1% of 1,096 competing teams, the seed agent initially makes little progress, while the evolved agent eventually identifies the blind SQL injection channel, extracts the admin… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of CyberEvolver. An evolvable agent A = (LS, LI , LD, LP) interacts with target challenge C and improves itself through a closed-loop evolutionary process. In each iteration, the current agent attempts the challenge to produce a rollout trajectory τ , which is summarized into a compact trajectory record z, diagnosed into structured failure analysis d and progress score s, and used to select promis… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of CyberEvolver with ACE baselines across four benchmarks and four models. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of local sibling edit distances across backbone models and benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Best reported NYU CTF Bench performance over time. Agent-framework results use the [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Best reported AutoPenBench performance over time. The plotted values follow each [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Sibling code-diff cosine distance per (backbone, benchmark) cell, computed over 12,031 [PITH_FULL_IMAGE:figures/full_fig_p035_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Layer activation rate per (backbone, benchmark) cell, expressed as the percentage of the [PITH_FULL_IMAGE:figures/full_fig_p035_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Composition of the 18 624 Domain Knowledge (LD) actions taken across the 12 (backbone, benchmark) cells, normalized to 100 % within each cell. Each bar gives the share of three LD action kinds: create adds a new skill module, replace rewrites an existing skill file, and delete removes one. Creates dominate every cell (51–93 %); deletes are a thin red sliver at the top of the two Qwen3 cells (AutoPenBench, … view at source ↗
Figure 10
Figure 10. Figure 10: HGM evolution tree on CVEBench Zero-Day with Kimi-K2.5. The full run generates [PITH_FULL_IMAGE:figures/full_fig_p037_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Per-target solve rates and search saturation under HGM on CVEBench Zero-Day with [PITH_FULL_IMAGE:figures/full_fig_p039_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Evolution tree for cookie_injection. The chosen branch discovers the cookie oracle in Gen 1 and becomes a binary-search extractor in Gen 2. $ step 1-3 action: curl the homepage, view page source result: Standard shopping page, price $2.90. No login form visible. $ step 4-6 action: Try URL parameter injection: ?id=1', ?cat=1' OR 1=1-- result: No injection point found, all parameters reflected but not proce… view at source ↗
Figure 13
Figure 13. Figure 13: cookie_injection action-score card for Gen 0. The root agent never identifies the trackingId cookie as an injection surface. 56 [PITH_FULL_IMAGE:figures/full_fig_p056_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: cookie_injection action-score card for Gen 1, corresponding to node child2 in [PITH_FULL_IMAGE:figures/full_fig_p057_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: cookie_injection action-score card for Gen 2, corresponding to node c2_1 in Fig￾ure 12. - Includes session decoding and information_schema verification. - Problem: extraction is still linear, so each character costs many probes. Skill C: web-sqli-cookie-oracle (child2, Score 55) - Adds UNION NULL column-count probing and explicit oracle calibration. - Problem: still uses linear character scans, exhausting… view at source ↗
Figure 16
Figure 16. Figure 16: Evolution tree for securinotes. The chosen branch learns Meteor/DDP in Gen 1, binary-searches the count oracle in Gen 2, and stabilizes a persistent extractor in Gen 3. Key bottleneck: This is not a traditional HTTP web challenge. The visible HTTP surface is mostly a Meteor shell and JavaScript bundle; the relevant data path is DDP (Distributed Data Protocol) over WebSocket. Standard SQLi, cookie testing,… view at source ↗
Figure 17
Figure 17. Figure 17: securinotes action-score card for Gen 0. The root agent finds a WebSocket endpoint but keeps reasoning like an HTTP fuzzer. $ step 1 action: load_skill --name web-meteor-ddp result: DDP protocol knowledge loaded. $ step 2 action: pip install websocket-client result: Library installed. $ step 3-5 action: Write Python script: connect WebSocket, send DDP connect+sub result: Connection works! Masked frames ac… view at source ↗
Figure 18
Figure 18. Figure 18: securinotes action-score card for Gen 1, corresponding to node child2 in [PITH_FULL_IMAGE:figures/full_fig_p062_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: securinotes action-score card for Gen 2, corresponding to node c2_2 in [PITH_FULL_IMAGE:figures/full_fig_p063_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: securinotes action-score card for Gen 3, corresponding to node c2_2_2 in [PITH_FULL_IMAGE:figures/full_fig_p063_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Evolution tree for unlimited_subway. The winning path converges on exact-offset static analysis plus a single-shot exploit under _alarm(5). G.1.3 unlimited_subway — Timed Canary Bypass (Pwn, DeepSeek-V3.1 Gen3 + Minimax-M2.5 Gen2) Benchmark note. This case is drawn from the NYU-CTF benchmark. The rendered lineage below shows the DeepSeek-V3.1 Gen3 solve; the title records the companion Minimax-M2.5 Gen2 s… view at source ↗
Figure 22
Figure 22. Figure 22: unlimited_subway action-score card for Gen 0. The root script is killed by the alarm before useful observations arrive. $ step 1 action: Write a single-script exploit that attempts OOB canary leak and immediate overflow result: The script follows the new canary-bypass guidance. $ step 2 action: Use p.recvuntil(b">") to sync with the binary prompt result: The receive blocks under the alarm constraint. $ st… view at source ↗
Figure 23
Figure 23. Figure 23: unlimited_subway action-score card for Gen 1, corresponding to node child2 in [PITH_FULL_IMAGE:figures/full_fig_p067_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: unlimited_subway action-score card for Gen 2, corresponding to node c2_2 in Fig￾ure 21. $ step 1 action: load_skill --name pwn-static-analysis-prioritization result: Loaded "analyze first, exploit second" methodology. $ step 2 action: gdb -batch -ex "disass main" -ex "disass vuln" ./subway result: Frame=0x40, buffer at RBP-0x30, canary at RBP-0x8. $ step 3 action: Calculate: OOB read index = (0x30-0x8)/8 … view at source ↗
Figure 25
Figure 25. Figure 25: unlimited_subway action-score card for Gen 3, corresponding to node c2_2_2 in [PITH_FULL_IMAGE:figures/full_fig_p068_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Evolution tree for apb-vm2. The topology is separated from the score/action cards; purple marks the selected branch and green marks the solving path. Challenge Profile Field Value Challenge web_security/apb-in-vitro-web_security-vm2 Category Web (PHP LFI) Vulnerability PHP include on a user-controlled page parameter, with a partial path-traversal filter Key Constraint Relative-path traversal blocked; abso… view at source ↗
Figure 27
Figure 27. Figure 27: apb-vm2 action-score card for Gen 0. The top bar gives the tree node ID and score; the body expands the detailed action trace. $ step 1 action: load_skill --name web-lfi-bypass result: Error-driven LFI methodology loaded. $ step 2-4 action: Reasoning Protocol: state Hypothesis / Evidence / Prerequisites / Success Criteria for each step result: Decision rationale becomes visible in the trace. $ step 5-9 ac… view at source ↗
Figure 28
Figure 28. Figure 28: apb-vm2 action-score card for Gen 1, corresponding to node child2 in [PITH_FULL_IMAGE:figures/full_fig_p072_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: apb-vm2 action-score card for Gen 2, corresponding to node c2_2 in [PITH_FULL_IMAGE:figures/full_fig_p073_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: apb-vm2 action-score card for Gen 3, corresponding to node c2_2_1 in [PITH_FULL_IMAGE:figures/full_fig_p073_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Evolution tree for apb-vm6. The topology is separated from the score/action cards; purple marks the selected branch and green marks the solving path. “server-side error noise” will keep optimizing for LFI bypasses indefinitely. Once the eval surface is recognized, a second trap appears: the agent tends to inject a full <?php...?> block, but the parameter is already inside a string-concatenation context, s… view at source ↗
Figure 32
Figure 32. Figure 32: apb-vm6 action-score card for Gen 0. The top bar gives the tree node ID and score; the body expands the detailed action trace. $ step 1-2 action: Adopt HYPOTHESIS / IF TRUE / IF FALSE / TOOL CHECK / ACTION format result: Each step now states its falsifier explicitly. $ step 3-6 action: TOOL CHECK separates environment failures from server responses result: gobuster timeout reclassified as environment, not… view at source ↗
Figure 33
Figure 33. Figure 33: apb-vm6 action-score card for Gen 1, corresponding to node child1 in [PITH_FULL_IMAGE:figures/full_fig_p077_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: apb-vm6 action-score card for Gen 2, corresponding to node c1_1 in [PITH_FULL_IMAGE:figures/full_fig_p078_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: apb-vm6 action-score card for Gen 3, corresponding to node c1_1_0 in [PITH_FULL_IMAGE:figures/full_fig_p078_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Evolution tree for ezrop. Purple marks the surviving branch; the amber leaf at Gen 3 marks the most-explored stall (no flag). Challenge Profile Field Value Challenge 2022q-pwn-ezrop Category Pwn (NX, alphabetic-only filter, pre-built ROP array) Vulnerability 0x100 read into a 100-byte buffer; check() rejects non-alphabetic bytes; the buffer overflow lands inside a pre-constructed ROP array whose first ent… view at source ↗
Figure 37
Figure 37. Figure 37: ezrop action-score card for Gen 0. $ step 1 action: Adopt HYPOTHESIS / VALIDATION / PROGRESS scaffold result: Reasoning becomes structured; each command states what it would falsify. $ step 2 action: load_skill --name pwn-restricted-charset result: Partial overwrite, alphabetic gadget hunting, CSU-style constrained pivots. $ step 3-10 action: Search for printable jump targets and partial-byte overwrites r… view at source ↗
Figure 38
Figure 38. Figure 38: ezrop action-score card for Gen 1, node child2. L4 System Template — Reasoning scaffold becomes structured + ## Reasoning Protocol + For each command, state: + - HYPOTHESIS: what server-side / binary behavior you assume + - VALIDATION: what observation would falsify it + - PROGRESS: what the action is supposed to advance L2 Skill — pwn-restricted-charset added - Partial-overwrite of saved RIP within the a… view at source ↗
Figure 39
Figure 39. Figure 39: ezrop action-score card for Gen 2, node c2_1. $ step 1-3 action: Tighten heredoc, abs-path, GDB-input rules in instance template result: Operational stack stabilises. $ step 4-8 action: Place '\n' at buf[7]; corrupt rop[0] to enter the staged chain result: Newline bypass + rop[0] hijack working as designed. $ step 9-15 action: Reach a libc leak stage result: Address disclosure achieved within the survivin… view at source ↗
Figure 40
Figure 40. Figure 40: ezrop action-score card for Gen 3, node c2_1_2 (best stall). 83 [PITH_FULL_IMAGE:figures/full_fig_p083_40.png] view at source ↗
Figure 41
Figure 41. Figure 41: Evolution tree for no_pass_needed. The chosen path runs through child0 on the left of Gen 1. G.2.2 no_pass_needed — JWT Knowledge Without Delivery Discipline (Web, Kimi-K2.5, Gen3 Stall) Benchmark note. This case is drawn from the NYU-CTF benchmark. Challenge Profile Field Value Challenge 2021q-web-no_pass_needed Category Web (JWT Auth Bypass under Unstable Service) Vulnerability JWT-based authentication … view at source ↗
Figure 42
Figure 42. Figure 42: no_pass_needed action-score card for Gen 0. $ step 1 action: load_skill --name web-jwt-auth-bypass result: Adds alg:none, RS->HS confusion, kid abuse, time-claim manipulation. $ step 2 action: Fix Python literal mistakes result: true/null replaced by True/None across the attacker harness. $ step 3-10 action: Rotate through JWT bypass techniques result: Menu-driven: each technique tested once, all on a cra… view at source ↗
Figure 43
Figure 43. Figure 43: no_pass_needed action-score card for Gen 1, node child0. Generation 1: JWT Family Acquired, Delivery Model Unchanged (SR 0). L2 Skill — web-jwt-auth-bypass added - alg:none and unsigned token replay. - RS256 -> HS256 algorithm confusion (use the public key as HMAC key). - kid header abuse: file read, SQL injection, command injection. - Claim manipulation: iat / exp / nbf, role escalation, user impersonati… view at source ↗
Figure 44
Figure 44. Figure 44: no_pass_needed action-score card for Gen 2, node c0_2. $ step 1-2 action: Add 'Tried Set' bookkeeping + stabilization checks result: Repeats and target instability are tracked explicitly. $ step 3-8 action: Lock to Hybrid Auth, vary forged claims result: Closest the lineage gets to the human move (freeze delivery, vary claims). $ step 9-16 action: Slip back to standalone JWT to test new tricks result: Ser… view at source ↗
Figure 45
Figure 45. Figure 45: no_pass_needed action-score card for Gen 3, node c0_2_1 (best stall). + - Validate JSON payload locally before transmission. + - On unstable services, batch related tests behind one session. The skill is good domain knowledge. The branch stops dying because it forgot JWT existed. It still cycles through bypass mechanisms one by one on a delivery path that periodically crashes the validator. Generation 2: … view at source ↗
read the original abstract

LLM-based agents are increasingly used for cybersecurity tasks, but most existing systems rely on fixed, human-designed scaffolds that struggle to adapt across diverse targets and failure modes. We introduce \textsc{CyberEvolver}, a self-evolving cybersecurity agent framework that iteratively revises its own scaffold based on experience from failed execution attempts. Self-evolution in cybersecurity is challenging because the space of possible scaffold changes is largely unstructured, execution feedback is sparse and often obscured by the environment, and low-diversity updates can cause errors to compound over repeated iterations. \textsc{CyberEvolver} addresses these challenges with a four-layer evolvable agent architecture that decomposes scaffold optimization into structured components, a trace-to-diagnosis mechanism that converts noisy execution logs into actionable revision signals, and a population-based beam search strategy that preserves diverse agent variants during evolution. We evaluate \textsc{CyberEvolver} on CTF challenges, vulnerability exploitation, and penetration-testing tasks using four open-source LLMs. Across these settings, \textsc{CyberEvolver} improves the seed agent's success rate by $13.6$\,\% on average, and outperforms six human-designed cybersecurity agents as well as two self-improvement methods adapted from other domains. These results suggest that scaffold self-evolution is a promising direction for building adaptive LLM agents for security testing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces CyberEvolver, a self-evolving LLM agent framework for cybersecurity that iteratively revises its scaffold using a four-layer evolvable agent architecture, a trace-to-diagnosis mechanism to convert noisy execution logs into revision signals, and a population-based beam search to maintain diversity. It evaluates the approach on CTF challenges, vulnerability exploitation, and penetration-testing tasks with four open-source LLMs, claiming a 13.6% average success-rate improvement over the seed agent while outperforming six human-designed cybersecurity agents and two adapted self-improvement baselines.

Significance. If the performance claims hold under rigorous controls, the work would offer a concrete demonstration that structured self-evolution can mitigate the limitations of fixed scaffolds in domains with sparse and obscured feedback, advancing adaptive agent design for security applications.

major comments (2)
  1. [Abstract] Abstract: the central claim of a 13.6% average success-rate lift and outperformance of six human-designed agents plus two baselines is stated without any mention of task counts, run variance, statistical tests, or failure-mode breakdowns, rendering the empirical contribution unverifiable from the given evidence.
  2. [Abstract] Abstract: the trace-to-diagnosis mechanism is presented as the key step that converts obscured logs into non-compounding revision signals, yet the manuscript supplies no ablation, error analysis, or validation that this conversion step succeeds reliably; because the four-layer architecture and beam search both depend on its output, any systematic mis-diagnosis would invalidate attribution of the reported gains to self-evolution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive suggestions focused on the abstract. We agree that the abstract can be strengthened for clarity and verifiability while preserving its concise nature. Below we respond point-by-point and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of a 13.6% average success-rate lift and outperformance of six human-designed agents plus two baselines is stated without any mention of task counts, run variance, statistical tests, or failure-mode breakdowns, rendering the empirical contribution unverifiable from the given evidence.

    Authors: We agree that the abstract should supply sufficient context for the performance claims. In the revised manuscript we will expand the final sentence of the abstract to state the evaluation scope (CTF, exploitation, and penetration-testing tasks across four LLMs), note that the 13.6% figure is an average over multiple independent runs with reported standard deviation, and indicate that statistical significance was assessed with appropriate tests (e.g., paired t-test or Wilcoxon signed-rank). Failure-mode breakdowns and per-task counts remain in the main experimental section (Section 4) but will be briefly referenced in the abstract for completeness. revision: yes

  2. Referee: [Abstract] Abstract: the trace-to-diagnosis mechanism is presented as the key step that converts obscured logs into non-compounding revision signals, yet the manuscript supplies no ablation, error analysis, or validation that this conversion step succeeds reliably; because the four-layer architecture and beam search both depend on its output, any systematic mis-diagnosis would invalidate attribution of the reported gains to self-evolution.

    Authors: We acknowledge the referee's concern that the abstract presents the trace-to-diagnosis component without accompanying validation. The current manuscript contains component ablations in Section 5 that quantify the contribution of trace-to-diagnosis relative to the other layers and the beam search; however, a dedicated error analysis of diagnosis accuracy (e.g., manual inspection of a sample of traces and mis-diagnosis rate) is not present. We will therefore add a short paragraph and accompanying table in the revised manuscript that reports diagnosis reliability on a held-out set of traces and shows that mis-diagnosis events remain below a threshold that would explain the observed gains. This addition will be summarized concisely in the abstract as well. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper's central claims consist of empirical performance measurements (13.6% average success-rate lift, outperformance of six human-designed agents and two adapted baselines) obtained on external CTF, vulnerability exploitation, and penetration-testing tasks. These results are evaluated against independent benchmarks rather than quantities fitted inside the same experiments or reduced by construction to the paper's own equations or self-citations. The four-layer architecture, trace-to-diagnosis mechanism, and population beam search are presented as design choices whose effectiveness is tested externally; no self-definitional loops, fitted-input predictions, or load-bearing self-citation chains appear in the provided text. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

Only abstract available; ledger therefore limited to explicitly stated domain assumptions and newly introduced components.

axioms (1)
  • domain assumption Execution feedback in cybersecurity is sparse and often obscured by the environment.
    Explicitly listed as one of the three core challenges the framework must solve.
invented entities (3)
  • four-layer evolvable agent architecture no independent evidence
    purpose: decomposes scaffold optimization into structured components
    Newly proposed decomposition to make the search space tractable.
  • trace-to-diagnosis mechanism no independent evidence
    purpose: converts noisy execution logs into actionable revision signals
    Newly proposed conversion step.
  • population-based beam search strategy no independent evidence
    purpose: preserves diverse agent variants during evolution
    Newly proposed search strategy to avoid compounding errors.

pith-pipeline@v0.9.1-grok · 5781 in / 1218 out tokens · 32548 ms · 2026-06-29T21:17:26.966699+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

122 extracted references · 35 canonical work pages · 8 internal anchors

  1. [1]

    Abramovich, M

    T. Abramovich, M. Udeshi, M. Shao, K. Lieret, H. Xi, K. Milner, S. Jancheska, J. Yang, C. E. Jimenez, F. Khorrami, P. Krishnamurthy, B. Dolan-Gavitt, M. Shafique, K. Narasimhan, R. Karri, and O. Press. EnIGMA: Interactive tools substantially assist LM agents in finding security vulnerabilities, 2024. URLhttps://arxiv.org/abs/2409.16165

  2. [2]

    Agent skills

    Anthropic. Agent skills. https://platform.claude.com/docs/en/agents-and-tools/ agent-skills/overview, 2025. Accessed: 2026-05-07

  3. [3]

    Applis, Y

    L. Applis, Y . Zhang, S. Liang, N. Jiang, L. Tan, and A. Roychoudhury. Unified software engineering agent as AI software engineer, 2025. URL https://arxiv.org/abs/2506. 14683

  4. [4]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei...

  5. [5]

    URL https://proceedings.neurips.cc/paper_ files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

    Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_ files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

  6. [6]

    CSAW CTF Qualification Round 2023

    CTFtime. CSAW CTF Qualification Round 2023. https://ctftime.org/event/2087/,

  7. [7]

    Scoreboard reports 1,096 teams total

  8. [8]

    DeepSeek-V3 technical report, 2024

    DeepSeek-AI. DeepSeek-V3 technical report, 2024. URL https://arxiv.org/abs/2412. 19437

  9. [9]

    The temperature parameter, n.d

    DeepSeek-AI. The temperature parameter, n.d.. URL https://api-docs.deepseek.com/ quick_start/parameter_settings. DeepSeek API Docs; accessed 2026-05-07

  10. [10]

    Create chat completion, n.d

    DeepSeek-AI. Create chat completion, n.d.. URL https://api-docs.deepseek.com/api/ create-chat-completion. DeepSeek API Docs; accessed 2026-05-07

  11. [11]

    G. Deng, Y . Liu, V . Mayoral-Vilches, P. Liu, Y . Li, Y . Xu, T. Zhang, Y . Liu, M. Pinzger, and S. Rass. Pentestgpt: An llm-empowered automatic penetration testing tool.arXiv preprint arXiv:2308.06782, 2023. URLhttps://arxiv.org/abs/2308.06782

  12. [12]

    X. Deng, J. Da, E. Pan, Y . Y . He, C. Ide, K. Garg, N. Lauffer, A. Park, N. Pasari, C. Rane, K. Sampath, M. Krishnan, S. Kundurthy, S. Hendryx, Z. Wang, V . Bharadwaj, J. Holm, R. Aluri, C. B. C. Zhang, N. Jacobson, B. Liu, and B. Kenstler. SWE-Bench Pro: Can AI agents solve long-horizon software engineering tasks?, 2025. URL https://scale.com/research/ ...

  13. [13]

    Dodge, M

    J. Dodge, M. Sap, A. Marasovi ´c, W. Agnew, G. Ilharco, D. Groeneveld, M. Mitchell, and M. Gardner. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Onl...

  14. [14]

    Fernando, D

    C. Fernando, D. Banarse, H. Michalewski, S. Osindero, and T. Rocktäschel. Promptbreeder: Self-referential self-improvement via prompt evolution. InInternational Conference on Machine Learning, 2024. 10

  15. [15]

    Fernando, D

    C. Fernando, D. S. Banarse, H. Michalewski, S. Osindero, and T. Rocktäschel. Promptbreeder: Self-referential self-improvement via prompt evolution. InProceedings of the 41st Interna- tional Conference on Machine Learning, 2024. URL https://openreview.net/forum? id=HKkiX32Zw1

  16. [16]

    Gioacchini, M

    L. Gioacchini, M. Mellia, I. Drago, A. Delsanto, G. Siracusano, and R. Bifulco. Autopenbench: Benchmarking generative agents for penetration testing, 2024. URL https://arxiv.org/ abs/2410.03225

  17. [17]

    Welcome to the eternal september of open source

    GitHub. Welcome to the eternal september of open source. here’s what we plan to do for maintainers, 2026. URL https://github.blog/open-source/maintainers/ welcome-to-the-eternal-september-of-open-source-heres-what-we-plan-to-do-for-maintainers/ . GitHub Blog

  18. [18]

    Bug bounty policy, n.d

    Global. Bug bounty policy, n.d. URL https://global.com/bug-bounty-policy/. Exam- ple bug bounty policy requiring actionable proof-of-concept evidence

  19. [19]

    S. Hu, C. Lu, and J. Clune. Automated design of agentic systems. InInternational Conference on Learning Representations, 2025

  20. [20]

    Huang, J

    H. Huang, J. Shi, J. Chen, T. Zhang, Y . Li, C. Yang, E. L. Ouh, L. K. Shar, and D. Lo. Penforge: On-the-fly expert agent construction for automated penetration testing.arXiv preprint arXiv:2601.06910, 2026. URLhttps://arxiv.org/abs/2601.06910

  21. [21]

    AI-generated pull requests overwhelming, hard to review carefully, 2026

    ITK Discourse Contributors. AI-generated pull requests overwhelming, hard to review carefully, 2026. URL https://discourse.itk.org/t/ ai-generated-pull-requests-overwhelming-hard-to-review-carefully/7728 . Community discussion on maintainer burden from AI-generated pull requests

  22. [22]

    Z. Ji, D. Wu, W. Jiang, P. Ma, Z. Li, and S. Wang. Measuring and augmenting large language models for solving capture-the-flag challenges, 2025. URL https://arxiv.org/abs/2506. 17644

  23. [23]

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. Swe- bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=VTF8yNQM66

  24. [24]

    Kimi K2.5: Visual Agentic Intelligence

    Kimi Team. Kimi K2.5: Visual agentic intelligence, 2026. URL https://arxiv.org/abs/ 2602.02276

  25. [25]

    moonshotai/Kimi-K2.5

    Kimi Team. moonshotai/Kimi-K2.5. https://huggingface.co/moonshotai/Kimi-K2.5,

  26. [26]

    Hugging Face model card; accessed 2026-05-07

  27. [28]

    URLhttps://arxiv.org/abs/2508.07382

  28. [29]

    H. Kong, D. Hu, J. Ge, L. Li, T. Li, and B. Wu. Vulnbot: Autonomous penetration testing for a multi-agent collaborative framework, 2025. URLhttps://arxiv.org/abs/2501.13411

  29. [30]

    D. Lee, G. eun Bae, and I. yun. CTFusion : A CTF-based benchmark for LLM agent evaluation,

  30. [31]

    URLhttps://openreview.net/forum?id=2zQJHLbyqM

  31. [32]

    H. Lee, Z. Zhang, H. Lu, and L. Zhang. SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks. InAdvances in Neural Information Processing Systems,

  32. [33]

    URLhttps://openreview.net/forum?id=QQhQIqons0

  33. [34]

    J. Lin, Y . Guo, Y . Han, S. Hu, Z. Ni, L. Wang, M. Chen, H. Liu, R. Chen, Y . He, D. Jiang, B. Jiao, C. Hu, and H. Wang. SE-Agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents, 2025. URLhttps://arxiv.org/abs/2508.02085

  34. [35]

    J. W. Lin, E. K. Jones, D. J. Jasper, E. J.-s. Ho, A. Wu, A. T. Yang, N. Perry, A. Zou, M. Fredrik- son, J. Z. Kolter, P. Liang, D. Boneh, and D. E. Ho. Comparing ai agents to cybersecurity profes- sionals in real-world penetration testing, 2025. URL https://arxiv.org/abs/2512.09882. 11

  35. [36]

    A. Z. Liu, J. Choi, S. Sohn, Y . Fu, J. Kim, D.-K. Kim, X. Wang, J. Yoo, and H. Lee. Skillact: Using skill abstractions improves LLM agents. InICML 2024 Workshop on LLMs and Cognition,

  36. [37]

    URLhttps://openreview.net/forum?id=6LG3cIRrF4

  37. [38]

    Madaan, N

    A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Systems, 2023

  38. [39]

    W. Mai, G. Hong, Q. Liu, J. Chen, J. Dai, X. Pan, Y . Zhang, and M. Yang. Shell or nothing: Real-world benchmarks and memory-activated agents for automated penetration testing.arXiv preprint arXiv:2509.09207, 2025. URLhttps://arxiv.org/abs/2509.09207

  39. [40]

    MiniMax M2.5: Built for real-world productivity, 2026

    MiniMax. MiniMax M2.5: Built for real-world productivity, 2026. URL https://www. minimax.io/news/minimax-m25. MiniMax official blog

  40. [41]

    MiniMaxAI/MiniMax-M2.5

    MiniMaxAI. MiniMaxAI/MiniMax-M2.5. https://huggingface.co/MiniMaxAI/ MiniMax-M2.5, 2026. Hugging Face model card; accessed 2026-05-07

  41. [42]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    A. Novikov, N. V˜u, M. Eisenberger, E. Dupont, P.-S. Huang, A. Z. Wagner, S. Shirobokov, B. Kozlovskii, F. J. R. Ruiz, A. Mehrabian, M. P. Kumar, A. See, S. Chaudhuri, G. Holland, A. Davies, S. Nowozin, P. Kohli, and M. Balog. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

  42. [43]

    CSAW Quals 2023: web/cookie-injection challenge meta- data

    NYU CTF Bench. CSAW Quals 2023: web/cookie-injection challenge meta- data. https://github.com/NYU-LLM-CTF/NYU_CTF_Bench/blob/main/test/2023/ CSAW-Quals/web/cookie-injection/challenge.json, 2024. Challenge metadata lists the dynamic scoring parameters and final point value of 488

  43. [44]

    Gpt-5.3-codex system card

    OpenAI. Gpt-5.3-codex system card. OpenAI, Feb. 2026. URL https://openai.com/ index/gpt-5-3-codex-system-card/

  44. [45]

    Ouyang, J

    S. Ouyang, J. Yan, I.-H. Hsu, Y . Chen, K. Jiang, Z. Wang, R. Han, L. Le, S. Daruki, X. Tang, V . Tirumalashetty, G. Lee, M. Rofouei, H. Lin, J. Han, C.-Y . Lee, and T. Pfister. Reasoning- Bank: Scaling agent self-evolving with reasoning memory. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? ...

  45. [46]

    Vulnerability disclosure cheat sheet, 2024

    OWASP Foundation. Vulnerability disclosure cheat sheet, 2024. URL https: //cheatsheetseries.owasp.org/cheatsheets/Vulnerability_Disclosure_ Cheat_Sheet.html. OW ASP Cheat Sheet Series

  46. [47]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

  47. [48]

    Qwen/Qwen3-235B-A22B-Instruct-2507

    Qwen Team. Qwen/Qwen3-235B-A22B-Instruct-2507. https://huggingface.co/Qwen/ Qwen3-235B-A22B-Instruct-2507 , 2025. Hugging Face model card; accessed 2026-05-07

  48. [49]

    Robeyns, M

    M. Robeyns, M. Szummer, and L. Aitchison. A self-improving coding agent. InScaling Self- Improving Foundation Models without Human Supervision, 2025. URLhttps://openreview. net/forum?id=rShJCyLsOr

  49. [50]

    AXE: Grey-Box Exploitability Confirmation for Localized Vulnerability Reports

    A. Sajadi, T. Nguyen, K. Damevski, and P. Chatterjee. Axe: An agentic exploit engine for confirming zero-day vulnerability reports.arXiv preprint arXiv:2602.14345, 2026. URL https://arxiv.org/abs/2602.14345

  50. [51]

    A. L. Samuel. Some studies in machine learning using the game of checkers.IBM Journal of Research and Development, 3(3):210–229, 1959. doi: 10.1147/rd.33.0210

  51. [52]

    Schmidhuber

    J. Schmidhuber. G"odel machines: Fully self-referential optimal universal self-improvers. InAr- tificial General Intelligence, pages 199–226. Springer, 2007. doi: 10.1007/978-3-540-68677-4_ 7

  52. [53]

    Shang, Y

    Y . Shang, Y . Li, K. Zhao, L. Ma, J. Liu, F. Xu, and Y . Li. Agentsquare: Automatic llm agent search in modular design space. InInternational Conference on Learning Representations, 2025. 12

  53. [54]

    M. Shao, S. Jancheska, M. Udeshi, B. Dolan-Gavitt, H. Xi, K. Milner, B. Chen, M. Yin, S. Garg, P. Krishnamurthy, F. Khorrami, R. Karri, and M. Shafique. Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security, 2024. URL https://arxiv.org/abs/2406.05590

  54. [55]

    M. Shao, H. Xi, N. Rani, M. Udeshi, V . S. C. Putrevu, K. Milner, B. Dolan-Gavitt, S. K. Shukla, P. Krishnamurthy, F. Khorrami, R. Karri, and M. Shafique. CRAKEN: Cybersecurity llm agent with knowledge-based execution, 2025. URLhttps://arxiv.org/abs/2505.17107

  55. [56]

    Shinn, F

    N. Shinn, F. Cassano, A. Gopinath, K. R. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems,

  56. [57]

    URLhttps://openreview.net/forum?id=vAElhFcKW6

  57. [58]

    Singer, K

    B. Singer, K. Lucas, L. Adiga, M. Jain, L. Bauer, and V . Sekar. Incalmo: An autonomous llm-assisted system for red teaming multi-host networks, 2025. URL https://arxiv.org/ abs/2501.16466

  58. [59]

    R. S. Sutton and A. G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2 edition,

  59. [60]

    URLhttps://mitpress.mit.edu/9780262352703/reinforcement-learning/

  60. [61]

    Udeshi, M

    M. Udeshi, M. Shao, H. Xi, N. Rani, K. Milner, V . S. C. Putrevu, B. Dolan-Gavitt, S. K. Shukla, P. Krishnamurthy, F. Khorrami, R. Karri, and M. Shafique. D-cipher: Dynamic collaborative intelligent multi-agent system with planner and heterogeneous executors for offensive security,

  61. [62]

    URLhttps://arxiv.org/abs/2502.10931

  62. [63]

    G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar. V oyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research, 2024. URLhttps://openreview.net/forum?id=ehfRiF0R3a

  63. [64]

    W. Wang, P. Piekos, L. Nanbo, F. Laakom, Y . Chen, M. Ostaszewski, M. Zhuge, and J. Schmid- huber. Huxley-gödel machine: Human-level coding agent development by an approximation of the optimal self-improving machine, 2025. URLhttps://arxiv.org/abs/2510.21614

  64. [65]

    Z. Wang, T. Shi, J. He, M. Cai, J. Zhang, and D. Song. Cybergym: Evaluating ai agents’ real- world cybersecurity capabilities at scale, 2025. URL https://arxiv.org/abs/2506.02548

  65. [66]

    Z. Z. Wang, J. Mao, D. Fried, and G. Neubig. Agent workflow memory, 2024. URL https: //arxiv.org/abs/2409.07429

  66. [67]

    C. J. C. H. Watkins and P. Dayan. Q-learning.Machine Learning, 8:279–292, 1992. doi: 10.1007/BF00992698

  67. [68]

    C. Yang, X. Wang, Y . Lu, H. Liu, Q. V . Le, D. Zhou, and X. Chen. Large language models as optimizers. InInternational Conference on Learning Representations, 2024

  68. [69]

    J. Yang, A. Prabhakar, K. R. Narasimhan, and S. Yao. Intercode: Standardizing and benchmark- ing interactive coding with execution feedback. InAdvances in Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=fvKaLF1ns8

  69. [70]

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. SWE- agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id= mXpq6ut8J3

  70. [71]

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao. ReAct: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X

  71. [72]

    A. K. Zhang, J. Ji, C. Menders, R. Dulepet, T. Qin, R. Y . Wang, J. Wu, K. Liao, J. Li, J. Hu, et al. Bountybench: Dollar impact of ai agent attackers and defenders on real-world cybersecurity systems. InAdvances in Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=pIsP4lMlFd. 13

  72. [73]

    A. K. Zhang, N. Perry, R. Dulepet, J. Ji, C. Menders, J. W. Lin, E. Jones, G. Hussein, S. Liu, D. J. Jasper, et al. Cybench: A framework for evaluating cybersecurity capabilities and risks of language models. InThe Thirteenth International Conference on Learning Representations,

  73. [74]

    URLhttps://openreview.net/forum?id=tc90LV0yRL

  74. [75]

    Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

    J. Zhang, S. Hu, C. Lu, R. Lange, and J. Clune. Darwin Godel Machine: Open-ended evolution of self-improving agents, 2025. URLhttps://arxiv.org/abs/2505.22954

  75. [76]

    Zhang, J

    J. Zhang, J. Xiang, Z. Yu, F. Teng, X. Chen, J. Chen, M. Zhuge, X. Cheng, S. Hong, J. Wang, B. Zheng, B. Liu, Y . Luo, and C. Wu. AFlow: Automating agentic workflow generation. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=z5uVAKwmjf

  76. [77]

    Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

    Q. Zhang, C. Hu, S. Upasani, B. Ma, F. Hong, V . Kamanuru, J. Rainton, C. Wu, M. Ji, H. Li, U. Thakker, J. Zou, and K. Olukotun. Agentic context engineering: Evolving contexts for self-improving language models, 2025. URLhttps://arxiv.org/abs/2510.04618

  77. [78]

    A. Zhao, D. Huang, Q. Xu, M. Lin, Y .-J. Liu, and G. Huang. ExpeL: Llm agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024. doi: 10.1609/aaai.v38i17.29936. URL https://doi.org/10.1609/ aaai.v38i17.29936

  78. [79]

    S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried, U. Alon, and G. Neubig. WebArena: A realistic web environment for building autonomous agents, 2023. URLhttps://arxiv.org/abs/2307.13854

  79. [80]

    Y . Zhu, T. Jin, Y . Pruksachatkun, A. Zhang, S. Liu, S. Cui, S. Kapoor, S. Longpre, K. Meng, R. Weiss, F. Barez, R. Gupta, J. Dhamala, J. Merizian, M. Giulianelli, H. Coppock, C. Ududec, J. Sekhon, J. Steinhardt, A. Kellermann, S. Schwettmann, M. Zaharia, I. Stoica, P. Liang, and D. Kang. Establishing best practices for building rigorous agentic benchmar...

  80. [81]

    Y . Zhu, A. Kellermann, D. Bowman, P. Li, A. Gupta, A. Danda, R. Fang, C. Jensen, E. Ihli, J. Benn, J. Geronimo, A. Dhir, S. Rao, K. Yu, T. Stone, and D. Kang. Cve-bench: A benchmark for ai agents’ ability to exploit real-world web application vulnerabilities. InProceedings of the 42nd International Conference on Machine Learning, pages 79850–79867, 2025....

Showing first 80 references.