CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly

Changyi Li; Hong Geng; Jiarun Dai; Lichen Xu; Min Yang; Xudong Pan; Yihe Fan

arxiv: 2605.26195 · v2 · pith:672TUPNYnew · submitted 2026-05-25 · 💻 cs.CR · cs.AI

CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly

Yihe Fan , Changyi Li , Lichen Xu , Xudong Pan , Jiarun Dai , Hong Geng , Min Yang This is my paper

Pith reviewed 2026-06-29 21:17 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords self-evolving agentscybersecurity agentsLLM agentsscaffold optimizationCTF challengesvulnerability exploitationpenetration testing

0 comments

The pith

CyberEvolver lets LLM cybersecurity agents revise their own scaffolds from failed execution traces, lifting success rates by 13.6 percent on average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CyberEvolver as a framework that lets LLM agents for cybersecurity tasks iteratively rewrite their own operating structures after unsuccessful runs. It tackles the problems of unstructured revision spaces, sparse or hidden feedback, and compounding mistakes by splitting the agent into four layers, mapping execution logs to specific diagnosis signals, and running a population-based beam search to keep multiple variants alive. If the approach holds, agents could adjust to new targets and failure patterns without repeated human redesign of their scaffolds. This would matter for tasks such as capture-the-flag challenges, vulnerability exploitation, and penetration testing, where fixed designs often fall short across different environments and models.

Core claim

CyberEvolver decomposes scaffold optimization into a four-layer evolvable agent architecture, converts noisy execution logs into actionable revision signals via a trace-to-diagnosis mechanism, and preserves diversity through population-based beam search, allowing the agent to improve its own structure over iterations and achieve a 13.6 percent average gain in success rate over the seed agent while beating six human-designed agents and two adapted self-improvement baselines on CTF, exploitation, and penetration-testing tasks with four open-source LLMs.

What carries the argument

The four-layer evolvable agent architecture that decomposes scaffold optimization into structured components, paired with the trace-to-diagnosis mechanism and population-based beam search.

If this is right

Agents built this way adapt across diverse targets and failure modes without fixed human scaffolds.
The same framework works with multiple open-source LLMs on CTF challenges, vulnerability exploitation, and penetration testing.
It produces higher success rates than six existing human-designed cybersecurity agents.
It also exceeds two self-improvement methods taken from other domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition and diversity-preserving search could be tested in agent settings outside cybersecurity where feedback is also sparse.
If the diagnosis mapping proves stable, teams might shift effort from hand-crafting scaffolds to curating better log-to-signal translators.
Maintaining a population of variants may prevent the kind of premature convergence seen in single-lineage self-improvement loops.

Load-bearing premise

The trace-to-diagnosis step can turn noisy or obscured execution logs into reliable revision signals without errors compounding across iterations.

What would settle it

Run repeated self-evolution cycles on a set of tasks with deliberately obscured logs and check whether success rates stop rising or begin to fall after the first few iterations.

Figures

Figures reproduced from arXiv: 2605.26195 by Changyi Li, Hong Geng, Jiarun Dai, Lichen Xu, Min Yang, Xudong Pan, Yihe Fan.

**Figure 1.** Figure 1: Left: CyberEvolver consistently improves over the seed agent and outperforms existing selfimproving methods across benchmarks, with performance continuing to increase over generations. Right: On a 488-point challenge solved by only 4.1% of 1,096 competing teams, the seed agent initially makes little progress, while the evolved agent eventually identifies the blind SQL injection channel, extracts the admin… view at source ↗

**Figure 2.** Figure 2: Overview of CyberEvolver. An evolvable agent A = (LS, LI , LD, LP) interacts with target challenge C and improves itself through a closed-loop evolutionary process. In each iteration, the current agent attempts the challenge to produce a rollout trajectory τ , which is summarized into a compact trajectory record z, diagnosed into structured failure analysis d and progress score s, and used to select promis… view at source ↗

**Figure 3.** Figure 3: Comparison of CyberEvolver with ACE baselines across four benchmarks and four models. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of local sibling edit distances across backbone models and benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Best reported NYU CTF Bench performance over time. Agent-framework results use the [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗

**Figure 6.** Figure 6: Best reported AutoPenBench performance over time. The plotted values follow each [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

**Figure 7.** Figure 7: Sibling code-diff cosine distance per (backbone, benchmark) cell, computed over 12,031 [PITH_FULL_IMAGE:figures/full_fig_p035_7.png] view at source ↗

**Figure 8.** Figure 8: Layer activation rate per (backbone, benchmark) cell, expressed as the percentage of the [PITH_FULL_IMAGE:figures/full_fig_p035_8.png] view at source ↗

**Figure 9.** Figure 9: Composition of the 18 624 Domain Knowledge (LD) actions taken across the 12 (backbone, benchmark) cells, normalized to 100 % within each cell. Each bar gives the share of three LD action kinds: create adds a new skill module, replace rewrites an existing skill file, and delete removes one. Creates dominate every cell (51–93 %); deletes are a thin red sliver at the top of the two Qwen3 cells (AutoPenBench, … view at source ↗

**Figure 10.** Figure 10: HGM evolution tree on CVEBench Zero-Day with Kimi-K2.5. The full run generates [PITH_FULL_IMAGE:figures/full_fig_p037_10.png] view at source ↗

**Figure 11.** Figure 11: Per-target solve rates and search saturation under HGM on CVEBench Zero-Day with [PITH_FULL_IMAGE:figures/full_fig_p039_11.png] view at source ↗

**Figure 12.** Figure 12: Evolution tree for cookie_injection. The chosen branch discovers the cookie oracle in Gen 1 and becomes a binary-search extractor in Gen 2. $ step 1-3 action: curl the homepage, view page source result: Standard shopping page, price $2.90. No login form visible. $ step 4-6 action: Try URL parameter injection: ?id=1', ?cat=1' OR 1=1-- result: No injection point found, all parameters reflected but not proce… view at source ↗

**Figure 13.** Figure 13: cookie_injection action-score card for Gen 0. The root agent never identifies the trackingId cookie as an injection surface. 56 [PITH_FULL_IMAGE:figures/full_fig_p056_13.png] view at source ↗

**Figure 14.** Figure 14: cookie_injection action-score card for Gen 1, corresponding to node child2 in [PITH_FULL_IMAGE:figures/full_fig_p057_14.png] view at source ↗

**Figure 15.** Figure 15: cookie_injection action-score card for Gen 2, corresponding to node c2_1 in Figure 12. - Includes session decoding and information_schema verification. - Problem: extraction is still linear, so each character costs many probes. Skill C: web-sqli-cookie-oracle (child2, Score 55) - Adds UNION NULL column-count probing and explicit oracle calibration. - Problem: still uses linear character scans, exhausting… view at source ↗

**Figure 16.** Figure 16: Evolution tree for securinotes. The chosen branch learns Meteor/DDP in Gen 1, binary-searches the count oracle in Gen 2, and stabilizes a persistent extractor in Gen 3. Key bottleneck: This is not a traditional HTTP web challenge. The visible HTTP surface is mostly a Meteor shell and JavaScript bundle; the relevant data path is DDP (Distributed Data Protocol) over WebSocket. Standard SQLi, cookie testing,… view at source ↗

**Figure 17.** Figure 17: securinotes action-score card for Gen 0. The root agent finds a WebSocket endpoint but keeps reasoning like an HTTP fuzzer. $ step 1 action: load_skill --name web-meteor-ddp result: DDP protocol knowledge loaded. $ step 2 action: pip install websocket-client result: Library installed. $ step 3-5 action: Write Python script: connect WebSocket, send DDP connect+sub result: Connection works! Masked frames ac… view at source ↗

**Figure 18.** Figure 18: securinotes action-score card for Gen 1, corresponding to node child2 in [PITH_FULL_IMAGE:figures/full_fig_p062_18.png] view at source ↗

**Figure 19.** Figure 19: securinotes action-score card for Gen 2, corresponding to node c2_2 in [PITH_FULL_IMAGE:figures/full_fig_p063_19.png] view at source ↗

**Figure 20.** Figure 20: securinotes action-score card for Gen 3, corresponding to node c2_2_2 in [PITH_FULL_IMAGE:figures/full_fig_p063_20.png] view at source ↗

**Figure 21.** Figure 21: Evolution tree for unlimited_subway. The winning path converges on exact-offset static analysis plus a single-shot exploit under _alarm(5). G.1.3 unlimited_subway — Timed Canary Bypass (Pwn, DeepSeek-V3.1 Gen3 + Minimax-M2.5 Gen2) Benchmark note. This case is drawn from the NYU-CTF benchmark. The rendered lineage below shows the DeepSeek-V3.1 Gen3 solve; the title records the companion Minimax-M2.5 Gen2 s… view at source ↗

**Figure 22.** Figure 22: unlimited_subway action-score card for Gen 0. The root script is killed by the alarm before useful observations arrive. $ step 1 action: Write a single-script exploit that attempts OOB canary leak and immediate overflow result: The script follows the new canary-bypass guidance. $ step 2 action: Use p.recvuntil(b">") to sync with the binary prompt result: The receive blocks under the alarm constraint. $ st… view at source ↗

**Figure 23.** Figure 23: unlimited_subway action-score card for Gen 1, corresponding to node child2 in [PITH_FULL_IMAGE:figures/full_fig_p067_23.png] view at source ↗

**Figure 24.** Figure 24: unlimited_subway action-score card for Gen 2, corresponding to node c2_2 in Figure 21. $ step 1 action: load_skill --name pwn-static-analysis-prioritization result: Loaded "analyze first, exploit second" methodology. $ step 2 action: gdb -batch -ex "disass main" -ex "disass vuln" ./subway result: Frame=0x40, buffer at RBP-0x30, canary at RBP-0x8. $ step 3 action: Calculate: OOB read index = (0x30-0x8)/8 … view at source ↗

**Figure 25.** Figure 25: unlimited_subway action-score card for Gen 3, corresponding to node c2_2_2 in [PITH_FULL_IMAGE:figures/full_fig_p068_25.png] view at source ↗

**Figure 26.** Figure 26: Evolution tree for apb-vm2. The topology is separated from the score/action cards; purple marks the selected branch and green marks the solving path. Challenge Profile Field Value Challenge web_security/apb-in-vitro-web_security-vm2 Category Web (PHP LFI) Vulnerability PHP include on a user-controlled page parameter, with a partial path-traversal filter Key Constraint Relative-path traversal blocked; abso… view at source ↗

**Figure 27.** Figure 27: apb-vm2 action-score card for Gen 0. The top bar gives the tree node ID and score; the body expands the detailed action trace. $ step 1 action: load_skill --name web-lfi-bypass result: Error-driven LFI methodology loaded. $ step 2-4 action: Reasoning Protocol: state Hypothesis / Evidence / Prerequisites / Success Criteria for each step result: Decision rationale becomes visible in the trace. $ step 5-9 ac… view at source ↗

**Figure 28.** Figure 28: apb-vm2 action-score card for Gen 1, corresponding to node child2 in [PITH_FULL_IMAGE:figures/full_fig_p072_28.png] view at source ↗

**Figure 29.** Figure 29: apb-vm2 action-score card for Gen 2, corresponding to node c2_2 in [PITH_FULL_IMAGE:figures/full_fig_p073_29.png] view at source ↗

**Figure 30.** Figure 30: apb-vm2 action-score card for Gen 3, corresponding to node c2_2_1 in [PITH_FULL_IMAGE:figures/full_fig_p073_30.png] view at source ↗

**Figure 31.** Figure 31: Evolution tree for apb-vm6. The topology is separated from the score/action cards; purple marks the selected branch and green marks the solving path. “server-side error noise” will keep optimizing for LFI bypasses indefinitely. Once the eval surface is recognized, a second trap appears: the agent tends to inject a full <?php...?> block, but the parameter is already inside a string-concatenation context, s… view at source ↗

**Figure 32.** Figure 32: apb-vm6 action-score card for Gen 0. The top bar gives the tree node ID and score; the body expands the detailed action trace. $ step 1-2 action: Adopt HYPOTHESIS / IF TRUE / IF FALSE / TOOL CHECK / ACTION format result: Each step now states its falsifier explicitly. $ step 3-6 action: TOOL CHECK separates environment failures from server responses result: gobuster timeout reclassified as environment, not… view at source ↗

**Figure 33.** Figure 33: apb-vm6 action-score card for Gen 1, corresponding to node child1 in [PITH_FULL_IMAGE:figures/full_fig_p077_33.png] view at source ↗

**Figure 34.** Figure 34: apb-vm6 action-score card for Gen 2, corresponding to node c1_1 in [PITH_FULL_IMAGE:figures/full_fig_p078_34.png] view at source ↗

**Figure 35.** Figure 35: apb-vm6 action-score card for Gen 3, corresponding to node c1_1_0 in [PITH_FULL_IMAGE:figures/full_fig_p078_35.png] view at source ↗

**Figure 36.** Figure 36: Evolution tree for ezrop. Purple marks the surviving branch; the amber leaf at Gen 3 marks the most-explored stall (no flag). Challenge Profile Field Value Challenge 2022q-pwn-ezrop Category Pwn (NX, alphabetic-only filter, pre-built ROP array) Vulnerability 0x100 read into a 100-byte buffer; check() rejects non-alphabetic bytes; the buffer overflow lands inside a pre-constructed ROP array whose first ent… view at source ↗

**Figure 37.** Figure 37: ezrop action-score card for Gen 0. $ step 1 action: Adopt HYPOTHESIS / VALIDATION / PROGRESS scaffold result: Reasoning becomes structured; each command states what it would falsify. $ step 2 action: load_skill --name pwn-restricted-charset result: Partial overwrite, alphabetic gadget hunting, CSU-style constrained pivots. $ step 3-10 action: Search for printable jump targets and partial-byte overwrites r… view at source ↗

**Figure 38.** Figure 38: ezrop action-score card for Gen 1, node child2. L4 System Template — Reasoning scaffold becomes structured + ## Reasoning Protocol + For each command, state: + - HYPOTHESIS: what server-side / binary behavior you assume + - VALIDATION: what observation would falsify it + - PROGRESS: what the action is supposed to advance L2 Skill — pwn-restricted-charset added - Partial-overwrite of saved RIP within the a… view at source ↗

**Figure 39.** Figure 39: ezrop action-score card for Gen 2, node c2_1. $ step 1-3 action: Tighten heredoc, abs-path, GDB-input rules in instance template result: Operational stack stabilises. $ step 4-8 action: Place '\n' at buf[7]; corrupt rop[0] to enter the staged chain result: Newline bypass + rop[0] hijack working as designed. $ step 9-15 action: Reach a libc leak stage result: Address disclosure achieved within the survivin… view at source ↗

**Figure 40.** Figure 40: ezrop action-score card for Gen 3, node c2_1_2 (best stall). 83 [PITH_FULL_IMAGE:figures/full_fig_p083_40.png] view at source ↗

**Figure 41.** Figure 41: Evolution tree for no_pass_needed. The chosen path runs through child0 on the left of Gen 1. G.2.2 no_pass_needed — JWT Knowledge Without Delivery Discipline (Web, Kimi-K2.5, Gen3 Stall) Benchmark note. This case is drawn from the NYU-CTF benchmark. Challenge Profile Field Value Challenge 2021q-web-no_pass_needed Category Web (JWT Auth Bypass under Unstable Service) Vulnerability JWT-based authentication … view at source ↗

**Figure 42.** Figure 42: no_pass_needed action-score card for Gen 0. $ step 1 action: load_skill --name web-jwt-auth-bypass result: Adds alg:none, RS->HS confusion, kid abuse, time-claim manipulation. $ step 2 action: Fix Python literal mistakes result: true/null replaced by True/None across the attacker harness. $ step 3-10 action: Rotate through JWT bypass techniques result: Menu-driven: each technique tested once, all on a cra… view at source ↗

**Figure 43.** Figure 43: no_pass_needed action-score card for Gen 1, node child0. Generation 1: JWT Family Acquired, Delivery Model Unchanged (SR 0). L2 Skill — web-jwt-auth-bypass added - alg:none and unsigned token replay. - RS256 -> HS256 algorithm confusion (use the public key as HMAC key). - kid header abuse: file read, SQL injection, command injection. - Claim manipulation: iat / exp / nbf, role escalation, user impersonati… view at source ↗

**Figure 44.** Figure 44: no_pass_needed action-score card for Gen 2, node c0_2. $ step 1-2 action: Add 'Tried Set' bookkeeping + stabilization checks result: Repeats and target instability are tracked explicitly. $ step 3-8 action: Lock to Hybrid Auth, vary forged claims result: Closest the lineage gets to the human move (freeze delivery, vary claims). $ step 9-16 action: Slip back to standalone JWT to test new tricks result: Ser… view at source ↗

**Figure 45.** Figure 45: no_pass_needed action-score card for Gen 3, node c0_2_1 (best stall). + - Validate JSON payload locally before transmission. + - On unstable services, batch related tests behind one session. The skill is good domain knowledge. The branch stops dying because it forgot JWT existed. It still cycles through bypass mechanisms one by one on a delivery path that periodically crashes the validator. Generation 2: … view at source ↗

read the original abstract

LLM-based agents are increasingly used for cybersecurity tasks, but most existing systems rely on fixed, human-designed scaffolds that struggle to adapt across diverse targets and failure modes. We introduce \textsc{CyberEvolver}, a self-evolving cybersecurity agent framework that iteratively revises its own scaffold based on experience from failed execution attempts. Self-evolution in cybersecurity is challenging because the space of possible scaffold changes is largely unstructured, execution feedback is sparse and often obscured by the environment, and low-diversity updates can cause errors to compound over repeated iterations. \textsc{CyberEvolver} addresses these challenges with a four-layer evolvable agent architecture that decomposes scaffold optimization into structured components, a trace-to-diagnosis mechanism that converts noisy execution logs into actionable revision signals, and a population-based beam search strategy that preserves diverse agent variants during evolution. We evaluate \textsc{CyberEvolver} on CTF challenges, vulnerability exploitation, and penetration-testing tasks using four open-source LLMs. Across these settings, \textsc{CyberEvolver} improves the seed agent's success rate by $13.6$\,\% on average, and outperforms six human-designed cybersecurity agents as well as two self-improvement methods adapted from other domains. These results suggest that scaffold self-evolution is a promising direction for building adaptive LLM agents for security testing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CyberEvolver gives a structured self-evolution setup for cybersecurity agents but the 13.6% gain claim hinges on an unverified trace-to-diagnosis step that may not handle obscured logs cleanly.

read the letter

The paper's main contribution is a four-layer agent architecture plus trace-to-diagnosis conversion and population beam search aimed at letting cybersecurity agents revise their scaffolds from failed runs. It targets the real issue that fixed human scaffolds do not adapt well when targets and failure modes shift.

What stands out as new is the specific combination of those three pieces for the cybersecurity setting. The authors correctly note that scaffold changes are unstructured, feedback is sparse and obscured, and low-diversity updates compound errors. They try to counter that with decomposition, diagnosis from traces, and diversity-preserving search.

The work does a reasonable job framing the problem. It explains why standard self-improvement from other domains falls short here and gives concrete mechanisms instead of vague adaptation.

The soft spots sit in the evaluation and the central mechanism. The abstract reports a 13.6% average lift and outperformance over six human-designed agents plus two adapted baselines, yet supplies no task counts, no statistical tests, and no breakdown of when the method fails. More importantly, the trace-to-diagnosis step must turn noisy, obscured logs into accurate revision signals without injecting errors that the beam search cannot filter. The paper itself flags this risk, but the reported gains assume the conversion works reliably. If it does not, the improvements cannot be attributed to self-evolution.

This is for people building or studying LLM agents for CTF, exploitation, and penetration testing who need agents that change without repeated human redesign. A reader looking for practical architecture ideas in that niche could extract usable structure even if the numbers require checking.

The paper shows clear engagement with domain constraints and deserves a serious referee so the experimental controls and diagnosis validation can be examined.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces CyberEvolver, a self-evolving LLM agent framework for cybersecurity that iteratively revises its scaffold using a four-layer evolvable agent architecture, a trace-to-diagnosis mechanism to convert noisy execution logs into revision signals, and a population-based beam search to maintain diversity. It evaluates the approach on CTF challenges, vulnerability exploitation, and penetration-testing tasks with four open-source LLMs, claiming a 13.6% average success-rate improvement over the seed agent while outperforming six human-designed cybersecurity agents and two adapted self-improvement baselines.

Significance. If the performance claims hold under rigorous controls, the work would offer a concrete demonstration that structured self-evolution can mitigate the limitations of fixed scaffolds in domains with sparse and obscured feedback, advancing adaptive agent design for security applications.

major comments (2)

[Abstract] Abstract: the central claim of a 13.6% average success-rate lift and outperformance of six human-designed agents plus two baselines is stated without any mention of task counts, run variance, statistical tests, or failure-mode breakdowns, rendering the empirical contribution unverifiable from the given evidence.
[Abstract] Abstract: the trace-to-diagnosis mechanism is presented as the key step that converts obscured logs into non-compounding revision signals, yet the manuscript supplies no ablation, error analysis, or validation that this conversion step succeeds reliably; because the four-layer architecture and beam search both depend on its output, any systematic mis-diagnosis would invalidate attribution of the reported gains to self-evolution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive suggestions focused on the abstract. We agree that the abstract can be strengthened for clarity and verifiability while preserving its concise nature. Below we respond point-by-point and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of a 13.6% average success-rate lift and outperformance of six human-designed agents plus two baselines is stated without any mention of task counts, run variance, statistical tests, or failure-mode breakdowns, rendering the empirical contribution unverifiable from the given evidence.

Authors: We agree that the abstract should supply sufficient context for the performance claims. In the revised manuscript we will expand the final sentence of the abstract to state the evaluation scope (CTF, exploitation, and penetration-testing tasks across four LLMs), note that the 13.6% figure is an average over multiple independent runs with reported standard deviation, and indicate that statistical significance was assessed with appropriate tests (e.g., paired t-test or Wilcoxon signed-rank). Failure-mode breakdowns and per-task counts remain in the main experimental section (Section 4) but will be briefly referenced in the abstract for completeness. revision: yes
Referee: [Abstract] Abstract: the trace-to-diagnosis mechanism is presented as the key step that converts obscured logs into non-compounding revision signals, yet the manuscript supplies no ablation, error analysis, or validation that this conversion step succeeds reliably; because the four-layer architecture and beam search both depend on its output, any systematic mis-diagnosis would invalidate attribution of the reported gains to self-evolution.

Authors: We acknowledge the referee's concern that the abstract presents the trace-to-diagnosis component without accompanying validation. The current manuscript contains component ablations in Section 5 that quantify the contribution of trace-to-diagnosis relative to the other layers and the beam search; however, a dedicated error analysis of diagnosis accuracy (e.g., manual inspection of a sample of traces and mis-diagnosis rate) is not present. We will therefore add a short paragraph and accompanying table in the revised manuscript that reports diagnosis reliability on a held-out set of traces and shows that mis-diagnosis events remain below a threshold that would explain the observed gains. This addition will be summarized concisely in the abstract as well. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper's central claims consist of empirical performance measurements (13.6% average success-rate lift, outperformance of six human-designed agents and two adapted baselines) obtained on external CTF, vulnerability exploitation, and penetration-testing tasks. These results are evaluated against independent benchmarks rather than quantities fitted inside the same experiments or reduced by construction to the paper's own equations or self-citations. The four-layer architecture, trace-to-diagnosis mechanism, and population beam search are presented as design choices whose effectiveness is tested externally; no self-definitional loops, fitted-input predictions, or load-bearing self-citation chains appear in the provided text. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

Only abstract available; ledger therefore limited to explicitly stated domain assumptions and newly introduced components.

axioms (1)

domain assumption Execution feedback in cybersecurity is sparse and often obscured by the environment.
Explicitly listed as one of the three core challenges the framework must solve.

invented entities (3)

four-layer evolvable agent architecture no independent evidence
purpose: decomposes scaffold optimization into structured components
Newly proposed decomposition to make the search space tractable.
trace-to-diagnosis mechanism no independent evidence
purpose: converts noisy execution logs into actionable revision signals
Newly proposed conversion step.
population-based beam search strategy no independent evidence
purpose: preserves diverse agent variants during evolution
Newly proposed search strategy to avoid compounding errors.

pith-pipeline@v0.9.1-grok · 5781 in / 1218 out tokens · 32548 ms · 2026-06-29T21:17:26.966699+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

122 extracted references · 35 canonical work pages · 8 internal anchors

[1]

Abramovich, M

T. Abramovich, M. Udeshi, M. Shao, K. Lieret, H. Xi, K. Milner, S. Jancheska, J. Yang, C. E. Jimenez, F. Khorrami, P. Krishnamurthy, B. Dolan-Gavitt, M. Shafique, K. Narasimhan, R. Karri, and O. Press. EnIGMA: Interactive tools substantially assist LM agents in finding security vulnerabilities, 2024. URLhttps://arxiv.org/abs/2409.16165

work page arXiv 2024
[2]

Agent skills

Anthropic. Agent skills. https://platform.claude.com/docs/en/agents-and-tools/ agent-skills/overview, 2025. Accessed: 2026-05-07

2025
[3]

Applis, Y

L. Applis, Y . Zhang, S. Liang, N. Jiang, L. Tan, and A. Roychoudhury. Unified software engineering agent as AI software engineer, 2025. URL https://arxiv.org/abs/2506. 14683

2025
[4]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei...
[5]

URL https://proceedings.neurips.cc/paper_ files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_ files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

2020
[6]

CSAW CTF Qualification Round 2023

CTFtime. CSAW CTF Qualification Round 2023. https://ctftime.org/event/2087/,

2023
[7]

Scoreboard reports 1,096 teams total
[8]

DeepSeek-V3 technical report, 2024

DeepSeek-AI. DeepSeek-V3 technical report, 2024. URL https://arxiv.org/abs/2412. 19437

2024
[9]

The temperature parameter, n.d

DeepSeek-AI. The temperature parameter, n.d.. URL https://api-docs.deepseek.com/ quick_start/parameter_settings. DeepSeek API Docs; accessed 2026-05-07

2026
[10]

Create chat completion, n.d

DeepSeek-AI. Create chat completion, n.d.. URL https://api-docs.deepseek.com/api/ create-chat-completion. DeepSeek API Docs; accessed 2026-05-07

2026
[11]

G. Deng, Y . Liu, V . Mayoral-Vilches, P. Liu, Y . Li, Y . Xu, T. Zhang, Y . Liu, M. Pinzger, and S. Rass. Pentestgpt: An llm-empowered automatic penetration testing tool.arXiv preprint arXiv:2308.06782, 2023. URLhttps://arxiv.org/abs/2308.06782

work page arXiv 2023
[12]

X. Deng, J. Da, E. Pan, Y . Y . He, C. Ide, K. Garg, N. Lauffer, A. Park, N. Pasari, C. Rane, K. Sampath, M. Krishnan, S. Kundurthy, S. Hendryx, Z. Wang, V . Bharadwaj, J. Holm, R. Aluri, C. B. C. Zhang, N. Jacobson, B. Liu, and B. Kenstler. SWE-Bench Pro: Can AI agents solve long-horizon software engineering tasks?, 2025. URL https://scale.com/research/ ...

2025
[13]

Dodge, M

J. Dodge, M. Sap, A. Marasovi ´c, W. Agnew, G. Ilharco, D. Groeneveld, M. Mitchell, and M. Gardner. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Onl...

work page doi:10.18653/v1/2021.emnlp-main.98 2021
[14]

Fernando, D

C. Fernando, D. Banarse, H. Michalewski, S. Osindero, and T. Rocktäschel. Promptbreeder: Self-referential self-improvement via prompt evolution. InInternational Conference on Machine Learning, 2024. 10

2024
[15]

Fernando, D

C. Fernando, D. S. Banarse, H. Michalewski, S. Osindero, and T. Rocktäschel. Promptbreeder: Self-referential self-improvement via prompt evolution. InProceedings of the 41st Interna- tional Conference on Machine Learning, 2024. URL https://openreview.net/forum? id=HKkiX32Zw1

2024
[16]

Gioacchini, M

L. Gioacchini, M. Mellia, I. Drago, A. Delsanto, G. Siracusano, and R. Bifulco. Autopenbench: Benchmarking generative agents for penetration testing, 2024. URL https://arxiv.org/ abs/2410.03225

work page arXiv 2024
[17]

Welcome to the eternal september of open source

GitHub. Welcome to the eternal september of open source. here’s what we plan to do for maintainers, 2026. URL https://github.blog/open-source/maintainers/ welcome-to-the-eternal-september-of-open-source-heres-what-we-plan-to-do-for-maintainers/ . GitHub Blog

2026
[18]

Bug bounty policy, n.d

Global. Bug bounty policy, n.d. URL https://global.com/bug-bounty-policy/. Exam- ple bug bounty policy requiring actionable proof-of-concept evidence
[19]

S. Hu, C. Lu, and J. Clune. Automated design of agentic systems. InInternational Conference on Learning Representations, 2025

2025
[20]

Huang, J

H. Huang, J. Shi, J. Chen, T. Zhang, Y . Li, C. Yang, E. L. Ouh, L. K. Shar, and D. Lo. Penforge: On-the-fly expert agent construction for automated penetration testing.arXiv preprint arXiv:2601.06910, 2026. URLhttps://arxiv.org/abs/2601.06910

work page arXiv 2026
[21]

AI-generated pull requests overwhelming, hard to review carefully, 2026

ITK Discourse Contributors. AI-generated pull requests overwhelming, hard to review carefully, 2026. URL https://discourse.itk.org/t/ ai-generated-pull-requests-overwhelming-hard-to-review-carefully/7728 . Community discussion on maintainer burden from AI-generated pull requests

2026
[22]

Z. Ji, D. Wu, W. Jiang, P. Ma, Z. Li, and S. Wang. Measuring and augmenting large language models for solving capture-the-flag challenges, 2025. URL https://arxiv.org/abs/2506. 17644

2025
[23]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. Swe- bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=VTF8yNQM66

2024
[24]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team. Kimi K2.5: Visual agentic intelligence, 2026. URL https://arxiv.org/abs/ 2602.02276

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

moonshotai/Kimi-K2.5

Kimi Team. moonshotai/Kimi-K2.5. https://huggingface.co/moonshotai/Kimi-K2.5,
[26]

Hugging Face model card; accessed 2026-05-07

2026
[28]

URLhttps://arxiv.org/abs/2508.07382

work page arXiv
[29]

H. Kong, D. Hu, J. Ge, L. Li, T. Li, and B. Wu. Vulnbot: Autonomous penetration testing for a multi-agent collaborative framework, 2025. URLhttps://arxiv.org/abs/2501.13411

work page arXiv 2025
[30]

D. Lee, G. eun Bae, and I. yun. CTFusion : A CTF-based benchmark for LLM agent evaluation,
[31]

URLhttps://openreview.net/forum?id=2zQJHLbyqM
[32]

H. Lee, Z. Zhang, H. Lu, and L. Zhang. SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks. InAdvances in Neural Information Processing Systems,
[33]

URLhttps://openreview.net/forum?id=QQhQIqons0
[34]

J. Lin, Y . Guo, Y . Han, S. Hu, Z. Ni, L. Wang, M. Chen, H. Liu, R. Chen, Y . He, D. Jiang, B. Jiao, C. Hu, and H. Wang. SE-Agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents, 2025. URLhttps://arxiv.org/abs/2508.02085

work page arXiv 2025
[35]

J. W. Lin, E. K. Jones, D. J. Jasper, E. J.-s. Ho, A. Wu, A. T. Yang, N. Perry, A. Zou, M. Fredrik- son, J. Z. Kolter, P. Liang, D. Boneh, and D. E. Ho. Comparing ai agents to cybersecurity profes- sionals in real-world penetration testing, 2025. URL https://arxiv.org/abs/2512.09882. 11

work page arXiv 2025
[36]

A. Z. Liu, J. Choi, S. Sohn, Y . Fu, J. Kim, D.-K. Kim, X. Wang, J. Yoo, and H. Lee. Skillact: Using skill abstractions improves LLM agents. InICML 2024 Workshop on LLMs and Cognition,

2024
[37]

URLhttps://openreview.net/forum?id=6LG3cIRrF4
[38]

Madaan, N

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Systems, 2023

2023
[39]

W. Mai, G. Hong, Q. Liu, J. Chen, J. Dai, X. Pan, Y . Zhang, and M. Yang. Shell or nothing: Real-world benchmarks and memory-activated agents for automated penetration testing.arXiv preprint arXiv:2509.09207, 2025. URLhttps://arxiv.org/abs/2509.09207

work page arXiv 2025
[40]

MiniMax M2.5: Built for real-world productivity, 2026

MiniMax. MiniMax M2.5: Built for real-world productivity, 2026. URL https://www. minimax.io/news/minimax-m25. MiniMax official blog

2026
[41]

MiniMaxAI/MiniMax-M2.5

MiniMaxAI. MiniMaxAI/MiniMax-M2.5. https://huggingface.co/MiniMaxAI/ MiniMax-M2.5, 2026. Hugging Face model card; accessed 2026-05-07

2026
[42]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

A. Novikov, N. V˜u, M. Eisenberger, E. Dupont, P.-S. Huang, A. Z. Wagner, S. Shirobokov, B. Kozlovskii, F. J. R. Ruiz, A. Mehrabian, M. P. Kumar, A. See, S. Chaudhuri, G. Holland, A. Davies, S. Nowozin, P. Kohli, and M. Balog. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

CSAW Quals 2023: web/cookie-injection challenge meta- data

NYU CTF Bench. CSAW Quals 2023: web/cookie-injection challenge meta- data. https://github.com/NYU-LLM-CTF/NYU_CTF_Bench/blob/main/test/2023/ CSAW-Quals/web/cookie-injection/challenge.json, 2024. Challenge metadata lists the dynamic scoring parameters and final point value of 488

2023
[44]

Gpt-5.3-codex system card

OpenAI. Gpt-5.3-codex system card. OpenAI, Feb. 2026. URL https://openai.com/ index/gpt-5-3-codex-system-card/

2026
[45]

Ouyang, J

S. Ouyang, J. Yan, I.-H. Hsu, Y . Chen, K. Jiang, Z. Wang, R. Han, L. Le, S. Daruki, X. Tang, V . Tirumalashetty, G. Lee, M. Rofouei, H. Lin, J. Han, C.-Y . Lee, and T. Pfister. Reasoning- Bank: Scaling agent self-evolving with reasoning memory. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? ...

2026
[46]

Vulnerability disclosure cheat sheet, 2024

OWASP Foundation. Vulnerability disclosure cheat sheet, 2024. URL https: //cheatsheetseries.owasp.org/cheatsheets/Vulnerability_Disclosure_ Cheat_Sheet.html. OW ASP Cheat Sheet Series

2024
[47]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Qwen/Qwen3-235B-A22B-Instruct-2507

Qwen Team. Qwen/Qwen3-235B-A22B-Instruct-2507. https://huggingface.co/Qwen/ Qwen3-235B-A22B-Instruct-2507 , 2025. Hugging Face model card; accessed 2026-05-07

2025
[49]

Robeyns, M

M. Robeyns, M. Szummer, and L. Aitchison. A self-improving coding agent. InScaling Self- Improving Foundation Models without Human Supervision, 2025. URLhttps://openreview. net/forum?id=rShJCyLsOr

2025
[50]

AXE: Grey-Box Exploitability Confirmation for Localized Vulnerability Reports

A. Sajadi, T. Nguyen, K. Damevski, and P. Chatterjee. Axe: An agentic exploit engine for confirming zero-day vulnerability reports.arXiv preprint arXiv:2602.14345, 2026. URL https://arxiv.org/abs/2602.14345

work page internal anchor Pith review Pith/arXiv arXiv 2026
[51]

A. L. Samuel. Some studies in machine learning using the game of checkers.IBM Journal of Research and Development, 3(3):210–229, 1959. doi: 10.1147/rd.33.0210

work page doi:10.1147/rd.33.0210 1959
[52]

Schmidhuber

J. Schmidhuber. G"odel machines: Fully self-referential optimal universal self-improvers. InAr- tificial General Intelligence, pages 199–226. Springer, 2007. doi: 10.1007/978-3-540-68677-4_ 7

work page doi:10.1007/978-3-540-68677-4_ 2007
[53]

Shang, Y

Y . Shang, Y . Li, K. Zhao, L. Ma, J. Liu, F. Xu, and Y . Li. Agentsquare: Automatic llm agent search in modular design space. InInternational Conference on Learning Representations, 2025. 12

2025
[54]

M. Shao, S. Jancheska, M. Udeshi, B. Dolan-Gavitt, H. Xi, K. Milner, B. Chen, M. Yin, S. Garg, P. Krishnamurthy, F. Khorrami, R. Karri, and M. Shafique. Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security, 2024. URL https://arxiv.org/abs/2406.05590

work page arXiv 2024
[55]

M. Shao, H. Xi, N. Rani, M. Udeshi, V . S. C. Putrevu, K. Milner, B. Dolan-Gavitt, S. K. Shukla, P. Krishnamurthy, F. Khorrami, R. Karri, and M. Shafique. CRAKEN: Cybersecurity llm agent with knowledge-based execution, 2025. URLhttps://arxiv.org/abs/2505.17107

work page arXiv 2025
[56]

Shinn, F

N. Shinn, F. Cassano, A. Gopinath, K. R. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems,
[57]

URLhttps://openreview.net/forum?id=vAElhFcKW6
[58]

Singer, K

B. Singer, K. Lucas, L. Adiga, M. Jain, L. Bauer, and V . Sekar. Incalmo: An autonomous llm-assisted system for red teaming multi-host networks, 2025. URL https://arxiv.org/ abs/2501.16466

work page arXiv 2025
[59]

R. S. Sutton and A. G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2 edition,
[60]

URLhttps://mitpress.mit.edu/9780262352703/reinforcement-learning/

work page arXiv
[61]

Udeshi, M

M. Udeshi, M. Shao, H. Xi, N. Rani, K. Milner, V . S. C. Putrevu, B. Dolan-Gavitt, S. K. Shukla, P. Krishnamurthy, F. Khorrami, R. Karri, and M. Shafique. D-cipher: Dynamic collaborative intelligent multi-agent system with planner and heterogeneous executors for offensive security,
[62]

URLhttps://arxiv.org/abs/2502.10931

work page arXiv
[63]

G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar. V oyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research, 2024. URLhttps://openreview.net/forum?id=ehfRiF0R3a

2024
[64]

W. Wang, P. Piekos, L. Nanbo, F. Laakom, Y . Chen, M. Ostaszewski, M. Zhuge, and J. Schmid- huber. Huxley-gödel machine: Human-level coding agent development by an approximation of the optimal self-improving machine, 2025. URLhttps://arxiv.org/abs/2510.21614

work page arXiv 2025
[65]

Z. Wang, T. Shi, J. He, M. Cai, J. Zhang, and D. Song. Cybergym: Evaluating ai agents’ real- world cybersecurity capabilities at scale, 2025. URL https://arxiv.org/abs/2506.02548

work page arXiv 2025
[66]

Z. Z. Wang, J. Mao, D. Fried, and G. Neubig. Agent workflow memory, 2024. URL https: //arxiv.org/abs/2409.07429

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

C. J. C. H. Watkins and P. Dayan. Q-learning.Machine Learning, 8:279–292, 1992. doi: 10.1007/BF00992698

work page doi:10.1007/bf00992698 1992
[68]

C. Yang, X. Wang, Y . Lu, H. Liu, Q. V . Le, D. Zhou, and X. Chen. Large language models as optimizers. InInternational Conference on Learning Representations, 2024

2024
[69]

J. Yang, A. Prabhakar, K. R. Narasimhan, and S. Yao. Intercode: Standardizing and benchmark- ing interactive coding with execution feedback. InAdvances in Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=fvKaLF1ns8

2023
[70]

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. SWE- agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id= mXpq6ut8J3

2024
[71]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao. ReAct: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X

2023
[72]

A. K. Zhang, J. Ji, C. Menders, R. Dulepet, T. Qin, R. Y . Wang, J. Wu, K. Liao, J. Li, J. Hu, et al. Bountybench: Dollar impact of ai agent attackers and defenders on real-world cybersecurity systems. InAdvances in Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=pIsP4lMlFd. 13

2025
[73]

A. K. Zhang, N. Perry, R. Dulepet, J. Ji, C. Menders, J. W. Lin, E. Jones, G. Hussein, S. Liu, D. J. Jasper, et al. Cybench: A framework for evaluating cybersecurity capabilities and risks of language models. InThe Thirteenth International Conference on Learning Representations,
[74]

URLhttps://openreview.net/forum?id=tc90LV0yRL
[75]

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

J. Zhang, S. Hu, C. Lu, R. Lange, and J. Clune. Darwin Godel Machine: Open-ended evolution of self-improving agents, 2025. URLhttps://arxiv.org/abs/2505.22954

work page internal anchor Pith review Pith/arXiv arXiv 2025
[76]

Zhang, J

J. Zhang, J. Xiang, Z. Yu, F. Teng, X. Chen, J. Chen, M. Zhuge, X. Cheng, S. Hong, J. Wang, B. Zheng, B. Liu, Y . Luo, and C. Wu. AFlow: Automating agentic workflow generation. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=z5uVAKwmjf

2025
[77]

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Q. Zhang, C. Hu, S. Upasani, B. Ma, F. Hong, V . Kamanuru, J. Rainton, C. Wu, M. Ji, H. Li, U. Thakker, J. Zou, and K. Olukotun. Agentic context engineering: Evolving contexts for self-improving language models, 2025. URLhttps://arxiv.org/abs/2510.04618

work page internal anchor Pith review Pith/arXiv arXiv 2025
[78]

A. Zhao, D. Huang, Q. Xu, M. Lin, Y .-J. Liu, and G. Huang. ExpeL: Llm agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024. doi: 10.1609/aaai.v38i17.29936. URL https://doi.org/10.1609/ aaai.v38i17.29936

work page doi:10.1609/aaai.v38i17.29936 2024
[79]

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried, U. Alon, and G. Neubig. WebArena: A realistic web environment for building autonomous agents, 2023. URLhttps://arxiv.org/abs/2307.13854

work page internal anchor Pith review Pith/arXiv arXiv 2023
[80]

Y . Zhu, T. Jin, Y . Pruksachatkun, A. Zhang, S. Liu, S. Cui, S. Kapoor, S. Longpre, K. Meng, R. Weiss, F. Barez, R. Gupta, J. Dhamala, J. Merizian, M. Giulianelli, H. Coppock, C. Ududec, J. Sekhon, J. Steinhardt, A. Kellermann, S. Schwettmann, M. Zaharia, I. Stoica, P. Liang, and D. Kang. Establishing best practices for building rigorous agentic benchmar...

work page arXiv 2025
[81]

Y . Zhu, A. Kellermann, D. Bowman, P. Li, A. Gupta, A. Danda, R. Fang, C. Jensen, E. Ihli, J. Benn, J. Geronimo, A. Dhir, S. Rao, K. Yu, T. Stone, and D. Kang. Cve-bench: A benchmark for ai agents’ ability to exploit real-world web application vulnerabilities. InProceedings of the 42nd International Conference on Machine Learning, pages 79850–79867, 2025....

2025

Showing first 80 references.

[1] [1]

Abramovich, M

T. Abramovich, M. Udeshi, M. Shao, K. Lieret, H. Xi, K. Milner, S. Jancheska, J. Yang, C. E. Jimenez, F. Khorrami, P. Krishnamurthy, B. Dolan-Gavitt, M. Shafique, K. Narasimhan, R. Karri, and O. Press. EnIGMA: Interactive tools substantially assist LM agents in finding security vulnerabilities, 2024. URLhttps://arxiv.org/abs/2409.16165

work page arXiv 2024

[2] [2]

Agent skills

Anthropic. Agent skills. https://platform.claude.com/docs/en/agents-and-tools/ agent-skills/overview, 2025. Accessed: 2026-05-07

2025

[3] [3]

Applis, Y

L. Applis, Y . Zhang, S. Liang, N. Jiang, L. Tan, and A. Roychoudhury. Unified software engineering agent as AI software engineer, 2025. URL https://arxiv.org/abs/2506. 14683

2025

[4] [4]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei...

[5] [5]

URL https://proceedings.neurips.cc/paper_ files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_ files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

2020

[6] [6]

CSAW CTF Qualification Round 2023

CTFtime. CSAW CTF Qualification Round 2023. https://ctftime.org/event/2087/,

2023

[7] [7]

Scoreboard reports 1,096 teams total

[8] [8]

DeepSeek-V3 technical report, 2024

DeepSeek-AI. DeepSeek-V3 technical report, 2024. URL https://arxiv.org/abs/2412. 19437

2024

[9] [9]

The temperature parameter, n.d

DeepSeek-AI. The temperature parameter, n.d.. URL https://api-docs.deepseek.com/ quick_start/parameter_settings. DeepSeek API Docs; accessed 2026-05-07

2026

[10] [10]

Create chat completion, n.d

DeepSeek-AI. Create chat completion, n.d.. URL https://api-docs.deepseek.com/api/ create-chat-completion. DeepSeek API Docs; accessed 2026-05-07

2026

[11] [11]

G. Deng, Y . Liu, V . Mayoral-Vilches, P. Liu, Y . Li, Y . Xu, T. Zhang, Y . Liu, M. Pinzger, and S. Rass. Pentestgpt: An llm-empowered automatic penetration testing tool.arXiv preprint arXiv:2308.06782, 2023. URLhttps://arxiv.org/abs/2308.06782

work page arXiv 2023

[12] [12]

X. Deng, J. Da, E. Pan, Y . Y . He, C. Ide, K. Garg, N. Lauffer, A. Park, N. Pasari, C. Rane, K. Sampath, M. Krishnan, S. Kundurthy, S. Hendryx, Z. Wang, V . Bharadwaj, J. Holm, R. Aluri, C. B. C. Zhang, N. Jacobson, B. Liu, and B. Kenstler. SWE-Bench Pro: Can AI agents solve long-horizon software engineering tasks?, 2025. URL https://scale.com/research/ ...

2025

[13] [13]

Dodge, M

J. Dodge, M. Sap, A. Marasovi ´c, W. Agnew, G. Ilharco, D. Groeneveld, M. Mitchell, and M. Gardner. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Onl...

work page doi:10.18653/v1/2021.emnlp-main.98 2021

[14] [14]

Fernando, D

C. Fernando, D. Banarse, H. Michalewski, S. Osindero, and T. Rocktäschel. Promptbreeder: Self-referential self-improvement via prompt evolution. InInternational Conference on Machine Learning, 2024. 10

2024

[15] [15]

Fernando, D

C. Fernando, D. S. Banarse, H. Michalewski, S. Osindero, and T. Rocktäschel. Promptbreeder: Self-referential self-improvement via prompt evolution. InProceedings of the 41st Interna- tional Conference on Machine Learning, 2024. URL https://openreview.net/forum? id=HKkiX32Zw1

2024

[16] [16]

Gioacchini, M

L. Gioacchini, M. Mellia, I. Drago, A. Delsanto, G. Siracusano, and R. Bifulco. Autopenbench: Benchmarking generative agents for penetration testing, 2024. URL https://arxiv.org/ abs/2410.03225

work page arXiv 2024

[17] [17]

Welcome to the eternal september of open source

GitHub. Welcome to the eternal september of open source. here’s what we plan to do for maintainers, 2026. URL https://github.blog/open-source/maintainers/ welcome-to-the-eternal-september-of-open-source-heres-what-we-plan-to-do-for-maintainers/ . GitHub Blog

2026

[18] [18]

Bug bounty policy, n.d

Global. Bug bounty policy, n.d. URL https://global.com/bug-bounty-policy/. Exam- ple bug bounty policy requiring actionable proof-of-concept evidence

[19] [19]

S. Hu, C. Lu, and J. Clune. Automated design of agentic systems. InInternational Conference on Learning Representations, 2025

2025

[20] [20]

Huang, J

H. Huang, J. Shi, J. Chen, T. Zhang, Y . Li, C. Yang, E. L. Ouh, L. K. Shar, and D. Lo. Penforge: On-the-fly expert agent construction for automated penetration testing.arXiv preprint arXiv:2601.06910, 2026. URLhttps://arxiv.org/abs/2601.06910

work page arXiv 2026

[21] [21]

AI-generated pull requests overwhelming, hard to review carefully, 2026

ITK Discourse Contributors. AI-generated pull requests overwhelming, hard to review carefully, 2026. URL https://discourse.itk.org/t/ ai-generated-pull-requests-overwhelming-hard-to-review-carefully/7728 . Community discussion on maintainer burden from AI-generated pull requests

2026

[22] [22]

Z. Ji, D. Wu, W. Jiang, P. Ma, Z. Li, and S. Wang. Measuring and augmenting large language models for solving capture-the-flag challenges, 2025. URL https://arxiv.org/abs/2506. 17644

2025

[23] [23]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. Swe- bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=VTF8yNQM66

2024

[24] [24]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team. Kimi K2.5: Visual agentic intelligence, 2026. URL https://arxiv.org/abs/ 2602.02276

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

moonshotai/Kimi-K2.5

Kimi Team. moonshotai/Kimi-K2.5. https://huggingface.co/moonshotai/Kimi-K2.5,

[26] [26]

Hugging Face model card; accessed 2026-05-07

2026

[27] [28]

URLhttps://arxiv.org/abs/2508.07382

work page arXiv

[28] [29]

H. Kong, D. Hu, J. Ge, L. Li, T. Li, and B. Wu. Vulnbot: Autonomous penetration testing for a multi-agent collaborative framework, 2025. URLhttps://arxiv.org/abs/2501.13411

work page arXiv 2025

[29] [30]

D. Lee, G. eun Bae, and I. yun. CTFusion : A CTF-based benchmark for LLM agent evaluation,

[30] [31]

URLhttps://openreview.net/forum?id=2zQJHLbyqM

[31] [32]

H. Lee, Z. Zhang, H. Lu, and L. Zhang. SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks. InAdvances in Neural Information Processing Systems,

[32] [33]

URLhttps://openreview.net/forum?id=QQhQIqons0

[33] [34]

J. Lin, Y . Guo, Y . Han, S. Hu, Z. Ni, L. Wang, M. Chen, H. Liu, R. Chen, Y . He, D. Jiang, B. Jiao, C. Hu, and H. Wang. SE-Agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents, 2025. URLhttps://arxiv.org/abs/2508.02085

work page arXiv 2025

[34] [35]

J. W. Lin, E. K. Jones, D. J. Jasper, E. J.-s. Ho, A. Wu, A. T. Yang, N. Perry, A. Zou, M. Fredrik- son, J. Z. Kolter, P. Liang, D. Boneh, and D. E. Ho. Comparing ai agents to cybersecurity profes- sionals in real-world penetration testing, 2025. URL https://arxiv.org/abs/2512.09882. 11

work page arXiv 2025

[35] [36]

A. Z. Liu, J. Choi, S. Sohn, Y . Fu, J. Kim, D.-K. Kim, X. Wang, J. Yoo, and H. Lee. Skillact: Using skill abstractions improves LLM agents. InICML 2024 Workshop on LLMs and Cognition,

2024

[36] [37]

URLhttps://openreview.net/forum?id=6LG3cIRrF4

[37] [38]

Madaan, N

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Systems, 2023

2023

[38] [39]

W. Mai, G. Hong, Q. Liu, J. Chen, J. Dai, X. Pan, Y . Zhang, and M. Yang. Shell or nothing: Real-world benchmarks and memory-activated agents for automated penetration testing.arXiv preprint arXiv:2509.09207, 2025. URLhttps://arxiv.org/abs/2509.09207

work page arXiv 2025

[39] [40]

MiniMax M2.5: Built for real-world productivity, 2026

MiniMax. MiniMax M2.5: Built for real-world productivity, 2026. URL https://www. minimax.io/news/minimax-m25. MiniMax official blog

2026

[40] [41]

MiniMaxAI/MiniMax-M2.5

MiniMaxAI. MiniMaxAI/MiniMax-M2.5. https://huggingface.co/MiniMaxAI/ MiniMax-M2.5, 2026. Hugging Face model card; accessed 2026-05-07

2026

[41] [42]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

A. Novikov, N. V˜u, M. Eisenberger, E. Dupont, P.-S. Huang, A. Z. Wagner, S. Shirobokov, B. Kozlovskii, F. J. R. Ruiz, A. Mehrabian, M. P. Kumar, A. See, S. Chaudhuri, G. Holland, A. Davies, S. Nowozin, P. Kohli, and M. Balog. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [43]

CSAW Quals 2023: web/cookie-injection challenge meta- data

NYU CTF Bench. CSAW Quals 2023: web/cookie-injection challenge meta- data. https://github.com/NYU-LLM-CTF/NYU_CTF_Bench/blob/main/test/2023/ CSAW-Quals/web/cookie-injection/challenge.json, 2024. Challenge metadata lists the dynamic scoring parameters and final point value of 488

2023

[43] [44]

Gpt-5.3-codex system card

OpenAI. Gpt-5.3-codex system card. OpenAI, Feb. 2026. URL https://openai.com/ index/gpt-5-3-codex-system-card/

2026

[44] [45]

Ouyang, J

S. Ouyang, J. Yan, I.-H. Hsu, Y . Chen, K. Jiang, Z. Wang, R. Han, L. Le, S. Daruki, X. Tang, V . Tirumalashetty, G. Lee, M. Rofouei, H. Lin, J. Han, C.-Y . Lee, and T. Pfister. Reasoning- Bank: Scaling agent self-evolving with reasoning memory. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? ...

2026

[45] [46]

Vulnerability disclosure cheat sheet, 2024

OWASP Foundation. Vulnerability disclosure cheat sheet, 2024. URL https: //cheatsheetseries.owasp.org/cheatsheets/Vulnerability_Disclosure_ Cheat_Sheet.html. OW ASP Cheat Sheet Series

2024

[46] [47]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [48]

Qwen/Qwen3-235B-A22B-Instruct-2507

Qwen Team. Qwen/Qwen3-235B-A22B-Instruct-2507. https://huggingface.co/Qwen/ Qwen3-235B-A22B-Instruct-2507 , 2025. Hugging Face model card; accessed 2026-05-07

2025

[48] [49]

Robeyns, M

M. Robeyns, M. Szummer, and L. Aitchison. A self-improving coding agent. InScaling Self- Improving Foundation Models without Human Supervision, 2025. URLhttps://openreview. net/forum?id=rShJCyLsOr

2025

[49] [50]

AXE: Grey-Box Exploitability Confirmation for Localized Vulnerability Reports

A. Sajadi, T. Nguyen, K. Damevski, and P. Chatterjee. Axe: An agentic exploit engine for confirming zero-day vulnerability reports.arXiv preprint arXiv:2602.14345, 2026. URL https://arxiv.org/abs/2602.14345

work page internal anchor Pith review Pith/arXiv arXiv 2026

[50] [51]

A. L. Samuel. Some studies in machine learning using the game of checkers.IBM Journal of Research and Development, 3(3):210–229, 1959. doi: 10.1147/rd.33.0210

work page doi:10.1147/rd.33.0210 1959

[51] [52]

Schmidhuber

J. Schmidhuber. G"odel machines: Fully self-referential optimal universal self-improvers. InAr- tificial General Intelligence, pages 199–226. Springer, 2007. doi: 10.1007/978-3-540-68677-4_ 7

work page doi:10.1007/978-3-540-68677-4_ 2007

[52] [53]

Shang, Y

Y . Shang, Y . Li, K. Zhao, L. Ma, J. Liu, F. Xu, and Y . Li. Agentsquare: Automatic llm agent search in modular design space. InInternational Conference on Learning Representations, 2025. 12

2025

[53] [54]

M. Shao, S. Jancheska, M. Udeshi, B. Dolan-Gavitt, H. Xi, K. Milner, B. Chen, M. Yin, S. Garg, P. Krishnamurthy, F. Khorrami, R. Karri, and M. Shafique. Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security, 2024. URL https://arxiv.org/abs/2406.05590

work page arXiv 2024

[54] [55]

M. Shao, H. Xi, N. Rani, M. Udeshi, V . S. C. Putrevu, K. Milner, B. Dolan-Gavitt, S. K. Shukla, P. Krishnamurthy, F. Khorrami, R. Karri, and M. Shafique. CRAKEN: Cybersecurity llm agent with knowledge-based execution, 2025. URLhttps://arxiv.org/abs/2505.17107

work page arXiv 2025

[55] [56]

Shinn, F

N. Shinn, F. Cassano, A. Gopinath, K. R. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems,

[56] [57]

URLhttps://openreview.net/forum?id=vAElhFcKW6

[57] [58]

Singer, K

B. Singer, K. Lucas, L. Adiga, M. Jain, L. Bauer, and V . Sekar. Incalmo: An autonomous llm-assisted system for red teaming multi-host networks, 2025. URL https://arxiv.org/ abs/2501.16466

work page arXiv 2025

[58] [59]

R. S. Sutton and A. G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2 edition,

[59] [60]

URLhttps://mitpress.mit.edu/9780262352703/reinforcement-learning/

work page arXiv

[60] [61]

Udeshi, M

M. Udeshi, M. Shao, H. Xi, N. Rani, K. Milner, V . S. C. Putrevu, B. Dolan-Gavitt, S. K. Shukla, P. Krishnamurthy, F. Khorrami, R. Karri, and M. Shafique. D-cipher: Dynamic collaborative intelligent multi-agent system with planner and heterogeneous executors for offensive security,

[61] [62]

URLhttps://arxiv.org/abs/2502.10931

work page arXiv

[62] [63]

G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar. V oyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research, 2024. URLhttps://openreview.net/forum?id=ehfRiF0R3a

2024

[63] [64]

W. Wang, P. Piekos, L. Nanbo, F. Laakom, Y . Chen, M. Ostaszewski, M. Zhuge, and J. Schmid- huber. Huxley-gödel machine: Human-level coding agent development by an approximation of the optimal self-improving machine, 2025. URLhttps://arxiv.org/abs/2510.21614

work page arXiv 2025

[64] [65]

Z. Wang, T. Shi, J. He, M. Cai, J. Zhang, and D. Song. Cybergym: Evaluating ai agents’ real- world cybersecurity capabilities at scale, 2025. URL https://arxiv.org/abs/2506.02548

work page arXiv 2025

[65] [66]

Z. Z. Wang, J. Mao, D. Fried, and G. Neubig. Agent workflow memory, 2024. URL https: //arxiv.org/abs/2409.07429

work page internal anchor Pith review Pith/arXiv arXiv 2024

[66] [67]

C. J. C. H. Watkins and P. Dayan. Q-learning.Machine Learning, 8:279–292, 1992. doi: 10.1007/BF00992698

work page doi:10.1007/bf00992698 1992

[67] [68]

C. Yang, X. Wang, Y . Lu, H. Liu, Q. V . Le, D. Zhou, and X. Chen. Large language models as optimizers. InInternational Conference on Learning Representations, 2024

2024

[68] [69]

J. Yang, A. Prabhakar, K. R. Narasimhan, and S. Yao. Intercode: Standardizing and benchmark- ing interactive coding with execution feedback. InAdvances in Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=fvKaLF1ns8

2023

[69] [70]

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. SWE- agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id= mXpq6ut8J3

2024

[70] [71]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao. ReAct: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X

2023

[71] [72]

A. K. Zhang, J. Ji, C. Menders, R. Dulepet, T. Qin, R. Y . Wang, J. Wu, K. Liao, J. Li, J. Hu, et al. Bountybench: Dollar impact of ai agent attackers and defenders on real-world cybersecurity systems. InAdvances in Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=pIsP4lMlFd. 13

2025

[72] [73]

A. K. Zhang, N. Perry, R. Dulepet, J. Ji, C. Menders, J. W. Lin, E. Jones, G. Hussein, S. Liu, D. J. Jasper, et al. Cybench: A framework for evaluating cybersecurity capabilities and risks of language models. InThe Thirteenth International Conference on Learning Representations,

[73] [74]

URLhttps://openreview.net/forum?id=tc90LV0yRL

[74] [75]

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

J. Zhang, S. Hu, C. Lu, R. Lange, and J. Clune. Darwin Godel Machine: Open-ended evolution of self-improving agents, 2025. URLhttps://arxiv.org/abs/2505.22954

work page internal anchor Pith review Pith/arXiv arXiv 2025

[75] [76]

Zhang, J

J. Zhang, J. Xiang, Z. Yu, F. Teng, X. Chen, J. Chen, M. Zhuge, X. Cheng, S. Hong, J. Wang, B. Zheng, B. Liu, Y . Luo, and C. Wu. AFlow: Automating agentic workflow generation. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=z5uVAKwmjf

2025

[76] [77]

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Q. Zhang, C. Hu, S. Upasani, B. Ma, F. Hong, V . Kamanuru, J. Rainton, C. Wu, M. Ji, H. Li, U. Thakker, J. Zou, and K. Olukotun. Agentic context engineering: Evolving contexts for self-improving language models, 2025. URLhttps://arxiv.org/abs/2510.04618

work page internal anchor Pith review Pith/arXiv arXiv 2025

[77] [78]

A. Zhao, D. Huang, Q. Xu, M. Lin, Y .-J. Liu, and G. Huang. ExpeL: Llm agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024. doi: 10.1609/aaai.v38i17.29936. URL https://doi.org/10.1609/ aaai.v38i17.29936

work page doi:10.1609/aaai.v38i17.29936 2024

[78] [79]

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried, U. Alon, and G. Neubig. WebArena: A realistic web environment for building autonomous agents, 2023. URLhttps://arxiv.org/abs/2307.13854

work page internal anchor Pith review Pith/arXiv arXiv 2023

[79] [80]

Y . Zhu, T. Jin, Y . Pruksachatkun, A. Zhang, S. Liu, S. Cui, S. Kapoor, S. Longpre, K. Meng, R. Weiss, F. Barez, R. Gupta, J. Dhamala, J. Merizian, M. Giulianelli, H. Coppock, C. Ududec, J. Sekhon, J. Steinhardt, A. Kellermann, S. Schwettmann, M. Zaharia, I. Stoica, P. Liang, and D. Kang. Establishing best practices for building rigorous agentic benchmar...

work page arXiv 2025

[80] [81]

Y . Zhu, A. Kellermann, D. Bowman, P. Li, A. Gupta, A. Danda, R. Fang, C. Jensen, E. Ihli, J. Benn, J. Geronimo, A. Dhir, S. Rao, K. Yu, T. Stone, and D. Kang. Cve-bench: A benchmark for ai agents’ ability to exploit real-world web application vulnerabilities. InProceedings of the 42nd International Conference on Machine Learning, pages 79850–79867, 2025....

2025