Towards Cybersecurity SuperIntelligence (CSI): What's the best harness for cybersecurity?

Daniel S\'anchez Prieto; Davide Quarta; Francesco Balassone; Mar\'ia Sanz-G\'omez; Marina Oteiza \'Alvarez; Martin Pinzger; Paul Zabalegui Landa; V\'ictor Mayoral-Vilches

arxiv: 2605.28334 · v2 · pith:ILHYYUXLnew · submitted 2026-05-27 · 💻 cs.CR

Towards Cybersecurity SuperIntelligence (CSI): What's the best harness for cybersecurity?

V\'ictor Mayoral-Vilches , Francesco Balassone , Mar\'ia Sanz-G\'omez , Paul Zabalegui Landa , Daniel S\'anchez Prieto , Marina Oteiza \'Alvarez , Davide Quarta , Martin Pinzger This is my paper

Pith reviewed 2026-06-29 11:57 UTC · model grok-4.3

classification 💻 cs.CR

keywords cybersecurityAI agentsmulti-agent systemsblackboard architectureLLM scaffoldsagent harnessesmeta-scaffold

0 comments

The pith

A blackboard that lets different AI scaffolds share findings solves more cybersecurity challenges than any one scaffold alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to determine the best execution harness for LLM-based cybersecurity agents and concludes that no individual scaffold works best across all tasks. It introduces CSI, a meta-scaffold that runs five structurally different agent harnesses in parallel and lets them exchange results on a shared blackboard. On the 33 cybench challenges the blackboard combination reaches 19 solves while the strongest single scaffold reaches only 15, and it does so faster at similar cost. A reader would care because current cybersecurity AI work is converging on single iterative loops, yet the results indicate that deliberate heterogeneity plus shared memory produces measurable gains in coverage.

Core claim

No single scaffold is the best harness; the combination of structurally heterogeneous scaffolds inside a blackboard-based multi-agent architecture produces the highest coverage, solving 19 of 33 cybench challenges versus 15 of 33 for the strongest individual scaffold at 25 percent less time and comparable cost.

What carries the argument

CSI's blackboard-based multi-agent architecture, in which scaffold-specialised agents run in parallel and exchange intermediate findings via a shared substrate.

If this is right

Union of four scaffolds already reaches 17 solves, with the fifth adding one exclusive solve.
Blackboard use yields a 27 percent relative gain over the best individual scaffold.
No scaffold dominates every challenge type, so coverage improves only when heterogeneous designs are combined.
The blackboard approach maintains comparable cost while reducing total runtime by about 25 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same blackboard pattern could be tested on real-world incident response logs rather than benchmark challenges.
Adding further scaffolds or refining the sharing rules on the blackboard might increase the number of unique solves beyond 19.
The result suggests that progress toward more capable cybersecurity AI may depend more on orchestration diversity than on improving any one harness.

Load-bearing premise

The 33 cybench challenges form a representative sample of cybersecurity tasks and the five scaffolds are different enough that parallel execution and blackboard sharing produce non-redundant solves.

What would settle it

Repeating the benchmark on a fresh collection of cybersecurity tasks outside the cybench set and finding that the blackboard no longer exceeds the best single scaffold.

Figures

Figures reproduced from arXiv: 2605.28334 by Daniel S\'anchez Prieto, Davide Quarta, Francesco Balassone, Mar\'ia Sanz-G\'omez, Marina Oteiza \'Alvarez, Martin Pinzger, Paul Zabalegui Landa, V\'ictor Mayoral-Vilches.

**Figure 1.** Figure 1: Per-scaffold and architecture-level solves on the 33-challenge cybench subset, holding the model fixed at alias2-mini. The five coloured bars are the per-scaffold solves (independent runs); CSI::Mistral is an independent complementary scaffold tested separately. The hatched teal bar is the four-scaffold union ceiling (17/33); the striped bar is the four-scaffold parallel race (17/33); the solid teal bar is… view at source ↗

**Figure 2.** Figure 2: CSI architecture. The csi wrapper dispatches to one of four scaffold backends. Every request issued by every backend transits the local routing proxy, which performs wire-protocol translation across upstream providers (Anthropic Messages, OpenAI Chat Completions, OpenAI Responses), enforces a non-API-path block list, and writes a unified JSONL ledger with per-request cost. Telemetry suppression operates in… view at source ↗

**Figure 3.** Figure 3: Per-scaffold comparison across seven normalised axes. Each axis is scaled so that 1.0 corresponds to the best scaffold on that axis (lower-is-better metrics are inverted before normalisation). 4 Results 4.1 Per-scaffold scoreboard [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of cybench challenges by the number of scaffolds (out of four) that solve them. The 16 challenges in the k=0 bar are the hard ceiling of alias2-mini on this suite. The k=1 bar (3 challenges) is the empirical evidence of complementarity: every bar to the left of k=4 is a challenge that some scaffold misses. 4.2 Complementarity: union beats best individual Let S = {Claude, Codex, GCAI, CAI} den… view at source ↗

**Figure 5.** Figure 5: Marginal contribution per scaffold, namely the number of challenges that the indicated scaffold solves and no other scaffold does. Three scaffolds each contribute exactly one exclusive solve (CSI::Claude: were pickle phreaks revenge, CSI::Codex: noisier crc, CSI::CAI: back to the past), while CSI::GCAI contributes none. The full breakdown by exact subset is given in [PITH_FULL_IMAGE:figures/full_fig_p006… view at source ↗

**Figure 6.** Figure 6: UpSet plot of solve-set co-occurrence. Each column is one non-empty exclusive subset (filled dots indicate membership); the bar above is the count of challenges solved by exactly that subset. Total = 17 (union ceiling). Named challenges per subset in Appendix A.2. Pair-wise agreement. The pair-wise intersection counts |Sa ∩ Sb| ( [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Pair-wise solve-set intersection |Sa ∩ Sb|. co-solve set (14), while CSI::CAI and CSI::GCAI occupy the most distant positions (|SCAI ∩ SGCAI| = 4). Jaccard similarity is in Appendix B.1. 4.3 Ensemble selection and cost frontier For each subset size k ∈ {1, 2, 3, 4} we report the largest union attainable [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Left: ensemble coverage curve. Each point is the largest union attainable from the best subset of size k. The gap between k=1 and k=2 (+1) and between k=2 and k=3 (+1) demonstrates the marginal value of each additional scaffold; the gap between k=3 and k=4 (+0) is the redundancy of the dominated scaffold. Right: cost-vs-coverage Pareto frontier over all 15 non-empty scaffold subsets. Very Easy Easy Medium … view at source ↗

**Figure 9.** Figure 9: Solves per scaffold by cybench difficulty tier. The total challenges per tier appear above each cluster. contributes +1 at k=4, and CSI::GCAI contributes +0 at k=5. The marginal gain from a fifth scaffold is at most one challenge, and the greedy path does not improve by replacing any of the four primary scaffolds with Mistral. Mistral strengthens the heterogeneity argument without altering the Pareto front… view at source ↗

**Figure 10.** Figure 10: Per-request input tokens on flecks of gold (reverse engineering, 60 min budget, unsolved by all five scaffolds, ×). Legend shows cumulative input tokens. Claude (18.2 M) compacts three times, peaking at 211 K before each reset. Codex (5.3 M) grows monotonically to 81 K. Mistral (14.4 M) compacts once at 200 K. GCAI (18.1 M) shows a sawtooth from retry-restart cycles across 404 turns. CAI (7.9 M) grows ste… view at source ↗

**Figure 11.** Figure 11: Aggregate per-scaffold bar charts. Top row: solve rate (%), flags captured, total API cost (USD). Middle row: cost per solve (USD), cumulative wall time (minutes), total commands. Bottom row: command error rate (%), total input tokens (M), total output tokens (M). All values match [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Cybench solve-rate progression over time by model series. The x-axis shows model launch dates, the y-axis the solved percentage of CAIBench-Jeopardy CTFs. Each experiment was run for a maximum of 300 agent interactions, 245 minutes per challenge, $40 per challenge on API expenses, and at pass@3. The alias series is highlighted in teal; alias2-mini (this paper’s fixed model) sits in the mid-capability band… view at source ↗

**Figure 13.** Figure 13: Full comparison of all evaluated models on the CAIBench-Jeopardy CTFs (cybench) benchmark, complementing the temporal progression in [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: Token profiles (Very Easy tier, all 5 scaffolds): dynastic, it has begun, loot stash, packed away, urgent, delulu. Dashed (✓) = solved, solid (×) = unsolved. Claude, Codex, Mistral, GCAI, CAI [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Token profiles (Easy tier, all 5 scaffolds): back to the past, crushing, eval me, glacier exchange, missing bits, primary knowledge. Dashed (✓) = solved, solid (×) = unsolved. Claude, Codex, Mistral, GCAI, CAI [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: Token profiles (Easy tier, all 5 scaffolds): unbreakable. Dashed (✓) = solved, solid (×) = unsolved. Claude, Codex, Mistral, GCAI, CAI [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Token profiles (Medium tier, all 5 scaffolds): rpgo, skilift, sop, flecks of gold, lock talk. Dashed (✓) = solved, solid (×) = unsolved. Claude, Codex, Mistral, GCAI, CAI [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: Token profiles (Hard and Very Hard tiers, all 5 scaffolds): avatar, data siege, diffecient, ezmaze, failproof, noisier crc. Dashed (✓) = solved, solid (×) = unsolved. Claude, Codex, Mistral, GCAI, CAI [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

**Figure 19.** Figure 19: Token profiles (Hard and Very Hard tiers, all 5 scaffolds): shuffled aes, slcg. Dashed (✓) = solved, solid (×) = unsolved. Claude, Codex, Mistral, GCAI, CAI [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗

read the original abstract

What is the best harness for cybersecurity AI? Cybersecurity systems are converging on a single execution scaffold per agent, an iterative shell loop driven by a Large Language Model (LLM). However, scaffolds are not interchangeable, rarely interoperable, and no single scaffold dominates across all challenge types. In our path towards researching Cybersecurity SuperIntelligence (CSI), we present a meta-scaffold that unifies heterogeneous agent harnesses under a common orchestration layer, enabling any LLM-driven scaffold to be deployed, benchmarked, and composed within the same infrastructure. Using CSI, we benchmark five scaffolds (CSI::Claude, CSI::Codex, CSI::GCAI, CSI::Mistral, CSI::CAI) on the 33 cybench challenges, holding the model fixed at alias2-mini. The best individual scaffolds solve 15/33 (45.5%); the four-scaffold union solves 17/33 (51.5%), with the fifth (CSI::Mistral, 10/33) contributing one exclusive solve. We find that no single scaffold is the best harness: it is the combination of structurally heterogeneous scaffolds that yields the highest coverage. We validate this through CSI's blackboard-based multi-agent architecture, in which scaffold-specialised agents run in parallel and exchange intermediate findings via a shared substrate (a blackboard). The blackboard solves 19/33 (57.6%), a 27% relative gain over CSI::Claude, one of the best individual scaffolds (15/33, 45.5%), 25% faster (20.2 h vs. 26.8 h), at comparable cost ($5,480 vs. $5,122).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The blackboard meta-scaffold delivers a clear lift on Cybench (19/33 vs 15/33) by running heterogeneous harnesses in parallel, but the evaluation stays narrow and lacks controls.

read the letter

The paper's core result is that no single scaffold wins across the board and that a blackboard letting five different ones share intermediate results solves 19 of the 33 Cybench challenges while the best individual one solves 15. It also reports the blackboard run finishing in 20.2 hours versus 26.8 for the top single scaffold at roughly the same cost. That is a concrete, measurable demonstration that structural diversity helps coverage.

The implementation itself is the main new piece: a meta-layer that can plug in any LLM-driven harness, run them together, and use the blackboard as shared memory. The numbers line up internally—the union of four scaffolds already reaches 17, and the fifth adds one more exclusive solve—so the arithmetic supports the claim that heterogeneity matters.

The main limitation is the test set. Thirty-three challenges is small, and the abstract gives no variance across runs, no statistical tests, and no breakdown of which problems each scaffold actually solved. Without those details it is hard to judge how much the 27 % relative gain would hold up on a broader or different set of tasks. Keeping the underlying model fixed is reasonable for isolating the scaffold effect, but it also leaves open whether stronger models would shrink the gap.

This is useful reading for anyone building or evaluating LLM agents for cybersecurity. It supplies a working architecture and head-to-head numbers rather than just another single-scaffold loop. The work is grounded enough in its own benchmark to warrant peer review; the referee can ask for the missing protocol details and a larger test suite.

Referee Report

2 major / 1 minor

Summary. The paper introduces CSI, a meta-scaffold unifying heterogeneous LLM-driven agent harnesses for cybersecurity tasks. It benchmarks five scaffolds (CSI::Claude, CSI::Codex, CSI::GCAI, CSI::Mistral, CSI::CAI) on the 33 Cybench challenges with fixed model alias2-mini. Key results: best single scaffold solves 15/33 (45.5%), four-scaffold union solves 17/33 (51.5%), and blackboard multi-agent architecture solves 19/33 (57.6%), achieving a 27% relative gain over the best single scaffold, 25% faster (20.2h vs 26.8h) at comparable cost ($5,480 vs $5,122). The central claim is that no single scaffold dominates and that structurally heterogeneous scaffolds combined via blackboard yield highest coverage.

Significance. If the empirical results hold under scrutiny, the work demonstrates that multi-harness orchestration leveraging scaffold heterogeneity can improve coverage on cybersecurity benchmarks without added cost, providing a concrete step toward Cybersecurity SuperIntelligence. It merits credit for using a public benchmark suite and reporting concrete solve counts, timing, and cost metrics. However, the absence of detailed methods substantially limits verifiability and immediate impact.

major comments (2)

[Results section] Results section (and abstract): The manuscript reports concrete performance claims including 19/33 solves for the blackboard vs. 15/33 for the best single scaffold (CSI::Claude), but provides no experimental protocol, run parameters, success criteria for Cybench challenges, timeout handling, per-challenge attribution, or any statistical tests/error analysis/controls. This is load-bearing for the central claim of a 27% relative gain, as the numbers cannot be reproduced or assessed for robustness without these details.
[Blackboard architecture] Blackboard architecture description: The paper states that the blackboard enables parallel execution and exchange of intermediate findings to produce non-redundant solves, but does not specify the exact orchestration rules, conflict resolution, or how scaffold outputs are integrated on the shared substrate. This detail is required to evaluate whether the reported 19/33 count follows from the heterogeneity premise or from unstated implementation choices.

minor comments (1)

[Abstract] Abstract: The phrasing 'the four-scaffold union solves 17/33 (51.5%), with the fifth (CSI::Mistral, 10/33) contributing one exclusive solve' could be clarified to explicitly state whether the union includes all five or only four, to avoid ambiguity in interpreting the incremental contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that additional methodological details are required for reproducibility and will revise the manuscript to address both major comments. Point-by-point responses follow.

read point-by-point responses

Referee: [Results section] Results section (and abstract): The manuscript reports concrete performance claims including 19/33 solves for the blackboard vs. 15/33 for the best single scaffold (CSI::Claude), but provides no experimental protocol, run parameters, success criteria for Cybench challenges, timeout handling, per-challenge attribution, or any statistical tests/error analysis/controls. This is load-bearing for the central claim of a 27% relative gain, as the numbers cannot be reproduced or assessed for robustness without these details.

Authors: We acknowledge that the current version lacks a complete experimental protocol, which limits verifiability of the reported solve counts. In the revised manuscript we will add a dedicated experimental setup subsection (and update the abstract) that specifies: (i) exact run parameters and model configurations for alias2-mini across all five scaffolds, (ii) Cybench success criteria and verification procedure, (iii) timeout and retry handling, (iv) per-challenge solve attribution table, and (v) any statistical controls or error analysis performed. These additions will allow independent reproduction of the 15/33, 17/33, and 19/33 figures while leaving the empirical claims unchanged. revision: yes
Referee: [Blackboard architecture] Blackboard architecture description: The paper states that the blackboard enables parallel execution and exchange of intermediate findings to produce non-redundant solves, but does not specify the exact orchestration rules, conflict resolution, or how scaffold outputs are integrated on the shared substrate. This detail is required to evaluate whether the reported 19/33 count follows from the heterogeneity premise or from unstated implementation choices.

Authors: We agree that the orchestration mechanics must be stated explicitly. The revised manuscript will expand the blackboard architecture section to describe: (i) the precise rules governing parallel execution of the five scaffold-specialised agents, (ii) the protocol for posting and reading intermediate findings on the shared substrate, (iii) conflict-resolution logic (priority weighting by per-scaffold historical accuracy plus consensus fallback), and (iv) the integration step that produces the final non-redundant solve set. This will make clear that the additional solves arise from scaffold heterogeneity rather than hidden implementation details. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark results are self-contained

full rationale

The paper reports direct empirical measurements of solve rates on the external Cybench benchmark suite (33 challenges) using a fixed model (alias2-mini) across five scaffolds and a blackboard meta-scaffold. The central claims (e.g., best single scaffold at 15/33, blackboard at 19/33) are counts from execution runs, with no equations, fitted parameters, or derivations that reduce the reported deltas to inputs by construction. No self-citations are invoked as load-bearing for uniqueness or ansatzes, and the architecture description does not rename known results or smuggle assumptions via prior work. The derivation chain consists solely of experimental protocol and observed outcomes on an independent test set.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract was available; no free parameters, axioms, or invented entities are described in sufficient detail to populate the ledger.

pith-pipeline@v0.9.1-grok · 5881 in / 1099 out tokens · 43665 ms · 2026-06-29T11:57:50.609117+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 19 canonical work pages · 6 internal anchors

[1]

Pentestgpt: Evaluating and harnessing large language models for automated penetration testing.33rd USENIX Security Symposium (USENIX Security 24), pages 847–864, 2024

Gelei Deng, Yi Liu, V´ ıctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. Pentestgpt: Evaluating and harnessing large language models for automated penetration testing.33rd USENIX Security Symposium (USENIX Security 24), pages 847–864, 2024

2024
[2]

Offensive robot cybersecurity

V´ ıctor Mayoral-Vilches. Offensive robot cybersecurity. arXiv preprint arXiv:2506.15343, 2025

work page arXiv 2025
[3]

Cai fluency: A framework for cybersecurity ai fluency.arXiv e-prints, pages arXiv–2508, 2025

V´ ıctor Mayoral-Vilches, Jasmin Wachter, Crist´ obal RJ Veas Chavez, Cathrin Schachner, Luis Javier Navarrete- Lozano, and Mar´ ıa Sanz-G´ omez. Cai fluency: A framework for cybersecurity ai fluency.arXiv e-prints, pages arXiv–2508, 2025

2025
[4]

Cyber- security ai: A game-theoretic ai for guiding attack and defense.arXiv preprint arXiv:2601.05887, 2026

V´ ıctor Mayoral-Vilches, Mar´ ıa Sanz-G´ omez, Francesco Balassone, Stefan Rass, Lidia Salas-Espejo, Ben- jamin Jablonski, Luis Javier Navarrete-Lozano, Maite del Mundo de Torres, and Crist´ obal RJ Chavez. Cyber- security ai: A game-theoretic ai for guiding attack and defense.arXiv preprint arXiv:2601.05887, 2026

work page arXiv 2026
[5]

Measuring and augmenting large language models for solving capture-the-flag challenges

Zimo Ji, Daoyuan Wu, et al. Measuring and augmenting large language models for solving capture-the-flag challenges. InProceedings of the ACM Conference on Computer and Communications Security (CCS), pages 603–617, 2025. doi: 10.1145/3719027.3744855

work page doi:10.1145/3719027.3744855 2025
[6]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.International Conference on Learning Representations (ICLR), 2023. URL https: //arxiv.org/abs/2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

doi: 10.1126/science.abq1158

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R´ emi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pus...

work page doi:10.1126/science.abq1158 2022
[8]

Ranked voting based self-consistency of large language models.arXiv preprint arXiv:2505.10772,

Anonymous. Ranked voting based self-consistency of large language models. InFindings of the As- sociation for Computational Linguistics (ACL Find- ings), 2025. URL https://arxiv.org/abs/2505.10772. arXiv:2505.10772

work page arXiv 2025
[9]

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative agents for “mind” exploration of large scale language model society.Advances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/2303.17760

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversations. InICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024. URL https: //arxiv.org/abs/2...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and J¨ urgen Schmidhuber. MetaGPT: Meta programming for multi-agent collaborative framework. InInternational Conference on Learning Representations (ICLR), 2024. UR...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

ChatDev: Communicative Agents for Software Development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. ChatDev: Communicative agents for software development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024. URL https://arxiv.org/ abs/2307.07924

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen-Ming Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. AgentVerse: Facilitating multi-agent collaboration and exploring emergent behaviors. InInternational Conference on Learning Representations (ICLR), 2024. URLht...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Scaling large language model-based multi-agent collab- oration.arXiv preprint arXiv:2406.07155, 2025

Chen Qian, Zihao Xie, Yifei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Scaling large language model-based multi-agent collab- oration.arXiv preprint arXiv:2406.07155, 2025. URL https://arxiv.org/abs/2406.07155. MacNet, multi- agent DAG topology

work page arXiv 2025
[15]

Erman, Frederick Hayes-Roth, Victor R

Lee D. Erman, Frederick Hayes-Roth, Victor R. Lesser, and D. Raj Reddy. The Hearsay-II speech- understanding system: Integrating knowledge to resolve uncertainty.ACM Computing Surveys, 12(2):213–253,
[16]

doi: 10.1145/356810.356816

work page doi:10.1145/356810.356816
[17]

A blackboard architecture for control.Artificial Intelligence, 26(3):251–321, 1985

Barbara Hayes-Roth. A blackboard architecture for control.Artificial Intelligence, 26(3):251–321, 1985. doi: 10.1016/0004-3702(85)90063-3

work page doi:10.1016/0004-3702(85)90063-3 1985
[18]

Exploring advanced LLM multi-agent systems based on blackboard architecture

Bowen Han and Gang Zhang. Exploring advanced LLM multi-agent systems based on blackboard architecture. arXiv preprint arXiv:2507.01701, 2025. URL https: //arxiv.org/abs/2507.01701

work page arXiv 2025
[19]

LLM-based multi-agent blackboard system for information discovery in data science.arXiv preprint arXiv:2510.01285, 2024

Anonymous. LLM-based multi-agent blackboard system for information discovery in data science.arXiv preprint arXiv:2510.01285, 2024. URL https://arxiv.org/ abs/2510.01285

work page arXiv 2024
[20]

Co-RedTeam: Orchestrated security discovery and exploitation with LLM agents.arXiv preprint arXiv:2602.02164, 2026

Yifeng He et al. Co-RedTeam: Orchestrated security discovery and exploitation with LLM agents.arXiv preprint arXiv:2602.02164, 2026. URL https://arxiv. org/abs/2602.02164

work page arXiv 2026
[21]

Claude code: an agentic coding tool that lives in your terminal

Anthropic. Claude code: an agentic coding tool that lives in your terminal. https://github.com/ anthropics/claude-code, 2025. Pinned to v2.1.87 for the experiments reported here

2025
[22]

RedTeamLLM: An agentic ai framework for offensive security.arXiv preprint arXiv:2505.06913, 2025

Brian Challita and Pierre Parrend. RedTeamLLM: An agentic ai framework for offensive security.arXiv preprint arXiv:2505.06913, 2025. URL https://arxiv. org/abs/2505.06913

work page arXiv 2025
[23]

Codex CLI: a lightweight coding agent that runs in your terminal

OpenAI. Codex CLI: a lightweight coding agent that runs in your terminal. https://github.com/openai/ codex, 2025. Pinned to v0.104.0 for the experiments reported here

2025
[24]

autoresearch: a self-improving ai researcher

Andrej Karpathy. autoresearch: a self-improving ai researcher. https://github.com/karpathy/ autoresearch, 2025. Accessed May 2026

2025
[25]

Mar´ ıa Sanz-G´ omez, V´ ıctor Mayoral-Vilches, Francesco Balassone, Luis Javier Navarrete-Lozano, Crist´ obal R. J. Veas Chavez, and Maite del Mundo de Torres. Cybersecurity ai benchmark (caibench): A meta- benchmark for evaluating cybersecurity ai agents, 2025. URLhttps://arxiv.org/abs/2510.24317

work page arXiv 2025
[26]

solve before 20% of budget

V´ ıctor Mayoral-Vilches, Mar´ ıa Sanz-G´ omez, and Endika Gil-Uriarte. Towards cybersecurity superintelligence. arXiv preprint, 2026. In preparation. Alias Robotics technical report; figures reproduced with permission of the authors. Challenge Claude Codex GCAI CAI avatar N N N N back to the past N N N Y crushing N N N N data siege N N N N delulu N N N N...

2026

[1] [1]

Pentestgpt: Evaluating and harnessing large language models for automated penetration testing.33rd USENIX Security Symposium (USENIX Security 24), pages 847–864, 2024

Gelei Deng, Yi Liu, V´ ıctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. Pentestgpt: Evaluating and harnessing large language models for automated penetration testing.33rd USENIX Security Symposium (USENIX Security 24), pages 847–864, 2024

2024

[2] [2]

Offensive robot cybersecurity

V´ ıctor Mayoral-Vilches. Offensive robot cybersecurity. arXiv preprint arXiv:2506.15343, 2025

work page arXiv 2025

[3] [3]

Cai fluency: A framework for cybersecurity ai fluency.arXiv e-prints, pages arXiv–2508, 2025

V´ ıctor Mayoral-Vilches, Jasmin Wachter, Crist´ obal RJ Veas Chavez, Cathrin Schachner, Luis Javier Navarrete- Lozano, and Mar´ ıa Sanz-G´ omez. Cai fluency: A framework for cybersecurity ai fluency.arXiv e-prints, pages arXiv–2508, 2025

2025

[4] [4]

Cyber- security ai: A game-theoretic ai for guiding attack and defense.arXiv preprint arXiv:2601.05887, 2026

V´ ıctor Mayoral-Vilches, Mar´ ıa Sanz-G´ omez, Francesco Balassone, Stefan Rass, Lidia Salas-Espejo, Ben- jamin Jablonski, Luis Javier Navarrete-Lozano, Maite del Mundo de Torres, and Crist´ obal RJ Chavez. Cyber- security ai: A game-theoretic ai for guiding attack and defense.arXiv preprint arXiv:2601.05887, 2026

work page arXiv 2026

[5] [5]

Measuring and augmenting large language models for solving capture-the-flag challenges

Zimo Ji, Daoyuan Wu, et al. Measuring and augmenting large language models for solving capture-the-flag challenges. InProceedings of the ACM Conference on Computer and Communications Security (CCS), pages 603–617, 2025. doi: 10.1145/3719027.3744855

work page doi:10.1145/3719027.3744855 2025

[6] [6]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.International Conference on Learning Representations (ICLR), 2023. URL https: //arxiv.org/abs/2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

doi: 10.1126/science.abq1158

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R´ emi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pus...

work page doi:10.1126/science.abq1158 2022

[8] [8]

Ranked voting based self-consistency of large language models.arXiv preprint arXiv:2505.10772,

Anonymous. Ranked voting based self-consistency of large language models. InFindings of the As- sociation for Computational Linguistics (ACL Find- ings), 2025. URL https://arxiv.org/abs/2505.10772. arXiv:2505.10772

work page arXiv 2025

[9] [9]

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative agents for “mind” exploration of large scale language model society.Advances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/2303.17760

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversations. InICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024. URL https: //arxiv.org/abs/2...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and J¨ urgen Schmidhuber. MetaGPT: Meta programming for multi-agent collaborative framework. InInternational Conference on Learning Representations (ICLR), 2024. UR...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

ChatDev: Communicative Agents for Software Development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. ChatDev: Communicative agents for software development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024. URL https://arxiv.org/ abs/2307.07924

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen-Ming Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. AgentVerse: Facilitating multi-agent collaboration and exploring emergent behaviors. InInternational Conference on Learning Representations (ICLR), 2024. URLht...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Scaling large language model-based multi-agent collab- oration.arXiv preprint arXiv:2406.07155, 2025

Chen Qian, Zihao Xie, Yifei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Scaling large language model-based multi-agent collab- oration.arXiv preprint arXiv:2406.07155, 2025. URL https://arxiv.org/abs/2406.07155. MacNet, multi- agent DAG topology

work page arXiv 2025

[15] [15]

Erman, Frederick Hayes-Roth, Victor R

Lee D. Erman, Frederick Hayes-Roth, Victor R. Lesser, and D. Raj Reddy. The Hearsay-II speech- understanding system: Integrating knowledge to resolve uncertainty.ACM Computing Surveys, 12(2):213–253,

[16] [16]

doi: 10.1145/356810.356816

work page doi:10.1145/356810.356816

[17] [17]

A blackboard architecture for control.Artificial Intelligence, 26(3):251–321, 1985

Barbara Hayes-Roth. A blackboard architecture for control.Artificial Intelligence, 26(3):251–321, 1985. doi: 10.1016/0004-3702(85)90063-3

work page doi:10.1016/0004-3702(85)90063-3 1985

[18] [18]

Exploring advanced LLM multi-agent systems based on blackboard architecture

Bowen Han and Gang Zhang. Exploring advanced LLM multi-agent systems based on blackboard architecture. arXiv preprint arXiv:2507.01701, 2025. URL https: //arxiv.org/abs/2507.01701

work page arXiv 2025

[19] [19]

LLM-based multi-agent blackboard system for information discovery in data science.arXiv preprint arXiv:2510.01285, 2024

Anonymous. LLM-based multi-agent blackboard system for information discovery in data science.arXiv preprint arXiv:2510.01285, 2024. URL https://arxiv.org/ abs/2510.01285

work page arXiv 2024

[20] [20]

Co-RedTeam: Orchestrated security discovery and exploitation with LLM agents.arXiv preprint arXiv:2602.02164, 2026

Yifeng He et al. Co-RedTeam: Orchestrated security discovery and exploitation with LLM agents.arXiv preprint arXiv:2602.02164, 2026. URL https://arxiv. org/abs/2602.02164

work page arXiv 2026

[21] [21]

Claude code: an agentic coding tool that lives in your terminal

Anthropic. Claude code: an agentic coding tool that lives in your terminal. https://github.com/ anthropics/claude-code, 2025. Pinned to v2.1.87 for the experiments reported here

2025

[22] [22]

RedTeamLLM: An agentic ai framework for offensive security.arXiv preprint arXiv:2505.06913, 2025

Brian Challita and Pierre Parrend. RedTeamLLM: An agentic ai framework for offensive security.arXiv preprint arXiv:2505.06913, 2025. URL https://arxiv. org/abs/2505.06913

work page arXiv 2025

[23] [23]

Codex CLI: a lightweight coding agent that runs in your terminal

OpenAI. Codex CLI: a lightweight coding agent that runs in your terminal. https://github.com/openai/ codex, 2025. Pinned to v0.104.0 for the experiments reported here

2025

[24] [24]

autoresearch: a self-improving ai researcher

Andrej Karpathy. autoresearch: a self-improving ai researcher. https://github.com/karpathy/ autoresearch, 2025. Accessed May 2026

2025

[25] [25]

Mar´ ıa Sanz-G´ omez, V´ ıctor Mayoral-Vilches, Francesco Balassone, Luis Javier Navarrete-Lozano, Crist´ obal R. J. Veas Chavez, and Maite del Mundo de Torres. Cybersecurity ai benchmark (caibench): A meta- benchmark for evaluating cybersecurity ai agents, 2025. URLhttps://arxiv.org/abs/2510.24317

work page arXiv 2025

[26] [26]

solve before 20% of budget

V´ ıctor Mayoral-Vilches, Mar´ ıa Sanz-G´ omez, and Endika Gil-Uriarte. Towards cybersecurity superintelligence. arXiv preprint, 2026. In preparation. Alias Robotics technical report; figures reproduced with permission of the authors. Challenge Claude Codex GCAI CAI avatar N N N N back to the past N N N Y crushing N N N N data siege N N N N delulu N N N N...

2026