arxiv: 2605.11671 · v1 · submitted 2026-05-12 · 💻 cs.CR · cs.AI· cs.SE

Recognition: no theorem link

Cochise: A Reference Harness for Autonomous Penetration Testing

Andreas Happe, J\"urgen Cito

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:02 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.SE

keywords autonomous penetration testingLLM agentsreference harnessexperimental infrastructurepenetration testing tracesReAct executoragent evaluationSSH-based execution

0 comments

The pith

Cochise supplies a minimal 597-line Python harness that separates planning from execution to support reproducible comparisons of LLM agents in penetration testing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Cochise as a lightweight reference implementation that links an LLM agent to a Linux target via SSH while keeping long-term state outside the model context. It demonstrates the harness on the GOAD testbed and releases accompanying replay, analysis, and logging tools plus a corpus of trajectory data. The goal is to let researchers isolate the effects of different models or agent designs without rebuilding complex scaffolding each time. A sympathetic reader would care because prior autonomous pen-testing work bundles many choices together, obscuring which parts actually matter. If the approach works, future experiments can start from this shared baseline rather than from scratch.

Core claim

Cochise implements a Planner-Executor split in which the planner maintains persistent state externally and the ReAct-style executor issues commands over SSH, self-corrects on output, and receives adaptable scenario prompts. The harness totals 597 lines, supports controlled environments reachable from the jump host, and is accompanied by offline visualization tools and a public corpus of JSON logs captured on the GOAD Active Directory testbed so that others can examine behavior without provisioning the full 48-64 GB RAM environment.

What carries the argument

Separated Planner-Executor architecture with external long-term state and SSH command channel, which keeps the LLM context focused on immediate actions while allowing self-correction across steps.

If this is right

Researchers can swap LLM back-ends or executor policies while holding the rest of the experimental setup fixed.
The released analysis tools quantify cost, token usage, duration, and compromise outcomes directly from captured trajectories.
The public JSON corpus allows offline study of agent behavior without repeated provisioning of the full GOAD testbed.
Scenario prompts can be rewritten for new target environments while reusing the same harness and logging layer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If adopted, the harness could reduce duplicated engineering effort across papers that currently each build their own agent scaffolds from scratch.
The external-state design may make it easier to insert human oversight or logging checkpoints between planner and executor steps.
Extending the same minimal split to other domains such as web-application testing or cloud configuration auditing would be a direct next measurement.

Load-bearing premise

A minimal separated Planner-Executor architecture connected via SSH to a Linux host is sufficient to enable meaningful comparisons of LLM choices and agent designs across different penetration-testing scenarios.

What would settle it

Running two architecturally distinct agents inside the same Cochise harness on identical GOAD scenarios and observing statistically indistinguishable distributions of compromise success, token cost, and step count would indicate the harness does not surface the differences it is meant to expose.

Figures

Figures reproduced from arXiv: 2605.11671 by Andreas Happe, J\"urgen Cito.

**Figure 2.** Figure 2: Screenshot of an active cochise run, showing how cochise used certipy-ad to execute an ADCS ESC1 attack to gain [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

Recent work on LLM-driven autonomous penetration testing reports promising results, but existing systems often combine many architectural, prompting, and tool-integration choices, making it difficult to tell what is gained over a simple agent scaffold. We present cochise, a 597 LOC Python reference harness for autonomous penetration-testing experiments. Cochise connects an LLM-driven agent to a Linux execution host over SSH and supports controlled target environments reachable from that jump host. The prototype implements a separated Planner--Executor architecture in which long-term state is maintained outside the LLM context, while a ReAct-style executor issues commands over SSH and self-corrects based on command outputs. The scenario prompt can be adapted to different target environments. To demonstrate the efficacy of our minimal harness, we evaluate it against a live third-party testbed called Game of Active Directory (GOAD). Alongside the harness, we release replay and analysis tools: (i) cochise-replay for offline visualization of captured runs, (ii) cochise-analyze-alogs and cochise-analyze-graphs for cost, token, duration, and compromise analysis, and (iii) a corpus of JSON trajectory logs from GOAD runs, allowing researchers to study agent behavior without provisioning the 48--64 GB RAM / 190 GB storage testbed themselves. Cochise is intended not as a state-of-the-art pen-testing agent, but as reusable experimental infrastructure for comparing models, agent architectures, and penetration-testing traces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript presents Cochise, a 597 LOC Python reference harness for autonomous penetration-testing experiments with LLMs. It implements a separated Planner-Executor architecture in which long-term state is maintained outside the LLM context while a ReAct-style executor issues commands over SSH to a Linux host and self-corrects on outputs. The scenario prompt is adaptable to different targets; the system is demonstrated on the Game of Active Directory (GOAD) testbed. The authors release replay and analysis tools (cochise-replay, cochise-analyze-logs, cochise-analyze-graphs) plus a corpus of JSON trajectory logs, positioning the harness explicitly as reusable experimental infrastructure for comparing models, agent architectures, and traces rather than as a state-of-the-art agent.

Significance. If the released code, tools, and trajectory corpus function as described, the work supplies a minimal, openly available baseline that directly addresses reproducibility and comparability challenges in LLM-driven penetration testing. The explicit separation of planner and executor, the SSH-based execution model, the GOAD evaluation, and the offline analysis artifacts together lower the barrier for controlled experiments; the 597 LOC size and public release of both harness and data corpus are concrete strengths that enable others to study agent behavior without provisioning the full 48-64 GB / 190 GB testbed.

minor comments (2)

[Abstract] Abstract: the sentence 'to demonstrate the efficacy of our minimal harness' could be rephrased as 'to demonstrate the functionality' to better match the paper's explicit disclaimer that Cochise is not intended as a competitive agent.
[Architecture description] The description of the Planner-Executor split would benefit from a short diagram or pseudocode snippet showing the exact information flow between the two components and the SSH channel.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept the manuscript. The referee's summary accurately captures the intent of Cochise as a minimal reference harness and correctly highlights the value of the released code, tools, and trajectory corpus for enabling reproducible experiments.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents Cochise as an engineering reference harness (597 LOC Python code, replay tools, and GOAD trajectory corpus) for comparing LLM agents in penetration testing, with no mathematical derivations, equations, fitted parameters, or predictive claims. The central claim—that the minimal Planner-Executor scaffold plus released artifacts enables meaningful comparisons—is directly supported by the open-source release itself rather than any self-referential reduction. No load-bearing steps invoke self-citations, ansatzes, or uniqueness theorems that collapse to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on standard domain assumptions about SSH reliability and testbed representativeness rather than new free parameters or invented entities; no ad-hoc fitted values are introduced.

axioms (2)

domain assumption SSH provides reliable, observable command execution and output feedback for self-correction
Invoked for the ReAct-style executor to issue commands and adjust based on results.
domain assumption The GOAD testbed constitutes a representative live penetration-testing environment
Used to demonstrate efficacy of the harness.

pith-pipeline@v0.9.0 · 5559 in / 1258 out tokens · 43870 ms · 2026-05-13T01:02:50.261474+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

[1]

Can LLMs Hack Enterprise Networks? Autonomous Assumed Breach Penetration-Testing Active Directory Networks , year =

Happe, Andreas and Cito, J\". Can LLMs Hack Enterprise Networks? Autonomous Assumed Breach Penetration-Testing Active Directory Networks , year =. ACM Trans. Softw. Eng. Methodol. , month = sep, keywords =. doi:10.1145/3766895 , abstract =

work page doi:10.1145/3766895
[2]

Can LLMs Hack Enterprise Networks? — RCR Report , year =

Happe, Andreas and Cito, J\". Can LLMs Hack Enterprise Networks? — RCR Report , year =. ACM Trans. Softw. Eng. Methodol. , month = mar, keywords =. doi:10.1145/3800584 , abstract =

work page doi:10.1145/3800584
[3]

2026 , eprint=

What Makes a Good LLM Agent for Real-world Penetration Testing? , author=. 2026 , eprint=

work page 2026
[4]

2025 , eprint=

CAI: An Open, Bug Bounty-Ready Cybersecurity AI , author=. 2025 , eprint=

work page 2025
[5]

2025 , eprint=

Incalmo: An Autonomous LLM-assisted System for Red Teaming Multi-Host Networks , author=. 2025 , eprint=

work page 2025
[6]

2024 , url=

John Yang and Carlos E Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik R Narasimhan and Ofir Press , booktitle=. 2024 , url=

work page 2024
[7]

2023 , eprint=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. 2023 , eprint=

work page 2023
[8]

Advances in neural information processing systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=

work page