Recognition: no theorem link
Cochise: A Reference Harness for Autonomous Penetration Testing
Pith reviewed 2026-05-13 01:02 UTC · model grok-4.3
The pith
Cochise supplies a minimal 597-line Python harness that separates planning from execution to support reproducible comparisons of LLM agents in penetration testing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Cochise implements a Planner-Executor split in which the planner maintains persistent state externally and the ReAct-style executor issues commands over SSH, self-corrects on output, and receives adaptable scenario prompts. The harness totals 597 lines, supports controlled environments reachable from the jump host, and is accompanied by offline visualization tools and a public corpus of JSON logs captured on the GOAD Active Directory testbed so that others can examine behavior without provisioning the full 48-64 GB RAM environment.
What carries the argument
Separated Planner-Executor architecture with external long-term state and SSH command channel, which keeps the LLM context focused on immediate actions while allowing self-correction across steps.
If this is right
- Researchers can swap LLM back-ends or executor policies while holding the rest of the experimental setup fixed.
- The released analysis tools quantify cost, token usage, duration, and compromise outcomes directly from captured trajectories.
- The public JSON corpus allows offline study of agent behavior without repeated provisioning of the full GOAD testbed.
- Scenario prompts can be rewritten for new target environments while reusing the same harness and logging layer.
Where Pith is reading between the lines
- If adopted, the harness could reduce duplicated engineering effort across papers that currently each build their own agent scaffolds from scratch.
- The external-state design may make it easier to insert human oversight or logging checkpoints between planner and executor steps.
- Extending the same minimal split to other domains such as web-application testing or cloud configuration auditing would be a direct next measurement.
Load-bearing premise
A minimal separated Planner-Executor architecture connected via SSH to a Linux host is sufficient to enable meaningful comparisons of LLM choices and agent designs across different penetration-testing scenarios.
What would settle it
Running two architecturally distinct agents inside the same Cochise harness on identical GOAD scenarios and observing statistically indistinguishable distributions of compromise success, token cost, and step count would indicate the harness does not surface the differences it is meant to expose.
Figures
read the original abstract
Recent work on LLM-driven autonomous penetration testing reports promising results, but existing systems often combine many architectural, prompting, and tool-integration choices, making it difficult to tell what is gained over a simple agent scaffold. We present cochise, a 597 LOC Python reference harness for autonomous penetration-testing experiments. Cochise connects an LLM-driven agent to a Linux execution host over SSH and supports controlled target environments reachable from that jump host. The prototype implements a separated Planner--Executor architecture in which long-term state is maintained outside the LLM context, while a ReAct-style executor issues commands over SSH and self-corrects based on command outputs. The scenario prompt can be adapted to different target environments. To demonstrate the efficacy of our minimal harness, we evaluate it against a live third-party testbed called Game of Active Directory (GOAD). Alongside the harness, we release replay and analysis tools: (i) cochise-replay for offline visualization of captured runs, (ii) cochise-analyze-alogs and cochise-analyze-graphs for cost, token, duration, and compromise analysis, and (iii) a corpus of JSON trajectory logs from GOAD runs, allowing researchers to study agent behavior without provisioning the 48--64 GB RAM / 190 GB storage testbed themselves. Cochise is intended not as a state-of-the-art pen-testing agent, but as reusable experimental infrastructure for comparing models, agent architectures, and penetration-testing traces.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Cochise, a 597 LOC Python reference harness for autonomous penetration-testing experiments with LLMs. It implements a separated Planner-Executor architecture in which long-term state is maintained outside the LLM context while a ReAct-style executor issues commands over SSH to a Linux host and self-corrects on outputs. The scenario prompt is adaptable to different targets; the system is demonstrated on the Game of Active Directory (GOAD) testbed. The authors release replay and analysis tools (cochise-replay, cochise-analyze-logs, cochise-analyze-graphs) plus a corpus of JSON trajectory logs, positioning the harness explicitly as reusable experimental infrastructure for comparing models, agent architectures, and traces rather than as a state-of-the-art agent.
Significance. If the released code, tools, and trajectory corpus function as described, the work supplies a minimal, openly available baseline that directly addresses reproducibility and comparability challenges in LLM-driven penetration testing. The explicit separation of planner and executor, the SSH-based execution model, the GOAD evaluation, and the offline analysis artifacts together lower the barrier for controlled experiments; the 597 LOC size and public release of both harness and data corpus are concrete strengths that enable others to study agent behavior without provisioning the full 48-64 GB / 190 GB testbed.
minor comments (2)
- [Abstract] Abstract: the sentence 'to demonstrate the efficacy of our minimal harness' could be rephrased as 'to demonstrate the functionality' to better match the paper's explicit disclaimer that Cochise is not intended as a competitive agent.
- [Architecture description] The description of the Planner-Executor split would benefit from a short diagram or pseudocode snippet showing the exact information flow between the two components and the SSH channel.
Simulated Author's Rebuttal
We thank the referee for their positive review and recommendation to accept the manuscript. The referee's summary accurately captures the intent of Cochise as a minimal reference harness and correctly highlights the value of the released code, tools, and trajectory corpus for enabling reproducible experiments.
Circularity Check
No significant circularity
full rationale
The paper presents Cochise as an engineering reference harness (597 LOC Python code, replay tools, and GOAD trajectory corpus) for comparing LLM agents in penetration testing, with no mathematical derivations, equations, fitted parameters, or predictive claims. The central claim—that the minimal Planner-Executor scaffold plus released artifacts enables meaningful comparisons—is directly supported by the open-source release itself rather than any self-referential reduction. No load-bearing steps invoke self-citations, ansatzes, or uniqueness theorems that collapse to the paper's own inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption SSH provides reliable, observable command execution and output feedback for self-correction
- domain assumption The GOAD testbed constitutes a representative live penetration-testing environment
Reference graph
Works this paper leans on
-
[1]
Happe, Andreas and Cito, J\". Can LLMs Hack Enterprise Networks? Autonomous Assumed Breach Penetration-Testing Active Directory Networks , year =. ACM Trans. Softw. Eng. Methodol. , month = sep, keywords =. doi:10.1145/3766895 , abstract =
-
[2]
Can LLMs Hack Enterprise Networks? — RCR Report , year =
Happe, Andreas and Cito, J\". Can LLMs Hack Enterprise Networks? — RCR Report , year =. ACM Trans. Softw. Eng. Methodol. , month = mar, keywords =. doi:10.1145/3800584 , abstract =
-
[3]
What Makes a Good LLM Agent for Real-world Penetration Testing? , author=. 2026 , eprint=
work page 2026
-
[4]
CAI: An Open, Bug Bounty-Ready Cybersecurity AI , author=. 2025 , eprint=
work page 2025
-
[5]
Incalmo: An Autonomous LLM-assisted System for Red Teaming Multi-Host Networks , author=. 2025 , eprint=
work page 2025
-
[6]
John Yang and Carlos E Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik R Narasimhan and Ofir Press , booktitle=. 2024 , url=
work page 2024
-
[7]
ReAct: Synergizing Reasoning and Acting in Language Models , author=. 2023 , eprint=
work page 2023
-
[8]
Advances in neural information processing systems , volume=
Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.