Lessons from Penetration Tests on Large-Scale Agent Systems

Dhilung Kirat; Frederico Araujo; Ian Molloy; Jiyong Jang; Kevin Eykholt; Xiaokui Shu

arxiv: 2605.27042 · v1 · pith:SRFBIFW7new · submitted 2026-05-26 · 💻 cs.CR · cs.AI

Lessons from Penetration Tests on Large-Scale Agent Systems

Kevin Eykholt , Dhilung Kirat , Xiaokui Shu , Jiyong Jang , Frederico Araujo , Ian Molloy This is my paper

Pith reviewed 2026-06-29 16:36 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords AI agentspenetration testingsecurity vulnerabilitiesproprietary systemsopen-source agentscross-layer securityagent frameworks

0 comments

The pith

Proprietary AI agent systems exhibit the same recurring security weaknesses as open-source agents despite stricter development processes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports results from penetration tests performed in 2025 on two proprietary large-scale agent products. It argues that the vulnerabilities uncovered are not new but belong to the same recurring classes long seen in traditional computing systems and in prior open-source agent research. A sympathetic reader would care because execution-capable agents interact across multiple layers of the computing stack, creating a broad attack surface that developers must secure regardless of their internal processes. The work tests whether formal coding standards and review practices have reduced those risks compared with earlier assessments of open-source agents and frameworks.

Core claim

Penetration tests conducted in 2025 against two proprietary agent products show that these systems exhibit similar security weaknesses to those observed in prior open-source agent research, indicating that the security posture of AI agents has not substantially improved despite stricter coding standards and formal review processes.

What carries the argument

Penetration testing applied to proprietary agent products to surface cross-layer weaknesses in unbounded, self-modifying execution-capable AI agents.

If this is right

Developers of execution-capable agents must still reason about and secure complex cross-layer behaviors.
Recurring vulnerability classes persist across both open-source and proprietary development methodologies.
The security burden on agent developers remains significant even under formal review processes.
Prior research on open-source agents provides relevant lessons for proprietary products.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Security engineering for agents may need to shift from coding standards toward new architectural constraints on autonomy and tool use.
Organizations deploying agents at scale could benefit from shared test suites that target the recurring weakness classes identified here.
If the pattern holds, regulatory or certification requirements for agent systems might focus on observable interaction surfaces rather than internal development processes.

Load-bearing premise

The two proprietary products tested in 2025 are representative of the broader class of large-scale proprietary agent systems.

What would settle it

A penetration test on a third proprietary agent product from 2025 that finds no recurring classes of weaknesses previously documented in open-source agents.

Figures

Figures reproduced from arXiv: 2605.27042 by Dhilung Kirat, Frederico Araujo, Ian Molloy, Jiyong Jang, Kevin Eykholt, Xiaokui Shu.

**Figure 1.** Figure 1: Workflow of GitHub Assistant practices typically include stricter coding standards, security reviews, and release processes, reflecting the potential financial, reputational, and legal consequences of security failures. These factors may suggest that proprietary agents would exhibit fewer or more sophisticated vulnerabilities than their open-source counterparts. Our team regularly conducts penetration tes… view at source ↗

**Figure 2.** Figure 2: Potential Threat Actors Editing. If the editing capability is enabled and an editing tag is applied, the localization output is forwarded to the editing stage. One or more agents are initialized and prompted to generate concrete code edits based on the localizer’s recommendations. The resulting candidate patches are evaluated by a judge LLM, which selects the final solution. The selected output is then r… view at source ↗

**Figure 3.** Figure 3: Markdown hiding use markdown link syntax. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

read the original abstract

As AI systems gain increasing autonomy and execution capability, the number of discovered security vulnerabilities continues to rise. However, many of these vulnerabilities are not fundamentally novel, but instead reflect recurring classes of weaknesses long observed in prior computing systems. Execution-capable AI agents are effectively unbounded, self-modifying programs that interact extensively with multiple layers of the computing stack. This broad interaction surface imposes a significant security burden on developers, who must reason about and secure complex cross-layer behaviors. Prior research has primarily focused on vulnerabilities in open-source agents and agent frameworks. In contrast, it remains unclear whether proprietary agent systems -- developed under stricter coding standards and formal review processes -- exhibit similar security weaknesses. In this paper, we present findings from two penetration tests conducted in 2025 against proprietary agent products and evaluate whether the security posture of AI agents has improved since these assessments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper runs 2025 penetration tests on two proprietary agent systems and asks if stricter development practices fix the security problems seen in open-source agents.

read the letter

The new part is the 2025 tests on proprietary products rather than open-source frameworks. That setup directly addresses the gap the abstract flags: whether formal review processes change the security posture for execution-capable agents.

The paper states the problem clearly. Agents interact across many layers, act as unbounded programs, and inherit old classes of weaknesses. Framing the question against prior open-source work is useful.

The soft spots are straightforward. The abstract gives the research question and high-level setup but no findings, no methodology details, no data on what was tested or found, and no error analysis. Without those, the claim that proprietary systems show similar weaknesses cannot be checked. The stress-test note is correct on generalizability: two products with no selection criteria, no architectural comparison to other proprietary agents, and no discussion of access limits or bias leave the broader conclusion unsupported. That is the load-bearing issue.

This is for people working on agent security who need to know whether proprietary development changes the risk picture. A reader wanting concrete results or reproducible details will not get much yet.

It deserves peer review if the full paper supplies the test outcomes and addresses how the two systems were chosen. The question is worth asking; the current evidence level is low.

Referee Report

2 major / 1 minor

Summary. The paper presents findings from two penetration tests conducted in 2025 on proprietary large-scale AI agent products. It claims that, despite development under stricter coding standards and formal review processes, these systems exhibit similar recurring classes of security weaknesses to those previously documented in open-source agents and frameworks, indicating that the security posture of execution-capable AI agents has not meaningfully improved.

Significance. If the two tested systems prove representative, the result would demonstrate that cross-layer interaction surfaces in autonomous agents impose persistent security burdens regardless of development paradigm, reinforcing the need for improved developer reasoning about unbounded, self-modifying behaviors. The use of real-world proprietary targets adds practical relevance beyond prior open-source studies.

major comments (2)

[Abstract] Abstract: the central claim that proprietary systems 'exhibit similar security weaknesses' to open-source ones is load-bearing on the two 2025 products being representative of the broader class, yet the abstract supplies no selection criteria, architectural comparison to other proprietary agents, population estimate, or discussion of access restrictions or selection bias.
[Abstract] Abstract: no specific vulnerabilities, methodology details, data, or error analysis are provided, so it is impossible to evaluate whether the observed weaknesses are in fact the same recurring classes or whether the tests support the 'has not improved' conclusion.

minor comments (1)

[Abstract] The phrase 'since these assessments' is undefined; the abstract does not identify the prior open-source studies or time frame being used as baseline.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their review and for highlighting issues of generalizability and transparency in the abstract. We address each major comment below. We agree that the abstract can be strengthened with additional context on selection and methodology where feasible, but confidentiality constraints on the proprietary targets limit what can be disclosed. We propose partial revisions to the abstract accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that proprietary systems 'exhibit similar security weaknesses' to open-source ones is load-bearing on the two 2025 products being representative of the broader class, yet the abstract supplies no selection criteria, architectural comparison to other proprietary agents, population estimate, or discussion of access restrictions or selection bias.

Authors: The two products were selected because they are large-scale, commercially deployed proprietary agent systems for which authorized penetration testing access was obtained under responsible disclosure agreements. No comprehensive public population estimate of such proprietary systems exists, precluding statistical sampling claims. Architectural comparisons to other agents appear in the introduction and related work of the full manuscript. We will revise the abstract to note the selection basis (large-scale commercial deployments) and access restrictions (NDA-bound responsible testing), which directly addresses the representativeness concern without revealing confidential details. revision: partial
Referee: [Abstract] Abstract: no specific vulnerabilities, methodology details, data, or error analysis are provided, so it is impossible to evaluate whether the observed weaknesses are in fact the same recurring classes or whether the tests support the 'has not improved' conclusion.

Authors: The abstract is kept high-level due to length limits and to avoid disclosing vendor-sensitive information. The full manuscript contains dedicated sections on the black-box and gray-box penetration testing methodology, anonymized examples of the recurring weakness classes, direct comparisons to prior open-source findings, and supporting analysis. We will revise the abstract to briefly reference the testing approach and direct readers to the detailed evaluations in the body. Specific vulnerability instances, raw data, and granular error analysis cannot be provided publicly. revision: partial

standing simulated objections not resolved

Disclosure of specific vulnerabilities, raw test data, or granular error analysis from the proprietary targets, which is precluded by non-disclosure agreements and ongoing remediation processes.

Circularity Check

0 steps flagged

No circularity: empirical report with direct observations only

full rationale

The paper is a report of penetration test findings on two specific proprietary agent products. It contains no equations, no fitted parameters, no derivations, and no self-citation chains that reduce any central claim to a prior result by construction. The strongest claim (similar weaknesses in proprietary vs. open-source agents) rests on the observed test outcomes rather than any definitional or predictive loop. Generalizability concerns are validity issues, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical security assessment paper; no free parameters, axioms, or invented entities are invoked.

pith-pipeline@v0.9.1-grok · 5685 in / 848 out tokens · 35788 ms · 2026-06-29T16:36:44.787304+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references

[1]

Shared Responsibility Model

Amazon. Shared Responsibility Model. https://aws.amazon.com/ compliance/shared-responsibility-model/, 2025

2025
[2]

Trace your LLM application’s runtime using OpenTelemetry-based instrumentation

Arize-ai. Trace your LLM application’s runtime using OpenTelemetry-based instrumentation. https://docs.arize.com/ phoenix/tracing/llm-traces, 2025

2025
[3]

Artificial intelligence (AI) shared respon- sibility model

Microsoft Azure. Artificial intelligence (AI) shared respon- sibility model. https://learn.microsoft.com/en-us/azure/security/ fundamentals/shared-responsibility-ai, 2024

2024
[4]

Systems security foundations for agentic computing

Mihai Christodorescu, Earlence Fernandes, Ashish Hooda, Somesh Jha, Johann Rehberger, and Khawaja Shams. Systems security foundations for agentic computing. Cryptology ePrint Archive, Paper 2025/2173, 2025

2025
[5]

Github flavored markdown spec

GitHub. Github flavored markdown spec. https://github.github.com/ gfm/, 2019

2019
[6]

Project Padawan: Advancing Agentic AI in GitHub Copilot

GitHub. Project Padawan: Advancing Agentic AI in GitHub Copilot. https://github.com/features/copilot/whats-new, 2025

2025
[7]

Securing generative AI

IBM Institute for Business Value. Securing generative AI. https://www.ibm.com/thought-leadership/institute-business-value/ en-us/report/securing-generative-ai, 2024

2024
[8]

Essential Log management for your AI tool belt

Joseph Jang. Essential Log management for your AI tool belt. https://live-d9newrelic.pantheonsite.io/blog/best-practices/ the-eu-artificial-intelligence-act-and-observability?utm source= tldrdevops, 2024

2024
[9]

Introduction to Langfuse Tracing

Langfuse. Introduction to Langfuse Tracing. https://langfuse.com/ docs/tracing, 2025

2025
[10]

1-Click RCE To Steal Your OpenClaw Data and Keys (CVE-2026-25253)

Mav Levin. 1-Click RCE To Steal Your OpenClaw Data and Keys (CVE-2026-25253). https://depthfirst.com/post/ 1-click-rce-to-steal-your-moltbot-data-and-keys, 2026

2026
[11]

LlamaIndex Observability

LlamaIndex. LlamaIndex Observability. https://docs.llamaindex.ai/ en/stable/module guides/observability/, 2025

2025
[12]

OpenClaw — Personal AI Assistant

OpenClaw Contributors. OpenClaw — Personal AI Assistant. https: //github.com/openclaw/openclaw, 2026

2026
[13]

OpenDevin: An Open AI Agent for Soft- ware Engineering

OpenDevin Contributors. OpenDevin: An Open AI Agent for Soft- ware Engineering. https://github.com/AI-App/OpenDevin, 2025

2025
[14]

High-quality, ubiquitous, and portable telemetry to enable effective observability

OpenTelemetry. High-quality, ubiquitous, and portable telemetry to enable effective observability. https://github.com/open-telemetry, 2025

2025
[15]

Llm01:2025 prompt injection

OW ASP. Llm01:2025 prompt injection. https://genai.owasp.org/ llmrisk/llm01-prompt-injection/, 2025

2025
[16]

RestrictedPython

RestrictedPython Contributors. RestrictedPython. https://github.com/ zopefoundation/RestrictedPython, 2023

2023
[17]

LADYBUG: an LLM agent debugger for data-driven applications

Joel Rorseth, Parke Godfrey, Lukasz Golab, Divesh Srivastava, and Jarek Szlichta. LADYBUG: an LLM agent debugger for data-driven applications. In Alkis Simitsis, Bettina Kemme, Anna Queralt, Oscar Romero, and Petar Jovanovic, editors,Proceedings 28th Interna- tional Conference on Extending Database Technology, EDBT 2025, Barcelona, Spain, March 25-28, 202...

2025
[18]

Know when your LLM app is hallucinating or malfunc- tioning

TraceLoop. Know when your LLM app is hallucinating or malfunc- tioning. https://www.traceloop.com/, 2025

2025
[19]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An Open Platform for AI Soft...

2024
[20]

Pi monorepo

Mario Zechner. Pi monorepo. https://github.com/badlogic/pi-mono, 2026

2026

[1] [1]

Shared Responsibility Model

Amazon. Shared Responsibility Model. https://aws.amazon.com/ compliance/shared-responsibility-model/, 2025

2025

[2] [2]

Trace your LLM application’s runtime using OpenTelemetry-based instrumentation

Arize-ai. Trace your LLM application’s runtime using OpenTelemetry-based instrumentation. https://docs.arize.com/ phoenix/tracing/llm-traces, 2025

2025

[3] [3]

Artificial intelligence (AI) shared respon- sibility model

Microsoft Azure. Artificial intelligence (AI) shared respon- sibility model. https://learn.microsoft.com/en-us/azure/security/ fundamentals/shared-responsibility-ai, 2024

2024

[4] [4]

Systems security foundations for agentic computing

Mihai Christodorescu, Earlence Fernandes, Ashish Hooda, Somesh Jha, Johann Rehberger, and Khawaja Shams. Systems security foundations for agentic computing. Cryptology ePrint Archive, Paper 2025/2173, 2025

2025

[5] [5]

Github flavored markdown spec

GitHub. Github flavored markdown spec. https://github.github.com/ gfm/, 2019

2019

[6] [6]

Project Padawan: Advancing Agentic AI in GitHub Copilot

GitHub. Project Padawan: Advancing Agentic AI in GitHub Copilot. https://github.com/features/copilot/whats-new, 2025

2025

[7] [7]

Securing generative AI

IBM Institute for Business Value. Securing generative AI. https://www.ibm.com/thought-leadership/institute-business-value/ en-us/report/securing-generative-ai, 2024

2024

[8] [8]

Essential Log management for your AI tool belt

Joseph Jang. Essential Log management for your AI tool belt. https://live-d9newrelic.pantheonsite.io/blog/best-practices/ the-eu-artificial-intelligence-act-and-observability?utm source= tldrdevops, 2024

2024

[9] [9]

Introduction to Langfuse Tracing

Langfuse. Introduction to Langfuse Tracing. https://langfuse.com/ docs/tracing, 2025

2025

[10] [10]

1-Click RCE To Steal Your OpenClaw Data and Keys (CVE-2026-25253)

Mav Levin. 1-Click RCE To Steal Your OpenClaw Data and Keys (CVE-2026-25253). https://depthfirst.com/post/ 1-click-rce-to-steal-your-moltbot-data-and-keys, 2026

2026

[11] [11]

LlamaIndex Observability

LlamaIndex. LlamaIndex Observability. https://docs.llamaindex.ai/ en/stable/module guides/observability/, 2025

2025

[12] [12]

OpenClaw — Personal AI Assistant

OpenClaw Contributors. OpenClaw — Personal AI Assistant. https: //github.com/openclaw/openclaw, 2026

2026

[13] [13]

OpenDevin: An Open AI Agent for Soft- ware Engineering

OpenDevin Contributors. OpenDevin: An Open AI Agent for Soft- ware Engineering. https://github.com/AI-App/OpenDevin, 2025

2025

[14] [14]

High-quality, ubiquitous, and portable telemetry to enable effective observability

OpenTelemetry. High-quality, ubiquitous, and portable telemetry to enable effective observability. https://github.com/open-telemetry, 2025

2025

[15] [15]

Llm01:2025 prompt injection

OW ASP. Llm01:2025 prompt injection. https://genai.owasp.org/ llmrisk/llm01-prompt-injection/, 2025

2025

[16] [16]

RestrictedPython

RestrictedPython Contributors. RestrictedPython. https://github.com/ zopefoundation/RestrictedPython, 2023

2023

[17] [17]

LADYBUG: an LLM agent debugger for data-driven applications

Joel Rorseth, Parke Godfrey, Lukasz Golab, Divesh Srivastava, and Jarek Szlichta. LADYBUG: an LLM agent debugger for data-driven applications. In Alkis Simitsis, Bettina Kemme, Anna Queralt, Oscar Romero, and Petar Jovanovic, editors,Proceedings 28th Interna- tional Conference on Extending Database Technology, EDBT 2025, Barcelona, Spain, March 25-28, 202...

2025

[18] [18]

Know when your LLM app is hallucinating or malfunc- tioning

TraceLoop. Know when your LLM app is hallucinating or malfunc- tioning. https://www.traceloop.com/, 2025

2025

[19] [19]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An Open Platform for AI Soft...

2024

[20] [20]

Pi monorepo

Mario Zechner. Pi monorepo. https://github.com/badlogic/pi-mono, 2026

2026