pith. sign in

arxiv: 2605.27042 · v1 · pith:SRFBIFW7new · submitted 2026-05-26 · 💻 cs.CR · cs.AI

Lessons from Penetration Tests on Large-Scale Agent Systems

Pith reviewed 2026-06-29 16:36 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords AI agentspenetration testingsecurity vulnerabilitiesproprietary systemsopen-source agentscross-layer securityagent frameworks
0
0 comments X

The pith

Proprietary AI agent systems exhibit the same recurring security weaknesses as open-source agents despite stricter development processes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports results from penetration tests performed in 2025 on two proprietary large-scale agent products. It argues that the vulnerabilities uncovered are not new but belong to the same recurring classes long seen in traditional computing systems and in prior open-source agent research. A sympathetic reader would care because execution-capable agents interact across multiple layers of the computing stack, creating a broad attack surface that developers must secure regardless of their internal processes. The work tests whether formal coding standards and review practices have reduced those risks compared with earlier assessments of open-source agents and frameworks.

Core claim

Penetration tests conducted in 2025 against two proprietary agent products show that these systems exhibit similar security weaknesses to those observed in prior open-source agent research, indicating that the security posture of AI agents has not substantially improved despite stricter coding standards and formal review processes.

What carries the argument

Penetration testing applied to proprietary agent products to surface cross-layer weaknesses in unbounded, self-modifying execution-capable AI agents.

If this is right

  • Developers of execution-capable agents must still reason about and secure complex cross-layer behaviors.
  • Recurring vulnerability classes persist across both open-source and proprietary development methodologies.
  • The security burden on agent developers remains significant even under formal review processes.
  • Prior research on open-source agents provides relevant lessons for proprietary products.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Security engineering for agents may need to shift from coding standards toward new architectural constraints on autonomy and tool use.
  • Organizations deploying agents at scale could benefit from shared test suites that target the recurring weakness classes identified here.
  • If the pattern holds, regulatory or certification requirements for agent systems might focus on observable interaction surfaces rather than internal development processes.

Load-bearing premise

The two proprietary products tested in 2025 are representative of the broader class of large-scale proprietary agent systems.

What would settle it

A penetration test on a third proprietary agent product from 2025 that finds no recurring classes of weaknesses previously documented in open-source agents.

Figures

Figures reproduced from arXiv: 2605.27042 by Dhilung Kirat, Frederico Araujo, Ian Molloy, Jiyong Jang, Kevin Eykholt, Xiaokui Shu.

Figure 1
Figure 1. Figure 1: Workflow of GitHub Assistant practices typically include stricter coding standards, secu￾rity reviews, and release processes, reflecting the potential financial, reputational, and legal consequences of security failures. These factors may suggest that proprietary agents would exhibit fewer or more sophisticated vulnerabilities than their open-source counterparts. Our team regularly conducts penetration tes… view at source ↗
Figure 2
Figure 2. Figure 2: Potential Threat Actors Editing. If the editing capability is enabled and an edit￾ing tag is applied, the localization output is forwarded to the editing stage. One or more agents are initialized and prompted to generate concrete code edits based on the local￾izer’s recommendations. The resulting candidate patches are evaluated by a judge LLM, which selects the final solution. The selected output is then r… view at source ↗
Figure 3
Figure 3. Figure 3: Markdown hiding use markdown link syntax. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
read the original abstract

As AI systems gain increasing autonomy and execution capability, the number of discovered security vulnerabilities continues to rise. However, many of these vulnerabilities are not fundamentally novel, but instead reflect recurring classes of weaknesses long observed in prior computing systems. Execution-capable AI agents are effectively unbounded, self-modifying programs that interact extensively with multiple layers of the computing stack. This broad interaction surface imposes a significant security burden on developers, who must reason about and secure complex cross-layer behaviors. Prior research has primarily focused on vulnerabilities in open-source agents and agent frameworks. In contrast, it remains unclear whether proprietary agent systems -- developed under stricter coding standards and formal review processes -- exhibit similar security weaknesses. In this paper, we present findings from two penetration tests conducted in 2025 against proprietary agent products and evaluate whether the security posture of AI agents has improved since these assessments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents findings from two penetration tests conducted in 2025 on proprietary large-scale AI agent products. It claims that, despite development under stricter coding standards and formal review processes, these systems exhibit similar recurring classes of security weaknesses to those previously documented in open-source agents and frameworks, indicating that the security posture of execution-capable AI agents has not meaningfully improved.

Significance. If the two tested systems prove representative, the result would demonstrate that cross-layer interaction surfaces in autonomous agents impose persistent security burdens regardless of development paradigm, reinforcing the need for improved developer reasoning about unbounded, self-modifying behaviors. The use of real-world proprietary targets adds practical relevance beyond prior open-source studies.

major comments (2)
  1. [Abstract] Abstract: the central claim that proprietary systems 'exhibit similar security weaknesses' to open-source ones is load-bearing on the two 2025 products being representative of the broader class, yet the abstract supplies no selection criteria, architectural comparison to other proprietary agents, population estimate, or discussion of access restrictions or selection bias.
  2. [Abstract] Abstract: no specific vulnerabilities, methodology details, data, or error analysis are provided, so it is impossible to evaluate whether the observed weaknesses are in fact the same recurring classes or whether the tests support the 'has not improved' conclusion.
minor comments (1)
  1. [Abstract] The phrase 'since these assessments' is undefined; the abstract does not identify the prior open-source studies or time frame being used as baseline.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their review and for highlighting issues of generalizability and transparency in the abstract. We address each major comment below. We agree that the abstract can be strengthened with additional context on selection and methodology where feasible, but confidentiality constraints on the proprietary targets limit what can be disclosed. We propose partial revisions to the abstract accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that proprietary systems 'exhibit similar security weaknesses' to open-source ones is load-bearing on the two 2025 products being representative of the broader class, yet the abstract supplies no selection criteria, architectural comparison to other proprietary agents, population estimate, or discussion of access restrictions or selection bias.

    Authors: The two products were selected because they are large-scale, commercially deployed proprietary agent systems for which authorized penetration testing access was obtained under responsible disclosure agreements. No comprehensive public population estimate of such proprietary systems exists, precluding statistical sampling claims. Architectural comparisons to other agents appear in the introduction and related work of the full manuscript. We will revise the abstract to note the selection basis (large-scale commercial deployments) and access restrictions (NDA-bound responsible testing), which directly addresses the representativeness concern without revealing confidential details. revision: partial

  2. Referee: [Abstract] Abstract: no specific vulnerabilities, methodology details, data, or error analysis are provided, so it is impossible to evaluate whether the observed weaknesses are in fact the same recurring classes or whether the tests support the 'has not improved' conclusion.

    Authors: The abstract is kept high-level due to length limits and to avoid disclosing vendor-sensitive information. The full manuscript contains dedicated sections on the black-box and gray-box penetration testing methodology, anonymized examples of the recurring weakness classes, direct comparisons to prior open-source findings, and supporting analysis. We will revise the abstract to briefly reference the testing approach and direct readers to the detailed evaluations in the body. Specific vulnerability instances, raw data, and granular error analysis cannot be provided publicly. revision: partial

standing simulated objections not resolved
  • Disclosure of specific vulnerabilities, raw test data, or granular error analysis from the proprietary targets, which is precluded by non-disclosure agreements and ongoing remediation processes.

Circularity Check

0 steps flagged

No circularity: empirical report with direct observations only

full rationale

The paper is a report of penetration test findings on two specific proprietary agent products. It contains no equations, no fitted parameters, no derivations, and no self-citation chains that reduce any central claim to a prior result by construction. The strongest claim (similar weaknesses in proprietary vs. open-source agents) rests on the observed test outcomes rather than any definitional or predictive loop. Generalizability concerns are validity issues, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical security assessment paper; no free parameters, axioms, or invented entities are invoked.

pith-pipeline@v0.9.1-grok · 5685 in / 848 out tokens · 35788 ms · 2026-06-29T16:36:44.787304+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references

  1. [1]

    Shared Responsibility Model

    Amazon. Shared Responsibility Model. https://aws.amazon.com/ compliance/shared-responsibility-model/, 2025

  2. [2]

    Trace your LLM application’s runtime using OpenTelemetry-based instrumentation

    Arize-ai. Trace your LLM application’s runtime using OpenTelemetry-based instrumentation. https://docs.arize.com/ phoenix/tracing/llm-traces, 2025

  3. [3]

    Artificial intelligence (AI) shared respon- sibility model

    Microsoft Azure. Artificial intelligence (AI) shared respon- sibility model. https://learn.microsoft.com/en-us/azure/security/ fundamentals/shared-responsibility-ai, 2024

  4. [4]

    Systems security foundations for agentic computing

    Mihai Christodorescu, Earlence Fernandes, Ashish Hooda, Somesh Jha, Johann Rehberger, and Khawaja Shams. Systems security foundations for agentic computing. Cryptology ePrint Archive, Paper 2025/2173, 2025

  5. [5]

    Github flavored markdown spec

    GitHub. Github flavored markdown spec. https://github.github.com/ gfm/, 2019

  6. [6]

    Project Padawan: Advancing Agentic AI in GitHub Copilot

    GitHub. Project Padawan: Advancing Agentic AI in GitHub Copilot. https://github.com/features/copilot/whats-new, 2025

  7. [7]

    Securing generative AI

    IBM Institute for Business Value. Securing generative AI. https://www.ibm.com/thought-leadership/institute-business-value/ en-us/report/securing-generative-ai, 2024

  8. [8]

    Essential Log management for your AI tool belt

    Joseph Jang. Essential Log management for your AI tool belt. https://live-d9newrelic.pantheonsite.io/blog/best-practices/ the-eu-artificial-intelligence-act-and-observability?utm source= tldrdevops, 2024

  9. [9]

    Introduction to Langfuse Tracing

    Langfuse. Introduction to Langfuse Tracing. https://langfuse.com/ docs/tracing, 2025

  10. [10]

    1-Click RCE To Steal Your OpenClaw Data and Keys (CVE-2026-25253)

    Mav Levin. 1-Click RCE To Steal Your OpenClaw Data and Keys (CVE-2026-25253). https://depthfirst.com/post/ 1-click-rce-to-steal-your-moltbot-data-and-keys, 2026

  11. [11]

    LlamaIndex Observability

    LlamaIndex. LlamaIndex Observability. https://docs.llamaindex.ai/ en/stable/module guides/observability/, 2025

  12. [12]

    OpenClaw — Personal AI Assistant

    OpenClaw Contributors. OpenClaw — Personal AI Assistant. https: //github.com/openclaw/openclaw, 2026

  13. [13]

    OpenDevin: An Open AI Agent for Soft- ware Engineering

    OpenDevin Contributors. OpenDevin: An Open AI Agent for Soft- ware Engineering. https://github.com/AI-App/OpenDevin, 2025

  14. [14]

    High-quality, ubiquitous, and portable telemetry to enable effective observability

    OpenTelemetry. High-quality, ubiquitous, and portable telemetry to enable effective observability. https://github.com/open-telemetry, 2025

  15. [15]

    Llm01:2025 prompt injection

    OW ASP. Llm01:2025 prompt injection. https://genai.owasp.org/ llmrisk/llm01-prompt-injection/, 2025

  16. [16]

    RestrictedPython

    RestrictedPython Contributors. RestrictedPython. https://github.com/ zopefoundation/RestrictedPython, 2023

  17. [17]

    LADYBUG: an LLM agent debugger for data-driven applications

    Joel Rorseth, Parke Godfrey, Lukasz Golab, Divesh Srivastava, and Jarek Szlichta. LADYBUG: an LLM agent debugger for data-driven applications. In Alkis Simitsis, Bettina Kemme, Anna Queralt, Oscar Romero, and Petar Jovanovic, editors,Proceedings 28th Interna- tional Conference on Extending Database Technology, EDBT 2025, Barcelona, Spain, March 25-28, 202...

  18. [18]

    Know when your LLM app is hallucinating or malfunc- tioning

    TraceLoop. Know when your LLM app is hallucinating or malfunc- tioning. https://www.traceloop.com/, 2025

  19. [19]

    Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An Open Platform for AI Soft...

  20. [20]

    Pi monorepo

    Mario Zechner. Pi monorepo. https://github.com/badlogic/pi-mono, 2026