pith. sign in

arxiv: 2606.16287 · v2 · pith:LJF2W5WMnew · submitted 2026-06-15 · 💻 cs.CR

Dynamic Malicious Skills in Agentic AI

Pith reviewed 2026-06-27 04:01 UTC · model grok-4.3

classification 💻 cs.CR
keywords dynamic malicious skillsagentic AI securityruntime code injectionskill documentation attacksAI agent vulnerabilitieskernel read-only mountsOpenHandsClaude Code
0
0 comments X

The pith

Malicious instructions hidden in skill documentation can cause agents to inject harmful logic into benign skills at runtime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that embedding instructions in files like SKILL.md lets an attacker make an agent rewrite a skill's code during execution to add malicious behavior. This works on frameworks such as OpenHands and Claude Code and produces several types of attacks with non-trivial success rates. The authors also present a defense that uses operating-system kernel read-only mounts on skill directories to stop any runtime changes while leaving normal skill use intact. If the attack holds, it means documentation files become a direct path for compromising agent capabilities without touching the original skill files.

Core claim

Dynamic malicious skills arise when an attacker places natural-language instructions inside skill documentation; the agent then follows those instructions to alter the skill's executable code at runtime, enabling behaviors such as data exfiltration or unauthorized actions. The attack succeeds across multiple agentic frameworks without requiring changes to the original skill code. A kernel-enforced read-only mount on the skill directory prevents the modification while preserving the functionality of unmodified skills.

What carries the argument

Embedding malicious instructions in natural-language documentation files (SKILL.md) that the agent reads and applies to rewrite skill code during execution.

If this is right

  • Agents can be made to perform malicious actions such as data theft or command execution by altering skills only at runtime.
  • The attack requires no modification to the original skill files and succeeds with non-trivial rates on OpenHands and Claude Code.
  • Kernel-enforced read-only mounts block the runtime modification while leaving benign skill execution unchanged.
  • Documentation files become an attack surface when agents treat their contents as actionable instructions for code changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Skill systems may need explicit separation between human-readable docs and executable code to limit this vector.
  • Similar documentation-driven modification risks could appear in other agent frameworks that allow runtime skill updates.
  • Developers could add integrity checks on skill code before any execution triggered by documentation content.

Load-bearing premise

Agent frameworks will read instructions from documentation files and use them to modify skill code at runtime without verification or isolation.

What would settle it

A test in which an agent given malicious instructions in SKILL.md never alters the skill code, or in which the read-only mount still permits the code change to occur.

Figures

Figures reproduced from arXiv: 2606.16287 by Neil Zhenqiang Gong, Tianhao Chen, Yebei Gou, Yuepeng Hu, Zhengyuan Jiang.

Figure 1
Figure 1. Figure 1: Illustration of DyMalSkill. analysis, as explicitly malicious behaviors—such as deleting local files or transmitting sensitive information to external endpoints—can be identified. In this work, we introduce a new security threat to the agentic AI ecosystem, termed dynamic malicious skills. A dynamic malicious skill satisfies two key conditions: (1) the original code of the skill is benign at the time of di… view at source ↗
Figure 2
Figure 2. Figure 2: Impact of injection location across malicious behaviors. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: FNR and FPR of prompt-injection de￾tection methods in identifying dynamic malicious skills generated by DyMalSkill. (iii) The permission-based defense does not disrupt the performance of benign skills. As analyzed in Section 6.1, benign skills do not require dynamic code modification to achieve their intended functionality. Consequently, the permission-based defense does not interfere with their correct ex… view at source ↗
Figure 4
Figure 4. Figure 4: Impact of injected content. A Verifier Details We adopt the runtime verifier from MalTool [Hu et al., 2026], and refer the reader to that work for the full implementation. We summarize here only the design properties that are relevant for interpreting the ASR numbers in Section 5. The verifier counts a trial as a successful attack only when the agent-modified skill, re-executed in a controlled sandbox, pro… view at source ↗
Figure 5
Figure 5. Figure 5: ASR@k, which is defined as the probability that a dynamic malicious skill successfully [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
read the original abstract

Skills are a key enabling component of agentic AI. While they enhance agents' capabilities, they also introduce new attack surfaces. In this work, we investigate one such attack surface by demonstrating dynamic malicious skills. By embedding malicious instructions in natural-language documentation (e.g., SKILL.md), an attacker can induce an agent to dynamically inject malicious logic into an otherwise benign skill during execution. We evaluate this attack across agentic frameworks such as OpenHands and Claude Code, showing that dynamic malicious skills can successfully introduce a range of malicious behaviors at runtime with non-trivial success rates. To mitigate this vulnerability, we propose a system-level defense that prevents dynamic modification of skills using operating system kernel-enforced read-only mounts. Our evaluation demonstrates that this defense effectively blocks dynamic malicious skills while preserving the functionality of benign skills.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that embedding malicious instructions in natural-language documentation files (e.g., SKILL.md) allows an attacker to induce agentic frameworks such as OpenHands and Claude Code to dynamically rewrite otherwise benign skill code at runtime, achieving non-trivial success rates across a range of malicious behaviors; it further proposes and evaluates an OS-level defense using kernel-enforced read-only mounts that blocks the attack while preserving benign skill functionality.

Significance. If the empirical results hold with full methodological transparency, the work identifies a previously under-examined attack surface in which documentation can drive runtime code modification in agentic systems, which is relevant to the security of tool-using AI agents. The proposed defense is concrete and leverages existing OS primitives.

major comments (3)
  1. [Abstract / Evaluation] Abstract and Evaluation section: the claim of 'non-trivial success rates' is asserted without any reported quantitative metrics, success criteria, number of trials, or error analysis, preventing assessment of whether the attack is reproducible or framework-inherent.
  2. [Evaluation] Evaluation section: no description is given of the system prompts, skill-loading code, or control conditions used for OpenHands and Claude Code; without these details it is impossible to distinguish a novel dynamic-skill attack surface from standard prompt injection that depends on the evaluation harness.
  3. [Defense / Evaluation] Defense evaluation: the read-only mount defense is stated to preserve benign functionality, but no quantitative comparison (e.g., success rate of benign tasks before/after the mount) or description of how skill execution paths were verified is supplied.
minor comments (2)
  1. [Introduction] Define 'dynamic malicious skill' precisely and distinguish it from static prompt injection or tool misuse.
  2. [Attack Description] Provide the exact format and location of the SKILL.md files used in the experiments.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving transparency and reproducibility in the evaluation sections. We address each major comment below and commit to revisions that incorporate the requested details without altering the core claims of the work.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and Evaluation section: the claim of 'non-trivial success rates' is asserted without any reported quantitative metrics, success criteria, number of trials, or error analysis, preventing assessment of whether the attack is reproducible or framework-inherent.

    Authors: We agree that the abstract and evaluation sections would benefit from explicit quantitative support for the 'non-trivial success rates' claim. The revised manuscript will include specific metrics such as success percentages across defined numbers of trials, clear success criteria for each malicious behavior, and basic error analysis (e.g., variance across runs) to allow assessment of reproducibility. revision: yes

  2. Referee: [Evaluation] Evaluation section: no description is given of the system prompts, skill-loading code, or control conditions used for OpenHands and Claude Code; without these details it is impossible to distinguish a novel dynamic-skill attack surface from standard prompt injection that depends on the evaluation harness.

    Authors: We acknowledge that the current evaluation description is insufficient for distinguishing the dynamic skill attack from harness-dependent prompt injection. The revised version will add the exact system prompts, relevant excerpts of skill-loading code, and descriptions of control conditions (e.g., baseline runs without malicious documentation) to the evaluation section or an appendix. revision: yes

  3. Referee: [Defense / Evaluation] Defense evaluation: the read-only mount defense is stated to preserve benign functionality, but no quantitative comparison (e.g., success rate of benign tasks before/after the mount) or description of how skill execution paths were verified is supplied.

    Authors: We agree that the defense evaluation requires quantitative backing to substantiate preservation of benign functionality. The revised manuscript will include before/after success rates for a set of benign tasks, along with a description of how skill execution paths were verified (e.g., via logging or path tracing) to confirm the mounts do not interfere with legitimate operations. revision: yes

Circularity Check

0 steps flagged

Empirical attack demonstration contains no derivations or load-bearing self-references

full rationale

The paper is an empirical security study demonstrating an attack via natural-language instructions in SKILL.md files and proposing a read-only mount defense. It reports experimental success rates on OpenHands and Claude Code but includes no equations, fitted parameters, uniqueness theorems, or derivation chains. All claims rest on direct experimental observations rather than any reduction to prior self-citations or self-definitions, satisfying the criteria for a self-contained empirical result with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical security demonstration paper rather than a theoretical or mathematical work, resulting in a minimal axiom ledger with no free parameters or invented entities.

axioms (1)
  • domain assumption Agentic AI frameworks will parse and execute instructions contained in natural-language skill documentation files such as SKILL.md during skill loading and runtime.
    This assumption is required for the described attack to succeed as stated in the abstract.

pith-pipeline@v0.9.1-grok · 5666 in / 1155 out tokens · 79405 ms · 2026-06-27T04:01:11.737910+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 3 linked inside Pith

  1. [1]

    Network and Distributed System Security (NDSS) Symposium , year=

    Prompt Injection Attack to Tool Selection in LLM Agents , author=. Network and Distributed System Security (NDSS) Symposium , year=

  2. [2]

    ACM SIGSAC Conference on Computer and Communications Security , year=

    Optimization-based prompt injection attack to llm-as-a-judge , author=. ACM SIGSAC Conference on Computer and Communications Security , year=

  3. [3]

    arXiv preprint arXiv:2602.14211 , year=

    Skillject: Automating stealthy skill-based prompt injection for coding agents with trace-driven closed-loop refinement , author=. arXiv preprint arXiv:2602.14211 , year=

  4. [4]

    arXiv preprint arXiv:2602.12194 , year=

    Maltool: Malicious tool attacks on LLM agents , author=. arXiv preprint arXiv:2602.12194 , year=

  5. [5]

    arXiv preprint arXiv:2505.09388 , year=

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  6. [6]

    International Conference on Learning Representations , year=

    ReAct: Synergizing Reasoning and Acting in Language Models , author=. International Conference on Learning Representations , year=

  7. [7]

    Advances in Neural Information Processing Systems , year=

    Toolformer: Language Models Can Teach Themselves to Use Tools , author=. Advances in Neural Information Processing Systems , year=

  8. [8]

    International Conference on Learning Representations , year=

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs , author=. International Conference on Learning Representations , year=

  9. [9]

    International Conference on Machine Learning , year=

    Executable Code Actions Elicit Better LLM Agents , author=. International Conference on Machine Learning , year=

  10. [10]

    and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik R

    Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik R. , booktitle=

  11. [11]

    and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , booktitle=

    Yang, John and Jimenez, Carlos E. and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , booktitle=

  12. [12]

    International Conference on Learning Representations , year=

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents , author=. International Conference on Learning Representations , year=

  13. [13]

    ACM Workshop on Artificial Intelligence and Security , year=

    Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection , author=. ACM Workshop on Artificial Intelligence and Security , year=

  14. [14]

    USENIX Security Symposium , year=

    Formalizing and Benchmarking Prompt Injection Attacks and Defenses , author=. USENIX Security Symposium , year=

  15. [15]

    IEEE Symposium on Security and Privacy , year=

    DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks , author=. IEEE Symposium on Security and Privacy , year=

  16. [16]

    International Conference on Learning Representations , year=

    Identifying the Risks of LM Agents with an LM-Emulated Sandbox , author=. International Conference on Learning Representations , year=

  17. [17]

    Advances in Neural Information Processing Systems , year=

    AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents , author=. Advances in Neural Information Processing Systems , year=

  18. [18]

    arXiv preprint arXiv:2507.15219 , year=

    PromptArmor: Simple yet Effective Prompt Injection Defenses , author=. arXiv preprint arXiv:2507.15219 , year=

  19. [19]

    2026 , howpublished=

    Claude Code Overview , author=. 2026 , howpublished=

  20. [20]

    2024 , howpublished=

    Model Context Protocol , author=. 2024 , howpublished=

  21. [21]

    2024 , howpublished=

  22. [22]

    Proceedings of the IEEE , year=

    The Protection of Information in Computer Systems , author=. Proceedings of the IEEE , year=

  23. [23]

    Secure Hash Standard (SHS) , author=

  24. [24]

    2026 , howpublished=

    Bubblewrap , author=. 2026 , howpublished=