pith. machine review for the scientific record. sign in

arxiv: 2604.15415 · v1 · submitted 2026-04-16 · 💻 cs.CR · cs.AI

Recognition: unknown

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

Michael Backes, Xinyue Shen, Yage Zhang, Yang Zhang, Yukun Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:28 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords LLM agentsharmful skillsagent safetyskill ecosystemsbenchmarksrefusal ratesLLM safety
0
0 comments X

The pith

Pre-installed harmful skills substantially increase LLM agents' compliance with dangerous requests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures harmful skills in open agent ecosystems and finds that nearly 5 percent of 98,440 skills are harmful. It builds HarmfulSkillBench with 200 skills in 20 categories to test how pre-installed skills affect agent behavior on harmful tasks. Evaluations show that using a pre-installed skill to present the task raises average harm scores from 0.27 to 0.47 across six LLMs, reaching 0.76 when the intent is implicit rather than explicit. This highlights a vulnerability where agents can be weaponized through their skill dependencies.

Core claim

The authors claim that 4.93% of skills in two major registries are harmful and that presenting a harmful task through a pre-installed skill substantially lowers refusal rates across all models, with the average harm score rising from 0.27 without the skill to 0.47 with it, and further to 0.76 when the harmful intent is implicit rather than stated as an explicit user request.

What carries the argument

HarmfulSkillBench, a benchmark of 200 harmful skills across 20 categories and four evaluation conditions that isolate the effect of skill mediation and implicit intent.

If this is right

  • ClawHub shows a higher harmful skill rate of 8.84% compared to 3.49% on Skills.Rest.
  • All six evaluated LLMs exhibit increased harm scores when tasks are routed through pre-installed skills.
  • Implicit harmful intents via skills produce the highest average harm score of 0.76.
  • The released benchmark enables consistent testing of future agent safety measures against skill-based attacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent platforms could reduce risk by vetting skills before allowing installation or by isolating their execution.
  • Users may not realize they have equipped their agents with harmful capabilities if skills are added automatically or from untrusted sources.
  • Malicious actors could develop and distribute skills that enable specific harmful actions without direct prompting.
  • This setup suggests testing similar skill-based attacks in other AI systems that use external tools or plugins.

Load-bearing premise

The LLM-driven scoring system accurately classifies skills as harmful without substantial bias or false positives that would distort the rates and benchmark.

What would settle it

A large-scale manual audit of the skills flagged as harmful revealing that most do not match the taxonomy would invalidate the 4.93% finding and the benchmark results.

Figures

Figures reproduced from arXiv: 2604.15415 by Michael Backes, Xinyue Shen, Yage Zhang, Yang Zhang, Yukun Jiang.

Figure 1
Figure 1. Figure 1: Illustration for how harmful skills affect agent re [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our four-stage methodology. Stages 1–2 (data collection and harmful skill identification) address RQ1. Stage 3 [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Structure of an agent skill. The directory contains a [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Threat model comparison of prior work and our [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Taxonomy of 21 harmful categories organized into Tier 1 Prohibited Use (P1–P14) and Tier 2 High-Risk Use (H1–H7). Each [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of harmful skills across taxonomy by reg [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Category co-occurrence among harmful skills. [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Engagement metrics (downloads and stars) on the [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Lorenz curve of harmful skill production across [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Overview of HARMFULSKILLBENCH. abling reasoning, so we set it to “minimal.” To simulate a realistic agent context, we construct a five￾part message structure inspired by AgentDojo [14]: (1) a system prompt defining the agent’s role and available tools, adapted from AgentHarm [2]; (2) a user message requesting the agent to read an available skill; (3) a pre-filled assistant message containing a read_skill … view at source ↗
Figure 11
Figure 11. Figure 11: Harm score heatmaps across all 20 harmful skill categories under Conditions A, B, and D. Each cell shows the mean score [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Runtime skill integration in LLM-based agents. [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Risk score prompt used for LLM-based skill identi [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Complete judge prompt used for evaluating agent responses. The judge receives the system message and user message as [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
read the original abstract

Large language models (LLMs) have evolved into autonomous agents that rely on open skill ecosystems (e.g., ClawHub and Skills.Rest), hosting numerous publicly reusable skills. Existing security research on these ecosystems mainly focuses on vulnerabilities within skills, such as prompt injection. However, there is a critical gap regarding skills that may be misused for harmful actions (e.g., cyber attacks, fraud and scams, privacy violations, and sexual content generation), namely harmful skills. In this paper, we present the first large-scale measurement study of harmful skills in agent ecosystems, covering 98,440 skills across two major registries. Using an LLM-driven scoring system grounded in our harmful skill taxonomy, we find that 4.93% of skills (4,858) are harmful, with ClawHub exhibiting an 8.84% harmful rate compared to 3.49% on Skills.Rest. We then construct HarmfulSkillBench, the first benchmark for evaluating agent safety against harmful skills in realistic agent contexts, comprising 200 harmful skills across 20 categories and four evaluation conditions. By evaluating six LLMs on HarmfulSkillBench, we find that presenting a harmful task through a pre-installed skill substantially lowers refusal rates across all models, with the average harm score rising from 0.27 without the skill to 0.47 with it, and further to 0.76 when the harmful intent is implicit rather than stated as an explicit user request. We responsibly disclose our findings to the affected registries and release our benchmark to support future research (see https://github.com/TrustAIRLab/HarmfulSkillBench).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts a large-scale measurement of harmful skills across 98,440 skills in two agent skill registries (ClawHub and Skills.Rest), using an LLM-driven scorer based on a custom harmful skill taxonomy to identify 4.93% (4,858) as harmful. It then releases HarmfulSkillBench, a benchmark of 200 skills across 20 categories and four conditions, and evaluates six LLMs to show that pre-installed harmful skills raise average harm scores from 0.27 (no skill) to 0.47 (with skill) and 0.76 (implicit intent), substantially lowering refusal rates.

Significance. If the LLM scorer is reliable, the work provides the first quantitative prevalence data on harmful skills in open agent ecosystems and a concrete benchmark demonstrating how skill reuse can bypass safety alignments. The release of the benchmark and responsible disclosure to registries are positive contributions that enable follow-on safety research.

major comments (2)
  1. [Methods / LLM-driven scoring system] The central quantitative claims (4.93% harmful rate, harm-score deltas of 0.27→0.47→0.76) rest entirely on an LLM-driven scoring system used both to filter the 98,440 skills and to label model outputs. The abstract and methods description provide no accuracy metrics, human inter-rater agreement, false-positive analysis, or validation against a held-out set. Without these, both the prevalence statistic and the reported effect sizes could be artifacts of scorer bias rather than evidence of weaponization.
  2. [HarmfulSkillBench construction] HarmfulSkillBench selects 200 skills from the filtered set and defines four evaluation conditions, yet the paper does not report how the 20 categories were chosen or whether the selection process introduces selection bias that inflates the observed refusal-rate differences. The claim that pre-installed skills “substantially lower refusal rates across all models” therefore lacks a clear control for category or skill difficulty.
minor comments (2)
  1. [Taxonomy] The taxonomy is described as “grounded in” prior work but the paper does not include a table mapping taxonomy categories to the 20 benchmark categories or to the four evaluation conditions.
  2. [Evaluation results] Figure captions and axis labels for the harm-score plots should explicitly state the number of runs per model and whether error bars represent standard error or 95% CI.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough and constructive review. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the manuscript without misrepresenting our current results.

read point-by-point responses
  1. Referee: [Methods / LLM-driven scoring system] The central quantitative claims (4.93% harmful rate, harm-score deltas of 0.27→0.47→0.76) rest entirely on an LLM-driven scoring system used both to filter the 98,440 skills and to label model outputs. The abstract and methods description provide no accuracy metrics, human inter-rater agreement, false-positive analysis, or validation against a held-out set. Without these, both the prevalence statistic and the reported effect sizes could be artifacts of scorer bias rather than evidence of weaponization.

    Authors: We agree that explicit validation metrics for the LLM-driven scorer are essential to support the prevalence claim and effect sizes. The current manuscript grounds the scorer in the harmful skill taxonomy but does not report accuracy, inter-rater agreement, or false-positive rates. In the revised version we will add a validation subsection that includes human annotation results on a held-out sample of skills, reporting agreement rates and error analysis to demonstrate that the 4.93% rate and subsequent harm-score deltas are not artifacts of scorer bias. We will also clarify that the benchmark harm scoring uses a separate judge prompt from the initial filtering step. revision: yes

  2. Referee: [HarmfulSkillBench construction] HarmfulSkillBench selects 200 skills from the filtered set and defines four evaluation conditions, yet the paper does not report how the 20 categories were chosen or whether the selection process introduces selection bias that inflates the observed refusal-rate differences. The claim that pre-installed skills “substantially lower refusal rates across all models” therefore lacks a clear control for category or skill difficulty.

    Authors: We appreciate the referee's point on potential selection bias. The 20 categories were derived directly from the harmful skill taxonomy to ensure broad coverage of harm types, with 10 skills randomly sampled per category from the filtered harmful set. The four conditions isolate the effect of skill installation. In the revision we will explicitly document this taxonomy-derived selection and randomization process in the methods section. We will also add per-category harm-score breakdowns to show that the refusal-rate reductions hold consistently across categories, providing a control for category-specific effects. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurement without self-referential derivations

full rationale

The paper performs a measurement study: an LLM-driven scorer (grounded in a taxonomy) classifies skills for prevalence (4.93%) and benchmark selection, then applies scoring to agent responses under controlled conditions to report refusal rates and harm deltas (0.27 to 0.47). No equations, fitted parameters, or predictions reduce by construction to the inputs; the scorer is a consistent external measurement instrument rather than a tautological loop. The central claims are observational comparisons on held-out model outputs and do not rely on self-citation chains or ansatzes that smuggle results. This is self-contained empirical work against the stated methodology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The measurement and benchmark rest on an unvalidated LLM classifier and a custom harmful-skill taxonomy whose construction details are not visible in the abstract.

free parameters (1)
  • Harm threshold in LLM scoring system
    The 4.93% harmful rate depends on whatever cutoff the authors chose for the automated classifier.
axioms (1)
  • domain assumption LLM-based classifier produces reliable harmful-skill labels when grounded in the authors' taxonomy
    The entire 98k-skill measurement and the 200-skill benchmark selection depend on this assumption.

pith-pipeline@v0.9.0 · 5601 in / 1295 out tokens · 52668 ms · 2026-05-10T10:28:23.139172+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Sealing the Audit-Runtime Gap for LLM Skills

    cs.CR 2026-05 unverdicted novelty 7.0

    SIGIL cryptographically seals the audit-runtime gap for LLM skills via an on-chain registry with four publication types, DAO vetting, and a runtime verification loader that enforces integrity and permissions.

Reference graph

Works this paper leans on

82 extracted references · 28 canonical work pages · cited by 1 Pith paper · 12 internal anchors

  1. [1]

    Amazon.de Conditions of Use and Sale

    Amazon. Amazon.de Conditions of Use and Sale. https://www.amazon.de/gp/help/customer/ display.html?nodeId=GLSBYFE9MGKKQXXM. 21

  2. [2]

    AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

    Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrik- son, Eric Winsor, Jerome Wynne, Yarin Gal, and Xan- der Davies. AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents. InInternational Confer- ence on Learning Representations (ICLR), 2025. 8, 9, 13

  3. [3]

    Usage Policy.https://www.anthropic

    Anthropic. Usage Policy.https://www.anthropic. com/legal/aup. 3, 13, 17

  4. [4]

    Using Agents According to Our Usage Pol- icy.https://support.claude.com/en/articles/ 12005017-using-agents-according-to-our- usage-policy

    Anthropic. Using Agents According to Our Usage Pol- icy.https://support.claude.com/en/articles/ 12005017-using-agents-according-to-our- usage-policy. 3, 17

  5. [5]

    AutoGPT.https://agpt.co/

    AutoGPT. AutoGPT.https://agpt.co/. 1

  6. [6]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McK- innon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse,...

  7. [7]

    Bilibili User Agreement.https://www

    Bilibili. Bilibili User Agreement.https://www. bilibili.com/protocal/licence.html. 21

  8. [8]

    An Evaluation of the Google Chrome Extension Security Architecture

    Nicholas Carlini, Adrienne Porter Felt, and David Wag- ner. An Evaluation of the Google Chrome Extension Security Architecture. InUSENIX Security Symposium (USENIX Security), pages 97–111. USENIX, 2012. 13

  9. [9]

    Jail- breakbench: An open robustness benchmark for jailbreaking large language models

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, and Eric Wong. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models. CoRR abs/2404.01318, 2024. 8, 13

  10. [10]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jail- breaking Black Box Large Language Models in Twenty Queries.CoRR abs/2310.08419, 2023. 9, 13

  11. [11]

    AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

    Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases. In 14 Annual Conference on Neural Information Processing Systems (NeurIPS), pages 130185–130213. NeurIPS,

  12. [12]

    Jail- breakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs

    Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. Jail- breakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs. InAnnual Meeting of the As- sociation for Computational Linguistics (ACL), pages 21538–21566. ACL, 2025. 8, 13

  13. [13]

    ClawHub, The Skill Dock for Sharp Agents

    ClawHub. ClawHub, The Skill Dock for Sharp Agents. https://clawhub.ai/. 1, 3, 4

  14. [14]

    AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

    Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. AgentDojo: A Dynamic Environment to Eval- uate Attacks and Defenses for LLM Agents.CoRR abs/2406.13352, 2024. 8, 9, 13

  15. [15]

    Gemini 3 Flash.https:// deepmind.google/models/gemini/flash/

    Google Deepmind. Gemini 3 Flash.https:// deepmind.google/models/gemini/flash/. 8

  16. [16]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    DeepSeek-AI. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models.CoRR abs/2512.02556,

  17. [17]

    Discord’s Terms of Service.https:// discord.com/terms

    Discord. Discord’s Terms of Service.https:// discord.com/terms. 21

  18. [18]

    How to Run OpenClaw with DigitalOcean.https://www.digitalocean.com/ community/tutorials/how-to-run-openclaw

    Andrew Dugan. How to Run OpenClaw with DigitalOcean.https://www.digitalocean.com/ community/tutorials/how-to-run-openclaw. 2

  19. [19]

    Understanding The Weaponization Of Agentic AI.https://www.securityforum.org/in- the-news/understanding-the-weaponization- of-agentic-ai/

    Steve Durbin. Understanding The Weaponization Of Agentic AI.https://www.securityforum.org/in- the-news/understanding-the-weaponization- of-agentic-ai/. 1

  20. [20]

    Terms of Service.https://www

    Facebook. Terms of Service.https://www. facebook.com/legal/terms. 21

  21. [21]

    Gupta, Taylor Berg- Kirkpatrick, and Earlence Fernandes

    Xiaohan Fu, Shuheng Li, Zihan Wang, Yihao Liu, Ra- jesh K. Gupta, Taylor Berg-Kirkpatrick, and Earlence Fernandes. Imprompter: Tricking LLM Agents into Improper Tool Use.CoRR abs/2410.14923, 2024. 13

  22. [22]

    GitHub Terms of Service.https: //docs.github.com/en/site-policy/github- terms/github-terms-of-service

    GitHub. GitHub Terms of Service.https: //docs.github.com/en/site-policy/github- terms/github-terms-of-service. 21

  23. [23]

    ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

    Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving. InInternational Con- ference on Learning Representations (ICLR), 2024. 1

  24. [24]

    Skillprobe: Security auditing for emerging agent skill marketplaces via multi-agent collaboration.arXiv preprint arXiv:2603.21019, 2026

    Zihan Guo, Zhiyu Chen, Xiaohang Nie, Jianghao Lin, Yuanjian Zhou, and Weinan Zhang. SkillProbe: Secu- rity Auditing for Emerging Agent Skill Marketplaces via Multi-Agent Collaboration.CoRR abs/2603.21019,

  25. [25]

    Catastrophic jailbreak of open-source llms via exploiting generation

    Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation.CoRR abs/2310.06987, 2023. 13

  26. [26]

    Terms and Imprint.https://www

    Instagram. Terms and Imprint.https://www. instagram.com/legal/terms/. 21

  27. [27]

    AI as trade- craft: How threat actors operationalize AI

    Microsoft Threat Intelligence. AI as trade- craft: How threat actors operationalize AI. https://www.microsoft.com/en-us/security/ blog/2026/03/06/ai-as-tradecraft-how- threat-actors-operationalize-ai/. 1

  28. [28]

    LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI’s ChatGPT Plugins

    Umar Iqbal, Tadayoshi Kohno, and Franziska Roes- ner. LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI’s ChatGPT Plugins. InAAAI/ACM Conference on AI, Ethics, and Society (AIES), pages 611–623. ACM, 2024. 13

  29. [29]

    Wiley, 1991

    Raj Jain.The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Mea- surement, Simulation, and Modeling. Wiley, 1991. 4

  30. [30]

    Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs.CoRR abs/2602.08621, 2026

    Yukun Jiang, Hai Huang, Mingjie Li, Yage Zhang, Michael Backes, and Yang Zhang. Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs.CoRR abs/2602.08621, 2026. 13

  31. [31]

    Adjacent Words, Divergent Intents: Jailbreak- ing Large Language Models via Task Concurrency

    Yukun Jiang, Mingjie Li, Michael Backes, and Yang Zhang. Adjacent Words, Divergent Intents: Jailbreak- ing Large Language Models via Task Concurrency. In Annual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS, 2025. 13

  32. [32]

    Marcia K

    Yukun Jiang, Yage Zhang, Xinyue Shen, Michael Backes, and Yang Zhang. Humans welcome to observe: A First Look at the Agent Social Network Moltbook. CoRR abs/2602.10127, 2026. 1

  33. [33]

    When AI Meets the Web: Prompt Injec- tion Risks in Third-Party AI Chatbot Plugins

    Yigitcan Kaya, Anton Landerer, Stijn Pletinckx, Michelle Zimmermann, Christopher Kruegel, and Gio- vanni Vigna. When AI Meets the Web: Prompt Injec- tion Risks in Third-Party AI Chatbot Plugins. InIEEE Symposium on Security and Privacy (S&P). IEEE,

  34. [34]

    Kimi K2: Open Agentic Intelligence

    Kimi. Kimi K2: Open Agentic Intelligence.CoRR abs/2507.20534, 2025. 8

  35. [35]

    arXiv preprint arXiv:2603.02176 , year=

    Hao Li, Chunjiang Mu, Jianhao Chen, Siyue Ren, Zhiyao Cui, Yiqun Zhang, Lei Bai, and Shuyue Hu. Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale.CoRR abs/2603.02176,

  36. [36]

    Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection

    Zekun Li, Baolin Peng, Pengcheng He, and Xifeng Yan. Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection. In Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 557–568. ACL, 2024. 13 15

  37. [37]

    EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage

    Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Jiawei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, and Huan Sun. EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage. InIn- ternational Conference on Learning Representations (ICLR), 2025. 13

  38. [38]

    arXiv preprint arXiv:2602.08004 , year=

    George Ling, Shanshan Zhong, and Richard Huang. Agent Skills: A Data-Driven Analysis of Claude Skills for Extending Large Language Model Functionality. CoRR abs/2602.08004, 2026. 2, 8, 13

  39. [39]

    User Agreement.https://www.linkedin

    LinkedIn. User Agreement.https://www.linkedin. com/legal/user-agreement. 21

  40. [40]

    Malicious agent skills in the wild: A large-scale security empirical study.arXiv preprint arXiv:2602.06547, 2026

    Yi Liu, Zhihao Chen, Yanjun Zhang, Gelei Deng, Yuekang Li, Jianting Ning, Ying Zhang, and Leo Yu Zhang. Malicious Agent Skills in the Wild: A Large-Scale Security Empirical Study.CoRR abs/2602.06547, 2026. 2, 4, 8, 13

  41. [41]

    On a Test of Whether one of Two Random Variables is Stochasti- cally Larger than the Other.The Annals of Mathemati- cal Statistics, 1947

    Henry B Mann and Donald R Whitney. On a Test of Whether one of Two Random Variables is Stochasti- cally Larger than the Other.The Annals of Mathemati- cal Statistics, 1947. 6

  42. [42]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David A. Forsyth, and Dan Hendrycks. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal.CoRR abs/2402.04249, 2024. 13

  43. [43]

    Tree of Attacks: Jailbreaking Black-Box

    Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of Attacks: Jailbreaking Black- Box LLMs Automatically.CoRR abs/2312.02119,

  44. [44]

    TrustLDM: Benchmarking Trustworthiness in Lan- guage Diffusion Model

    Yichuan Mo, Yukun Jiang, Yanbo Shi, Mingjie Li, Michael Backes, Yang Zhang, and Yisen Wang. TrustLDM: Benchmarking Trustworthiness in Lan- guage Diffusion Model. InICLR 2026 Workshop on Principled Design for Trustworthy AI - Interpretability, Robustness, and Safety across Modalities, 2026. 8

  45. [45]

    NanoClaw.https://github.com/ qwibitai/nanoclaw

    NanoClaw. NanoClaw.https://github.com/ qwibitai/nanoclaw. 1

  46. [46]

    Codex.https://github.com/openai/ codex

    OpenAI. Codex.https://github.com/openai/ codex. 1

  47. [47]

    GPT-5.4 mini Model.https:// developers.openai.com/api/docs/models/gpt- 5.4-mini

    OpenAI. GPT-5.4 mini Model.https:// developers.openai.com/api/docs/models/gpt- 5.4-mini. 4, 8

  48. [48]

    Usage Policies.https://openai.com/ policies/usage-policies/

    OpenAI. Usage Policies.https://openai.com/ policies/usage-policies/. 3, 13, 17

  49. [49]

    GPT-4o System Card

    OpenAI. GPT-4o System Card.CoRR abs/2410.21276,

  50. [50]

    Install.https://docs.openclaw.ai/ install

    OpenClaw. Install.https://docs.openclaw.ai/ install. 2

  51. [51]

    Introducing OpenClaw.https:// openclaw.ai/blog/introducing-openclaw

    OpenClaw. Introducing OpenClaw.https:// openclaw.ai/blog/introducing-openclaw. 2

  52. [52]

    OpenClaw.https://openclaw.ai/

    OpenClaw. OpenClaw.https://openclaw.ai/. 1

  53. [53]

    OpenClaw Partners with VirusTotal for Skill Security.https://openclaw.ai/blog/ virustotal-partnership

    OpenClaw. OpenClaw Partners with VirusTotal for Skill Security.https://openclaw.ai/blog/ virustotal-partnership. 7

  54. [54]

    Qwen3 Technical Report

    Qwen. Qwen3 Technical Report.CoRR abs/2505.09388, 2025. 8

  55. [55]

    Reddit Rules.https://redditinc.com/ policies/reddit-rules

    Reddit. Reddit Rules.https://redditinc.com/ policies/reddit-rules. 21

  56. [56]

    Weaponis- ing AI: The New Cyber Attack Surface.https: //www.iiss.org/online-analysis/survival- online/2026/02/weaponising-ai-the-new- cyber-attack-surface/

    Rafal Rohozinski and Chris Spirito. Weaponis- ing AI: The New Cyber Attack Surface.https: //www.iiss.org/online-analysis/survival- online/2026/02/weaponising-ai-the-new- cyber-attack-surface/. 1

  57. [57]

    Do Anything Now: Charac- terizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. Do Anything Now: Charac- terizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. InACM SIGSAC Con- ference on Computer and Communications Security (CCS). ACM, 2024. 13

  58. [58]

    GPTracker: A Large-Scale Measurement of Misused GPTs

    Xinyue Shen, Yun Shen, Michael Backes, and Yang Zhang. GPTracker: A Large-Scale Measurement of Misused GPTs. InIEEE Symposium on Security and Privacy (S&P). IEEE, 2025. 4, 13, 18

  59. [59]

    Agent Skill Library.https://skills

    Skill.Rest. Agent Skill Library.https://skills. rest/. 1, 3, 4, 13

  60. [60]

    Contribute a Skill.https://skills

    Skills.Rest. Contribute a Skill.https://skills. rest/contribute. 7

  61. [61]

    Terms of Use.https://skills.rest/ terms

    Skills.Rest. Terms of Use.https://skills.rest/ terms. 7

  62. [62]

    Main Services Agreement.https://slack

    Slack. Main Services Agreement.https://slack. com/terms-of-service. 21

  63. [63]

    A StrongREJECT for empty jailbreaks

    Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. A StrongREJECT for Empty Jailbreaks. CoRR abs/2402.10260, 2024. 8, 9, 13

  64. [64]

    How are AI agents used? Evidence from 177,000 MCP tools,

    Merlin Stein. How are AI agents used? Evidence from 177,000 MCP tools.CoRR abs/2603.23802, 2026. 1

  65. [65]

    WeChat - Terms of Service.https://www

    Tencent. WeChat - Terms of Service.https://www. wechat.com/en/service_terms.html. 21

  66. [66]

    Terms of Service.https://www.tiktok

    TikTok. Terms of Service.https://www.tiktok. com/legal/terms-of-service. 21

  67. [67]

    Tinder Terms of Use.https://policies

    Tinder. Tinder Terms of Use.https://policies. tinder.com/terms. 21

  68. [68]

    VirusTotal.https://www.virustotal

    VirusTotal. VirusTotal.https://www.virustotal. com/gui/home/upload. 7 16

  69. [69]

    BadAgent: Inserting and Activating Back- door Attacks in LLM Agents

    Yifei Wang, Dizhan Xue, Shengjie Zhang, and Sheng- sheng Qian. BadAgent: Inserting and Activating Back- door Attacks in LLM Agents. InAnnual Meeting of the Association for Computational Linguistics (ACL), pages 9811–9827. ACL, 2024. 13

  70. [70]

    WIPI: A new web threat for LLM-driven web agents,

    Fangzhou Wu, Shutong Wu, Yulong Cao, and Chaowei Xiao. WIPI: A New Web Threat for LLM-Driven Web Agents.CoRR abs/2402.16965, 2024. 13

  71. [71]

    Data exposure from llm apps: An in-depth investigation of openai’s gpts,

    Yuhao Wu, Evin Jaff, Ke Yang, Ning Zhang, and Umar Iqbal. An In-Depth Investigation of Data Collection in LLM App Ecosystems.CoRR abs/2408.13247, 2025. 13

  72. [72]

    Terms of Service.https://x.com/en/tos

    X. Terms of Service.https://x.com/en/tos. 21

  73. [73]

    Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

    Renjun Xu and Yang Yan. Agent Skills for Large Lan- guage Models: Architecture, Acquisition, Security, and the Path Forward.CoRR abs/2602.12430, 2026. 3

  74. [74]

    Watch out for your agents! investigating backdoor threats to llm-based agents

    Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, and Xu Sun. Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents. CoRR abs/2402.11208, 2024. 13

  75. [75]

    Terms of Service.https://www.youtube

    YouTube. Terms of Service.https://www.youtube. com/t/terms. 21

  76. [76]

    R-Judge: Benchmarking Safety Risk Aware- ness for LLM Agents

    Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, and Gong- shen Liu. R-Judge: Benchmarking Safety Risk Aware- ness for LLM Agents. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 1467–1490. ACL, 2024. 13

  77. [77]

    InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

    Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents.CoRR abs/2403.02691, 2024. 8, 13

  78. [78]

    Agent Security Bench (ASB): For- malizing and Benchmarking Attacks and Defenses in LLM-based Agents

    Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. Agent Security Bench (ASB): For- malizing and Benchmarking Attacks and Defenses in LLM-based Agents. InInternational Conference on Learning Representations (ICLR), 2025. 13

  79. [79]

    Real Money, Fake Models: Deceptive Model Claims in Shadow APIs.CoRR abs/2603.01919, 2026

    Yage Zhang, Yukun Jiang, Zeyuan Chen, Michael Backes, Xinyue Shen, and Yang Zhang. Real Money, Fake Models: Deceptive Model Claims in Shadow APIs.CoRR abs/2603.01919, 2026. 8

  80. [80]

    Agent-SafetyBench: Evaluating the Safety of LLM Agents

    Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Minlie Huang. Agent-SafetyBench: Evaluating the Safety of LLM Agents.CoRR abs/2412.14470, 2024. 13

Showing first 80 references.