arxiv: 2604.15415 · v1 · submitted 2026-04-16 · 💻 cs.CR · cs.AI

Recognition: unknown

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

Michael Backes, Xinyue Shen, Yage Zhang, Yang Zhang, Yukun Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:28 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords LLM agentsharmful skillsagent safetyskill ecosystemsbenchmarksrefusal ratesLLM safety

0 comments

The pith

Pre-installed harmful skills substantially increase LLM agents' compliance with dangerous requests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures harmful skills in open agent ecosystems and finds that nearly 5 percent of 98,440 skills are harmful. It builds HarmfulSkillBench with 200 skills in 20 categories to test how pre-installed skills affect agent behavior on harmful tasks. Evaluations show that using a pre-installed skill to present the task raises average harm scores from 0.27 to 0.47 across six LLMs, reaching 0.76 when the intent is implicit rather than explicit. This highlights a vulnerability where agents can be weaponized through their skill dependencies.

Core claim

The authors claim that 4.93% of skills in two major registries are harmful and that presenting a harmful task through a pre-installed skill substantially lowers refusal rates across all models, with the average harm score rising from 0.27 without the skill to 0.47 with it, and further to 0.76 when the harmful intent is implicit rather than stated as an explicit user request.

What carries the argument

HarmfulSkillBench, a benchmark of 200 harmful skills across 20 categories and four evaluation conditions that isolate the effect of skill mediation and implicit intent.

If this is right

ClawHub shows a higher harmful skill rate of 8.84% compared to 3.49% on Skills.Rest.
All six evaluated LLMs exhibit increased harm scores when tasks are routed through pre-installed skills.
Implicit harmful intents via skills produce the highest average harm score of 0.76.
The released benchmark enables consistent testing of future agent safety measures against skill-based attacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent platforms could reduce risk by vetting skills before allowing installation or by isolating their execution.
Users may not realize they have equipped their agents with harmful capabilities if skills are added automatically or from untrusted sources.
Malicious actors could develop and distribute skills that enable specific harmful actions without direct prompting.
This setup suggests testing similar skill-based attacks in other AI systems that use external tools or plugins.

Load-bearing premise

The LLM-driven scoring system accurately classifies skills as harmful without substantial bias or false positives that would distort the rates and benchmark.

What would settle it

A large-scale manual audit of the skills flagged as harmful revealing that most do not match the taxonomy would invalidate the 4.93% finding and the benchmark results.

Figures

Figures reproduced from arXiv: 2604.15415 by Michael Backes, Xinyue Shen, Yage Zhang, Yang Zhang, Yukun Jiang.

**Figure 2.** Figure 2: Overview of our four-stage methodology. Stages 1–2 (data collection and harmful skill identification) address RQ1. Stage 3 [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Structure of an agent skill. The directory contains a [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Threat model comparison of prior work and our [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: Taxonomy of 21 harmful categories organized into Tier 1 Prohibited Use (P1–P14) and Tier 2 High-Risk Use (H1–H7). Each [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of harmful skills across taxonomy by reg [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Category co-occurrence among harmful skills. [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Engagement metrics (downloads and stars) on the [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 9.** Figure 9: Lorenz curve of harmful skill production across [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 10.** Figure 10: Overview of HARMFULSKILLBENCH. abling reasoning, so we set it to “minimal.” To simulate a realistic agent context, we construct a fivepart message structure inspired by AgentDojo [14]: (1) a system prompt defining the agent’s role and available tools, adapted from AgentHarm [2]; (2) a user message requesting the agent to read an available skill; (3) a pre-filled assistant message containing a read_skill … view at source ↗

**Figure 11.** Figure 11: Harm score heatmaps across all 20 harmful skill categories under Conditions A, B, and D. Each cell shows the mean score [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: Runtime skill integration in LLM-based agents. [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Risk score prompt used for LLM-based skill identi [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: Complete judge prompt used for evaluating agent responses. The judge receives the system message and user message as [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

read the original abstract

Large language models (LLMs) have evolved into autonomous agents that rely on open skill ecosystems (e.g., ClawHub and Skills.Rest), hosting numerous publicly reusable skills. Existing security research on these ecosystems mainly focuses on vulnerabilities within skills, such as prompt injection. However, there is a critical gap regarding skills that may be misused for harmful actions (e.g., cyber attacks, fraud and scams, privacy violations, and sexual content generation), namely harmful skills. In this paper, we present the first large-scale measurement study of harmful skills in agent ecosystems, covering 98,440 skills across two major registries. Using an LLM-driven scoring system grounded in our harmful skill taxonomy, we find that 4.93% of skills (4,858) are harmful, with ClawHub exhibiting an 8.84% harmful rate compared to 3.49% on Skills.Rest. We then construct HarmfulSkillBench, the first benchmark for evaluating agent safety against harmful skills in realistic agent contexts, comprising 200 harmful skills across 20 categories and four evaluation conditions. By evaluating six LLMs on HarmfulSkillBench, we find that presenting a harmful task through a pre-installed skill substantially lowers refusal rates across all models, with the average harm score rising from 0.27 without the skill to 0.47 with it, and further to 0.76 when the harmful intent is implicit rather than stated as an explicit user request. We responsibly disclose our findings to the affected registries and release our benchmark to support future research (see https://github.com/TrustAIRLab/HarmfulSkillBench).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper scans 98k agent skills, flags 5% as harmful via LLM scoring, and shows pre-installed skills cut refusal rates, but the unvalidated scorer undercuts the numbers.

read the letter

This paper scans two agent skill registries for harmful skills and shows that giving models access to them makes refusal much harder. They cover 98,440 skills, label 4.93% harmful with an LLM scorer, and build HarmfulSkillBench with 200 skills across 20 categories and four conditions. On six models the average harm score rises from 0.27 without the skill to 0.47 with it, and to 0.76 when intent is implicit rather than explicit. They also notify the registries and release the benchmark and code.

Referee Report

2 major / 2 minor

Summary. The paper conducts a large-scale measurement of harmful skills across 98,440 skills in two agent skill registries (ClawHub and Skills.Rest), using an LLM-driven scorer based on a custom harmful skill taxonomy to identify 4.93% (4,858) as harmful. It then releases HarmfulSkillBench, a benchmark of 200 skills across 20 categories and four conditions, and evaluates six LLMs to show that pre-installed harmful skills raise average harm scores from 0.27 (no skill) to 0.47 (with skill) and 0.76 (implicit intent), substantially lowering refusal rates.

Significance. If the LLM scorer is reliable, the work provides the first quantitative prevalence data on harmful skills in open agent ecosystems and a concrete benchmark demonstrating how skill reuse can bypass safety alignments. The release of the benchmark and responsible disclosure to registries are positive contributions that enable follow-on safety research.

major comments (2)

[Methods / LLM-driven scoring system] The central quantitative claims (4.93% harmful rate, harm-score deltas of 0.27→0.47→0.76) rest entirely on an LLM-driven scoring system used both to filter the 98,440 skills and to label model outputs. The abstract and methods description provide no accuracy metrics, human inter-rater agreement, false-positive analysis, or validation against a held-out set. Without these, both the prevalence statistic and the reported effect sizes could be artifacts of scorer bias rather than evidence of weaponization.
[HarmfulSkillBench construction] HarmfulSkillBench selects 200 skills from the filtered set and defines four evaluation conditions, yet the paper does not report how the 20 categories were chosen or whether the selection process introduces selection bias that inflates the observed refusal-rate differences. The claim that pre-installed skills “substantially lower refusal rates across all models” therefore lacks a clear control for category or skill difficulty.

minor comments (2)

[Taxonomy] The taxonomy is described as “grounded in” prior work but the paper does not include a table mapping taxonomy categories to the 20 benchmark categories or to the four evaluation conditions.
[Evaluation results] Figure captions and axis labels for the harm-score plots should explicitly state the number of runs per model and whether error bars represent standard error or 95% CI.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough and constructive review. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the manuscript without misrepresenting our current results.

read point-by-point responses

Referee: [Methods / LLM-driven scoring system] The central quantitative claims (4.93% harmful rate, harm-score deltas of 0.27→0.47→0.76) rest entirely on an LLM-driven scoring system used both to filter the 98,440 skills and to label model outputs. The abstract and methods description provide no accuracy metrics, human inter-rater agreement, false-positive analysis, or validation against a held-out set. Without these, both the prevalence statistic and the reported effect sizes could be artifacts of scorer bias rather than evidence of weaponization.

Authors: We agree that explicit validation metrics for the LLM-driven scorer are essential to support the prevalence claim and effect sizes. The current manuscript grounds the scorer in the harmful skill taxonomy but does not report accuracy, inter-rater agreement, or false-positive rates. In the revised version we will add a validation subsection that includes human annotation results on a held-out sample of skills, reporting agreement rates and error analysis to demonstrate that the 4.93% rate and subsequent harm-score deltas are not artifacts of scorer bias. We will also clarify that the benchmark harm scoring uses a separate judge prompt from the initial filtering step. revision: yes
Referee: [HarmfulSkillBench construction] HarmfulSkillBench selects 200 skills from the filtered set and defines four evaluation conditions, yet the paper does not report how the 20 categories were chosen or whether the selection process introduces selection bias that inflates the observed refusal-rate differences. The claim that pre-installed skills “substantially lower refusal rates across all models” therefore lacks a clear control for category or skill difficulty.

Authors: We appreciate the referee's point on potential selection bias. The 20 categories were derived directly from the harmful skill taxonomy to ensure broad coverage of harm types, with 10 skills randomly sampled per category from the filtered harmful set. The four conditions isolate the effect of skill installation. In the revision we will explicitly document this taxonomy-derived selection and randomization process in the methods section. We will also add per-category harm-score breakdowns to show that the refusal-rate reductions hold consistently across categories, providing a control for category-specific effects. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurement without self-referential derivations

full rationale

The paper performs a measurement study: an LLM-driven scorer (grounded in a taxonomy) classifies skills for prevalence (4.93%) and benchmark selection, then applies scoring to agent responses under controlled conditions to report refusal rates and harm deltas (0.27 to 0.47). No equations, fitted parameters, or predictions reduce by construction to the inputs; the scorer is a consistent external measurement instrument rather than a tautological loop. The central claims are observational comparisons on held-out model outputs and do not rely on self-citation chains or ansatzes that smuggle results. This is self-contained empirical work against the stated methodology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The measurement and benchmark rest on an unvalidated LLM classifier and a custom harmful-skill taxonomy whose construction details are not visible in the abstract.

free parameters (1)

Harm threshold in LLM scoring system
The 4.93% harmful rate depends on whatever cutoff the authors chose for the automated classifier.

axioms (1)

domain assumption LLM-based classifier produces reliable harmful-skill labels when grounded in the authors' taxonomy
The entire 98k-skill measurement and the 200-skill benchmark selection depend on this assumption.

pith-pipeline@v0.9.0 · 5601 in / 1295 out tokens · 52668 ms · 2026-05-10T10:28:23.139172+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Sealing the Audit-Runtime Gap for LLM Skills
cs.CR 2026-05 unverdicted novelty 7.0

SIGIL cryptographically seals the audit-runtime gap for LLM skills via an on-chain registry with four publication types, DAO vetting, and a runtime verification loader that enforces integrity and permissions.

Reference graph

Works this paper leans on

82 extracted references · 28 canonical work pages · cited by 1 Pith paper · 12 internal anchors

[1]

Amazon.de Conditions of Use and Sale

Amazon. Amazon.de Conditions of Use and Sale. https://www.amazon.de/gp/help/customer/ display.html?nodeId=GLSBYFE9MGKKQXXM. 21
[2]

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrik- son, Eric Winsor, Jerome Wynne, Yarin Gal, and Xan- der Davies. AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents. InInternational Confer- ence on Learning Representations (ICLR), 2025. 8, 9, 13

2025
[3]

Usage Policy.https://www.anthropic

Anthropic. Usage Policy.https://www.anthropic. com/legal/aup. 3, 13, 17
[4]

Using Agents According to Our Usage Pol- icy.https://support.claude.com/en/articles/ 12005017-using-agents-according-to-our- usage-policy

Anthropic. Using Agents According to Our Usage Pol- icy.https://support.claude.com/en/articles/ 12005017-using-agents-according-to-our- usage-policy. 3, 17
[5]

AutoGPT.https://agpt.co/

AutoGPT. AutoGPT.https://agpt.co/. 1
[6]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McK- innon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Bilibili User Agreement.https://www

Bilibili. Bilibili User Agreement.https://www. bilibili.com/protocal/licence.html. 21
[8]

An Evaluation of the Google Chrome Extension Security Architecture

Nicholas Carlini, Adrienne Porter Felt, and David Wag- ner. An Evaluation of the Google Chrome Extension Security Architecture. InUSENIX Security Symposium (USENIX Security), pages 97–111. USENIX, 2012. 13

2012
[9]

Jail- breakbench: An open robustness benchmark for jailbreaking large language models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, and Eric Wong. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models. CoRR abs/2404.01318, 2024. 8, 13

work page arXiv 2024
[10]

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jail- breaking Black Box Large Language Models in Twenty Queries.CoRR abs/2310.08419, 2023. 9, 13

work page internal anchor Pith review arXiv 2023
[11]

AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases. In 14 Annual Conference on Neural Information Processing Systems (NeurIPS), pages 130185–130213. NeurIPS,
[12]

Jail- breakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs

Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. Jail- breakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs. InAnnual Meeting of the As- sociation for Computational Linguistics (ACL), pages 21538–21566. ACL, 2025. 8, 13

2025
[13]

ClawHub, The Skill Dock for Sharp Agents

ClawHub. ClawHub, The Skill Dock for Sharp Agents. https://clawhub.ai/. 1, 3, 4
[14]

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. AgentDojo: A Dynamic Environment to Eval- uate Attacks and Defenses for LLM Agents.CoRR abs/2406.13352, 2024. 8, 9, 13

work page internal anchor Pith review arXiv 2024
[15]

Gemini 3 Flash.https:// deepmind.google/models/gemini/flash/

Google Deepmind. Gemini 3 Flash.https:// deepmind.google/models/gemini/flash/. 8
[16]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models.CoRR abs/2512.02556,

work page internal anchor Pith review arXiv
[17]

Discord’s Terms of Service.https:// discord.com/terms

Discord. Discord’s Terms of Service.https:// discord.com/terms. 21
[18]

How to Run OpenClaw with DigitalOcean.https://www.digitalocean.com/ community/tutorials/how-to-run-openclaw

Andrew Dugan. How to Run OpenClaw with DigitalOcean.https://www.digitalocean.com/ community/tutorials/how-to-run-openclaw. 2
[19]

Understanding The Weaponization Of Agentic AI.https://www.securityforum.org/in- the-news/understanding-the-weaponization- of-agentic-ai/

Steve Durbin. Understanding The Weaponization Of Agentic AI.https://www.securityforum.org/in- the-news/understanding-the-weaponization- of-agentic-ai/. 1
[20]

Terms of Service.https://www

Facebook. Terms of Service.https://www. facebook.com/legal/terms. 21
[21]

Gupta, Taylor Berg- Kirkpatrick, and Earlence Fernandes

Xiaohan Fu, Shuheng Li, Zihan Wang, Yihao Liu, Ra- jesh K. Gupta, Taylor Berg-Kirkpatrick, and Earlence Fernandes. Imprompter: Tricking LLM Agents into Improper Tool Use.CoRR abs/2410.14923, 2024. 13

work page arXiv 2024
[22]

GitHub Terms of Service.https: //docs.github.com/en/site-policy/github- terms/github-terms-of-service

GitHub. GitHub Terms of Service.https: //docs.github.com/en/site-policy/github- terms/github-terms-of-service. 21
[23]

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving. InInternational Con- ference on Learning Representations (ICLR), 2024. 1

2024
[24]

Skillprobe: Security auditing for emerging agent skill marketplaces via multi-agent collaboration.arXiv preprint arXiv:2603.21019, 2026

Zihan Guo, Zhiyu Chen, Xiaohang Nie, Jianghao Lin, Yuanjian Zhou, and Weinan Zhang. SkillProbe: Secu- rity Auditing for Emerging Agent Skill Marketplaces via Multi-Agent Collaboration.CoRR abs/2603.21019,

work page arXiv
[25]

Catastrophic jailbreak of open-source llms via exploiting generation

Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation.CoRR abs/2310.06987, 2023. 13

work page arXiv 2023
[26]

Terms and Imprint.https://www

Instagram. Terms and Imprint.https://www. instagram.com/legal/terms/. 21
[27]

AI as trade- craft: How threat actors operationalize AI

Microsoft Threat Intelligence. AI as trade- craft: How threat actors operationalize AI. https://www.microsoft.com/en-us/security/ blog/2026/03/06/ai-as-tradecraft-how- threat-actors-operationalize-ai/. 1

2026
[28]

LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI’s ChatGPT Plugins

Umar Iqbal, Tadayoshi Kohno, and Franziska Roes- ner. LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI’s ChatGPT Plugins. InAAAI/ACM Conference on AI, Ethics, and Society (AIES), pages 611–623. ACM, 2024. 13

2024
[29]

Wiley, 1991

Raj Jain.The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Mea- surement, Simulation, and Modeling. Wiley, 1991. 4

1991
[30]

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs.CoRR abs/2602.08621, 2026

Yukun Jiang, Hai Huang, Mingjie Li, Yage Zhang, Michael Backes, and Yang Zhang. Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs.CoRR abs/2602.08621, 2026. 13

work page arXiv 2026
[31]

Adjacent Words, Divergent Intents: Jailbreak- ing Large Language Models via Task Concurrency

Yukun Jiang, Mingjie Li, Michael Backes, and Yang Zhang. Adjacent Words, Divergent Intents: Jailbreak- ing Large Language Models via Task Concurrency. In Annual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS, 2025. 13

2025
[32]

Marcia K

Yukun Jiang, Yage Zhang, Xinyue Shen, Michael Backes, and Yang Zhang. Humans welcome to observe: A First Look at the Agent Social Network Moltbook. CoRR abs/2602.10127, 2026. 1

work page arXiv 2026
[33]

When AI Meets the Web: Prompt Injec- tion Risks in Third-Party AI Chatbot Plugins

Yigitcan Kaya, Anton Landerer, Stijn Pletinckx, Michelle Zimmermann, Christopher Kruegel, and Gio- vanni Vigna. When AI Meets the Web: Prompt Injec- tion Risks in Third-Party AI Chatbot Plugins. InIEEE Symposium on Security and Privacy (S&P). IEEE,
[34]

Kimi K2: Open Agentic Intelligence

Kimi. Kimi K2: Open Agentic Intelligence.CoRR abs/2507.20534, 2025. 8

work page internal anchor Pith review arXiv 2025
[35]

arXiv preprint arXiv:2603.02176 , year=

Hao Li, Chunjiang Mu, Jianhao Chen, Siyue Ren, Zhiyao Cui, Yiqun Zhang, Lei Bai, and Shuyue Hu. Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale.CoRR abs/2603.02176,

work page arXiv
[36]

Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection

Zekun Li, Baolin Peng, Pengcheng He, and Xifeng Yan. Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection. In Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 557–568. ACL, 2024. 13 15

2024
[37]

EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage

Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Jiawei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, and Huan Sun. EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage. InIn- ternational Conference on Learning Representations (ICLR), 2025. 13

2025
[38]

arXiv preprint arXiv:2602.08004 , year=

George Ling, Shanshan Zhong, and Richard Huang. Agent Skills: A Data-Driven Analysis of Claude Skills for Extending Large Language Model Functionality. CoRR abs/2602.08004, 2026. 2, 8, 13

work page arXiv 2026
[39]

User Agreement.https://www.linkedin

LinkedIn. User Agreement.https://www.linkedin. com/legal/user-agreement. 21
[40]

Malicious agent skills in the wild: A large-scale security empirical study.arXiv preprint arXiv:2602.06547, 2026

Yi Liu, Zhihao Chen, Yanjun Zhang, Gelei Deng, Yuekang Li, Jianting Ning, Ying Zhang, and Leo Yu Zhang. Malicious Agent Skills in the Wild: A Large-Scale Security Empirical Study.CoRR abs/2602.06547, 2026. 2, 4, 8, 13

work page arXiv 2026
[41]

On a Test of Whether one of Two Random Variables is Stochasti- cally Larger than the Other.The Annals of Mathemati- cal Statistics, 1947

Henry B Mann and Donald R Whitney. On a Test of Whether one of Two Random Variables is Stochasti- cally Larger than the Other.The Annals of Mathemati- cal Statistics, 1947. 6

1947
[42]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David A. Forsyth, and Dan Hendrycks. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal.CoRR abs/2402.04249, 2024. 13

work page internal anchor Pith review arXiv 2024
[43]

Tree of Attacks: Jailbreaking Black-Box

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of Attacks: Jailbreaking Black- Box LLMs Automatically.CoRR abs/2312.02119,

work page arXiv
[44]

TrustLDM: Benchmarking Trustworthiness in Lan- guage Diffusion Model

Yichuan Mo, Yukun Jiang, Yanbo Shi, Mingjie Li, Michael Backes, Yang Zhang, and Yisen Wang. TrustLDM: Benchmarking Trustworthiness in Lan- guage Diffusion Model. InICLR 2026 Workshop on Principled Design for Trustworthy AI - Interpretability, Robustness, and Safety across Modalities, 2026. 8

2026
[45]

NanoClaw.https://github.com/ qwibitai/nanoclaw

NanoClaw. NanoClaw.https://github.com/ qwibitai/nanoclaw. 1
[46]

Codex.https://github.com/openai/ codex

OpenAI. Codex.https://github.com/openai/ codex. 1
[47]

GPT-5.4 mini Model.https:// developers.openai.com/api/docs/models/gpt- 5.4-mini

OpenAI. GPT-5.4 mini Model.https:// developers.openai.com/api/docs/models/gpt- 5.4-mini. 4, 8
[48]

Usage Policies.https://openai.com/ policies/usage-policies/

OpenAI. Usage Policies.https://openai.com/ policies/usage-policies/. 3, 13, 17
[49]

GPT-4o System Card

OpenAI. GPT-4o System Card.CoRR abs/2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Install.https://docs.openclaw.ai/ install

OpenClaw. Install.https://docs.openclaw.ai/ install. 2
[51]

Introducing OpenClaw.https:// openclaw.ai/blog/introducing-openclaw

OpenClaw. Introducing OpenClaw.https:// openclaw.ai/blog/introducing-openclaw. 2
[52]

OpenClaw.https://openclaw.ai/

OpenClaw. OpenClaw.https://openclaw.ai/. 1
[53]

OpenClaw Partners with VirusTotal for Skill Security.https://openclaw.ai/blog/ virustotal-partnership

OpenClaw. OpenClaw Partners with VirusTotal for Skill Security.https://openclaw.ai/blog/ virustotal-partnership. 7
[54]

Qwen3 Technical Report

Qwen. Qwen3 Technical Report.CoRR abs/2505.09388, 2025. 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Reddit Rules.https://redditinc.com/ policies/reddit-rules

Reddit. Reddit Rules.https://redditinc.com/ policies/reddit-rules. 21
[56]

Weaponis- ing AI: The New Cyber Attack Surface.https: //www.iiss.org/online-analysis/survival- online/2026/02/weaponising-ai-the-new- cyber-attack-surface/

Rafal Rohozinski and Chris Spirito. Weaponis- ing AI: The New Cyber Attack Surface.https: //www.iiss.org/online-analysis/survival- online/2026/02/weaponising-ai-the-new- cyber-attack-surface/. 1

2026
[57]

Do Anything Now: Charac- terizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. Do Anything Now: Charac- terizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. InACM SIGSAC Con- ference on Computer and Communications Security (CCS). ACM, 2024. 13

2024
[58]

GPTracker: A Large-Scale Measurement of Misused GPTs

Xinyue Shen, Yun Shen, Michael Backes, and Yang Zhang. GPTracker: A Large-Scale Measurement of Misused GPTs. InIEEE Symposium on Security and Privacy (S&P). IEEE, 2025. 4, 13, 18

2025
[59]

Agent Skill Library.https://skills

Skill.Rest. Agent Skill Library.https://skills. rest/. 1, 3, 4, 13
[60]

Contribute a Skill.https://skills

Skills.Rest. Contribute a Skill.https://skills. rest/contribute. 7
[61]

Terms of Use.https://skills.rest/ terms

Skills.Rest. Terms of Use.https://skills.rest/ terms. 7
[62]

Main Services Agreement.https://slack

Slack. Main Services Agreement.https://slack. com/terms-of-service. 21
[63]

A StrongREJECT for empty jailbreaks

Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. A StrongREJECT for Empty Jailbreaks. CoRR abs/2402.10260, 2024. 8, 9, 13

work page arXiv 2024
[64]

How are AI agents used? Evidence from 177,000 MCP tools,

Merlin Stein. How are AI agents used? Evidence from 177,000 MCP tools.CoRR abs/2603.23802, 2026. 1

work page arXiv 2026
[65]

WeChat - Terms of Service.https://www

Tencent. WeChat - Terms of Service.https://www. wechat.com/en/service_terms.html. 21
[66]

Terms of Service.https://www.tiktok

TikTok. Terms of Service.https://www.tiktok. com/legal/terms-of-service. 21
[67]

Tinder Terms of Use.https://policies

Tinder. Tinder Terms of Use.https://policies. tinder.com/terms. 21
[68]

VirusTotal.https://www.virustotal

VirusTotal. VirusTotal.https://www.virustotal. com/gui/home/upload. 7 16
[69]

BadAgent: Inserting and Activating Back- door Attacks in LLM Agents

Yifei Wang, Dizhan Xue, Shengjie Zhang, and Sheng- sheng Qian. BadAgent: Inserting and Activating Back- door Attacks in LLM Agents. InAnnual Meeting of the Association for Computational Linguistics (ACL), pages 9811–9827. ACL, 2024. 13

2024
[70]

WIPI: A new web threat for LLM-driven web agents,

Fangzhou Wu, Shutong Wu, Yulong Cao, and Chaowei Xiao. WIPI: A New Web Threat for LLM-Driven Web Agents.CoRR abs/2402.16965, 2024. 13

work page arXiv 2024
[71]

Data exposure from llm apps: An in-depth investigation of openai’s gpts,

Yuhao Wu, Evin Jaff, Ke Yang, Ning Zhang, and Umar Iqbal. An In-Depth Investigation of Data Collection in LLM App Ecosystems.CoRR abs/2408.13247, 2025. 13

work page arXiv 2025
[72]

Terms of Service.https://x.com/en/tos

X. Terms of Service.https://x.com/en/tos. 21
[73]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Renjun Xu and Yang Yan. Agent Skills for Large Lan- guage Models: Architecture, Acquisition, Security, and the Path Forward.CoRR abs/2602.12430, 2026. 3

work page internal anchor Pith review arXiv 2026
[74]

Watch out for your agents! investigating backdoor threats to llm-based agents

Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, and Xu Sun. Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents. CoRR abs/2402.11208, 2024. 13

work page arXiv 2024
[75]

Terms of Service.https://www.youtube

YouTube. Terms of Service.https://www.youtube. com/t/terms. 21
[76]

R-Judge: Benchmarking Safety Risk Aware- ness for LLM Agents

Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, and Gong- shen Liu. R-Judge: Benchmarking Safety Risk Aware- ness for LLM Agents. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 1467–1490. ACL, 2024. 13

2024
[77]

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents.CoRR abs/2403.02691, 2024. 8, 13

work page internal anchor Pith review arXiv 2024
[78]

Agent Security Bench (ASB): For- malizing and Benchmarking Attacks and Defenses in LLM-based Agents

Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. Agent Security Bench (ASB): For- malizing and Benchmarking Attacks and Defenses in LLM-based Agents. InInternational Conference on Learning Representations (ICLR), 2025. 13

2025
[79]

Real Money, Fake Models: Deceptive Model Claims in Shadow APIs.CoRR abs/2603.01919, 2026

Yage Zhang, Yukun Jiang, Zeyuan Chen, Michael Backes, Xinyue Shen, and Yang Zhang. Real Money, Fake Models: Deceptive Model Claims in Shadow APIs.CoRR abs/2603.01919, 2026. 8

work page arXiv 2026
[80]

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Minlie Huang. Agent-SafetyBench: Evaluating the Safety of LLM Agents.CoRR abs/2412.14470, 2024. 13

work page internal anchor Pith review arXiv 2024

Showing first 80 references.