arxiv: 2605.11770 · v1 · submitted 2026-05-12 · 💻 cs.CR · cs.AI· cs.SY· eess.SY

Recognition: no theorem link

Behavioral Integrity Verification for AI Agent Skills

Yuhao Wu , Tung-Ling Li , Hongliang Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:51 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.SYeess.SY

keywords behavioral integrity verificationAI agent skillsdeviation taxonomymalicious skill detectioncapability extractionLLM securityroot cause analysisagent safety

0 comments

The pith

Behavioral integrity verification shows 80% of AI agent skills deviate from declared capabilities, mostly due to oversight.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes the problem of verifying whether AI agent skills actually perform as their descriptions claim, since these skills grant privileged access to systems like file operations and network calls. It introduces a framework that compares declared and actual capabilities using code analysis combined with language model assistance to build evidence for further checks. When applied to nearly 50,000 skills, the analysis finds a large mismatch rate, with most issues traced to developer mistakes instead of intentional harm. The same framework also identifies malicious skills more accurately than prior methods on a dedicated benchmark.

Core claim

The central claim is that behavioral integrity verification can be formalized as a typed set comparison between declared and actual capabilities over a shared taxonomy. The framework pairs deterministic code analysis with LLM-assisted capability extraction to generate structured evidence. On 49,943 skills, this reveals 80.0% deviation from declared behavior, four novel compound-threat categories, root causes split as 81.1% oversight and 18.9% adversarial intent, and 5.0% of skills with predicted multi-stage attack chains. Malicious skill detection reaches an F1 of 0.946 on 906 skills, outperforming rule-based and single-pass LLM baselines.

What carries the argument

The BIV framework, which performs a typed set comparison between declared and actual capabilities over a shared taxonomy by pairing deterministic code analysis with LLM-assisted capability extraction.

If this is right

The deviation taxonomy surfaces four novel compound-threat categories for classifying complex risks.
Root-cause classification shows the majority of issues trace to oversight and can be addressed through improved development practices.
5.0% of skills carry predicted multi-stage attack chains that warrant targeted scrutiny.
Malicious skill detection achieves an F1 of 0.946 and outperforms existing rule-based and single-pass LLM baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent skill platforms could require BIV checks before publishing tools to reduce unsafe capabilities reaching users.
The taxonomy could inform developer guidelines that minimize accidental description-implementation gaps.
Adapting the extraction approach to monitor skills at runtime might catch behavioral changes after initial approval.
Repeating the audit on skills from other registries would indicate whether the 80% deviation rate is widespread.

Load-bearing premise

The LLM-assisted capability extraction accurately and consistently identifies actual skill capabilities from code, instructions, and metadata without substantial errors or biases that would invalidate the deviation taxonomy or detection results.

What would settle it

A large-scale manual review of extracted capabilities on a random sample of skills from the registry that shows frequent mismatches with the automated taxonomy would falsify the deviation rates and detection performance.

Figures

Figures reproduced from arXiv: 2605.11770 by Hongliang Liu, Tung-Ling Li, Yuhao Wu.

**Figure 1.** Figure 1: BIV processes a third-party skill s = (M, C, I) along symmetric declared-behavior (metadata M) and actual-behavior (code C + instructions I) tracks. Each track runs parallel deterministic and LLM extractors that converge on a 29-capability taxonomy; the typed mismatch becomes structured Behavioral Deviations powering three demonstrated downstream analyses: deviation taxonomy (§4.1), root-cause classific… view at source ↗

**Figure 2.** Figure 2: UMAP of clustered deviation embeddings: nine compact, well-separated representatives [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Per-category deviation profile. Left: total deviation volume by category, split by direc [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: The 8-branch / 36-leaf intent taxonomy. Adversarial branches (A–F) in warm tones, non [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Intent classification of 163,754 clustered deviations. Left: non-adversarial root causes [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Real-world trace from the OpenClaw scan. The manifest declares [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

read the original abstract

Agent skills extend LLM agents with privileged third-party capabilities such as filesystem access, credentials, network calls, and shell execution. Existing safety work catches malicious prompts and risky runtime actions, but the skill artifact itself goes unverified. We formalize this as the behavioral integrity verification (BIV) problem: a typed set comparison between declared and actual capabilities over a shared taxonomy that bridges code, instructions, and metadata. The BIV framework instantiates this comparison by pairing deterministic code analysis with LLM-assisted capability extraction. The resulting structured evidence supports three downstream analyses: deviation taxonomy, root-cause classification, and malicious-skill detection. On 49,943 skills from the OpenClaw registry, the deviation taxonomy reveals a pervasive description-implementation gap: 80.0% of skills deviate from declared behavior, with four novel compound-threat categories surfaced. Root-cause classification finds that deviations are mostly oversight, not malice: 81.1% trace to developer oversight and 18.9% to adversarial intent, with 5.0% of skills carrying predicted multi-stage attack chains. On a 906-skill malicious-skill detection benchmark, BIV reaches an F1 of 0.946, outperforming state-of-the-art rule-based and single-pass LLM baselines. These results demonstrate behavioral integrity auditing for agent skills at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper formalizes behavioral integrity verification for AI agent skills and reports an 80% deviation rate at scale, but its core numbers depend on unvalidated LLM extraction.

read the letter

This paper formalizes behavioral integrity verification as a typed set comparison between declared and actual capabilities over a bridging taxonomy. They combine deterministic code analysis with LLM-assisted extraction and run it across 49,943 skills from the OpenClaw registry, surfacing an 80% mismatch rate plus four compound-threat categories. Root-cause splits show 81% oversight versus 19% adversarial intent, and they report 0.946 F1 on a 906-skill malicious detection benchmark that beats rule-based and single-pass LLM baselines.

Referee Report

1 major / 2 minor

Summary. The paper introduces the behavioral integrity verification (BIV) problem for AI agent skills, formalizing it as a typed set comparison between declared and actual capabilities using a shared taxonomy. It instantiates BIV via deterministic code analysis paired with LLM-assisted capability extraction, then applies the framework to three analyses: a deviation taxonomy on 49,943 OpenClaw skills (reporting 80.0% deviation with four novel compound-threat categories), root-cause classification (81.1% oversight vs. 18.9% adversarial intent, 5.0% multi-stage chains), and malicious-skill detection on a 906-skill benchmark (F1=0.946, outperforming rule-based and single-pass LLM baselines).

Significance. If the extraction step holds, the work provides the first large-scale empirical audit of the description-implementation gap in agent skills, demonstrating that most deviations stem from oversight rather than malice and offering a practical detection method that improves on existing baselines. The scale (nearly 50k skills) and concrete metrics are strengths; the framework could support ongoing registry auditing if validated.

major comments (1)

[Methods/Evaluation] Methods and Evaluation sections: the LLM-assisted capability extraction step that produces the 'actual' capability sets is central to all headline results (80.0% deviation rate, 81.1%/18.9% split, 5.0% multi-stage chains, and F1=0.946), yet no large-scale human ground-truth validation, inter-annotator agreement, or error-rate measurement is reported on the OpenClaw corpus. Systematic extraction errors (e.g., missed implicit calls or metadata misclassification) would directly propagate into the taxonomy and detection claims without being detectable from the reported numbers.

minor comments (2)

[Abstract/§4] Abstract and §4: clarify how the 906-skill malicious-skill benchmark was constructed and labeled, including any overlap with the 49,943-skill corpus.
[Results] Figure 3 or equivalent: add error bars or confidence intervals to the reported percentages and F1 scores.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments and recommendation. We address the major comment below and will revise the manuscript accordingly to strengthen the validation of our methods.

read point-by-point responses

Referee: [Methods/Evaluation] Methods and Evaluation sections: the LLM-assisted capability extraction step that produces the 'actual' capability sets is central to all headline results (80.0% deviation rate, 81.1%/18.9% split, 5.0% multi-stage chains, and F1=0.946), yet no large-scale human ground-truth validation, inter-annotator agreement, or error-rate measurement is reported on the OpenClaw corpus. Systematic extraction errors (e.g., missed implicit calls or metadata misclassification) would directly propagate into the taxonomy and detection claims without being detectable from the reported numbers.

Authors: We agree that the reliability of the LLM-assisted capability extraction is central to all reported results and that the absence of large-scale human validation is a limitation. The original manuscript focused on the deterministic code analysis component for grounding but did not report human ground-truth metrics on the full corpus. In the revised manuscript, we will add a new subsection in the Evaluation section describing a human validation study on a stratified sample of 500 skills. This will report inter-annotator agreement, extraction error rates, and an analysis of potential systematic issues such as missed implicit calls. We will also expand the description of the LLM prompt engineering and few-shot examples used. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results derive from external registry and separate benchmark without reduction to self-referential definitions or fitted inputs.

full rationale

The paper defines BIV as a typed set comparison between declared and actual capabilities, instantiated via deterministic code analysis plus LLM-assisted extraction. Headline statistics (80% deviation rate, root-cause splits, 5% multi-stage chains, F1=0.946) are computed directly from applying this comparison to the external OpenClaw registry (49,943 skills) and a separate 906-skill benchmark. No equations or steps reduce a claimed prediction or taxonomy to quantities defined in terms of the same fitted parameters or self-citations. The LLM extraction step is a methodological component whose accuracy is assumed rather than derived from the results themselves; this is a validation gap, not a circular reduction. The derivation chain remains self-contained against the stated external data sources.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the existence of a complete shared taxonomy and the reliability of LLM extraction; no explicit free parameters or new invented entities are described in the abstract.

axioms (1)

domain assumption A shared taxonomy exists that can bridge code, instructions, and metadata for capability comparison.
Invoked as the basis for the typed set comparison that defines the BIV problem.

pith-pipeline@v0.9.0 · 5539 in / 1332 out tokens · 37025 ms · 2026-05-13T05:51:55.104019+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 4 internal anchors

[1]

Re- Act: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. Re- Act: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[2]

Toolformer: Language models can teach them- selves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach them- selves to use tools. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023

work page 2023
[3]

Cox, Jaeyeon Jung, Patrick McDaniel, and Anmol N

William Enck, Peter Gilbert, Seungyeop Han, Vasant Tendulkar, Byung-Gon Chun, Landon P. Cox, Jaeyeon Jung, Patrick McDaniel, and Anmol N. Sheth. TaintDroid: An information-flow tracking sys- tem for realtime privacy monitoring on smartphones. InProceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 393–407. US...

work page 2010
[4]

DREBIN: Effective and explainable detection of Android malware in your pocket

Daniel Arp, Michael Spreitzenbarth, Malte Hubner, Hugo Gascon, and Konrad Rieck. DREBIN: Effective and explainable detection of Android malware in your pocket. InProceedings of the 2014 Network and Distributed System Security Symposium (NDSS). Internet Society, 2014

work page 2014
[5]

Trends and lessons from three years fighting malicious extensions

Nav Jagpal, Eric Dingle, Jean-Philippe Gravel, Panayiotis Mavrommatis, Niels Provos, Moheeb Abu Rajab, and Kurt Thomas. Trends and lessons from three years fighting malicious extensions. In24th USENIX Security Symposium, pages 579–593. USENIX Association, 2015

work page 2015
[6]

Hulk: Eliciting malicious behavior in browser extensions

Alexandros Kapravelos, Chris Grier, Neha Chachra, Christopher Kruegel, Giovanni Vigna, and Vern Paxson. Hulk: Eliciting malicious behavior in browser extensions. In23rd USENIX Security Symposium, pages 641–654. USENIX Association, 2014

work page 2014
[7]

Small world with high risks: A study of security threats in the npm ecosystem

Markus Zimmermann, Cristian-Alexandru Staicu, Cam Tenny, and Michael Pradel. Small world with high risks: A study of security threats in the npm ecosystem. In28th USENIX Security Symposium, pages 995–1010. USENIX Association, 2019

work page 2019
[8]

Towards measuring supply chain attacks on package managers for interpreted languages

Ruian Duan, Omar Alrawi, Ranjita Pai Kasturi, Ryan Elder, Brendan Saltaformaggio, and Wenke Lee. Towards measuring supply chain attacks on package managers for interpreted languages. InProceedings of the 2021 Network and Distributed System Security Symposium (NDSS). Internet Society, 2021

work page 2021
[9]

Tensor Trust: Interpretable prompt injection attacks from an online game

Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, and Stuart Russell. Tensor Trust: Interpretable prompt injection attacks from an online game. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[10]

WHYPER: Towards automating risk assessment of mobile applications

Rahul Pandita, Xusheng Xiao, Wei Yang, William Enck, and Tao Xie. WHYPER: Towards automating risk assessment of mobile applications. In22nd USENIX Security Symposium, pages 527–542. USENIX Association, 2013

work page 2013
[11]

Checking app behavior against app descriptions

Alessandra Gorla, Ilaria Tavecchia, Florian Gross, and Andreas Zeller. Checking app behavior against app descriptions. InProceedings of the 36th International Conference on Software Engineering (ICSE), pages 1025–1035. ACM, 2014

work page 2014
[12]

SoK: Taxonomy of attacks on open-source software supply chains

Piergiorgio Ladisa, Henrik Plate, Matias Martinez, and Olivier Barais. SoK: Taxonomy of attacks on open-source software supply chains. In2023 IEEE Symposium on Security and Privacy (SP), pages 1509–1526. IEEE, 2023. 11

work page 2023
[13]

Ad injection at scale: Assessing deceptive advertisement modifications

Kurt Thomas, Elie Bursztein, Chris Grier, Grant Ho, Nav Jagpal, Alexandros Kapravelos, Damon McCoy, Antonio Nappa, Vern Paxson, Niels Provos, Moheeb Abu Rajab, and Giovanni Vigna. Ad injection at scale: Assessing deceptive advertisement modifications. In2015 IEEE Symposium on Security and Privacy, pages 151–167. IEEE, 2015

work page 2015
[14]

You’ve changed: Detecting malicious browser extension updates

Nikolaos Pantelaios, Nick Nikiforakis, and Alexandros Kapravelos. You’ve changed: Detecting malicious browser extension updates. InProceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 477–491. ACM, 2020

work page 2020
[15]

Data exposure from llm apps: An in-depth investigation of openai’s gpts,

Yuhao Wu, Evin Jaff, Ke Yang, Ning Zhang, and Umar Iqbal. An in-depth investigation of data collection in LLM app ecosystems. InProceedings of the ACM Internet Measurement Conference (IMC), 2025. arXiv preprint arXiv:2408.13247

work page arXiv 2025
[16]

A measurement study of model context protocol (MCP) ecosystem.arXiv preprint arXiv:2509.25292, 2025

Hechuan Guo, Yongle Hao, Yue Zhang, Minghui Xu, Peizhuo Lv, Jiezhi Chen, and Xiuzhen Cheng. A measurement study of model context protocol (MCP) ecosystem.arXiv preprint arXiv:2509.25292, 2025

work page arXiv 2025
[17]

Parasites in the Toolchain: A Large-Scale Analysis of Attacks on the MCP Ecosystem

Shuli Zhao, Qinsheng Hou, Zihan Zhan, Yanhao Wang, Yuchong Xie, Yu Guo, Libo Chen, Shenghong Li, and Zhi Xue. Parasites in the toolchain: A large-scale analysis of attacks on the MCP ecosystem.arXiv preprint arXiv:2509.06572, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Don’t believe everything you read: Understanding and measuring MCP behavior under misleading tool descriptions

Zhihao Li, Boyang Ma, Xuelong Dai, Minghui Xu, Yue Zhang, Biwei Yan, and Kun Li. Don’t believe everything you read: Understanding and measuring MCP behavior under misleading tool descriptions. arXiv preprint arXiv:2602.03580, 2026

work page arXiv 2026
[19]

From component manipulation to system compromise: Understanding and detecting malicious MCP servers.arXiv preprint arXiv:2604.01905, 2026

Yiheng Huang, Zhijia Zhao, Bihuan Chen, and Susheng Wu. From component manipulation to system compromise: Understanding and detecting malicious MCP servers.arXiv preprint arXiv:2604.01905, 2026

work page arXiv 2026
[20]

SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills

Yinghan Hou and Zongyou Yang. SkillSieve: A hierarchical triage framework for detecting malicious AI agent skills.arXiv preprint arXiv:2604.06550, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

elementary, my dear watson

Shenao Wang, Junjie He, Yanjie Zhao, Yayi Wang, Kan Yu, and Haoyu Wang. “elementary, my dear watson.” detecting malicious skills via neuro-symbolic reasoning across heterogeneous artifacts.arXiv preprint arXiv:2603.27204, 2026

work page arXiv 2026
[22]

Formal analysis and supply chain security for agentic AI skills.arXiv preprint arXiv:2603.00195, 2026

Varun Pratap Bhardwaj. Formal analysis and supply chain security for agentic AI skills.arXiv preprint arXiv:2603.00195, 2026

work page arXiv 2026
[23]

Modeling and discovering vulnerabilities with code property graphs

Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. Modeling and discovering vulnerabilities with code property graphs. In2014 IEEE Symposium on Security and Privacy, pages 590–604. IEEE, 2014

work page 2014
[24]

FlowDroid: Precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for Android apps

Steven Arzt, Siegfried Rasthofer, Christian Fritz, Eric Bodden, Alexandre Bartel, Jacques Klein, Yves Le Traon, Damien Octeau, and Patrick McDaniel. FlowDroid: Precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for Android apps. InProceedings of the 35th ACM SIGPLAN Confer- ence on Programming Language Design and Implementat...

work page 2014
[25]

Benjamin Livshits and Monica S

V . Benjamin Livshits and Monica S. Lam. Finding security vulnerabilities in Java applications with static analysis. In14th USENIX Security Symposium, pages 271–286. USENIX Association, 2005

work page 2005
[26]

BadAgent: Inserting and activating backdoor attacks in LLM agents

Yifei Wang, Dizhan Xue, Shengjie Zhang, and Shengsheng Qian. BadAgent: Inserting and activating backdoor attacks in LLM agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9811–9827. Association for Computational Linguistics, 2024

work page 2024
[27]

Skill-inject: Measuring agent vulnerability to skill file attacks.arXiv preprint arXiv:2602.20156, 2026

D. Schmotz, L. Beurer-Kellner, S. Abdelnabi, and M. Andriushchenko. Skill-Inject: Measuring agent vulnerability to skill file attacks.arXiv preprint arXiv:2602.20156, 2026

work page arXiv 2026
[28]

AgentPoison: Red-teaming LLM agents via poisoning memory or knowledge bases

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. AgentPoison: Red-teaming LLM agents via poisoning memory or knowledge bases. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[29]

Prompt injection attack to tool selection in llm agents.arXiv preprint arXiv:2504.19793, 2025

Jiawen Shi, Zenghui Yuan, Guiyao Tie, Pan Zhou, Neil Zhenqiang Gong, and Lichao Sun. Prompt injection attack to tool selection in LLM agents. InNetwork and Distributed System Security Symposium (NDSS), 2026. arXiv preprint arXiv:2504.19793

work page arXiv 2026
[30]

Prompt flow integrity to prevent privilege escalation in LLM agents.arXiv preprint arXiv:2503.15547, 2025

Juhee Kim, Woohyuk Choi, and Byoungyoung Lee. Prompt flow integrity to prevent privilege escalation in LLM agents.arXiv preprint arXiv:2503.15547, 2025. 12

work page arXiv 2025
[31]

Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. InProceedings of the 2023 Workshop on Artificial Intelligence and Security (AISec), co-located with ACM CCS, pages 79–90. ACM, 2023

work page 2023
[32]

AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovi ´c, Luca Beurer-Kellner, Marc Fischer, and Florian Tram`er. AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024), Datasets and Bench- marks Track, 2024

work page 2024
[33]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM- as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[34]

R-Judge: Benchmarking safety risk awareness for LLM agents

Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, and Gongshen Liu. R-Judge: Benchmarking safety risk awareness for LLM agents. InFindings of the Association for Computational Linguistics (EMNLP), 2024

work page 2024
[35]

Llama Guard: LLM-based input-output safeguard for human-AI conversations, 2023

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard: LLM-based input-output safeguard for human-AI conversations, 2023

work page 2023
[36]

Watch out for your agents! investigating backdoor threats to LLM-based agents

Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, and Xu Sun. Watch out for your agents! investigating backdoor threats to LLM-based agents. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[37]

SecGPT: An execution isolation architecture for LLM-based systems

Yuhao Wu, Franziska Roesner, Tadayoshi Kohno, Ning Zhang, and Umar Iqbal. IsolateGPT: An execu- tion isolation architecture for LLM-based agentic systems. InNetwork and Distributed System Security Symposium (NDSS), 2025. arXiv preprint arXiv:2403.04960; originally titled SecGPT

work page arXiv 2025
[38]

Alexander Robey, Eric Wong, Hamed Hassani, and George J. Pappas. SmoothLLM: Defending large language models against jailbreaking attacks.Transactions on Machine Learning Research (TMLR),

work page
[39]

arXiv preprint arXiv:2310.03684

work page internal anchor Pith review arXiv
[40]

Towards automating data access permissions in AI agents

Yuhao Wu, Ke Yang, Franziska Roesner, Tadayoshi Kohno, Ning Zhang, and Umar Iqbal. Towards automating data access permissions in AI agents. In2026 IEEE Symposium on Security and Privacy (SP), 2026. arXiv preprint arXiv:2511.17959

work page arXiv 2026
[41]

Saltzer and Michael D

Jerome H. Saltzer and Michael D. Schroeder. The protection of information in computer systems.Pro- ceedings of the IEEE, 63(9):1278–1308, 1975

work page 1975
[42]

Sentence-BERT: Sentence embeddings using siamese BERT-networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3982–3992. Association for Computational Linguistics, 2019

work page 2019
[43]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Leland McInnes, John Healy, and James Melville. UMAP: Uniform manifold approximation and projec- tion for dimension reduction.arXiv preprint arXiv:1802.03426, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[44]

Ricardo J. G. B. Campello, Davoud Moulavi, and J ¨org Sander. Density-based clustering based on hi- erarchical density estimates. InAdvances in Knowledge Discovery and Data Mining (PAKDD), pages 160–172. Springer, 2013

work page 2013
[45]

Malicious agent skills in the wild: A large-scale security empirical study.arXiv preprint arXiv:2602.06547, 2026

Yi Liu, Zhihao Chen, Yanjun Zhang, Gelei Deng, Yuekang Li, Jianting Ning, Ying Zhang, and Leo Yu Zhang. Malicious agent skills in the wild: A large-scale security empirical study.arXiv preprint arXiv:2602.06547, 2026. Dataset athttps://huggingface.co/datasets/ProtectSkills/ MaliciousAgentSkillsBench

work page arXiv 2026
[46]

Skillject: Automating stealthy skill-based prompt injection for coding agents with trace-driven closed-loop refinement.arXiv preprintarXiv:2602.14211, 2026

Xiaojun Jia, Jie Liao, Simeng Qin, Jindong Gu, Wenqi Ren, Xiaochun Cao, Yang Liu, and Philip Torr. SkillJect: Automating stealthy skill-based prompt injection for coding agents with trace-driven closed- loop refinement.arXiv preprint arXiv:2602.14211, 2026

work page arXiv 2026
[47]

Messaging integration: post messages and notify users from the agent

Cisco AI Defense. Cisco AI defense skill scanner.https://github.com/cisco-ai-defense/ skill-scanner, 2025. Apache 2.0 License. 13 A Hallucination-Control Filters The two LLM-based extractors (Semantic Extractor on the declared track and Instruction Analyzer on the actual track; §3) apply three filters to control hallucination risk:Taxonomy-echo rejection ...

work page 2025