pith. machine review for the scientific record. sign in

arxiv: 2605.11770 · v1 · submitted 2026-05-12 · 💻 cs.CR · cs.AI· cs.SY· eess.SY

Recognition: no theorem link

Behavioral Integrity Verification for AI Agent Skills

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:51 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.SYeess.SY
keywords behavioral integrity verificationAI agent skillsdeviation taxonomymalicious skill detectioncapability extractionLLM securityroot cause analysisagent safety
0
0 comments X

The pith

Behavioral integrity verification shows 80% of AI agent skills deviate from declared capabilities, mostly due to oversight.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes the problem of verifying whether AI agent skills actually perform as their descriptions claim, since these skills grant privileged access to systems like file operations and network calls. It introduces a framework that compares declared and actual capabilities using code analysis combined with language model assistance to build evidence for further checks. When applied to nearly 50,000 skills, the analysis finds a large mismatch rate, with most issues traced to developer mistakes instead of intentional harm. The same framework also identifies malicious skills more accurately than prior methods on a dedicated benchmark.

Core claim

The central claim is that behavioral integrity verification can be formalized as a typed set comparison between declared and actual capabilities over a shared taxonomy. The framework pairs deterministic code analysis with LLM-assisted capability extraction to generate structured evidence. On 49,943 skills, this reveals 80.0% deviation from declared behavior, four novel compound-threat categories, root causes split as 81.1% oversight and 18.9% adversarial intent, and 5.0% of skills with predicted multi-stage attack chains. Malicious skill detection reaches an F1 of 0.946 on 906 skills, outperforming rule-based and single-pass LLM baselines.

What carries the argument

The BIV framework, which performs a typed set comparison between declared and actual capabilities over a shared taxonomy by pairing deterministic code analysis with LLM-assisted capability extraction.

If this is right

  • The deviation taxonomy surfaces four novel compound-threat categories for classifying complex risks.
  • Root-cause classification shows the majority of issues trace to oversight and can be addressed through improved development practices.
  • 5.0% of skills carry predicted multi-stage attack chains that warrant targeted scrutiny.
  • Malicious skill detection achieves an F1 of 0.946 and outperforms existing rule-based and single-pass LLM baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent skill platforms could require BIV checks before publishing tools to reduce unsafe capabilities reaching users.
  • The taxonomy could inform developer guidelines that minimize accidental description-implementation gaps.
  • Adapting the extraction approach to monitor skills at runtime might catch behavioral changes after initial approval.
  • Repeating the audit on skills from other registries would indicate whether the 80% deviation rate is widespread.

Load-bearing premise

The LLM-assisted capability extraction accurately and consistently identifies actual skill capabilities from code, instructions, and metadata without substantial errors or biases that would invalidate the deviation taxonomy or detection results.

What would settle it

A large-scale manual review of extracted capabilities on a random sample of skills from the registry that shows frequent mismatches with the automated taxonomy would falsify the deviation rates and detection performance.

Figures

Figures reproduced from arXiv: 2605.11770 by Hongliang Liu, Tung-Ling Li, Yuhao Wu.

Figure 1
Figure 1. Figure 1: BIV processes a third-party skill s = (M, C, I) along symmetric declared-behavior (meta￾data M) and actual-behavior (code C + instructions I) tracks. Each track runs parallel determin￾istic and LLM extractors that converge on a 29-capability taxonomy; the typed mismatch becomes structured Behavioral Deviations powering three demonstrated downstream analyses: deviation tax￾onomy (§4.1), root-cause classific… view at source ↗
Figure 2
Figure 2. Figure 2: UMAP of clustered deviation embeddings: nine compact, well-separated representatives [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-category deviation profile. Left: total deviation volume by category, split by direc [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The 8-branch / 36-leaf intent taxonomy. Adversarial branches (A–F) in warm tones, non [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Intent classification of 163,754 clustered deviations. Left: non-adversarial root causes [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Real-world trace from the OpenClaw scan. The manifest declares [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

Agent skills extend LLM agents with privileged third-party capabilities such as filesystem access, credentials, network calls, and shell execution. Existing safety work catches malicious prompts and risky runtime actions, but the skill artifact itself goes unverified. We formalize this as the behavioral integrity verification (BIV) problem: a typed set comparison between declared and actual capabilities over a shared taxonomy that bridges code, instructions, and metadata. The BIV framework instantiates this comparison by pairing deterministic code analysis with LLM-assisted capability extraction. The resulting structured evidence supports three downstream analyses: deviation taxonomy, root-cause classification, and malicious-skill detection. On 49,943 skills from the OpenClaw registry, the deviation taxonomy reveals a pervasive description-implementation gap: 80.0% of skills deviate from declared behavior, with four novel compound-threat categories surfaced. Root-cause classification finds that deviations are mostly oversight, not malice: 81.1% trace to developer oversight and 18.9% to adversarial intent, with 5.0% of skills carrying predicted multi-stage attack chains. On a 906-skill malicious-skill detection benchmark, BIV reaches an F1 of 0.946, outperforming state-of-the-art rule-based and single-pass LLM baselines. These results demonstrate behavioral integrity auditing for agent skills at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces the behavioral integrity verification (BIV) problem for AI agent skills, formalizing it as a typed set comparison between declared and actual capabilities using a shared taxonomy. It instantiates BIV via deterministic code analysis paired with LLM-assisted capability extraction, then applies the framework to three analyses: a deviation taxonomy on 49,943 OpenClaw skills (reporting 80.0% deviation with four novel compound-threat categories), root-cause classification (81.1% oversight vs. 18.9% adversarial intent, 5.0% multi-stage chains), and malicious-skill detection on a 906-skill benchmark (F1=0.946, outperforming rule-based and single-pass LLM baselines).

Significance. If the extraction step holds, the work provides the first large-scale empirical audit of the description-implementation gap in agent skills, demonstrating that most deviations stem from oversight rather than malice and offering a practical detection method that improves on existing baselines. The scale (nearly 50k skills) and concrete metrics are strengths; the framework could support ongoing registry auditing if validated.

major comments (1)
  1. [Methods/Evaluation] Methods and Evaluation sections: the LLM-assisted capability extraction step that produces the 'actual' capability sets is central to all headline results (80.0% deviation rate, 81.1%/18.9% split, 5.0% multi-stage chains, and F1=0.946), yet no large-scale human ground-truth validation, inter-annotator agreement, or error-rate measurement is reported on the OpenClaw corpus. Systematic extraction errors (e.g., missed implicit calls or metadata misclassification) would directly propagate into the taxonomy and detection claims without being detectable from the reported numbers.
minor comments (2)
  1. [Abstract/§4] Abstract and §4: clarify how the 906-skill malicious-skill benchmark was constructed and labeled, including any overlap with the 49,943-skill corpus.
  2. [Results] Figure 3 or equivalent: add error bars or confidence intervals to the reported percentages and F1 scores.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments and recommendation. We address the major comment below and will revise the manuscript accordingly to strengthen the validation of our methods.

read point-by-point responses
  1. Referee: [Methods/Evaluation] Methods and Evaluation sections: the LLM-assisted capability extraction step that produces the 'actual' capability sets is central to all headline results (80.0% deviation rate, 81.1%/18.9% split, 5.0% multi-stage chains, and F1=0.946), yet no large-scale human ground-truth validation, inter-annotator agreement, or error-rate measurement is reported on the OpenClaw corpus. Systematic extraction errors (e.g., missed implicit calls or metadata misclassification) would directly propagate into the taxonomy and detection claims without being detectable from the reported numbers.

    Authors: We agree that the reliability of the LLM-assisted capability extraction is central to all reported results and that the absence of large-scale human validation is a limitation. The original manuscript focused on the deterministic code analysis component for grounding but did not report human ground-truth metrics on the full corpus. In the revised manuscript, we will add a new subsection in the Evaluation section describing a human validation study on a stratified sample of 500 skills. This will report inter-annotator agreement, extraction error rates, and an analysis of potential systematic issues such as missed implicit calls. We will also expand the description of the LLM prompt engineering and few-shot examples used. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results derive from external registry and separate benchmark without reduction to self-referential definitions or fitted inputs.

full rationale

The paper defines BIV as a typed set comparison between declared and actual capabilities, instantiated via deterministic code analysis plus LLM-assisted extraction. Headline statistics (80% deviation rate, root-cause splits, 5% multi-stage chains, F1=0.946) are computed directly from applying this comparison to the external OpenClaw registry (49,943 skills) and a separate 906-skill benchmark. No equations or steps reduce a claimed prediction or taxonomy to quantities defined in terms of the same fitted parameters or self-citations. The LLM extraction step is a methodological component whose accuracy is assumed rather than derived from the results themselves; this is a validation gap, not a circular reduction. The derivation chain remains self-contained against the stated external data sources.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the existence of a complete shared taxonomy and the reliability of LLM extraction; no explicit free parameters or new invented entities are described in the abstract.

axioms (1)
  • domain assumption A shared taxonomy exists that can bridge code, instructions, and metadata for capability comparison.
    Invoked as the basis for the typed set comparison that defines the BIV problem.

pith-pipeline@v0.9.0 · 5539 in / 1332 out tokens · 37025 ms · 2026-05-13T05:51:55.104019+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 4 internal anchors

  1. [1]

    Re- Act: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. Re- Act: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

  2. [2]

    Toolformer: Language models can teach them- selves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach them- selves to use tools. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023

  3. [3]

    Cox, Jaeyeon Jung, Patrick McDaniel, and Anmol N

    William Enck, Peter Gilbert, Seungyeop Han, Vasant Tendulkar, Byung-Gon Chun, Landon P. Cox, Jaeyeon Jung, Patrick McDaniel, and Anmol N. Sheth. TaintDroid: An information-flow tracking sys- tem for realtime privacy monitoring on smartphones. InProceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 393–407. US...

  4. [4]

    DREBIN: Effective and explainable detection of Android malware in your pocket

    Daniel Arp, Michael Spreitzenbarth, Malte Hubner, Hugo Gascon, and Konrad Rieck. DREBIN: Effective and explainable detection of Android malware in your pocket. InProceedings of the 2014 Network and Distributed System Security Symposium (NDSS). Internet Society, 2014

  5. [5]

    Trends and lessons from three years fighting malicious extensions

    Nav Jagpal, Eric Dingle, Jean-Philippe Gravel, Panayiotis Mavrommatis, Niels Provos, Moheeb Abu Rajab, and Kurt Thomas. Trends and lessons from three years fighting malicious extensions. In24th USENIX Security Symposium, pages 579–593. USENIX Association, 2015

  6. [6]

    Hulk: Eliciting malicious behavior in browser extensions

    Alexandros Kapravelos, Chris Grier, Neha Chachra, Christopher Kruegel, Giovanni Vigna, and Vern Paxson. Hulk: Eliciting malicious behavior in browser extensions. In23rd USENIX Security Symposium, pages 641–654. USENIX Association, 2014

  7. [7]

    Small world with high risks: A study of security threats in the npm ecosystem

    Markus Zimmermann, Cristian-Alexandru Staicu, Cam Tenny, and Michael Pradel. Small world with high risks: A study of security threats in the npm ecosystem. In28th USENIX Security Symposium, pages 995–1010. USENIX Association, 2019

  8. [8]

    Towards measuring supply chain attacks on package managers for interpreted languages

    Ruian Duan, Omar Alrawi, Ranjita Pai Kasturi, Ryan Elder, Brendan Saltaformaggio, and Wenke Lee. Towards measuring supply chain attacks on package managers for interpreted languages. InProceedings of the 2021 Network and Distributed System Security Symposium (NDSS). Internet Society, 2021

  9. [9]

    Tensor Trust: Interpretable prompt injection attacks from an online game

    Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, and Stuart Russell. Tensor Trust: Interpretable prompt injection attacks from an online game. InInternational Conference on Learning Representations (ICLR), 2024

  10. [10]

    WHYPER: Towards automating risk assessment of mobile applications

    Rahul Pandita, Xusheng Xiao, Wei Yang, William Enck, and Tao Xie. WHYPER: Towards automating risk assessment of mobile applications. In22nd USENIX Security Symposium, pages 527–542. USENIX Association, 2013

  11. [11]

    Checking app behavior against app descriptions

    Alessandra Gorla, Ilaria Tavecchia, Florian Gross, and Andreas Zeller. Checking app behavior against app descriptions. InProceedings of the 36th International Conference on Software Engineering (ICSE), pages 1025–1035. ACM, 2014

  12. [12]

    SoK: Taxonomy of attacks on open-source software supply chains

    Piergiorgio Ladisa, Henrik Plate, Matias Martinez, and Olivier Barais. SoK: Taxonomy of attacks on open-source software supply chains. In2023 IEEE Symposium on Security and Privacy (SP), pages 1509–1526. IEEE, 2023. 11

  13. [13]

    Ad injection at scale: Assessing deceptive advertisement modifications

    Kurt Thomas, Elie Bursztein, Chris Grier, Grant Ho, Nav Jagpal, Alexandros Kapravelos, Damon McCoy, Antonio Nappa, Vern Paxson, Niels Provos, Moheeb Abu Rajab, and Giovanni Vigna. Ad injection at scale: Assessing deceptive advertisement modifications. In2015 IEEE Symposium on Security and Privacy, pages 151–167. IEEE, 2015

  14. [14]

    You’ve changed: Detecting malicious browser extension updates

    Nikolaos Pantelaios, Nick Nikiforakis, and Alexandros Kapravelos. You’ve changed: Detecting malicious browser extension updates. InProceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 477–491. ACM, 2020

  15. [15]

    Data exposure from llm apps: An in-depth investigation of openai’s gpts,

    Yuhao Wu, Evin Jaff, Ke Yang, Ning Zhang, and Umar Iqbal. An in-depth investigation of data collection in LLM app ecosystems. InProceedings of the ACM Internet Measurement Conference (IMC), 2025. arXiv preprint arXiv:2408.13247

  16. [16]

    A measurement study of model context protocol (MCP) ecosystem.arXiv preprint arXiv:2509.25292, 2025

    Hechuan Guo, Yongle Hao, Yue Zhang, Minghui Xu, Peizhuo Lv, Jiezhi Chen, and Xiuzhen Cheng. A measurement study of model context protocol (MCP) ecosystem.arXiv preprint arXiv:2509.25292, 2025

  17. [17]

    Parasites in the Toolchain: A Large-Scale Analysis of Attacks on the MCP Ecosystem

    Shuli Zhao, Qinsheng Hou, Zihan Zhan, Yanhao Wang, Yuchong Xie, Yu Guo, Libo Chen, Shenghong Li, and Zhi Xue. Parasites in the toolchain: A large-scale analysis of attacks on the MCP ecosystem.arXiv preprint arXiv:2509.06572, 2025

  18. [18]

    Don’t believe everything you read: Understanding and measuring MCP behavior under misleading tool descriptions

    Zhihao Li, Boyang Ma, Xuelong Dai, Minghui Xu, Yue Zhang, Biwei Yan, and Kun Li. Don’t believe everything you read: Understanding and measuring MCP behavior under misleading tool descriptions. arXiv preprint arXiv:2602.03580, 2026

  19. [19]

    From component manipulation to system compromise: Understanding and detecting malicious MCP servers.arXiv preprint arXiv:2604.01905, 2026

    Yiheng Huang, Zhijia Zhao, Bihuan Chen, and Susheng Wu. From component manipulation to system compromise: Understanding and detecting malicious MCP servers.arXiv preprint arXiv:2604.01905, 2026

  20. [20]

    SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills

    Yinghan Hou and Zongyou Yang. SkillSieve: A hierarchical triage framework for detecting malicious AI agent skills.arXiv preprint arXiv:2604.06550, 2026

  21. [21]

    elementary, my dear watson

    Shenao Wang, Junjie He, Yanjie Zhao, Yayi Wang, Kan Yu, and Haoyu Wang. “elementary, my dear watson.” detecting malicious skills via neuro-symbolic reasoning across heterogeneous artifacts.arXiv preprint arXiv:2603.27204, 2026

  22. [22]

    Formal analysis and supply chain security for agentic AI skills.arXiv preprint arXiv:2603.00195, 2026

    Varun Pratap Bhardwaj. Formal analysis and supply chain security for agentic AI skills.arXiv preprint arXiv:2603.00195, 2026

  23. [23]

    Modeling and discovering vulnerabilities with code property graphs

    Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. Modeling and discovering vulnerabilities with code property graphs. In2014 IEEE Symposium on Security and Privacy, pages 590–604. IEEE, 2014

  24. [24]

    FlowDroid: Precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for Android apps

    Steven Arzt, Siegfried Rasthofer, Christian Fritz, Eric Bodden, Alexandre Bartel, Jacques Klein, Yves Le Traon, Damien Octeau, and Patrick McDaniel. FlowDroid: Precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for Android apps. InProceedings of the 35th ACM SIGPLAN Confer- ence on Programming Language Design and Implementat...

  25. [25]

    Benjamin Livshits and Monica S

    V . Benjamin Livshits and Monica S. Lam. Finding security vulnerabilities in Java applications with static analysis. In14th USENIX Security Symposium, pages 271–286. USENIX Association, 2005

  26. [26]

    BadAgent: Inserting and activating backdoor attacks in LLM agents

    Yifei Wang, Dizhan Xue, Shengjie Zhang, and Shengsheng Qian. BadAgent: Inserting and activating backdoor attacks in LLM agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9811–9827. Association for Computational Linguistics, 2024

  27. [27]

    Skill-inject: Measuring agent vulnerability to skill file attacks.arXiv preprint arXiv:2602.20156, 2026

    D. Schmotz, L. Beurer-Kellner, S. Abdelnabi, and M. Andriushchenko. Skill-Inject: Measuring agent vulnerability to skill file attacks.arXiv preprint arXiv:2602.20156, 2026

  28. [28]

    AgentPoison: Red-teaming LLM agents via poisoning memory or knowledge bases

    Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. AgentPoison: Red-teaming LLM agents via poisoning memory or knowledge bases. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  29. [29]

    Prompt injection attack to tool selection in llm agents.arXiv preprint arXiv:2504.19793, 2025

    Jiawen Shi, Zenghui Yuan, Guiyao Tie, Pan Zhou, Neil Zhenqiang Gong, and Lichao Sun. Prompt injection attack to tool selection in LLM agents. InNetwork and Distributed System Security Symposium (NDSS), 2026. arXiv preprint arXiv:2504.19793

  30. [30]

    Prompt flow integrity to prevent privilege escalation in LLM agents.arXiv preprint arXiv:2503.15547, 2025

    Juhee Kim, Woohyuk Choi, and Byoungyoung Lee. Prompt flow integrity to prevent privilege escalation in LLM agents.arXiv preprint arXiv:2503.15547, 2025. 12

  31. [31]

    Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. InProceedings of the 2023 Workshop on Artificial Intelligence and Security (AISec), co-located with ACM CCS, pages 79–90. ACM, 2023

  32. [32]

    AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents

    Edoardo Debenedetti, Jie Zhang, Mislav Balunovi ´c, Luca Beurer-Kellner, Marc Fischer, and Florian Tram`er. AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024), Datasets and Bench- marks Track, 2024

  33. [33]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM- as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  34. [34]

    R-Judge: Benchmarking safety risk awareness for LLM agents

    Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, and Gongshen Liu. R-Judge: Benchmarking safety risk awareness for LLM agents. InFindings of the Association for Computational Linguistics (EMNLP), 2024

  35. [35]

    Llama Guard: LLM-based input-output safeguard for human-AI conversations, 2023

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard: LLM-based input-output safeguard for human-AI conversations, 2023

  36. [36]

    Watch out for your agents! investigating backdoor threats to LLM-based agents

    Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, and Xu Sun. Watch out for your agents! investigating backdoor threats to LLM-based agents. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  37. [37]

    SecGPT: An execution isolation architecture for LLM-based systems

    Yuhao Wu, Franziska Roesner, Tadayoshi Kohno, Ning Zhang, and Umar Iqbal. IsolateGPT: An execu- tion isolation architecture for LLM-based agentic systems. InNetwork and Distributed System Security Symposium (NDSS), 2025. arXiv preprint arXiv:2403.04960; originally titled SecGPT

  38. [38]

    Alexander Robey, Eric Wong, Hamed Hassani, and George J. Pappas. SmoothLLM: Defending large language models against jailbreaking attacks.Transactions on Machine Learning Research (TMLR),

  39. [39]

    arXiv preprint arXiv:2310.03684

  40. [40]

    Towards automating data access permissions in AI agents

    Yuhao Wu, Ke Yang, Franziska Roesner, Tadayoshi Kohno, Ning Zhang, and Umar Iqbal. Towards automating data access permissions in AI agents. In2026 IEEE Symposium on Security and Privacy (SP), 2026. arXiv preprint arXiv:2511.17959

  41. [41]

    Saltzer and Michael D

    Jerome H. Saltzer and Michael D. Schroeder. The protection of information in computer systems.Pro- ceedings of the IEEE, 63(9):1278–1308, 1975

  42. [42]

    Sentence-BERT: Sentence embeddings using siamese BERT-networks

    Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3982–3992. Association for Computational Linguistics, 2019

  43. [43]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    Leland McInnes, John Healy, and James Melville. UMAP: Uniform manifold approximation and projec- tion for dimension reduction.arXiv preprint arXiv:1802.03426, 2018

  44. [44]

    Ricardo J. G. B. Campello, Davoud Moulavi, and J ¨org Sander. Density-based clustering based on hi- erarchical density estimates. InAdvances in Knowledge Discovery and Data Mining (PAKDD), pages 160–172. Springer, 2013

  45. [45]

    Malicious agent skills in the wild: A large-scale security empirical study.arXiv preprint arXiv:2602.06547, 2026

    Yi Liu, Zhihao Chen, Yanjun Zhang, Gelei Deng, Yuekang Li, Jianting Ning, Ying Zhang, and Leo Yu Zhang. Malicious agent skills in the wild: A large-scale security empirical study.arXiv preprint arXiv:2602.06547, 2026. Dataset athttps://huggingface.co/datasets/ProtectSkills/ MaliciousAgentSkillsBench

  46. [46]

    Skillject: Automating stealthy skill-based prompt injection for coding agents with trace-driven closed-loop refinement.arXiv preprintarXiv:2602.14211, 2026

    Xiaojun Jia, Jie Liao, Simeng Qin, Jindong Gu, Wenqi Ren, Xiaochun Cao, Yang Liu, and Philip Torr. SkillJect: Automating stealthy skill-based prompt injection for coding agents with trace-driven closed- loop refinement.arXiv preprint arXiv:2602.14211, 2026

  47. [47]

    Messaging integration: post messages and notify users from the agent

    Cisco AI Defense. Cisco AI defense skill scanner.https://github.com/cisco-ai-defense/ skill-scanner, 2025. Apache 2.0 License. 13 A Hallucination-Control Filters The two LLM-based extractors (Semantic Extractor on the declared track and Instruction Analyzer on the actual track; §3) apply three filters to control hallucination risk:Taxonomy-echo rejection ...