Recognition: 2 theorem links
· Lean TheoremSkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills
Pith reviewed 2026-05-10 18:33 UTC · model grok-4.3
The pith
A three-layer triage framework detects malicious skills in AI agent marketplaces by filtering benign ones cheaply before applying targeted LLM analysis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SkillSieve is a three-layer detection framework that applies progressively deeper analysis only where needed. Layer 1 uses regex, AST, and metadata checks with an XGBoost scorer to filter out most benign skills in milliseconds at zero API cost. Layer 2 sends remaining skills to an LLM split across four parallel sub-tasks covering intent alignment, permission justification, covert behavior detection, and cross-file consistency. Layer 3 routes high-risk items to a jury of three different LLMs that vote independently and debate disagreements before issuing a final verdict. On a 400-skill labeled benchmark drawn from real marketplace data, the system reaches higher detection performance than the
What carries the argument
The three-layer hierarchical triage that starts with lightweight code and metadata filters, moves to structured multi-prompt LLM subtasks for deeper inspection, and ends with LLM jury voting for confirmation on uncertain cases.
If this is right
- Most benign skills are discarded in under 40 milliseconds using only local checks with no API cost.
- Splitting analysis into four parallel subtasks allows separate checks for intent, permissions, covert actions, and file consistency.
- Jury voting among different LLMs resolves disagreements on high-risk skills before a final decision.
- The complete pipeline can process the full 49,000-skill corpus on a single low-power ARM board at low average cost per skill.
Where Pith is reading between the lines
- Staged filtering methods like this could be adapted to other AI security tasks where full analysis of every item would be too expensive.
- Dividing detection into several narrow questions may reduce the risk that one broad query overlooks subtle problems.
- Adding human review for cases where the LLM jury disagrees could strengthen trust in the automated output.
Load-bearing premise
The 400-skill labeled benchmark together with the five tested adversarial evasion samples accurately represent the malicious skills present in large real-world marketplaces, and the LLM subtasks plus jury voting can reliably separate malicious intent from complex but benign natural-language instructions.
What would settle it
A new collection of malicious skills that pass the initial filters and cause the LLM subtasks and jury to classify them as benign, or a large set of benign skills that the system consistently flags as malicious.
Figures
read the original abstract
OpenClaw's ClawHub marketplace hosts over 13,000 community-contributed agent skills, and between 13% and 26% of them contain security vulnerabilities according to recent audits. Regex scanners miss obfuscated payloads; formal static analyzers cannot read the natural language instructions in SKILL.md files where prompt injection and social engineering attacks hide. Neither approach handles both modalities. SkillSieve is a three-layer detection framework that applies progressively deeper analysis only where needed. Layer 1 runs regex, AST, and metadata checks through an XGBoost-based feature scorer, filtering roughly 86% of benign skills in under 40ms on average at zero API cost. Layer 2 sends suspicious skills to an LLM, but instead of asking one broad question, it splits the analysis into four parallel sub-tasks (intent alignment, permission justification, covert behavior detection, cross-file consistency), each with its own prompt and structured output. Layer 3 puts high-risk skills before a jury of three different LLMs that vote independently and, if they disagree, debate before reaching a verdict. We evaluate on 49,592 real ClawHub skills and adversarial samples across five evasion techniques, running the full pipeline on a 440 ARM single-board computer. On a 400-skill labeled benchmark, SkillSieve achieves 0.800 F1, outperforming ClawVet's 0.421, at an average cost of 0.006 per skill. Code, data, and benchmark are open-sourced.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents SkillSieve, a three-layer hierarchical triage framework for detecting malicious AI agent skills in marketplaces such as ClawHub. Layer 1 applies fast regex, AST, and metadata checks via an XGBoost feature scorer to filter the majority of benign skills at near-zero cost. Layer 2 decomposes analysis of remaining skills into four parallel LLM sub-tasks (intent alignment, permission justification, covert behavior detection, cross-file consistency). Layer 3 escalates high-risk cases to a jury of three LLMs that vote and debate if needed. The system is evaluated on the full 49,592-skill ClawHub corpus plus adversarial samples, reporting 0.800 F1 on a 400-skill labeled benchmark (vs. ClawVet at 0.421 F1) at an average cost of $0.006 per skill, with deployment tested on low-power ARM hardware. Code, data, and benchmark are open-sourced.
Significance. If the empirical results hold, SkillSieve provides a practical, cost-efficient solution to a real security gap: natural-language prompt-injection and social-engineering attacks embedded in community-contributed agent skills that neither regex scanners nor formal static analyzers can reliably catch. The hierarchical design and multi-LLM jury mechanism represent a concrete advance over single-pass LLM or baseline scanners. The open-sourcing of code, data, and the 400-skill benchmark is a clear strength that supports reproducibility and future work.
major comments (1)
- [Abstract] Abstract: The headline result of 0.800 F1 on the 400-skill labeled benchmark (outperforming ClawVet's 0.421) is the primary evidence offered for the framework's effectiveness. The manuscript states only that the benchmark is 'labeled' and that five adversarial evasion samples were used; it supplies no protocol for label assignment, criteria defining 'malicious' versus benign natural-language instructions, inter-annotator agreement, annotator expertise, or sampling method from the 49,592-skill corpus. Without these details the reported F1 score cannot be interpreted as evidence that the four-subtask LLM analysis plus jury voting distinguishes malicious intent rather than artifacts of the labeling process.
minor comments (1)
- [Abstract] Abstract: The phrase 'five adversarial evasion samples' and 'five evasion techniques' is mentioned without even a one-sentence characterization of the techniques; adding this would help readers assess the robustness claim.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The concern about insufficient detail on benchmark labeling is valid and directly impacts the interpretability of our primary result. We address it below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline result of 0.800 F1 on the 400-skill labeled benchmark (outperforming ClawVet's 0.421) is the primary evidence offered for the framework's effectiveness. The manuscript states only that the benchmark is 'labeled' and that five adversarial evasion samples were used; it supplies no protocol for label assignment, criteria defining 'malicious' versus benign natural-language instructions, inter-annotator agreement, annotator expertise, or sampling method from the 49,592-skill corpus. Without these details the reported F1 score cannot be interpreted as evidence that the four-subtask LLM analysis plus jury voting distinguishes malicious intent rather than artifacts of the labeling process.
Authors: We agree that the manuscript provides insufficient detail on how the 400-skill benchmark was constructed and labeled, limiting the ability to interpret the F1 score as evidence of the framework's effectiveness rather than labeling artifacts. In the revised manuscript we will add a dedicated subsection in the Evaluation section describing: (1) the stratified sampling method from the 49,592-skill ClawHub corpus, (2) the explicit criteria for malicious vs. benign labels based on our threat model (prompt injection, unauthorized permissions, covert behavior, social engineering), (3) the annotation protocol including annotator expertise in AI security, (4) inter-annotator agreement, and (5) the generation and inclusion of the five adversarial evasion samples. The open-sourced benchmark release will include the full annotation guidelines. These changes will allow readers to assess label reliability. revision: yes
Circularity Check
No circularity in derivation or evaluation chain
full rationale
The paper describes a hierarchical detection framework evaluated empirically on an external 400-skill labeled benchmark drawn from the ClawHub corpus, reporting F1 scores and costs without any equations, derivations, fitted parameters renamed as predictions, or self-citations that bear the load of the central claims. The methodology (regex/AST/XGBoost filtering, four LLM subtasks, jury voting) is defined independently of the benchmark outcomes, and performance is presented as measured against that benchmark rather than constructed from it. No self-definitional loops, ansatzes via prior author work, or renaming of known results appear in the provided text.
Axiom & Free-Parameter Ledger
free parameters (2)
- XGBoost decision thresholds and feature weights
- Layer escalation risk thresholds
axioms (1)
- domain assumption LLMs given structured prompts on intent alignment, permission justification, covert behavior, and cross-file consistency can produce reliable signals for malicious skills
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
three-layer detection framework that applies progressively deeper analysis only where needed... Layer 1 runs regex, AST, and metadata checks through an XGBoost-based feature scorer... Layer 2 splits the analysis into four parallel sub-tasks... Layer 3 puts high-risk skills before a jury of three different LLMs
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
On a 400-skill labeled benchmark, SkillSieve achieves 0.800 F1... at an average cost of 0.006 per skill
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 5 Pith papers
-
Under the Hood of SKILL.md: Semantic Supply-chain Attacks on AI Agent Skill Registry
Semantic manipulations of SKILL.md descriptions enable effective supply-chain attacks that bias AI agent skill registries toward adversarial skills in discovery, selection, and governance.
-
Exploiting LLM Agent Supply Chains via Payload-less Skills
Semantic Compliance Hijacking lets attackers hijack LLM agents by disguising malicious instructions as compliance rules in skills, reaching up to 77.67% success on confidentiality breaches and 67.33% on RCE while evad...
-
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
-
Behavioral Integrity Verification for AI Agent Skills
BIV audits AI agent skills at scale, finding 80% deviate from declared behavior on 49,943 skills and achieving 0.946 F1 for malicious skill detection.
-
From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills
SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.
Reference graph
Works this paper leans on
-
[1]
OpenClaw: Your own personal AI assistant
OpenClaw. OpenClaw: Your own personal AI assistant. https://github.com/ openclaw/openclaw, 2026
2026
-
[2]
ClawHub: Skill directory for OpenClaw
OpenClaw. ClawHub: Skill directory for OpenClaw. https://github.com/ openclaw/clawhub, 2026
2026
-
[3]
Skill format specification
OpenClaw. Skill format specification. https://github.com/openclaw/clawhub/ blob/main/docs/skill-format.md, 2026
2026
-
[4]
ToxicSkills: Malicious AI agent skills in ClawHub
Snyk Labs. ToxicSkills: Malicious AI agent skills in ClawHub. https://snyk.io/ blog/toxicskills-malicious-ai-agent-skills-clawhub/, February 2026
2026
-
[5]
ClawHavoc: 341 malicious skills found by the bot they were target- ing
Koi Security. ClawHavoc: 341 malicious skills found by the bot they were target- ing. https://www.koi.ai/blog/clawhavoc-341-malicious-clawedbot-skills-found- by-the-bot-they-were-targeting, February 2026
2026
-
[6]
Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale
Liu, Y., Wang, W., Feng, R., Zhang, Y., Xu, G., Deng, G., Li, Y., and Zhang, L. Agent skills in the wild: An empirical study of security vulnerabilities at scale. arXiv:2601.10338, January 2026
work page internal anchor Pith review arXiv 2026
-
[7]
Liu, Y., Chen, Z., Zhang, Y., Deng, G., Li, Y., Ning, J., Zhang, Y., and Zhang, L.Y. Malicious agent skills in the wild: A large-scale security empirical study. arXiv:2602.06547, February 2026
-
[8]
Bhardwaj, V.P. Formal analysis and supply chain security for agentic AI skills. arXiv:2603.00195, February 2026
-
[9]
ClawVet: Skill vetting & supply chain security for the OpenClaw ecosystem
Shaikh, M. ClawVet: Skill vetting & supply chain security for the OpenClaw ecosystem. https://github.com/MohibShaikh/clawvet, 2026
2026
-
[10]
From automation to infection: How OpenClaw agent skills are being weaponized
VirusTotal. From automation to infection: How OpenClaw agent skills are being weaponized. https://blog.virustotal.com/2026/02/from-automation-to-infection- how.html, February 2026
2026
-
[11]
Guo, Z., Chen, Z., Nie, X., Lin, J., Zhou, Y., and Zhang, W. SkillProbe: Security auditing for emerging agent skill marketplaces via multi-agent collaboration. arXiv:2603.21019, March 2026
-
[12]
Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward
Xu, R. and Yan, Y. Agent skills for large language models: Architecture, acquisi- tion, security, and the path forward.arXiv:2602.12430, February 2026
work page internal anchor Pith review arXiv 2026
-
[13]
OpenClaw’s 230 malicious skills: What agentic AI supply chains teach us about the need to evolve identity security
AuthMind. OpenClaw’s 230 malicious skills: What agentic AI supply chains teach us about the need to evolve identity security. https://www.authmind.com/ blogs/openclaw-malicious-skills-agentic-ai-supply-chain, 2026
2026
-
[14]
From magic to malware: How OpenClaw’s agent skills become an attack surface
1Password. From magic to malware: How OpenClaw’s agent skills become an attack surface. https://1password.com/blog/from-magic-to-malware-how- openclaws-agent-skills-become-an-attack-surface, 2026
2026
-
[15]
OpenClaw’s rapid adoption exposes skills supply chain and fake installer risks in a high-privilege AI agent platform
HKCERT. OpenClaw’s rapid adoption exposes skills supply chain and fake installer risks in a high-privilege AI agent platform. https://www.hkcert.org/blog/openclaw-s-rapid-adoption-exposes-skills- supply-chain-and-fake-installer-risks-in-a-high-privilege-ai-agent-platform, March 2026
2026
-
[16]
Malicious OpenClaw skills used to distribute Atomic ma- cOS Stealer
Trend Micro. Malicious OpenClaw skills used to distribute Atomic ma- cOS Stealer. https://www.trendmicro.com/en_us/research/26/b/openclaw-skills- used-to-distribute-atomic-macos-stealer.html, February 2026
2026
-
[17]
Malicious crypto skills compromise OpenClaw AI assistant users
Paubox. Malicious crypto skills compromise OpenClaw AI assistant users. https://www.paubox.com/blog/malicious-crypto-skills-compromise- openclaw-ai-assistant-users, 2026
2026
-
[18]
OWASP Agentic Skills Top 10
OWASP. OWASP Agentic Skills Top 10. https://owasp.org/www-project-agentic- skills-top-10/, 2026
2026
-
[19]
and Guestrin, C
Chen, T. and Guestrin, C. XGBoost: A scalable tree boosting system. InKDD, 2016
2016
-
[20]
Official documentation / project page
Tree-sitter. Official documentation / project page. https://tree-sitter.github.io/ tree-sitter/
-
[21]
Ohm, M. et al. Backstabber’s knife collection: A review of open source software supply chain attacks. InDIMV A, 2020
2020
-
[22]
Zhu, J., Zhang, L., Guo, W., and Liu, Y. SkillClone: Multi-modal clone detection and clone propagation analysis in the agent skill ecosystem.arXiv:2603.22447, March 2026
-
[23]
Skilltester: Benchmarking utility and security of agent skills.arXiv preprint arXiv:2603.28815, 2026
Wang, L., Wang, Z., and Xu, A. SkillTester: Benchmarking utility and security of agent skills.arXiv:2603.28815, March 2026
-
[24]
Jia, X., Liao, J., Qin, S., Gu, J., Ren, W., Cao, X., Liu, Y., and Torr, P. SkillJect: Automating stealthy skill-based prompt injection for coding agents with trace- driven closed-loop refinement.arXiv:2602.14211, February 2026
-
[25]
Agent audit: A security analysis system for llm agent applications,
Zhang, H., Nian, Y., and Zhao, Y. Agent Audit: A security analysis system for LLM agent applications.arXiv:2603.22853, March 2026
-
[26]
Mal- ware detection at the edge with lightweight LLMs: A performance evaluation
Rondanini, C., Carminati, B., Ferrari, E., Gaudiano, A., and Kundu, A. Mal- ware detection at the edge with lightweight LLMs: A performance evaluation. arXiv:2503.04302, March 2025
-
[27]
Rondanini, C., Carminati, B., Ferrari, E., Lardo, N., and Kundu, A. LoRA-based parameter-efficient LLMs for continuous learning in edge-based malware detec- tion.arXiv:2602.11655, February 2026
-
[28]
Top 10 for Agentic Applications for 2026
OWASP. Top 10 for Agentic Applications for 2026. https://genai.owasp.org/ resource/owasp-top-10-for-agentic-applications-for-2026/, December 2025
2026
-
[29]
OpenClaw can be hazardous to your software supply chain
JFrog. OpenClaw can be hazardous to your software supply chain. https://jfrog. com/blog/giving-openclaw-the-keys-to-your-kingdom-read-this-first/, 2026
2026
-
[30]
OpenClaw security engineer’s cheat sheet
Semgrep. OpenClaw security engineer’s cheat sheet. https://semgrep.dev/blog/ 2026/openclaw-security-engineers-cheat-sheet/, 2026
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.