arxiv: 2604.06550 · v1 · submitted 2026-04-08 · 💻 cs.CR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills

Yinghan Hou , Zongyou Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:33 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords SkillSievemalicious AI agent skillshierarchical triageLLM jury votingvulnerability detectionagent skill marketplacesprompt injectionsecurity scanning

0 comments

The pith

A three-layer triage framework detects malicious skills in AI agent marketplaces by filtering benign ones cheaply before applying targeted LLM analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SkillSieve as a practical way to scan large sets of AI agent skills for security issues that hide in both code and natural language instructions. It demonstrates that the majority of safe skills can be cleared quickly with basic code and metadata checks, so that language model analysis is used only on the small fraction that raises flags. The framework divides the deeper analysis into four focused sub-tasks and adds a jury of multiple models to settle uncertain cases. This staged design produces stronger detection results than earlier single-method tools while keeping overall costs low and allowing the full scan to run on modest hardware. Readers would care because growing marketplaces contain thousands of skills where undetected vulnerabilities could lead to prompt injection or other agent exploits.

Core claim

SkillSieve is a three-layer detection framework that applies progressively deeper analysis only where needed. Layer 1 uses regex, AST, and metadata checks with an XGBoost scorer to filter out most benign skills in milliseconds at zero API cost. Layer 2 sends remaining skills to an LLM split across four parallel sub-tasks covering intent alignment, permission justification, covert behavior detection, and cross-file consistency. Layer 3 routes high-risk items to a jury of three different LLMs that vote independently and debate disagreements before issuing a final verdict. On a 400-skill labeled benchmark drawn from real marketplace data, the system reaches higher detection performance than the

What carries the argument

The three-layer hierarchical triage that starts with lightweight code and metadata filters, moves to structured multi-prompt LLM subtasks for deeper inspection, and ends with LLM jury voting for confirmation on uncertain cases.

If this is right

Most benign skills are discarded in under 40 milliseconds using only local checks with no API cost.
Splitting analysis into four parallel subtasks allows separate checks for intent, permissions, covert actions, and file consistency.
Jury voting among different LLMs resolves disagreements on high-risk skills before a final decision.
The complete pipeline can process the full 49,000-skill corpus on a single low-power ARM board at low average cost per skill.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Staged filtering methods like this could be adapted to other AI security tasks where full analysis of every item would be too expensive.
Dividing detection into several narrow questions may reduce the risk that one broad query overlooks subtle problems.
Adding human review for cases where the LLM jury disagrees could strengthen trust in the automated output.

Load-bearing premise

The 400-skill labeled benchmark together with the five tested adversarial evasion samples accurately represent the malicious skills present in large real-world marketplaces, and the LLM subtasks plus jury voting can reliably separate malicious intent from complex but benign natural-language instructions.

What would settle it

A new collection of malicious skills that pass the initial filters and cause the LLM subtasks and jury to classify them as benign, or a large set of benign skills that the system consistently flags as malicious.

Figures

Figures reproduced from arXiv: 2604.06550 by Yinghan Hou, Zongyou Yang.

**Figure 1.** Figure 1: The SkillSieve three-layer triage architecture. Layer 1 filters ∼86% of benign skills via static analysis at zero cost. Layer 2 applies four parallel LLM sub-tasks to suspicious skills. Layer 3 convenes a multi-LLM jury for high-risk cases. cross-validation it achieves 0.959 F1 on the triage task. However, because the training malicious samples are dominated by three known-malicious authors with similar a… view at source ↗

read the original abstract

OpenClaw's ClawHub marketplace hosts over 13,000 community-contributed agent skills, and between 13% and 26% of them contain security vulnerabilities according to recent audits. Regex scanners miss obfuscated payloads; formal static analyzers cannot read the natural language instructions in SKILL.md files where prompt injection and social engineering attacks hide. Neither approach handles both modalities. SkillSieve is a three-layer detection framework that applies progressively deeper analysis only where needed. Layer 1 runs regex, AST, and metadata checks through an XGBoost-based feature scorer, filtering roughly 86% of benign skills in under 40ms on average at zero API cost. Layer 2 sends suspicious skills to an LLM, but instead of asking one broad question, it splits the analysis into four parallel sub-tasks (intent alignment, permission justification, covert behavior detection, cross-file consistency), each with its own prompt and structured output. Layer 3 puts high-risk skills before a jury of three different LLMs that vote independently and, if they disagree, debate before reaching a verdict. We evaluate on 49,592 real ClawHub skills and adversarial samples across five evasion techniques, running the full pipeline on a 440 ARM single-board computer. On a 400-skill labeled benchmark, SkillSieve achieves 0.800 F1, outperforming ClawVet's 0.421, at an average cost of 0.006 per skill. Code, data, and benchmark are open-sourced.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SkillSieve gives a practical three-layer filter for malicious agent skills that beats ClawVet on the reported numbers and ships the code and data, but the 400-skill benchmark's labeling process is the part that still needs verification.

read the letter

The main takeaway is that this paper ships a usable system for triaging skills in marketplaces like ClawHub. It combines a fast XGBoost static filter that drops most benign cases, four parallel LLM checks for intent and behavior, and a three-model jury for the hard cases. That architecture is new relative to the ClawVet baseline they cite, and the reported 0.800 F1 at low cost on a 440 ARM board is a concrete result worth looking at if you care about agent security in practice. They also open-source the code, data, and benchmark, which makes the claims checkable rather than just asserted.

Referee Report

1 major / 1 minor

Summary. The manuscript presents SkillSieve, a three-layer hierarchical triage framework for detecting malicious AI agent skills in marketplaces such as ClawHub. Layer 1 applies fast regex, AST, and metadata checks via an XGBoost feature scorer to filter the majority of benign skills at near-zero cost. Layer 2 decomposes analysis of remaining skills into four parallel LLM sub-tasks (intent alignment, permission justification, covert behavior detection, cross-file consistency). Layer 3 escalates high-risk cases to a jury of three LLMs that vote and debate if needed. The system is evaluated on the full 49,592-skill ClawHub corpus plus adversarial samples, reporting 0.800 F1 on a 400-skill labeled benchmark (vs. ClawVet at 0.421 F1) at an average cost of $0.006 per skill, with deployment tested on low-power ARM hardware. Code, data, and benchmark are open-sourced.

Significance. If the empirical results hold, SkillSieve provides a practical, cost-efficient solution to a real security gap: natural-language prompt-injection and social-engineering attacks embedded in community-contributed agent skills that neither regex scanners nor formal static analyzers can reliably catch. The hierarchical design and multi-LLM jury mechanism represent a concrete advance over single-pass LLM or baseline scanners. The open-sourcing of code, data, and the 400-skill benchmark is a clear strength that supports reproducibility and future work.

major comments (1)

[Abstract] Abstract: The headline result of 0.800 F1 on the 400-skill labeled benchmark (outperforming ClawVet's 0.421) is the primary evidence offered for the framework's effectiveness. The manuscript states only that the benchmark is 'labeled' and that five adversarial evasion samples were used; it supplies no protocol for label assignment, criteria defining 'malicious' versus benign natural-language instructions, inter-annotator agreement, annotator expertise, or sampling method from the 49,592-skill corpus. Without these details the reported F1 score cannot be interpreted as evidence that the four-subtask LLM analysis plus jury voting distinguishes malicious intent rather than artifacts of the labeling process.

minor comments (1)

[Abstract] Abstract: The phrase 'five adversarial evasion samples' and 'five evasion techniques' is mentioned without even a one-sentence characterization of the techniques; adding this would help readers assess the robustness claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The concern about insufficient detail on benchmark labeling is valid and directly impacts the interpretability of our primary result. We address it below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The headline result of 0.800 F1 on the 400-skill labeled benchmark (outperforming ClawVet's 0.421) is the primary evidence offered for the framework's effectiveness. The manuscript states only that the benchmark is 'labeled' and that five adversarial evasion samples were used; it supplies no protocol for label assignment, criteria defining 'malicious' versus benign natural-language instructions, inter-annotator agreement, annotator expertise, or sampling method from the 49,592-skill corpus. Without these details the reported F1 score cannot be interpreted as evidence that the four-subtask LLM analysis plus jury voting distinguishes malicious intent rather than artifacts of the labeling process.

Authors: We agree that the manuscript provides insufficient detail on how the 400-skill benchmark was constructed and labeled, limiting the ability to interpret the F1 score as evidence of the framework's effectiveness rather than labeling artifacts. In the revised manuscript we will add a dedicated subsection in the Evaluation section describing: (1) the stratified sampling method from the 49,592-skill ClawHub corpus, (2) the explicit criteria for malicious vs. benign labels based on our threat model (prompt injection, unauthorized permissions, covert behavior, social engineering), (3) the annotation protocol including annotator expertise in AI security, (4) inter-annotator agreement, and (5) the generation and inclusion of the five adversarial evasion samples. The open-sourced benchmark release will include the full annotation guidelines. These changes will allow readers to assess label reliability. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation or evaluation chain

full rationale

The paper describes a hierarchical detection framework evaluated empirically on an external 400-skill labeled benchmark drawn from the ClawHub corpus, reporting F1 scores and costs without any equations, derivations, fitted parameters renamed as predictions, or self-citations that bear the load of the central claims. The methodology (regex/AST/XGBoost filtering, four LLM subtasks, jury voting) is defined independently of the benchmark outcomes, and performance is presented as measured against that benchmark rather than constructed from it. No self-definitional loops, ansatzes via prior author work, or renaming of known results appear in the provided text.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Framework rests on standard ML assumptions plus the unproven domain assumption that structured LLM prompting can reliably surface covert malicious intent in natural-language skill files.

free parameters (2)

XGBoost decision thresholds and feature weights
Layer 1 scorer is trained on data; thresholds for 86% benign filter are fitted.
Layer escalation risk thresholds
Cutoffs determining when to invoke LLM layers are chosen or tuned.

axioms (1)

domain assumption LLMs given structured prompts on intent alignment, permission justification, covert behavior, and cross-file consistency can produce reliable signals for malicious skills
Core of Layers 2 and 3; no independent verification supplied in abstract.

pith-pipeline@v0.9.0 · 5569 in / 1326 out tokens · 34126 ms · 2026-05-10T18:33:56.516765+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three-layer detection framework that applies progressively deeper analysis only where needed... Layer 1 runs regex, AST, and metadata checks through an XGBoost-based feature scorer... Layer 2 splits the analysis into four parallel sub-tasks... Layer 3 puts high-risk skills before a jury of three different LLMs
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

On a 400-skill labeled benchmark, SkillSieve achieves 0.800 F1... at an average cost of 0.006 per skill

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Under the Hood of SKILL.md: Semantic Supply-chain Attacks on AI Agent Skill Registry
cs.AI 2026-05 unverdicted novelty 8.0

Semantic manipulations of SKILL.md descriptions enable effective supply-chain attacks that bias AI agent skill registries toward adversarial skills in discovery, selection, and governance.
Exploiting LLM Agent Supply Chains via Payload-less Skills
cs.CR 2026-05 conditional novelty 6.0

Semantic Compliance Hijacking lets attackers hijack LLM agents by disguising malicious instructions as compliance rules in skills, reaching up to 77.67% success on confidentiality breaches and 67.33% on RCE while evad...
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
cs.CR 2026-05 unverdicted novelty 6.0

SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
Behavioral Integrity Verification for AI Agent Skills
cs.CR 2026-05 unverdicted novelty 6.0

BIV audits AI agent skills at scale, finding 80% deviate from declared behavior on 49,943 skills and achieving 0.946 F1 for malicious skill detection.
From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills
cs.CL 2026-04 unverdicted novelty 6.0

SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.

Reference graph

Works this paper leans on

30 extracted references · 11 canonical work pages · cited by 5 Pith papers · 2 internal anchors

[1]

OpenClaw: Your own personal AI assistant

OpenClaw. OpenClaw: Your own personal AI assistant. https://github.com/ openclaw/openclaw, 2026

2026
[2]

ClawHub: Skill directory for OpenClaw

OpenClaw. ClawHub: Skill directory for OpenClaw. https://github.com/ openclaw/clawhub, 2026

2026
[3]

Skill format specification

OpenClaw. Skill format specification. https://github.com/openclaw/clawhub/ blob/main/docs/skill-format.md, 2026

2026
[4]

ToxicSkills: Malicious AI agent skills in ClawHub

Snyk Labs. ToxicSkills: Malicious AI agent skills in ClawHub. https://snyk.io/ blog/toxicskills-malicious-ai-agent-skills-clawhub/, February 2026

2026
[5]

ClawHavoc: 341 malicious skills found by the bot they were target- ing

Koi Security. ClawHavoc: 341 malicious skills found by the bot they were target- ing. https://www.koi.ai/blog/clawhavoc-341-malicious-clawedbot-skills-found- by-the-bot-they-were-targeting, February 2026

2026
[6]

Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale

Liu, Y., Wang, W., Feng, R., Zhang, Y., Xu, G., Deng, G., Li, Y., and Zhang, L. Agent skills in the wild: An empirical study of security vulnerabilities at scale. arXiv:2601.10338, January 2026

work page internal anchor Pith review arXiv 2026
[7]

Malicious agent skills in the wild: A large-scale security empirical study.arXiv preprint arXiv:2602.06547, 2026

Liu, Y., Chen, Z., Zhang, Y., Deng, G., Li, Y., Ning, J., Zhang, Y., and Zhang, L.Y. Malicious agent skills in the wild: A large-scale security empirical study. arXiv:2602.06547, February 2026

work page arXiv 2026
[8]

Formal analysis and supply chain security for agentic AI skills.arXiv preprint arXiv:2603.00195, 2026

Bhardwaj, V.P. Formal analysis and supply chain security for agentic AI skills. arXiv:2603.00195, February 2026

work page arXiv 2026
[9]

ClawVet: Skill vetting & supply chain security for the OpenClaw ecosystem

Shaikh, M. ClawVet: Skill vetting & supply chain security for the OpenClaw ecosystem. https://github.com/MohibShaikh/clawvet, 2026

2026
[10]

From automation to infection: How OpenClaw agent skills are being weaponized

VirusTotal. From automation to infection: How OpenClaw agent skills are being weaponized. https://blog.virustotal.com/2026/02/from-automation-to-infection- how.html, February 2026

2026
[11]

Skillprobe: Security auditing for emerging agent skill marketplaces via multi-agent collaboration.arXiv preprint arXiv:2603.21019, 2026

Guo, Z., Chen, Z., Nie, X., Lin, J., Zhou, Y., and Zhang, W. SkillProbe: Security auditing for emerging agent skill marketplaces via multi-agent collaboration. arXiv:2603.21019, March 2026

work page arXiv 2026
[12]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Xu, R. and Yan, Y. Agent skills for large language models: Architecture, acquisi- tion, security, and the path forward.arXiv:2602.12430, February 2026

work page internal anchor Pith review arXiv 2026
[13]

OpenClaw’s 230 malicious skills: What agentic AI supply chains teach us about the need to evolve identity security

AuthMind. OpenClaw’s 230 malicious skills: What agentic AI supply chains teach us about the need to evolve identity security. https://www.authmind.com/ blogs/openclaw-malicious-skills-agentic-ai-supply-chain, 2026

2026
[14]

From magic to malware: How OpenClaw’s agent skills become an attack surface

1Password. From magic to malware: How OpenClaw’s agent skills become an attack surface. https://1password.com/blog/from-magic-to-malware-how- openclaws-agent-skills-become-an-attack-surface, 2026

2026
[15]

OpenClaw’s rapid adoption exposes skills supply chain and fake installer risks in a high-privilege AI agent platform

HKCERT. OpenClaw’s rapid adoption exposes skills supply chain and fake installer risks in a high-privilege AI agent platform. https://www.hkcert.org/blog/openclaw-s-rapid-adoption-exposes-skills- supply-chain-and-fake-installer-risks-in-a-high-privilege-ai-agent-platform, March 2026

2026
[16]

Malicious OpenClaw skills used to distribute Atomic ma- cOS Stealer

Trend Micro. Malicious OpenClaw skills used to distribute Atomic ma- cOS Stealer. https://www.trendmicro.com/en_us/research/26/b/openclaw-skills- used-to-distribute-atomic-macos-stealer.html, February 2026

2026
[17]

Malicious crypto skills compromise OpenClaw AI assistant users

Paubox. Malicious crypto skills compromise OpenClaw AI assistant users. https://www.paubox.com/blog/malicious-crypto-skills-compromise- openclaw-ai-assistant-users, 2026

2026
[18]

OWASP Agentic Skills Top 10

OWASP. OWASP Agentic Skills Top 10. https://owasp.org/www-project-agentic- skills-top-10/, 2026

2026
[19]

and Guestrin, C

Chen, T. and Guestrin, C. XGBoost: A scalable tree boosting system. InKDD, 2016

2016
[20]

Official documentation / project page

Tree-sitter. Official documentation / project page. https://tree-sitter.github.io/ tree-sitter/
[21]

Ohm, M. et al. Backstabber’s knife collection: A review of open source software supply chain attacks. InDIMV A, 2020

2020
[22]

SkillClone: Multi-modal clone detection and clone propagation analysis in the agent skill ecosystem.arXiv:2603.22447, March 2026

Zhu, J., Zhang, L., Guo, W., and Liu, Y. SkillClone: Multi-modal clone detection and clone propagation analysis in the agent skill ecosystem.arXiv:2603.22447, March 2026

work page arXiv 2026
[23]

Skilltester: Benchmarking utility and security of agent skills.arXiv preprint arXiv:2603.28815, 2026

Wang, L., Wang, Z., and Xu, A. SkillTester: Benchmarking utility and security of agent skills.arXiv:2603.28815, March 2026

work page arXiv 2026
[24]

Skillject: Automating stealthy skill-based prompt injection for coding agents with trace-driven closed-loop refinement.arXiv preprintarXiv:2602.14211, 2026

Jia, X., Liao, J., Qin, S., Gu, J., Ren, W., Cao, X., Liu, Y., and Torr, P. SkillJect: Automating stealthy skill-based prompt injection for coding agents with trace- driven closed-loop refinement.arXiv:2602.14211, February 2026

work page arXiv 2026
[25]

Agent audit: A security analysis system for llm agent applications,

Zhang, H., Nian, Y., and Zhao, Y. Agent Audit: A security analysis system for LLM agent applications.arXiv:2603.22853, March 2026

work page arXiv 2026
[26]

Mal- ware detection at the edge with lightweight LLMs: A performance evaluation

Rondanini, C., Carminati, B., Ferrari, E., Gaudiano, A., and Kundu, A. Mal- ware detection at the edge with lightweight LLMs: A performance evaluation. arXiv:2503.04302, March 2025

work page arXiv 2025
[27]

LoRA-based parameter-efficient LLMs for continuous learning in edge-based malware detec- tion.arXiv:2602.11655, February 2026

Rondanini, C., Carminati, B., Ferrari, E., Lardo, N., and Kundu, A. LoRA-based parameter-efficient LLMs for continuous learning in edge-based malware detec- tion.arXiv:2602.11655, February 2026

work page arXiv 2026
[28]

Top 10 for Agentic Applications for 2026

OWASP. Top 10 for Agentic Applications for 2026. https://genai.owasp.org/ resource/owasp-top-10-for-agentic-applications-for-2026/, December 2025

2026
[29]

OpenClaw can be hazardous to your software supply chain

JFrog. OpenClaw can be hazardous to your software supply chain. https://jfrog. com/blog/giving-openclaw-the-keys-to-your-kingdom-read-this-first/, 2026

2026
[30]

OpenClaw security engineer’s cheat sheet

Semgrep. OpenClaw security engineer’s cheat sheet. https://semgrep.dev/blog/ 2026/openclaw-security-engineers-cheat-sheet/, 2026

2026