SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.
hub Canonical reference
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
Agentic systems increasingly rely on reusable procedural capabilities, \textit{a.k.a., agentic skills}, to execute long-horizon workflows reliably. These capabilities are callable modules that package procedural knowledge with explicit applicability conditions, execution policies, termination criteria, and reusable interfaces. Unlike one-off plans or atomic tool calls, skills operate (and often do well) across tasks. This paper maps the skill layer across the full lifecycle (discovery, practice, distillation, storage, composition, evaluation, and update) and introduces two complementary taxonomies. The first is a system-level set of \textbf{seven design patterns} capturing how skills are packaged and executed in practice, from metadata-driven progressive disclosure and executable code skills to self-evolving libraries and marketplace distribution. The second is an orthogonal \textbf{representation $\times$ scope} taxonomy describing what skills \emph{are} (natural language, code, policy, hybrid) and what environments they operate over (web, OS, software engineering, robotics). We analyze the security and governance implications of skill-based agents, covering supply-chain risks, prompt injection via skill payloads, and trust-tiered execution, grounded by a case study of the ClawHavoc campaign in which nearly 1{,}200 malicious skills infiltrated a major agent marketplace, exfiltrating API keys, cryptocurrency wallets, and browser credentials at scale. We further survey deterministic evaluation approaches, anchored by recent benchmark evidence that curated skills can substantially improve agent success rates while self-generated skills may degrade them. We conclude with open challenges toward robust, verifiable, and certifiable skills for real-world autonomous agents.
hub tools
citation-role summary
citation-polarity summary
years
2026 34roles
background 9polarities
background 9representative citing papers
SkillSafetyBench is a benchmark of 155 cases across 47 tasks and 6 risk domains showing that non-user attacks via skills, artifacts, or environments can consistently induce unsafe agent behavior.
Five practical attacks on the x402 agentic payment protocol are demonstrated across authorization, binding, replay protection, and web handling, validated on local chains, Base Sepolia, live endpoints, and three open-source SDKs.
OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on four benchmarks.
SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round repair success from 10% to 78%.
CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.
SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.
SIGIL cryptographically seals the audit-runtime gap for LLM skills via an on-chain registry with four publication types, DAO vetting, and a runtime verification loader that enforces integrity and permissions.
This paper introduces a systems-level conceptual framing and a three-level taxonomy (intra-model, system-level, socio-technical) for uncertainty propagation in compound LLM applications, along with engineering insights and open challenges.
Knows uses a YAML sidecar specification to provide structured, agent-consumable representations of research papers, yielding large accuracy gains for small LLMs on comprehension tasks and rapid community adoption via a public hub.
The first systematization of blockchain-based agent-to-agent payments organizes designs into discovery, authorization, execution, and accounting stages while identifying trust and security gaps.
SciVisAgentSkills provides reusable agent skills that raise mean task scores on a 108-task SciVis benchmark when paired with Codex and Claude Code agents.
Skill-RM unifies heterogeneous reward criteria by modeling reward computation as dynamic execution of a reusable Reward-Evaluation Skill within an agent framework.
SkillOpt introduces a controllable text-space optimizer that evolves agent skills via add/delete/replace edits accepted only on strict held-out validation improvement, reporting consistent gains across 52 model-benchmark-harness combinations.
Skill shadowing, not context overhead, is the dominant cause of performance degradation when LLM agent skill libraries expand.
A3S-Bench evaluates LLM agents against temporal, spatial, and semantic evasions, raising average risk trigger rates from 28.3% to 52.6% across 2,254 trajectories and 20 scenarios.
Empirical analysis across 15 LLMs and 1,141 skills identifies a logarithmic routing decay law and a multiplicative execution law coupled by a single fitted slope parameter b that enables targeted library optimizations improving routing accuracy and downstream task pass rates.
SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.
GoSkills converts flat skill lists into role-labeled execution contexts via anchor-centered groups and graph expansion, preserving coverage and improving rewards on SkillsBench and ALFWorld under small skill budgets.
SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.
DataCOPE uses verifier-guided contrastive distillation from agent trajectories to discover skills, yielding average gains of 9.71% on report-style and 32.30% on reasoning-style data analysis tasks across four model settings.
SkillVetBench is a two-stage benchmark combining natural-language semantic vetting and instrumented sandbox execution to detect and provide runtime evidence for malicious skills in open agent platforms, with experiments showing static methods miss up to 89% of threats.
Re-analysis of MCP-grounded CTF agent runs shows Skills add only 8.9 pp (p=0.71) when tool feedback is strict and low-latency, supporting the hypothesis that high environment-feedback bandwidth reduces the marginal value of curated procedural knowledge.
SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.
citing papers explorer
-
Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries
SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round repair success from 10% to 78%.
-
Uncertainty Propagation in LLM-Based Systems
This paper introduces a systems-level conceptual framing and a three-level taxonomy (intra-model, system-level, socio-technical) for uncertainty propagation in compound LLM applications, along with engineering insights and open challenges.
-
More Skills, Worse Agents? Skill Shadowing Degrades Performance When Expanding Skill Libraries
Skill shadowing, not context overhead, is the dominant cause of performance degradation when LLM agent skill libraries expand.
-
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.