hub Canonical reference

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He · 2026 · cs.AI · arXiv 2602.12670

Canonical reference. 76% of citing Pith papers cite this work as background.

92 Pith papers citing it

Background 76% of classified citations

open full Pith review browse 92 citing papers arXiv PDF

abstract

Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 of 84 tasks show negative deltas. Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming. Focused Skills with 2--3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 13 dataset 2 baseline 1 other 1

citation-polarity summary

background 13 unclear 2 baseline 1 use dataset 1

claims ledger

abstract Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Softwar

co-cited works

representative citing papers

Cloak and Detonate: Scanner Evasion and Dynamic Detection of Agent Skill Malware

cs.CR · 2026-07-02 · unverdicted · novelty 7.0

SkillCloak evades existing static scanners for agent skill malware at high rates, while SkillDetonate detects 97% of attacks at 2% false-positive rate using sandboxed runtime behavior analysis.

SkillFuzz: Fuzzing Skill Composition for Implicit Intents Discovery in Open Skill Marketplaces

cs.SE · 2026-07-02 · unverdicted · novelty 7.0

SkillFuzz is an execution-free fuzzing method that extracts skill contracts and applies contract-guided MCTS to discover over 1000 implicit intents in LLM skill marketplaces.

From Registry to Repository: How AI Agent Skills Are Written, Adapted, and Maintained

cs.SE · 2026-07-01 · unverdicted · novelty 7.0

Empirical study of 41k+ AI agent skills finds reuse is mostly one-time verbatim copying with 53% never modified afterward and maintenance focused on additive local adaptations.

Generative Skill Composition for LLM Agents

cs.CL · 2026-06-30 · unverdicted · novelty 7.0

SkillComposer performs task-conditioned skill sequence prediction with a constrained autoregressive decoder to jointly output skill subset, count, and order, raising pass rates by 23.1 and 18.2 percentage points on two production coding agents over no-skill baselines.

SkillAudit: From Fixed-Suite Benchmarking to Skill-Centered Assessment

cs.AI · 2026-06-21 · unverdicted · novelty 7.0

SkillAudit is an automated framework that generates capability-aligned tasks from skill packages, executes them in sandboxes, and produces reports on utility, cost, and safety via baseline comparisons and two-stage risk detection.

Co-Evolving Skill Generation and Policy Optimization

cs.CL · 2026-06-07 · unverdicted · novelty 7.0

Framework estimates context-dependent marginal utility of candidate skills via reward gaps in matched base vs. skill-augmented rollouts to filter skills and co-train policy as generator.

AIP: A Graph Representation for Learning and Governing Agent Skills

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

AIP models skills as graphs of discrete steps connected by typed I/O edges under a validated schema, raising agent mean reward from 0.60 to 0.71 and pass rate from 53% to 67% on 27 SkillsBench tasks while enabling node-level fixes.

SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

SkillHarm benchmark shows current AI agents are vulnerable to lifecycle-aware skill poisoning with success rates up to 86.3% for fixed-payload attacks and 69.3% for self-mutating attacks.

VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.

Behind EvoMap: Characterizing a Self-Evolving Agent-to-Agent Collaboration Network

cs.AI · 2026-05-25 · unverdicted · novelty 7.0

Empirical study of EvoMap shows 98% of assets never reused, scores driven by self-reported metadata, and 84% of assets using vacuous validation tests.

SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

cs.AI · 2026-05-22 · unverdicted · novelty 7.0

SkillEvolBench is a new diagnostic benchmark that evaluates the transition from episodic experience to procedural skills in LLM agents using role-conditioned task families and frozen deployment tests.

Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

cs.AI · 2026-05-19 · unverdicted · novelty 7.0 · 2 refs

The paper diagnoses library drift in self-evolving LLM skill libraries and demonstrates a governance recipe raising pass@1 from 0.258 to 0.584 on MBPP+ hard-100.

ContractBench: Can LLM Agents Preserve Observation Contracts?

cs.SE · 2026-05-17 · conditional · novelty 7.0

ContractBench shows that LLM agents frequently violate observation contracts by using expired artifacts or corrupting their byte integrity, with no model exceeding 80% success and notable scaling irregularities across families.

SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems

cs.SE · 2026-05-13 · unverdicted · novelty 7.0

SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero library-time LLM cost.

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

cs.AI · 2026-05-12 · conditional · novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

cs.CR · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

SkillSafetyBench is a benchmark of 155 cases across 47 tasks and 6 risk domains showing that non-user attacks via skills, artifacts, or environments can consistently induce unsafe agent behavior.

Counterfactual Trace Auditing of LLM Agent Skills

cs.AI · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Counterfactual Trace Auditing detects 522 behavioral change patterns from skills on 49 tasks where pass rates shift only 0.3 points on average.

SkillSmith: Compiling Agent Skills into Boundary-Guided Runtime Interfaces

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

SkillSmith is a boundary-first compiler-runtime system that turns skill packages into minimal executable interfaces, cutting token usage 57%, thinking iterations 43%, and solve time 51% versus raw skill injection on SkillsBench.

Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries

cs.SE · 2026-05-09 · conditional · novelty 7.0

SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round repair success from 10% to 78%.

Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.

SkillCom: Decomposing LLM-based Semantic Communication into Task and Channel Aware Skills

eess.SY · 2026-05-04 · unverdicted · novelty 7.0

SkillCom decomposes LLM semantic communication into four skills connected by structured semantic-unit interfaces and outperforms monolithic LLM baselines in robustness on multi-hop QA and dialogue state tracking tasks.

TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

cs.LG · 2026-04-27 · unverdicted · novelty 7.0

TCOD stabilizes on-policy distillation for multi-turn agents via temporal curriculum on trajectory depth, improving performance up to 18 points over vanilla OPD and sometimes surpassing the teacher.

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

cs.AI · 2026-04-22 · unverdicted · novelty 7.0

COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation cs.CV · 2026-05-08 · unverdicted · none · ref 37 · internal anchor
SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer