hub

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He · 2026 · cs.AI · arXiv 2602.12670

37 Pith papers cite this work. Polarity classification is still indexing.

37 Pith papers citing it

open full Pith review browse 37 citing papers arXiv PDF

abstract

Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 of 84 tasks show negative deltas. Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming. Focused Skills with 2--3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 dataset 1

citation-polarity summary

background 2

claims ledger

abstract Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Softwar

co-cited works

representative citing papers

From Context to Skills: Can Language Models Learn from Context Skillfully?

cs.AI · 2026-04-30 · unverdicted · novelty 8.0

Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.

SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems

cs.SE · 2026-05-13 · unverdicted · novelty 7.0

SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero library-time LLM cost.

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

cs.AI · 2026-05-12 · conditional · novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.

Counterfactual Trace Auditing of LLM Agent Skills

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

CTA framework detects 522 skill influence patterns in LLM agent traces across 49 tasks where average pass rate shifts only +0.3%, exposing evaluation gaps in behavioral effects like template copying and excess planning.

Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries

cs.SE · 2026-05-09 · conditional · novelty 7.0

SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round repair success from 10% to 78%.

Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.

SkillCom: Decomposing LLM-based Semantic Communication into Task and Channel Aware Skills

eess.SY · 2026-05-04 · unverdicted · novelty 7.0

SkillCom decomposes LLM semantic communication into four skills connected by structured semantic-unit interfaces and outperforms monolithic LLM baselines in robustness on multi-hop QA and dialogue state tracking tasks.

Skill Retrieval Augmentation for Agentic AI

cs.CL · 2026-04-27 · unverdicted · novelty 7.0

Agents improve when they retrieve skills on demand from large corpora, yet current models cannot selectively decide when to load or ignore a retrieved skill.

TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

cs.LG · 2026-04-27 · unverdicted · novelty 7.0

TCOD stabilizes on-policy distillation for multi-turn agents via temporal curriculum on trajectory depth, improving performance up to 18 points over vanilla OPD and sometimes surpassing the teacher.

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

cs.AI · 2026-04-22 · unverdicted · novelty 7.0

COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

cs.AI · 2026-04-19 · unverdicted · novelty 7.0

SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.

SkVM: Revisiting Language VM for Skills across Heterogenous LLMs and Harnesses

cs.SE · 2026-04-03 · unverdicted · novelty 7.0

SkVM uses capability profiling and compiler-style techniques to make skills portable across LLMs and harnesses, raising task completion rates while cutting token use by up to 40% and delivering up to 3.2x speedup.

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

cs.CR · 2026-05-12 · unverdicted · novelty 6.0

SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.

SkillRAE: Agent Skill-Based Context Compilation for Retrieval-Augmented Execution

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

SkillRAE organizes skills into a graph and compiles compact, grounded contexts for LLM agents, yielding 11.7% gains on SkillsBench over prior RAE methods.

Evidence Over Plans: Online Trajectory Verification for Skill Distillation

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

PDI-guided distillation from environment-verified trajectories yields skills that surpass no-skill baselines and human-written skills across 86 tasks with far lower inference cost.

SkillMaster: Toward Autonomous Skill Mastery in LLM Agents

cs.AI · 2026-05-09 · unverdicted · novelty 6.0 · 2 refs

SkillMaster enables LLM agents to autonomously develop skills via trajectory review, counterfactual evaluation, and DualAdv-GRPO training, boosting success rates by 8.8% on ALFWorld and 9.3% on WebShop.

SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.

Group of Skills: Group-Structured Skill Retrieval for Agent Skill Libraries

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

GoSkills converts flat skill lists into role-labeled execution contexts via anchor-centered groups and graph expansion, preserving coverage and improving rewards on SkillsBench and ALFWorld under small skill budgets.

PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

PrefixGuard induces typed step adapters from agent traces offline then trains prefix-risk scorers on terminal outcomes, reaching 0.900/0.710/0.533/0.557 AUPRC on four benchmarks and beating raw-text baselines by 0.137 on average.

ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation

cs.AI · 2026-04-26 · unverdicted · novelty 6.0

ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut costs by 32%.

MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

cs.AI · 2026-04-22 · unverdicted · novelty 6.0

MedSkillAudit is a new domain-specific audit framework for medical research agent skills that achieved moderate agreement with expert reviews (ICC 0.449), exceeding the human inter-rater baseline (ICC 0.300).

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lower cost.

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

citing papers explorer

Showing 37 of 37 citing papers.

From Context to Skills: Can Language Models Learn from Context Skillfully? cs.AI · 2026-04-30 · unverdicted · none · ref 20 · internal anchor
Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.
SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems cs.SE · 2026-05-13 · unverdicted · none · ref 2 · internal anchor
SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero library-time LLM cost.
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack cs.AI · 2026-05-12 · conditional · none · ref 29 · internal anchor
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
Counterfactual Trace Auditing of LLM Agent Skills cs.AI · 2026-05-12 · unverdicted · none · ref 9 · internal anchor
CTA framework detects 522 skill influence patterns in LLM agent traces across 49 tasks where average pass rate shifts only +0.3%, exposing evaluation gaps in behavioral effects like template copying and excess planning.
Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries cs.SE · 2026-05-09 · conditional · none · ref 16 · internal anchor
SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round repair success from 10% to 78%.
Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck cs.LG · 2026-05-08 · unverdicted · none · ref 14 · internal anchor
CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.
SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents cs.AI · 2026-05-07 · unverdicted · none · ref 19 · internal anchor
SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.
SkillCom: Decomposing LLM-based Semantic Communication into Task and Channel Aware Skills eess.SY · 2026-05-04 · unverdicted · none · ref 13 · internal anchor
SkillCom decomposes LLM semantic communication into four skills connected by structured semantic-unit interfaces and outperforms monolithic LLM baselines in robustness on multi-hop QA and dialogue state tracking tasks.
Skill Retrieval Augmentation for Agentic AI cs.CL · 2026-04-27 · unverdicted · none · ref 15 · internal anchor
Agents improve when they retrieve skills on demand from large corpora, yet current models cannot selectively decide when to load or ignore a retrieved skill.
TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents cs.LG · 2026-04-27 · unverdicted · none · ref 7 · internal anchor
TCOD stabilizes on-policy distillation for multi-turn agents via temporal curriculum on trajectory depth, improving performance up to 18 points over vanilla OPD and sometimes surpassing the teacher.
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks cs.AI · 2026-04-22 · unverdicted · none · ref 10 · internal anchor
COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents cs.AI · 2026-04-19 · unverdicted · none · ref 19 · internal anchor
SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
SkVM: Revisiting Language VM for Skills across Heterogenous LLMs and Harnesses cs.SE · 2026-04-03 · unverdicted · none · ref 33 · internal anchor
SkVM uses capability profiling and compiler-style techniques to make skills portable across LLMs and harnesses, raising task completion rates while cutting token use by up to 40% and delivering up to 3.2x speedup.
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces cs.CR · 2026-05-12 · unverdicted · none · ref 64 · internal anchor
SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
SkillRAE: Agent Skill-Based Context Compilation for Retrieval-Augmented Execution cs.CL · 2026-05-11 · unverdicted · none · ref 9 · internal anchor
SkillRAE organizes skills into a graph and compiles compact, grounded contexts for LLM agents, yielding 11.7% gains on SkillsBench over prior RAE methods.
Evidence Over Plans: Online Trajectory Verification for Skill Distillation cs.AI · 2026-05-09 · unverdicted · none · ref 5 · internal anchor
PDI-guided distillation from environment-verified trajectories yields skills that surpass no-skill baselines and human-written skills across 86 tasks with far lower inference cost.
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents cs.AI · 2026-05-09 · unverdicted · none · ref 12 · 2 links · internal anchor
SkillMaster enables LLM agents to autonomously develop skills via trajectory review, counterfactual evaluation, and DualAdv-GRPO training, boosting success rates by 8.8% on ALFWorld and 9.3% on WebShop.
SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation cs.CV · 2026-05-08 · unverdicted · none · ref 37 · internal anchor
SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.
Group of Skills: Group-Structured Skill Retrieval for Agent Skill Libraries cs.CL · 2026-05-07 · unverdicted · none · ref 6 · internal anchor
GoSkills converts flat skill lists into role-labeled execution contexts via anchor-centered groups and graph expansion, preserving coverage and improving rewards on SkillsBench and ALFWorld under small skill budgets.
PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors cs.AI · 2026-05-07 · unverdicted · none · ref 20 · internal anchor
PrefixGuard induces typed step adapters from agent traces offline then trains prefix-risk scorers on terminal outcomes, reaching 0.900/0.710/0.533/0.557 AUPRC on four benchmarks and beating raw-text baselines by 0.137 on average.
ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation cs.AI · 2026-04-26 · unverdicted · none · ref 5 · internal anchor
ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut costs by 32%.
MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills cs.AI · 2026-04-22 · unverdicted · none · ref 1 · internal anchor
MedSkillAudit is a new domain-specific audit framework for medical research agent skills that achieved moderate agreement with expert reviews (ICC 0.449), exceeding the human inter-rater baseline (ICC 0.300).
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents cs.AI · 2026-04-20 · unverdicted · none · ref 71 · internal anchor
ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lower cost.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence cs.AI · 2026-04-20 · unverdicted · none · ref 47 · internal anchor
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
When Agent Markets Arrive cs.CE · 2026-04-08 · unverdicted · none · ref 12 · internal anchor
DIAGON simulation shows agent markets produce 3.2 times more wealth than isolated agents, but institutional choices like transparency and competitive selection can reduce rather than increase performance.
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents cs.CR · 2026-02-24 · unverdicted · none · ref 32 · internal anchor
The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning cs.AI · 2026-05-07 · unverdicted · none · ref 30 · 3 links · internal anchor
Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency variation to credit distillation, outperforming baselines on ALFWorld and WebShop.
EvoAgent: An Evolvable Agent Framework with Skill Learning and Multi-Agent Delegation cs.AI · 2026-04-22 · unverdicted · none · ref 7 · internal anchor
EvoAgent is an evolvable LLM agent framework using structured skill learning, user-feedback loops, and hierarchical delegation that boosts GPT5.2 performance by about 28% in real-world trade scenarios under LLM-as-Judge evaluation.
Bilevel Optimization of Agent Skills via Monte Carlo Tree Search cs.AI · 2026-04-17 · unverdicted · none · ref 2 · internal anchor
Bilevel optimization with outer-loop MCTS for skill structure and inner-loop LLM refinement improves agent accuracy on an operations-research question-answering dataset.
From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution cs.SE · 2026-04-16 · unverdicted · none · ref 19 · internal anchor
Compact Gene representations of experience outperform documentation-oriented Skill packages for test-time control and iterative evolution in code-solving tasks, with measured gains on CritPt from 9.1% to 18.57% and 17.7% to 27.14%.
SkillMOO: Multi-Objective Optimization of Agent Skills for Software Engineering cs.SE · 2026-04-10 · unverdicted · none · ref 4 · internal anchor
SkillMOO automatically evolves skill bundles for LLM coding agents via LLM-proposed edits and NSGA-II, achieving up to 131% higher pass rates and 32% lower costs on three SkillsBench tasks.
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering cs.SE · 2026-04-09 · accept · none · ref 81 · internal anchor
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
Gradual Cognitive Externalization: From Modeling Cognition to Constituting It cs.AI · 2026-04-06 · unverdicted · none · ref 6 · internal anchor
Ambient AI systems transition from modeling cognition to constituting part of users' cognitive architectures through sustained causal coupling, under a functionalist view and the no behaviorally invisible residual hypothesis.
The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents cs.AI · 2026-05-11 · unverdicted · none · ref 22 · internal anchor
Agent Cybernetics reframes foundation agent design by adapting classical cybernetics laws into three engineering desiderata for reliable, long-running, self-improving agents.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications cs.IR · 2026-05-08 · unverdicted · none · ref 99 · internal anchor
The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence cs.AI · 2026-05-07 · unverdicted · none · ref 50 · 2 links · internal anchor
Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.
Know When to Trust the Skill: Delayed Appraisal and Epistemic Vigilance for Single-Agent LLMs cs.AI · 2026-04-17 · unverdicted · none · ref 5 · internal anchor
MESA-S framework translates human metacognitive control into LLMs via delayed procedural probes and Metacognitive Skill Cards to separate parametric certainty from source trust and reduce overthinking.

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer