hub Canonical reference

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He · 2026 · cs.AI · arXiv 2602.12670

Canonical reference. 76% of citing Pith papers cite this work as background.

97 Pith papers citing it

Background 76% of classified citations

open full Pith review browse 97 citing papers arXiv PDF

abstract

Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 of 84 tasks show negative deltas. Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming. Focused Skills with 2--3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 13 dataset 2 baseline 1 other 1

citation-polarity summary

background 13 unclear 2 baseline 1 use dataset 1

claims ledger

abstract Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Softwar

co-cited works

representative citing papers

Cloak and Detonate: Scanner Evasion and Dynamic Detection of Agent Skill Malware

cs.CR · 2026-07-02 · unverdicted · novelty 7.0

SkillCloak evades existing static scanners for agent skill malware at high rates, while SkillDetonate detects 97% of attacks at 2% false-positive rate using sandboxed runtime behavior analysis.

SkillFuzz: Fuzzing Skill Composition for Implicit Intents Discovery in Open Skill Marketplaces

cs.SE · 2026-07-02 · unverdicted · novelty 7.0

SkillFuzz is an execution-free fuzzing method that extracts skill contracts and applies contract-guided MCTS to discover over 1000 implicit intents in LLM skill marketplaces.

From Registry to Repository: How AI Agent Skills Are Written, Adapted, and Maintained

cs.SE · 2026-07-01 · unverdicted · novelty 7.0

Empirical study of 41k+ AI agent skills finds reuse is mostly one-time verbatim copying with 53% never modified afterward and maintenance focused on additive local adaptations.

Generative Skill Composition for LLM Agents

cs.CL · 2026-06-30 · unverdicted · novelty 7.0

SkillComposer performs task-conditioned skill sequence prediction with a constrained autoregressive decoder to jointly output skill subset, count, and order, raising pass rates by 23.1 and 18.2 percentage points on two production coding agents over no-skill baselines.

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

cs.CL · 2026-06-22 · unverdicted · novelty 7.0

EnterpriseClawBench is a benchmark for enterprise agents constructed from proprietary real-world sessions, with the reusable contribution being the construction and evaluation protocol rather than the data itself.

SkillAudit: From Fixed-Suite Benchmarking to Skill-Centered Assessment

cs.AI · 2026-06-21 · unverdicted · novelty 7.0

SkillAudit is an automated framework that generates capability-aligned tasks from skill packages, executes them in sandboxes, and produces reports on utility, cost, and safety via baseline comparisons and two-stage risk detection.

Co-Evolving Skill Generation and Policy Optimization

cs.CL · 2026-06-07 · unverdicted · novelty 7.0

Framework estimates context-dependent marginal utility of candidate skills via reward gaps in matched base vs. skill-augmented rollouts to filter skills and co-train policy as generator.

AIP: A Graph Representation for Learning and Governing Agent Skills

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

AIP models skills as graphs of discrete steps connected by typed I/O edges under a validated schema, raising agent mean reward from 0.60 to 0.71 and pass rate from 53% to 67% on 27 SkillsBench tasks while enabling node-level fixes.

SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

SkillHarm benchmark shows current AI agents are vulnerable to lifecycle-aware skill poisoning with success rates up to 86.3% for fixed-payload attacks and 69.3% for self-mutating attacks.

VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.

Behind EvoMap: Characterizing a Self-Evolving Agent-to-Agent Collaboration Network

cs.AI · 2026-05-25 · unverdicted · novelty 7.0

Empirical study of EvoMap shows 98% of assets never reused, scores driven by self-reported metadata, and 84% of assets using vacuous validation tests.

SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

cs.AI · 2026-05-22 · unverdicted · novelty 7.0

SkillEvolBench is a new diagnostic benchmark that evaluates the transition from episodic experience to procedural skills in LLM agents using role-conditioned task families and frozen deployment tests.

Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

cs.AI · 2026-05-19 · unverdicted · novelty 7.0 · 2 refs

The paper diagnoses library drift in self-evolving LLM skill libraries and demonstrates a governance recipe raising pass@1 from 0.258 to 0.584 on MBPP+ hard-100.

ContractBench: Can LLM Agents Preserve Observation Contracts?

cs.SE · 2026-05-17 · conditional · novelty 7.0

ContractBench shows that LLM agents frequently violate observation contracts by using expired artifacts or corrupting their byte integrity, with no model exceeding 80% success and notable scaling irregularities across families.

SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems

cs.SE · 2026-05-13 · unverdicted · novelty 7.0

SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero library-time LLM cost.

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

cs.AI · 2026-05-12 · conditional · novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

cs.CR · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

SkillSafetyBench is a benchmark of 155 cases across 47 tasks and 6 risk domains showing that non-user attacks via skills, artifacts, or environments can consistently induce unsafe agent behavior.

Counterfactual Trace Auditing of LLM Agent Skills

cs.AI · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Counterfactual Trace Auditing detects 522 behavioral change patterns from skills on 49 tasks where pass rates shift only 0.3 points on average.

SkillSmith: Compiling Agent Skills into Boundary-Guided Runtime Interfaces

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

SkillSmith is a boundary-first compiler-runtime system that turns skill packages into minimal executable interfaces, cutting token usage 57%, thinking iterations 43%, and solve time 51% versus raw skill injection on SkillsBench.

Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries

cs.SE · 2026-05-09 · conditional · novelty 7.0

SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round repair success from 10% to 78%.

Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.

SkillCom: Decomposing LLM-based Semantic Communication into Task and Channel Aware Skills

eess.SY · 2026-05-04 · unverdicted · novelty 7.0

SkillCom decomposes LLM semantic communication into four skills connected by structured semantic-unit interfaces and outperforms monolithic LLM baselines in robustness on multi-hop QA and dialogue state tracking tasks.

TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

cs.LG · 2026-04-27 · unverdicted · novelty 7.0

TCOD stabilizes on-policy distillation for multi-turn agents via temporal curriculum on trajectory depth, improving performance up to 18 points over vanilla OPD and sometimes surpassing the teacher.

citing papers explorer

Showing 47 of 97 citing papers.

OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents cs.CL · 2026-05-22 · unverdicted · none · ref 26 · 2 links · internal anchor
OpenSkillEval dynamically builds task instances across five application domains to evaluate 30 open skills with over 600 tests, finding that skill use depends heavily on model and framework and that many popular skills do not beat base agents.
More Skills, Worse Agents? Skill Shadowing Degrades Performance When Expanding Skill Libraries cs.SE · 2026-05-21 · unverdicted · none · ref 4 · internal anchor
Skill shadowing, not context overhead, is the dominant cause of performance degradation when LLM agent skill libraries expand.
Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles cs.LG · 2026-05-21 · unverdicted · none · ref 20 · internal anchor
Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and Gemini-2.5-Pro while generalizing to unseen components.
Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents cs.AI · 2026-05-21 · conditional · none · ref 2 · internal anchor
Ratchet provides a minimal hygiene recipe for self-managing skill libraries in frozen LLM agents, delivering +0.328 rolling-mean pass@1 gain on MBPP+ hard-100 and +0.22 peak lift on SWE-bench Verified.
ParaCell: Paravirtualized Secure Containers with Lightweight Intra-Container Isolation and Intent-Driven Memory Management cs.OS · 2026-05-20 · unverdicted · none · ref 50 · internal anchor
ParaCell reduces latency by up to 88% in nested setups and saves 35.6% memory on agent workloads via MPK XGates and intent-based Pager compared to PVM, RunV, and HyperAlloc.
CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows? cs.CL · 2026-05-15 · unverdicted · none · ref 34 · internal anchor
CHI-Bench shows current AI agents achieve at most 28% success on long-horizon healthcare workflows that require dense policy adherence, multi-role handoffs, and multi-turn interactions.
The Scaling Laws of Skills in LLM Agent Systems cs.CL · 2026-05-15 · unverdicted · none · ref 9 · internal anchor
Empirical analysis across 15 LLMs and 1,141 skills identifies a logarithmic routing decay law and a multiplicative execution law coupled by a single fitted slope parameter b that enables targeted library optimizations improving routing accuracy and downstream task pass rates.
SkillRAE: Agent Skill-Based Context Compilation for Retrieval-Augmented Execution cs.CL · 2026-05-11 · unverdicted · none · ref 9 · internal anchor
SkillRAE organizes skills into a graph and compiles compact, grounded contexts for LLM agents, yielding 11.7% gains on SkillsBench over prior RAE methods.
Evidence Over Plans: Online Trajectory Verification for Skill Distillation cs.AI · 2026-05-09 · unverdicted · none · ref 5 · 2 links · internal anchor
SPARK generates environment-verified trajectories to compute PDI, enabling posterior skill distillation that outperforms no-skill baselines and human-written skills across 86 tasks with up to 1000x cheaper inference.
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents cs.AI · 2026-05-09 · unverdicted · none · ref 12 · 2 links · internal anchor
SkillMaster enables LLM agents to autonomously develop skills via trajectory review, counterfactual evaluation, and DualAdv-GRPO training, boosting success rates by 8.8% on ALFWorld and 9.3% on WebShop.
SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation cs.CV · 2026-05-08 · unverdicted · none · ref 37 · internal anchor
SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.
Group of Skills: Group-Structured Skill Retrieval for Agent Skill Libraries cs.CL · 2026-05-07 · unverdicted · none · ref 6 · internal anchor
GoSkills converts flat skill lists into role-labeled execution contexts via anchor-centered groups and graph expansion, preserving coverage and improving rewards on SkillsBench and ALFWorld under small skill budgets.
PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors cs.AI · 2026-05-07 · unverdicted · none · ref 20 · internal anchor
PrefixGuard induces typed step adapters from agent traces offline then trains prefix-risk scorers on terminal outcomes, reaching 0.900/0.710/0.533/0.557 AUPRC on four benchmarks and beating raw-text baselines by 0.137 on average.
From Context to Skills: Can Language Models Learn from Context Skillfully? cs.AI · 2026-04-30 · unverdicted · none · ref 20 · internal anchor
Ctx2Skill uses a self-evolving multi-agent loop with Challenger, Reasoner, Judge, and Cross-time Replay to discover context-specific skills, improving task-solving rates on CL-bench benchmarks across models.
Skill Retrieval Augmentation for Agentic AI cs.CL · 2026-04-27 · unverdicted · none · ref 15 · 3 links · internal anchor
Introduces SRA paradigm and SRA-Bench benchmark (5,400 tasks, 26,262 skills) showing retrieval improves performance but LLMs fail to selectively incorporate retrieved skills.
ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation cs.AI · 2026-04-26 · unverdicted · none · ref 5 · internal anchor
ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut costs by 32%.
MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills cs.AI · 2026-04-22 · unverdicted · none · ref 1 · internal anchor
MedSkillAudit is a new domain-specific audit framework for medical research agent skills that achieved moderate agreement with expert reviews (ICC 0.449), exceeding the human inter-rater baseline (ICC 0.300).
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence cs.AI · 2026-04-20 · unverdicted · none · ref 47 · internal anchor
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents cs.SE · 2026-03-20 · unverdicted · none · ref 15 · internal anchor
ContractSkill converts draft web agent skills into explicit executable contracts that enable deterministic verification, fault localization, and minimal local repair, improving stability on benchmarks like VisualWebArena.
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents cs.CR · 2026-02-24 · unverdicted · none · ref 32 · internal anchor
The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
VIGIL: Runtime Enforcement of Behavioral Specifications in AI Agent Skills cs.CR · 2026-06-25 · unverdicted · none · ref 11 · internal anchor
VIGIL introduces a policy language and symbolic evaluation rules to enforce context-aware behavioral specifications on LLM agent traces, achieving over 95% recall and under 10% false positives on real tasks.
How Should Agents Read Demonstrations? Hierarchical Structure Beats Flat Action Logs cs.AI · 2026-06-18 · unverdicted · none · ref 2 · internal anchor
Hierarchically grouped demonstrations raise pass rates from 76.7% to 90.7% on 43 vague-description tasks while flat logs show smaller non-significant gains.
What Should a Skill Remember? Quality--Cost Trade-offs in Cost-Aware Skill Rewriting for Language Model Agents cs.CL · 2026-06-08 · unverdicted · none · ref 30 · internal anchor
Empirical study demonstrates that cost-aware skill rewriting for LLM agents can achieve 7% total cost reduction and 6% agent-token cost reduction with preserved quality on SkillsBench.
SkillHone: A Harness for Continual Agent Skill Evolution Through Persistent Decision History cs.LG · 2026-06-07 · unverdicted · none · ref 11 · internal anchor
SkillHone introduces a harness that maintains persistent decision histories to support continual evolution of language-model agent skills, reporting 15.8-point gains on GAIA over a commercial deep-research agent.
Pomona: Continuous Code Quality Improvement via Small, Automated Changes at Bloomberg cs.SE · 2026-06-04 · conditional · none · ref 7 · internal anchor
Pomona automates discovery and repair of small code quality issues via agent skills, achieving 15 of 17 PRs merged with median close time under 2 hours in a one-month Bloomberg team deployment.
OpenSkill: Open-World Self-Evolution for LLM Agents cs.AI · 2026-06-04 · unverdicted · none · ref 9 · internal anchor
OpenSkill bootstraps LLM agent self-evolution by pulling grounded knowledge and anchors from open-world sources, synthesizing transferable skills, and refining them on self-generated virtual tasks, achieving top benchmark pass rates without supervision.
MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation cs.AI · 2026-05-26 · unverdicted · none · ref 9 · internal anchor
MUSE-Autoskill introduces a skill-centric framework for self-evolving LLM agents through a unified lifecycle of skill creation, memory, management, evaluation, and refinement.
Workflow Closure Is Not Scientific Closure in Auto-Research Systems cs.SE · 2026-05-25 · unverdicted · none · ref 65 · internal anchor
Survey of auto-research systems identifies objective, validation, and acceptance collapses, concluding that workflow closure does not equal scientific closure and advocating non-autonomous epistemic control.
When Skills Don't Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity cs.AI · 2026-05-19 · unverdicted · none · ref 4 · 2 links · internal anchor
Re-analysis of MCP-grounded CTF agent runs shows Skills add only 8.9 pp (p=0.71) when tool feedback is strict and low-latency, supporting the hypothesis that high environment-feedback bandwidth reduces the marginal value of curated procedural knowledge.
SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution cs.CL · 2026-05-18 · unverdicted · none · ref 23 · internal anchor
SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning cs.AI · 2026-05-07 · unverdicted · none · ref 30 · 3 links · internal anchor
Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency variation to credit distillation, outperforming baselines on ALFWorld and WebShop.
Bilevel Optimization of Agent Skills via Monte Carlo Tree Search cs.AI · 2026-04-17 · unverdicted · none · ref 2 · internal anchor
Bilevel optimization with outer-loop MCTS for skill structure and inner-loop LLM refinement improves agent accuracy on an operations-research question-answering dataset.
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering cs.SE · 2026-04-09 · accept · none · ref 81 · internal anchor
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
Gradual Cognitive Externalization: From Modeling Cognition to Constituting It cs.AI · 2026-04-06 · unverdicted · none · ref 6 · internal anchor
Ambient AI systems transition from modeling cognition to constituting part of users' cognitive architectures through sustained causal coupling, under a functionalist view and the no behaviorally invisible residual hypothesis.
Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering cs.SE · 2026-06-16 · unverdicted · none · ref 25 · internal anchor
Coding benchmarks misalign with agentic software engineering because they conflate model and harness, grade against single references, and provide no component-level iteration signals.
Skill Availability and Presentation Granularity in Large-Language-Model Agents: A Controlled SkillsBench Study cs.CL · 2026-05-29 · unverdicted · none · ref 5 · internal anchor
In a 30-task SkillsBench study, skill availability boosts GPT-5.5 and DeepSeek V4-Flash agent pass rates substantially while presentation-granularity variations yield small uncertain effects.
Responsible Agentic AI Requires Explicit Provenance cs.AI · 2026-05-16 · unverdicted · none · ref 35 · internal anchor
Explicit provenance across the full agentic AI lifecycle is the necessary condition for making responsibility computable and actionable.
The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents cs.AI · 2026-05-11 · unverdicted · none · ref 22 · internal anchor
Agent Cybernetics reframes foundation agent design by adapting classical cybernetics laws into three engineering desiderata for reliable, long-running, self-improving agents.
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence cs.AI · 2026-05-07 · unverdicted · none · ref 50 · 2 links · internal anchor
Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.
Know When to Trust the Skill: Delayed Appraisal and Epistemic Vigilance for Single-Agent LLMs cs.AI · 2026-04-17 · unverdicted · none · ref 5 · internal anchor
MESA-S framework translates human metacognitive control into LLMs via delayed procedural probes and Metacognitive Skill Cards to separate parametric certainty from source trust and reduce overthinking.
Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub cs.CL · 2026-03-19 · unverdicted · none · ref 5 · internal anchor
Analysis of ClawHub shows language-based functional divides in agent skills, with over 30% flagged suspicious and submission-time documentation enabling 73% accurate risk prediction.
Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task cs.AI · 2026-06-10 · unverdicted · none · ref 11 · internal anchor
Exploratory human evaluation of skill-augmented AI agents versus native models on an NSCLC transcriptomic task found directional but non-significant quality gains overshadowed by rater noise.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications cs.IR · 2026-05-08 · unverdicted · none · ref 94 · 3 links · internal anchor
A survey that defines agent skills as reusable procedural artifacts and reviews methods, resources, and applications across their representation, acquisition, retrieval, and evolution stages.
EvoAgent: An Evolvable Agent Framework with Skill Learning and Multi-Agent Delegation cs.AI · 2026-04-22 · unreviewed · ref 7 · internal anchor
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents cs.AI · 2026-04-20 · unreviewed · ref 71 · internal anchor
From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution cs.SE · 2026-04-16 · unreviewed · ref 19 · internal anchor
When Agent Markets Arrive cs.CE · 2026-04-08 · unreviewed · ref 12 · internal anchor

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer