SkillEvolBench is a new diagnostic benchmark that evaluates the transition from episodic experience to procedural skills in LLM agents using role-conditioned task families and frozen deployment tests.
hub Mixed citations
CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification
Mixed citation behavior. Most common role is background (50%).
abstract
Anthropic proposes the concept of skills for LLM agents to tackle multi-step professional tasks that simple tool invocations cannot address. A tool is a single, self-contained function, whereas a skill is a structured bundle of interdependent multi-file artifacts. Currently, skill generation is not only label-intensive due to manual authoring, but also may suffer from human--machine cognitive misalignment, which can lead to degraded agent performance, as evidenced by evaluations on SkillsBench. Therefore, we aim to enable agents to autonomously generate skills. However, existing self-evolving methods designed for tools cannot be directly applied to skills due to their increased complexity. To address these issues, we propose CoEvoSkills, a self-evolving skills framework that enables agents to autonomously construct complex, multi-file skill packages. Specifically, CoEvoSkills couples a Skill Generator that iteratively refines skills with a Surrogate Verifier that co-evolves to provide informative and actionable feedback without access to ground-truth test content. On SkillsBench, CoEvoSkills achieves the highest pass rate among five baselines on both Claude Code and Codex, and also exhibits strong generalization capabilities to six additional LLMs.
hub tools
citation-role summary
citation-polarity summary
years
2026 24representative citing papers
Residual skill optimization creates complementary Text-to-SQL agents by training each new skill on prior ensemble failures, yielding accuracy gains on Spider2-Lite and transfer to other dialects and tasks.
SkillMOO applies LLM-proposed edits and NSGA-II Pareto optimization to skill bundles for SE agents, ranking top in pass rate on most SkillsBench tasks while cutting costs up to 31.7%.
SkillAdaptor introduces step-level failure attribution and targeted skill updates for LLM agents, yielding performance gains on WebShop, PinchBench, and Claw-Eval benchmarks.
Catalogs ten patterns and synthesizes a four-layer reference architecture for skill harnessing in LLM agents, evaluated via cross-instantiation on eight systems.
SW-DRSO optimizes a tractable surrogate of worst-case expected loss over plausible inference-time corruptions using a barycentric adversary approximated via simplex weights.
SkillOpt introduces a controllable text-space optimizer that evolves agent skills via add/delete/replace edits accepted only on strict held-out validation improvement, reporting consistent gains across 52 model-benchmark-harness combinations.
A systematic study across five domains finds model-generated skills yield average gains but non-uniform negative transfer, with a meta-skill improving extraction quality.
AgentCo-op retrieves and assembles existing agents and tools into interoperable workflows for open-world scientific tasks, showing effectiveness in genomics case studies and competitive benchmark results with lower costs.
SkillGen synthesizes auditable skills from agent trajectories via contrastive induction on successes and failures, then verifies net performance impact by comparing outcomes with and without the skill on identical tasks.
SkillMaster enables LLM agents to autonomously develop skills via trajectory review, counterfactual evaluation, and DualAdv-GRPO training, boosting success rates by 8.8% on ALFWorld and 9.3% on WebShop.
SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and raising ALFWorld success from 45% to 51.31%.
Ctx2Skill uses a self-evolving multi-agent loop with Challenger, Reasoner, Judge, and Cross-time Replay to discover context-specific skills, improving task-solving rates on CL-bench benchmarks across models.
ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut costs by 32%.
GAM decouples event-level memory encoding from topic-level consolidation in LLM agents using hierarchical graphs to reduce interference and improve long-term coherence and retrieval.
ParametricSkills uses a hypernetwork to turn textual skills into LoRA adapters, outperforming in-context learning by 6.44 points on average across six SWE subtasks with higher BERT Score and F1.
EvoRec deploys four collaborating LLM agents that co-evolve recommendation models and their optimization methods, reporting up to 5.54% offline gains and 1.85% revenue lift in an online A/B test.
SkillSmith introduces a synergy-aware skill-tool co-evolution framework with atomic bundles, Lotka-Volterra-inspired interaction modeling, and anti-pattern recording that outperforms baselines on complex tasks.
AtomisticSkills is a new harness framework with 100+ human-curated skills that lets general AI agents perform atomistic research tasks including simulations, screening, and analysis, shown on electrolyte design, CO2 capture, drug screening, and catalyst tasks.
SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.
EvE co-evolves code solvers and guidance states via synchronous races and Elo updates, discovering a rescale-then-interpolate mechanism that enables example-count generalization in ICON.
Ace-Skill boosts multimodal agent self-evolution via prioritized rollouts with lazy-decay tracking and semantic knowledge clustering, yielding up to 35% relative gains on tool-use benchmarks and zero-shot transfer to smaller models.
A survey that defines agent skills as reusable procedural artifacts and reviews methods, resources, and applications across their representation, acquisition, retrieval, and evolution stages.
citing papers explorer
-
SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills
SkillEvolBench is a new diagnostic benchmark that evaluates the transition from episodic experience to procedural skills in LLM agents using role-conditioned task families and frozen deployment tests.
-
Harnessing Agent Skills: Architectural Patterns and a Reference Architecture for Skill-Mediated LLM Agents
Catalogs ten patterns and synthesizes a four-layer reference architecture for skill harnessing in LLM agents, evaluated via cross-instantiation on eight systems.
-
SkillOpt: Executive Strategy for Self-Evolving Agent Skills
SkillOpt introduces a controllable text-space optimizer that evolves agent skills via add/delete/replace edits accepted only on strict held-out validation improvement, reporting consistent gains across 52 model-benchmark-harness combinations.
-
From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills
A systematic study across five domains finds model-generated skills yield average gains but non-uniform negative transfer, with a meta-skill improving extraction quality.
-
AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows
AgentCo-op retrieves and assembles existing agents and tools into interoperable workflows for open-world scientific tasks, showing effectiveness in genomics case studies and competitive benchmark results with lower costs.
-
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents
SkillMaster enables LLM agents to autonomously develop skills via trajectory review, counterfactual evaluation, and DualAdv-GRPO training, boosting success rates by 8.8% on ALFWorld and 9.3% on WebShop.
-
SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents
SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and raising ALFWorld success from 45% to 51.31%.
-
From Context to Skills: Can Language Models Learn from Context Skillfully?
Ctx2Skill uses a self-evolving multi-agent loop with Challenger, Reasoner, Judge, and Cross-time Replay to discover context-specific skills, improving task-solving rates on CL-bench benchmarks across models.
-
ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation
ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut costs by 32%.
-
GAM: Hierarchical Graph-based Agentic Memory for LLM Agents
GAM decouples event-level memory encoding from topic-level consolidation in LLM agents using hierarchical graphs to reduce interference and improve long-term coherence and retrieval.
-
SkillSmith: Co-Evolving Skills and Tools for Self-Improving Agent Systems
SkillSmith introduces a synergy-aware skill-tool co-evolution framework with atomic bundles, Lotka-Volterra-inspired interaction modeling, and anti-pattern recording that outperforms baselines on complex tasks.
-
Ace-Skill: Bootstrapping Multimodal Agents with Prioritized and Clustered Evolution
Ace-Skill boosts multimodal agent self-evolution via prioritized rollouts with lazy-decay tracking and semantic knowledge clustering, yielding up to 35% relative gains on tool-use benchmarks and zero-shot transfer to smaller models.
- EvoAgent: An Evolvable Agent Framework with Skill Learning and Multi-Agent Delegation