SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

Dimitrios Dimitriadis; Donghao Zhou; Hui Shen; Jiankun Zhang; Jingxuan Zhang; Mi Zhang; Peizhou Huang; Samiul Alam; Tuo Zhang; Xin Wang

arxiv: 2605.24117 · v1 · pith:IZA2VOKSnew · submitted 2026-05-22 · 💻 cs.AI

SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

Yingtie Lei , Zhongwei Wan , Jiankun Zhang , Samiul Alam , Zixuan Zhong , Peizhou Huang , Xin Wang , Jingxuan Zhang

show 8 more authors

Donghao Zhou Yunta Hsieh Zhihao Dou Hui Shen Yan Xu Dimitrios Dimitriadis Tuo Zhang Mi Zhang

This is my paper

Pith reviewed 2026-06-30 16:15 UTC · model grok-4.3

classification 💻 cs.AI

keywords skill evolutionprocedural skillsepisodic trajectoriesLLM agentsskill libraryabstractionbenchmarkcontext shift

0 comments

The pith

LLM agents rarely convert episodic trajectories into robust reusable procedural skills, with raw-trajectory reuse outperforming distilled skill libraries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SkillEvolBench to test whether experience accumulated by LLM agents can be distilled into reusable procedural skills rather than remaining as task-specific memory. Across 180 tasks in six environments, agents update an external skill library from acquisition trajectories and then face deployment tasks that shift context, introduce shortcuts, or require composition. Results show that agents adapt locally in some cases but produce skills whose gains are unstable under frozen deployment conditions. Raw reuse of full trajectories consistently outperforms the distilled skill versions, and expanding the library size or update frequency adds clutter without improving robustness. This matters because durable procedural knowledge would let agents handle new tasks without re-solving from scratch each time.

Core claim

SkillEvolBench organizes tasks into role-conditioned families sharing latent procedures and compares self-generated skill evolution against curated starts, no-skill baselines, and raw-trajectory controls. It finds that current agents often adapt locally but rarely form robust reusable skills; skill-based conditions can improve acquisition or replay on some axes yet these gains collapse under frozen deployment testing context shift, adversarial shortcuts, and composition. Raw-trajectory reuse frequently outperforms distilled skills, indicating that abstraction procedures discard contextual and procedural cues still useful later. Capacity analyses show that writing more skills or larger librar

What carries the argument

SkillEvolBench, a diagnostic benchmark that separates procedural abstraction from base capability and direct episodic reuse through role-conditioned task families, compacted-trajectory updates with verifier feedback, and frozen deployment axes (context shift, adversarial shortcuts, composition).

If this is right

Skill-based conditions sometimes improve acquisition or replay but the gains remain unstable when deployment is frozen.
Raw-trajectory reuse outperforms distilled skills, showing that current abstraction discards useful contextual cues.
Writing more skills or larger libraries increases coverage yet also introduces episode-specific drift and procedural clutter.
Individual models can gain on specific deployment axes but these improvements do not generalize across the full test set.
The benchmark positions experience-to-skill conversion as a measurable step separate from base capability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the isolation holds, future work could test whether different compaction or verification methods preserve more of the discarded cues that raw trajectories retain.
The finding that library size alone fails suggests examining whether retrieval mechanisms, rather than skill count, determine whether procedural knowledge transfers.
One extension would be to measure whether the same patterns appear when agents operate without external verifiers or when task families share fewer latent procedures.
The results imply that progress on reusable skills may require changes to how trajectories are selected for distillation rather than simply increasing update volume.

Load-bearing premise

The chosen task families, update procedure, and deployment axes isolate procedural abstraction from base capability, curated priors, and direct episodic reuse.

What would settle it

A run in which distilled skill libraries produce higher success rates than raw-trajectory controls on the frozen deployment tasks across the ten model configurations and three agent harnesses.

read the original abstract

Large language model (LLM) agents accumulate rich episodic trajectories while solving real-world tasks, but it remains unclear whether such experience can be distilled into reusable procedural skills. We introduce SkillEvolBench, a diagnostic benchmark for evaluating this step from experience reuse to skill formation. It contains 180 tasks across six real-world agent environments, organized into role-conditioned task families with shared latent procedures. Agents learn from acquisition tasks, update an external skill library using compacted trajectories and verifier feedback, and then face frozen deployment tasks testing context shift, adversarial shortcuts, and composition. By comparing self-generated and curated-start skill evolution against no-skill and raw-trajectory controls, SkillEvolBench separates procedural abstraction from base capability, curated prior knowledge, and direct reuse of episodic traces. Across ten model configurations and three agent harnesses, we find that current agents often adapt locally but rarely form robust reusable skills. Skill-based conditions can improve acquisition or replay, and individual models sometimes gain on specific deployment axes, but these gains are unstable under frozen deployment. Raw-trajectory reuse frequently outperforms distilled skills, suggesting that current abstraction procedures discard contextual and procedural cues that remain useful for future tasks. Capacity and cost analyses further show that writing more skills or larger Tier-3 resource libraries is not sufficient: additional updates can improve coverage while introducing episode-specific drift and procedural clutter. These findings position SkillEvolBench as a testbed for measuring when one-off experience becomes durable procedural knowledge rather than task-local memory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SkillEvolBench shows agents rarely form stable reusable skills from experience, with raw trajectories often outperforming distilled ones.

read the letter

This paper's core finding is that current LLM agents adapt locally to tasks but rarely produce reusable procedural skills that survive changes in context or composition, and that raw trajectory reuse beats attempts to distill skills.

The new element is the benchmark itself. SkillEvolBench uses 180 tasks across six environments, grouped into role-conditioned families that share latent procedures. Agents learn on acquisition tasks, update an external skill library from compacted trajectories plus verifier feedback, then face frozen deployment tests on context shift, adversarial shortcuts, and composition. Controls compare self-generated evolution, curated starts, no-skill baselines, and direct raw-trajectory reuse.

It does well by running the protocol across ten model configurations and three harnesses, and by including capacity analyses that show larger libraries add drift and clutter without fixing the problem. The separation of abstraction from episodic reuse is a clear step beyond standard agent benchmarks.

Soft spots are moderate. The abstract leaves out full statistical details and exact update rules, so the strength of the instability claims is hard to judge without the methods section. The task families are presented as isolating procedural abstraction, but that rests on an assumption about shared latent procedures that the paper should validate more explicitly.

This is for researchers building LLM agents and skill libraries who need tests that go past one-off success. A reader focused on evaluation protocols for durable knowledge would get direct value from the multi-axis frozen deployment design.

The work shows clear thinking in its controls and empirical pattern. It deserves a serious referee to examine the implementation and stats.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SkillEvolBench, a benchmark with 180 tasks across six real-world agent environments organized into role-conditioned task families sharing latent procedures. Agents acquire experience on initial tasks, update an external skill library via compacted trajectories and verifier feedback, then undergo frozen deployment testing on axes of context shift, adversarial shortcuts, and composition. Comparisons against no-skill and raw-trajectory controls are used to separate procedural abstraction from base capability and direct episodic reuse. Empirical results across ten model configurations and three harnesses indicate that agents typically adapt locally but rarely form robust reusable skills, that raw-trajectory reuse often outperforms distilled skills, and that expanding skill libraries does not reliably improve outcomes due to drift and clutter.

Significance. If the benchmark construction holds, the work supplies a structured diagnostic for when episodic experience yields durable procedural knowledge rather than task-local memory. The inclusion of explicit controls for raw-trajectory reuse and the multi-harness, multi-model evaluation provide a concrete basis for the headline claims about instability of skill gains under frozen deployment. The capacity analyses further quantify that simply scaling skill libraries is insufficient, offering a falsifiable direction for future abstraction methods.

major comments (2)

[Benchmark Design] Section on benchmark construction (task families and deployment axes): the claim that the chosen task families, compacted-trajectory updates, and frozen deployment axes isolate procedural abstraction from base capability and direct episodic reuse rests on the assumption that shared latent procedures do not leak into raw-trajectory performance; this separation is load-bearing for the central empirical finding yet receives only descriptive justification rather than an explicit control or ablation showing that task similarity alone cannot explain the raw-trajectory advantage.
[Empirical Evaluation] Results across ten model configurations: the abstract states that gains are 'unstable under frozen deployment' and that raw-trajectory reuse 'frequently outperforms,' but the reported findings lack per-axis statistical tests, run-to-run variance, or explicit data-exclusion rules; without these, the support for the instability claim cannot be fully verified and weakens the cross-harness generalization.

minor comments (2)

[Abstract] The term 'compacted-trajectory update procedure' is used in the abstract without a one-sentence definition; adding a brief gloss at first use would aid readers unfamiliar with the update mechanics.
[Figures] Figure captions for the deployment-axis results should explicitly state the number of runs and error bars used, consistent with the multi-configuration evaluation described in the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for minor revision. We address each major comment below.

read point-by-point responses

Referee: [Benchmark Design] Section on benchmark construction (task families and deployment axes): the claim that the chosen task families, compacted-trajectory updates, and frozen deployment axes isolate procedural abstraction from base capability and direct episodic reuse rests on the assumption that shared latent procedures do not leak into raw-trajectory performance; this separation is load-bearing for the central empirical finding yet receives only descriptive justification rather than an explicit control or ablation showing that task similarity alone cannot explain the raw-trajectory advantage.

Authors: The raw-trajectory control is the explicit mechanism isolating direct episodic reuse from abstracted skills, with comparisons to no-skill baselines further separating base capability. Task families are role-conditioned to share latent procedures while varying surface forms, contexts, and adversarial elements by design. We can expand the benchmark construction section with additional justification for why surface similarity alone cannot account for the observed raw-trajectory patterns, but a new ablation experiment is not required given the existing controls. revision: partial
Referee: [Empirical Evaluation] Results across ten model configurations: the abstract states that gains are 'unstable under frozen deployment' and that raw-trajectory reuse 'frequently outperforms,' but the reported findings lack per-axis statistical tests, run-to-run variance, or explicit data-exclusion rules; without these, the support for the instability claim cannot be fully verified and weakens the cross-harness generalization.

Authors: We agree that per-axis statistical tests, run-to-run variance reporting, and explicit data-exclusion rules would strengthen verifiability of the instability and outperformance claims. The current results emphasize aggregate trends across ten model configurations and three harnesses, but we will add the requested statistical details and clarifications in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a benchmark proposal whose core contribution consists of task families, update procedures, and empirical comparisons across models, harnesses, and controls (no-skill, raw-trajectory). No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text; all claims are externally falsifiable via the reported runs on the 180 tasks and deployment axes. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the central claims rest on the unstated premise that the six environments and 180 tasks adequately represent real-world agent challenges and that the skill-update mechanism using compacted trajectories plus verifier feedback is a faithful model of procedural skill formation. No explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5853 in / 1103 out tokens · 51609 ms · 2026-06-30T16:15:52.319678+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 13 canonical work pages · 8 internal anchors

[1]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023

2023
[2]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023. 14

2023
[3]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InThe TwelfthInternational Conference on Learning Representations, 2024

2024
[4]

SWE-bench: Can language models resolve real-world github issues? InThe TwelfthInternational Conference on Learning Representations, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe TwelfthInternational Conference on Learning Representations, 2024

2024
[5]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

2023
[6]

Expel: Llm agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024

2024
[7]

Synapse: Trajectory-as-exemplar prompting with memory for computer control

Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An. Synapse: Trajectory-as-exemplar prompting with memory for computer control. InThe TwelfthInternational Conference on Learning Representations, 2024

2024
[8]

Agent workflow memory

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. InForty-second International Conference on Machine Learning, 2025

2025
[9]

Agent skills specification.https://agentskills.io/specification

Agent Skills. Agent skills specification.https://agentskills.io/specification
[10]

Equipping agents for the real world with agent skills.https://www.anthropic.com/engineering/ equipping-agents-for-the-real-world-with-agent-skills, October 2025

Anthropic. Equipping agents for the real world with agent skills.https://www.anthropic.com/engineering/ equipping-agents-for-the-real-world-with-agent-skills, October 2025

2025
[11]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. Skillsbench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Appworld: A controllable world of apps and people for benchmarking interactive coding agents

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), p...

2024
[13]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

2023
[15]

Mind2web 2: Evaluating agentic search with agent-as-a-judge

Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jimenez Gutierrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Nour Moussa, TIANSHU ZHANG, Jian Xie, Yifei Li, Tianci Xue, Zeyi Liao, Kai Zhang, Boyuan Zheng, Zhaowei Cai, Viktor Rozgic, Morteza Ziyadi, Huan Sun, and Yu Su. Mind2web 2: Eval...

2026
[16]

OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer envi- ronments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer envi- ronments. InThe Thirty-eight Conference on...

2024
[17]

{$\tau$}-bench: A benchmark for \underline{T}ool-\underline{A}gent-\underline{U}ser interaction in real-world domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik R Narasimhan. {$\tau$}-bench: A benchmark for \underline{T}ool-\underline{A}gent-\underline{U}ser interaction in real-world domains. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[18]

Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Zhiruo Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Melroy Maben, Raj Mehta, Wayne Chi, Lawrence Keunho Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig. Theagentcompany: 15 Benchmarking LLM agents on consequential re...

2026
[19]

Creator: Tool creation for disentangling abstract and concrete reasoning of large language models

Cheng Qian, Chi Han, Yi Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji. Creator: Tool creation for disentangling abstract and concrete reasoning of large language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 6922–6939, 2023

2023
[20]

Large language models as tool makers

Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers. In The TwelfthInternational Conference on Learning Representations, 2024

2024
[21]

Voyager: An open-ended embodied agent with large language models.Transactionson Machine Learning Research, 2024

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.Transactionson Machine Learning Research, 2024. ISSN 2835-8856

2024
[22]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Autoskill: Experience-driven lifelong learning via skill self-evolution,

Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, et al. Autoskill: Experience-driven lifelong learning via skill self-evolution.arXiv preprint arXiv:2603.01145, 2026

work page arXiv 2026
[24]

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, and Wenya Wang. Memskill: Learning and evolving memory skills for self-evolving agents.arXiv preprint arXiv:2602.02474, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Memento-skills: Let agents design agents

Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, et al. Memento-skills: Let agents design agents.arXiv preprint arXiv:2603.18743, 2026

work page arXiv 2026
[26]

CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification

Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei-Chieh Huang, Yifei Yao, Kening Zheng, et al. Evoskills: Self-evolving agent skills via co-evolutionary verification.arXiv preprint arXiv:2604.01687, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Skillcraft: Can LLM agents learn to use tools skillfully?

Shiqi Chen, Jingze Gai, Ruochen Zhou, Jinghan Zhang, Tongyao Zhu, Junlong Li, Kangrui Wang, Zihan Wang, Zhengyu Chen, Klara Kaleb, et al. Skillcraft: Can llm agents learn to use tools skillfully? arXiv preprint arXiv:2603.00718, 2026

work page arXiv 2026
[29]

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

SkillX: Automatically Constructing Skill Knowledge Bases for Agents

Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao, Kexin Cao, Guozhou Zheng, Xiang Qi, Peng Zhang, et al. Skillx: Automatically constructing skill knowledge bases for agents.arXiv preprint arXiv:2604.04804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Swe-skills-bench: Do agent skills actually help in real-world software engineering?arXiv preprint arXiv:2603.15401, 2026

Tingxu Han, Yi Zhang, Wei Song, Chunrong Fang, Zhenyu Chen, Youcheng Sun, and Lijie Hu. Swe-skills-bench: Do agent skills actually help in real-world software engineering?arXiv preprint arXiv:2603.15401, 2026

work page arXiv 2026
[32]

Pinchbench: Real-world benchmarks for ai coding agents.https://github.com/pinchbench/skill, 2026

Kilo AI. Pinchbench: Real-world benchmarks for ai coding agents.https://github.com/pinchbench/skill, 2026

2026
[33]

Skills: Public repository for agent skills.https://github.com/anthropics/skills, 2025

Anthropic. Skills: Public repository for agent skills.https://github.com/anthropics/skills, 2025

2025
[34]

Skill creator.https://claude.com/plugins/skill-creator, 2026

Anthropic. Skill creator.https://claude.com/plugins/skill-creator, 2026

2026
[35]

Claude code overview.https://code.claude.com/docs/en/overview, 2026

Anthropic. Claude code overview.https://code.claude.com/docs/en/overview, 2026

2026
[36]

Codex cli.https://developers.openai.com/codex/cli, 2026

OpenAI. Codex cli.https://developers.openai.com/codex/cli, 2026

2026
[37]

fix the bug

Google Cloud. Gemini cli.https://docs.cloud.google.com/gemini/docs/codeassist/gemini-cli, 2026. 16 A Complete Family Catalog Thisappendixcatalogsthefullsetofenvironment-levelskillfamiliesinSkillEvolBench. Eachfamilycorresponds to one procedural skill and contains six role-instantiated tasks. Table 3Complete skill-family catalog. Descriptions are taken fro...

work page arXiv 2026
[38]

Examples:`Dockerfile`,`Express middleware`,`JSON schema`, `LaTeX table`,`pytest fixture`,`verifier output`,`npm package`, `bearer token`,`trajectory.json`

**Task object(s)** -- the concrete things the skill operates on. Examples:`Dockerfile`,`Express middleware`,`JSON schema`, `LaTeX table`,`pytest fixture`,`verifier output`,`npm package`, `bearer token`,`trajectory.json`. List 2-4
[39]

Examples:`inspect`,`debug`,`validate`,`summarize`,`patch`, `compare`,`extract`,`aggregate`,`refactor`,`migrate`

**Action verb(s)** -- what the skill DOES with those objects. Examples:`inspect`,`debug`,`validate`,`summarize`,`patch`, `compare`,`extract`,`aggregate`,`refactor`,`migrate`. List 2-4
[40]

Use this skill when

**WHEN scenarios** -- 2-3 concrete trigger contexts in "Use this skill when..." form. Include INDIRECT phrasings users actually type ("the test is timing out", "the verifier reports X", "the JSON file in /tmp doesn't validate"). Synonyms count
[41]

Do not use this skill for X / Y

**Optional: boundary conditions** -- "Do not use this skill for X / Y" lines. Reduces false-positive triggers when neighbouring skills exist. ### Style: - **Third-person / imperative voice**: "This skill should be used when..." or "Use when...". NEVER "I help with..." / "you can use this to...". - **<=1024 chars hard cap.** Aim for 4-8 lines. - **Cover IN...
[42]

The PASS branch and the FAIL branch ask for different things; do NOT substitute one for the other

**Apply the diagnosis_rule** provided in the trial-outcome section. The PASS branch and the FAIL branch ask for different things; do NOT substitute one for the other. If passed and the trajectory shows nothing the skill can absorb, return upsert_files containing only the existing SKILL.md text verbatim -- the parser treats that as NoOp
[43]

Use when debugging a software bug

**Respect the layering** (see Progressive Disclosure in the spec): - Don't bloat SKILL.md with material that belongs in`references/` (long enums, edge-case catalogues, vendored API docs). - Don't shrink the`description`for token savings -- discoverability beats a few chars. - Don't change`name`unless operation_type=replace; renaming breaks the folder-slug...
[44]

-`narrow`-- tighten`description`/`## When to use`so the skill no longer triggers on the failing context

**operation_type semantics**: -`revise`-- normal edit; same name, same intent. -`narrow`-- tighten`description`/`## When to use`so the skill no longer triggers on the failing context. -`replace`-- full rewrite; only when minor edits cannot fix the pattern. -`create`-- introduces a new skill (see rule 5; must be in the same family as the current task)
[45]

when to read / run / copy

**Tier-3 files** (scripts/, references/, assets/) on EXISTING skills MUST be cited from SKILL.md with a clear "when to read / run / copy" trigger. Uncited Tier-3 files are dead weight -- the parser rejects them
[46]

Skills in family

**Update existing vs create new (SAME FAMILY ONLY).** When the family already has skills (shown in the "Skills in family" block of the user message), your DEFAULT action is to SELECT ONE OR MORE of them and REFINE them in place. Creating a brand-new sibling is the exception, not the default. ADDITIVE BIAS for revisions: - PRESERVE the existing SKILL.md st...
[47]

- What context (file paths, schema names, exact APIs, conventions) did the agent have to discover at runtime? Encode it directly so future runs do not

**The trajectory is your evidence.** Treat it like a debrief: - Where did the agent stumble, improvise, retry, or miss the obvious? Those moments become Gotchas, Workflow steps, or bundled scripts. - What context (file paths, schema names, exact APIs, conventions) did the agent have to discover at runtime? Encode it directly so future runs do not. - Skip ...
[48]

Use this skill when

**Frontmatter** --`name`and`description`(HARD constraints): `name`MUST equal the slug given in the user message. `description`is the entire trigger surface (Tier-1, always in future agents'context). A weak description = the skill never fires. REQUIREMENTS: a. <=1024 chars, imperative voice ("Use this skill when..."). b. **Keyword density**: list AT LEAST ...
[49]

Gotchas is usually the single highest-value section -- write it when the trajectory shows any near-miss or recovery

**Recommended body sections** (omit ones you don't need): `# <title>`->`## When to use`->`## Workflow`->`## Examples` ->`## Gotchas`->`## Output template`. Gotchas is usually the single highest-value section -- write it when the trajectory shows any near-miss or recovery
[50]

Alternatives

**Pick ONE default path** in the workflow; relegate alternatives to a brief "Alternatives" line
[51]

summary":

**Tier-3 files** are optional. Add a`<slug>/scripts/<file>`only when the trajectory shows the agent re-deriving the same logic or the step is fragile enough that a tested script beats freeform generation. Use`<slug>/references/<file>`for long material loaded on demand. Use`<slug>/assets/<file>`for verbatim templates. Every Tier-3 file MUST be cited from S...
[52]

Inventing project quirks would mislead the agent

**No invented gotchas, no invented examples.** You have no execution evidence. Inventing project quirks would mislead the agent. Leave `## Gotchas`and`## Examples`for T1 induction to fill in from the real trace
[53]

Use this skill when

**Frontmatter** --`name`and`description`(HARD constraints): `name`MUST equal the slug given in the user message. `description`is the entire trigger surface (Tier-1, always in future agents'context). REQUIREMENTS: a. <=1024 chars, imperative voice ("Use this skill when..."). b. **Keyword density**: extract AT LEAST 5 concrete keywords from the FAMILY descr...
[54]

-`## When to use`-- bullet list of trigger conditions (cast wider than the description)

**Body** (only the sections you can honestly write without a trace): -`# <title>`-- one line. -`## When to use`-- bullet list of trigger conditions (cast wider than the description). -`## Workflow`-- best-guess numbered procedure based on the family description. Prescriptive on obviously fragile / order-sensitive steps; permissive on creative ones. -`## O...
[55]

**One default path, no menus.** If multiple approaches are plausible, pick the most common one and move on
[56]

summary":

**Keep the body under ~200 lines.** Post-T1 revision will refine it; over-investing now will mostly be rewritten. ## Required JSON output ``` { "summary": "<one short sentence>", "operation_type": "create", "upsert_files": { "<slug>/SKILL.md": "<full SKILL.md content, YAML frontmatter first>" } 39 } ``` Legacy single-key form is also accepted for backward...

[1] [1]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023

2023

[2] [2]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023. 14

2023

[3] [3]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InThe TwelfthInternational Conference on Learning Representations, 2024

2024

[4] [4]

SWE-bench: Can language models resolve real-world github issues? InThe TwelfthInternational Conference on Learning Representations, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe TwelfthInternational Conference on Learning Representations, 2024

2024

[5] [5]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

2023

[6] [6]

Expel: Llm agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024

2024

[7] [7]

Synapse: Trajectory-as-exemplar prompting with memory for computer control

Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An. Synapse: Trajectory-as-exemplar prompting with memory for computer control. InThe TwelfthInternational Conference on Learning Representations, 2024

2024

[8] [8]

Agent workflow memory

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. InForty-second International Conference on Machine Learning, 2025

2025

[9] [9]

Agent skills specification.https://agentskills.io/specification

Agent Skills. Agent skills specification.https://agentskills.io/specification

[10] [10]

Equipping agents for the real world with agent skills.https://www.anthropic.com/engineering/ equipping-agents-for-the-real-world-with-agent-skills, October 2025

Anthropic. Equipping agents for the real world with agent skills.https://www.anthropic.com/engineering/ equipping-agents-for-the-real-world-with-agent-skills, October 2025

2025

[11] [11]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. Skillsbench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Appworld: A controllable world of apps and people for benchmarking interactive coding agents

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), p...

2024

[13] [13]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

2023

[15] [15]

Mind2web 2: Evaluating agentic search with agent-as-a-judge

Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jimenez Gutierrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Nour Moussa, TIANSHU ZHANG, Jian Xie, Yifei Li, Tianci Xue, Zeyi Liao, Kai Zhang, Boyuan Zheng, Zhaowei Cai, Viktor Rozgic, Morteza Ziyadi, Huan Sun, and Yu Su. Mind2web 2: Eval...

2026

[16] [16]

OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer envi- ronments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer envi- ronments. InThe Thirty-eight Conference on...

2024

[17] [17]

{$\tau$}-bench: A benchmark for \underline{T}ool-\underline{A}gent-\underline{U}ser interaction in real-world domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik R Narasimhan. {$\tau$}-bench: A benchmark for \underline{T}ool-\underline{A}gent-\underline{U}ser interaction in real-world domains. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[18] [18]

Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Zhiruo Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Melroy Maben, Raj Mehta, Wayne Chi, Lawrence Keunho Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig. Theagentcompany: 15 Benchmarking LLM agents on consequential re...

2026

[19] [19]

Creator: Tool creation for disentangling abstract and concrete reasoning of large language models

Cheng Qian, Chi Han, Yi Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji. Creator: Tool creation for disentangling abstract and concrete reasoning of large language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 6922–6939, 2023

2023

[20] [20]

Large language models as tool makers

Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers. In The TwelfthInternational Conference on Learning Representations, 2024

2024

[21] [21]

Voyager: An open-ended embodied agent with large language models.Transactionson Machine Learning Research, 2024

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.Transactionson Machine Learning Research, 2024. ISSN 2835-8856

2024

[22] [22]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

Autoskill: Experience-driven lifelong learning via skill self-evolution,

Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, et al. Autoskill: Experience-driven lifelong learning via skill self-evolution.arXiv preprint arXiv:2603.01145, 2026

work page arXiv 2026

[24] [24]

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, and Wenya Wang. Memskill: Learning and evolving memory skills for self-evolving agents.arXiv preprint arXiv:2602.02474, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Memento-skills: Let agents design agents

Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, et al. Memento-skills: Let agents design agents.arXiv preprint arXiv:2603.18743, 2026

work page arXiv 2026

[26] [26]

CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification

Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei-Chieh Huang, Yifei Yao, Kening Zheng, et al. Evoskills: Self-evolving agent skills via co-evolutionary verification.arXiv preprint arXiv:2604.01687, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [28]

Skillcraft: Can LLM agents learn to use tools skillfully?

Shiqi Chen, Jingze Gai, Ruochen Zhou, Jinghan Zhang, Tongyao Zhu, Junlong Li, Kangrui Wang, Zihan Wang, Zhengyu Chen, Klara Kaleb, et al. Skillcraft: Can llm agents learn to use tools skillfully? arXiv preprint arXiv:2603.00718, 2026

work page arXiv 2026

[29] [29]

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[30] [30]

SkillX: Automatically Constructing Skill Knowledge Bases for Agents

Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao, Kexin Cao, Guozhou Zheng, Xiang Qi, Peng Zhang, et al. Skillx: Automatically constructing skill knowledge bases for agents.arXiv preprint arXiv:2604.04804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[31] [31]

Swe-skills-bench: Do agent skills actually help in real-world software engineering?arXiv preprint arXiv:2603.15401, 2026

Tingxu Han, Yi Zhang, Wei Song, Chunrong Fang, Zhenyu Chen, Youcheng Sun, and Lijie Hu. Swe-skills-bench: Do agent skills actually help in real-world software engineering?arXiv preprint arXiv:2603.15401, 2026

work page arXiv 2026

[32] [32]

Pinchbench: Real-world benchmarks for ai coding agents.https://github.com/pinchbench/skill, 2026

Kilo AI. Pinchbench: Real-world benchmarks for ai coding agents.https://github.com/pinchbench/skill, 2026

2026

[33] [33]

Skills: Public repository for agent skills.https://github.com/anthropics/skills, 2025

Anthropic. Skills: Public repository for agent skills.https://github.com/anthropics/skills, 2025

2025

[34] [34]

Skill creator.https://claude.com/plugins/skill-creator, 2026

Anthropic. Skill creator.https://claude.com/plugins/skill-creator, 2026

2026

[35] [35]

Claude code overview.https://code.claude.com/docs/en/overview, 2026

Anthropic. Claude code overview.https://code.claude.com/docs/en/overview, 2026

2026

[36] [36]

Codex cli.https://developers.openai.com/codex/cli, 2026

OpenAI. Codex cli.https://developers.openai.com/codex/cli, 2026

2026

[37] [37]

fix the bug

Google Cloud. Gemini cli.https://docs.cloud.google.com/gemini/docs/codeassist/gemini-cli, 2026. 16 A Complete Family Catalog Thisappendixcatalogsthefullsetofenvironment-levelskillfamiliesinSkillEvolBench. Eachfamilycorresponds to one procedural skill and contains six role-instantiated tasks. Table 3Complete skill-family catalog. Descriptions are taken fro...

work page arXiv 2026

[38] [38]

Examples:`Dockerfile`,`Express middleware`,`JSON schema`, `LaTeX table`,`pytest fixture`,`verifier output`,`npm package`, `bearer token`,`trajectory.json`

**Task object(s)** -- the concrete things the skill operates on. Examples:`Dockerfile`,`Express middleware`,`JSON schema`, `LaTeX table`,`pytest fixture`,`verifier output`,`npm package`, `bearer token`,`trajectory.json`. List 2-4

[39] [39]

Examples:`inspect`,`debug`,`validate`,`summarize`,`patch`, `compare`,`extract`,`aggregate`,`refactor`,`migrate`

**Action verb(s)** -- what the skill DOES with those objects. Examples:`inspect`,`debug`,`validate`,`summarize`,`patch`, `compare`,`extract`,`aggregate`,`refactor`,`migrate`. List 2-4

[40] [40]

Use this skill when

**WHEN scenarios** -- 2-3 concrete trigger contexts in "Use this skill when..." form. Include INDIRECT phrasings users actually type ("the test is timing out", "the verifier reports X", "the JSON file in /tmp doesn't validate"). Synonyms count

[41] [41]

Do not use this skill for X / Y

**Optional: boundary conditions** -- "Do not use this skill for X / Y" lines. Reduces false-positive triggers when neighbouring skills exist. ### Style: - **Third-person / imperative voice**: "This skill should be used when..." or "Use when...". NEVER "I help with..." / "you can use this to...". - **<=1024 chars hard cap.** Aim for 4-8 lines. - **Cover IN...

[42] [42]

The PASS branch and the FAIL branch ask for different things; do NOT substitute one for the other

**Apply the diagnosis_rule** provided in the trial-outcome section. The PASS branch and the FAIL branch ask for different things; do NOT substitute one for the other. If passed and the trajectory shows nothing the skill can absorb, return upsert_files containing only the existing SKILL.md text verbatim -- the parser treats that as NoOp

[43] [43]

Use when debugging a software bug

**Respect the layering** (see Progressive Disclosure in the spec): - Don't bloat SKILL.md with material that belongs in`references/` (long enums, edge-case catalogues, vendored API docs). - Don't shrink the`description`for token savings -- discoverability beats a few chars. - Don't change`name`unless operation_type=replace; renaming breaks the folder-slug...

[44] [44]

-`narrow`-- tighten`description`/`## When to use`so the skill no longer triggers on the failing context

**operation_type semantics**: -`revise`-- normal edit; same name, same intent. -`narrow`-- tighten`description`/`## When to use`so the skill no longer triggers on the failing context. -`replace`-- full rewrite; only when minor edits cannot fix the pattern. -`create`-- introduces a new skill (see rule 5; must be in the same family as the current task)

[45] [45]

when to read / run / copy

**Tier-3 files** (scripts/, references/, assets/) on EXISTING skills MUST be cited from SKILL.md with a clear "when to read / run / copy" trigger. Uncited Tier-3 files are dead weight -- the parser rejects them

[46] [46]

Skills in family

**Update existing vs create new (SAME FAMILY ONLY).** When the family already has skills (shown in the "Skills in family" block of the user message), your DEFAULT action is to SELECT ONE OR MORE of them and REFINE them in place. Creating a brand-new sibling is the exception, not the default. ADDITIVE BIAS for revisions: - PRESERVE the existing SKILL.md st...

[47] [47]

- What context (file paths, schema names, exact APIs, conventions) did the agent have to discover at runtime? Encode it directly so future runs do not

**The trajectory is your evidence.** Treat it like a debrief: - Where did the agent stumble, improvise, retry, or miss the obvious? Those moments become Gotchas, Workflow steps, or bundled scripts. - What context (file paths, schema names, exact APIs, conventions) did the agent have to discover at runtime? Encode it directly so future runs do not. - Skip ...

[48] [48]

Use this skill when

**Frontmatter** --`name`and`description`(HARD constraints): `name`MUST equal the slug given in the user message. `description`is the entire trigger surface (Tier-1, always in future agents'context). A weak description = the skill never fires. REQUIREMENTS: a. <=1024 chars, imperative voice ("Use this skill when..."). b. **Keyword density**: list AT LEAST ...

[49] [49]

Gotchas is usually the single highest-value section -- write it when the trajectory shows any near-miss or recovery

**Recommended body sections** (omit ones you don't need): `# <title>`->`## When to use`->`## Workflow`->`## Examples` ->`## Gotchas`->`## Output template`. Gotchas is usually the single highest-value section -- write it when the trajectory shows any near-miss or recovery

[50] [50]

Alternatives

**Pick ONE default path** in the workflow; relegate alternatives to a brief "Alternatives" line

[51] [51]

summary":

**Tier-3 files** are optional. Add a`<slug>/scripts/<file>`only when the trajectory shows the agent re-deriving the same logic or the step is fragile enough that a tested script beats freeform generation. Use`<slug>/references/<file>`for long material loaded on demand. Use`<slug>/assets/<file>`for verbatim templates. Every Tier-3 file MUST be cited from S...

[52] [52]

Inventing project quirks would mislead the agent

**No invented gotchas, no invented examples.** You have no execution evidence. Inventing project quirks would mislead the agent. Leave `## Gotchas`and`## Examples`for T1 induction to fill in from the real trace

[53] [53]

Use this skill when

**Frontmatter** --`name`and`description`(HARD constraints): `name`MUST equal the slug given in the user message. `description`is the entire trigger surface (Tier-1, always in future agents'context). REQUIREMENTS: a. <=1024 chars, imperative voice ("Use this skill when..."). b. **Keyword density**: extract AT LEAST 5 concrete keywords from the FAMILY descr...

[54] [54]

-`## When to use`-- bullet list of trigger conditions (cast wider than the description)

**Body** (only the sections you can honestly write without a trace): -`# <title>`-- one line. -`## When to use`-- bullet list of trigger conditions (cast wider than the description). -`## Workflow`-- best-guess numbered procedure based on the family description. Prescriptive on obviously fragile / order-sensitive steps; permissive on creative ones. -`## O...

[55] [55]

**One default path, no menus.** If multiple approaches are plausible, pick the most common one and move on

[56] [56]

summary":

**Keep the body under ~200 lines.** Post-T1 revision will refine it; over-investing now will mostly be rewritten. ## Required JSON output ``` { "summary": "<one short sentence>", "operation_type": "create", "upsert_files": { "<slug>/SKILL.md": "<full SKILL.md content, YAML frontmatter first>" } 39 } ``` Legacy single-key form is also accepted for backward...