SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills
Pith reviewed 2026-06-30 16:15 UTC · model grok-4.3
The pith
LLM agents rarely convert episodic trajectories into robust reusable procedural skills, with raw-trajectory reuse outperforming distilled skill libraries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SkillEvolBench organizes tasks into role-conditioned families sharing latent procedures and compares self-generated skill evolution against curated starts, no-skill baselines, and raw-trajectory controls. It finds that current agents often adapt locally but rarely form robust reusable skills; skill-based conditions can improve acquisition or replay on some axes yet these gains collapse under frozen deployment testing context shift, adversarial shortcuts, and composition. Raw-trajectory reuse frequently outperforms distilled skills, indicating that abstraction procedures discard contextual and procedural cues still useful later. Capacity analyses show that writing more skills or larger librar
What carries the argument
SkillEvolBench, a diagnostic benchmark that separates procedural abstraction from base capability and direct episodic reuse through role-conditioned task families, compacted-trajectory updates with verifier feedback, and frozen deployment axes (context shift, adversarial shortcuts, composition).
If this is right
- Skill-based conditions sometimes improve acquisition or replay but the gains remain unstable when deployment is frozen.
- Raw-trajectory reuse outperforms distilled skills, showing that current abstraction discards useful contextual cues.
- Writing more skills or larger libraries increases coverage yet also introduces episode-specific drift and procedural clutter.
- Individual models can gain on specific deployment axes but these improvements do not generalize across the full test set.
- The benchmark positions experience-to-skill conversion as a measurable step separate from base capability.
Where Pith is reading between the lines
- If the isolation holds, future work could test whether different compaction or verification methods preserve more of the discarded cues that raw trajectories retain.
- The finding that library size alone fails suggests examining whether retrieval mechanisms, rather than skill count, determine whether procedural knowledge transfers.
- One extension would be to measure whether the same patterns appear when agents operate without external verifiers or when task families share fewer latent procedures.
- The results imply that progress on reusable skills may require changes to how trajectories are selected for distillation rather than simply increasing update volume.
Load-bearing premise
The chosen task families, update procedure, and deployment axes isolate procedural abstraction from base capability, curated priors, and direct episodic reuse.
What would settle it
A run in which distilled skill libraries produce higher success rates than raw-trajectory controls on the frozen deployment tasks across the ten model configurations and three agent harnesses.
read the original abstract
Large language model (LLM) agents accumulate rich episodic trajectories while solving real-world tasks, but it remains unclear whether such experience can be distilled into reusable procedural skills. We introduce SkillEvolBench, a diagnostic benchmark for evaluating this step from experience reuse to skill formation. It contains 180 tasks across six real-world agent environments, organized into role-conditioned task families with shared latent procedures. Agents learn from acquisition tasks, update an external skill library using compacted trajectories and verifier feedback, and then face frozen deployment tasks testing context shift, adversarial shortcuts, and composition. By comparing self-generated and curated-start skill evolution against no-skill and raw-trajectory controls, SkillEvolBench separates procedural abstraction from base capability, curated prior knowledge, and direct reuse of episodic traces. Across ten model configurations and three agent harnesses, we find that current agents often adapt locally but rarely form robust reusable skills. Skill-based conditions can improve acquisition or replay, and individual models sometimes gain on specific deployment axes, but these gains are unstable under frozen deployment. Raw-trajectory reuse frequently outperforms distilled skills, suggesting that current abstraction procedures discard contextual and procedural cues that remain useful for future tasks. Capacity and cost analyses further show that writing more skills or larger Tier-3 resource libraries is not sufficient: additional updates can improve coverage while introducing episode-specific drift and procedural clutter. These findings position SkillEvolBench as a testbed for measuring when one-off experience becomes durable procedural knowledge rather than task-local memory.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SkillEvolBench, a benchmark with 180 tasks across six real-world agent environments organized into role-conditioned task families sharing latent procedures. Agents acquire experience on initial tasks, update an external skill library via compacted trajectories and verifier feedback, then undergo frozen deployment testing on axes of context shift, adversarial shortcuts, and composition. Comparisons against no-skill and raw-trajectory controls are used to separate procedural abstraction from base capability and direct episodic reuse. Empirical results across ten model configurations and three harnesses indicate that agents typically adapt locally but rarely form robust reusable skills, that raw-trajectory reuse often outperforms distilled skills, and that expanding skill libraries does not reliably improve outcomes due to drift and clutter.
Significance. If the benchmark construction holds, the work supplies a structured diagnostic for when episodic experience yields durable procedural knowledge rather than task-local memory. The inclusion of explicit controls for raw-trajectory reuse and the multi-harness, multi-model evaluation provide a concrete basis for the headline claims about instability of skill gains under frozen deployment. The capacity analyses further quantify that simply scaling skill libraries is insufficient, offering a falsifiable direction for future abstraction methods.
major comments (2)
- [Benchmark Design] Section on benchmark construction (task families and deployment axes): the claim that the chosen task families, compacted-trajectory updates, and frozen deployment axes isolate procedural abstraction from base capability and direct episodic reuse rests on the assumption that shared latent procedures do not leak into raw-trajectory performance; this separation is load-bearing for the central empirical finding yet receives only descriptive justification rather than an explicit control or ablation showing that task similarity alone cannot explain the raw-trajectory advantage.
- [Empirical Evaluation] Results across ten model configurations: the abstract states that gains are 'unstable under frozen deployment' and that raw-trajectory reuse 'frequently outperforms,' but the reported findings lack per-axis statistical tests, run-to-run variance, or explicit data-exclusion rules; without these, the support for the instability claim cannot be fully verified and weakens the cross-harness generalization.
minor comments (2)
- [Abstract] The term 'compacted-trajectory update procedure' is used in the abstract without a one-sentence definition; adding a brief gloss at first use would aid readers unfamiliar with the update mechanics.
- [Figures] Figure captions for the deployment-axis results should explicitly state the number of runs and error bars used, consistent with the multi-configuration evaluation described in the text.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and recommendation for minor revision. We address each major comment below.
read point-by-point responses
-
Referee: [Benchmark Design] Section on benchmark construction (task families and deployment axes): the claim that the chosen task families, compacted-trajectory updates, and frozen deployment axes isolate procedural abstraction from base capability and direct episodic reuse rests on the assumption that shared latent procedures do not leak into raw-trajectory performance; this separation is load-bearing for the central empirical finding yet receives only descriptive justification rather than an explicit control or ablation showing that task similarity alone cannot explain the raw-trajectory advantage.
Authors: The raw-trajectory control is the explicit mechanism isolating direct episodic reuse from abstracted skills, with comparisons to no-skill baselines further separating base capability. Task families are role-conditioned to share latent procedures while varying surface forms, contexts, and adversarial elements by design. We can expand the benchmark construction section with additional justification for why surface similarity alone cannot account for the observed raw-trajectory patterns, but a new ablation experiment is not required given the existing controls. revision: partial
-
Referee: [Empirical Evaluation] Results across ten model configurations: the abstract states that gains are 'unstable under frozen deployment' and that raw-trajectory reuse 'frequently outperforms,' but the reported findings lack per-axis statistical tests, run-to-run variance, or explicit data-exclusion rules; without these, the support for the instability claim cannot be fully verified and weakens the cross-harness generalization.
Authors: We agree that per-axis statistical tests, run-to-run variance reporting, and explicit data-exclusion rules would strengthen verifiability of the instability and outperformance claims. The current results emphasize aggregate trends across ten model configurations and three harnesses, but we will add the requested statistical details and clarifications in the revision. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is a benchmark proposal whose core contribution consists of task families, update procedures, and empirical comparisons across models, harnesses, and controls (no-skill, raw-trajectory). No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text; all claims are externally falsifiable via the reported runs on the 180 tasks and deployment axes. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023
2023
-
[2]
Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023. 14
2023
-
[3]
Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InThe TwelfthInternational Conference on Learning Representations, 2024
2024
-
[4]
SWE-bench: Can language models resolve real-world github issues? InThe TwelfthInternational Conference on Learning Representations, 2024
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe TwelfthInternational Conference on Learning Representations, 2024
2024
-
[5]
Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023
2023
-
[6]
Expel: Llm agents are experiential learners
Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024
2024
-
[7]
Synapse: Trajectory-as-exemplar prompting with memory for computer control
Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An. Synapse: Trajectory-as-exemplar prompting with memory for computer control. InThe TwelfthInternational Conference on Learning Representations, 2024
2024
-
[8]
Agent workflow memory
Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. InForty-second International Conference on Machine Learning, 2025
2025
-
[9]
Agent skills specification.https://agentskills.io/specification
Agent Skills. Agent skills specification.https://agentskills.io/specification
-
[10]
Equipping agents for the real world with agent skills.https://www.anthropic.com/engineering/ equipping-agents-for-the-real-world-with-agent-skills, October 2025
Anthropic. Equipping agents for the real world with agent skills.https://www.anthropic.com/engineering/ equipping-agents-for-the-real-world-with-agent-skills, October 2025
2025
-
[11]
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. Skillsbench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
Appworld: A controllable world of apps and people for benchmarking interactive coding agents
Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), p...
2024
-
[13]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[14]
Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023
2023
-
[15]
Mind2web 2: Evaluating agentic search with agent-as-a-judge
Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jimenez Gutierrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Nour Moussa, TIANSHU ZHANG, Jian Xie, Yifei Li, Tianci Xue, Zeyi Liao, Kai Zhang, Boyuan Zheng, Zhaowei Cai, Viktor Rozgic, Morteza Ziyadi, Huan Sun, and Yu Su. Mind2web 2: Eval...
2026
-
[16]
OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer envi- ronments
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer envi- ronments. InThe Thirty-eight Conference on...
2024
-
[17]
{$\tau$}-bench: A benchmark for \underline{T}ool-\underline{A}gent-\underline{U}ser interaction in real-world domains
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik R Narasimhan. {$\tau$}-bench: A benchmark for \underline{T}ool-\underline{A}gent-\underline{U}ser interaction in real-world domains. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[18]
Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Zhiruo Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Melroy Maben, Raj Mehta, Wayne Chi, Lawrence Keunho Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig. Theagentcompany: 15 Benchmarking LLM agents on consequential re...
2026
-
[19]
Creator: Tool creation for disentangling abstract and concrete reasoning of large language models
Cheng Qian, Chi Han, Yi Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji. Creator: Tool creation for disentangling abstract and concrete reasoning of large language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 6922–6939, 2023
2023
-
[20]
Large language models as tool makers
Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers. In The TwelfthInternational Conference on Learning Representations, 2024
2024
-
[21]
Voyager: An open-ended embodied agent with large language models.Transactionson Machine Learning Research, 2024
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.Transactionson Machine Learning Research, 2024. ISSN 2835-8856
2024
-
[22]
SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning
Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
Autoskill: Experience-driven lifelong learning via skill self-evolution,
Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, et al. Autoskill: Experience-driven lifelong learning via skill self-evolution.arXiv preprint arXiv:2603.01145, 2026
-
[24]
MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents
Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, and Wenya Wang. Memskill: Learning and evolving memory skills for self-evolving agents.arXiv preprint arXiv:2602.02474, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[25]
Memento-skills: Let agents design agents
Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, et al. Memento-skills: Let agents design agents.arXiv preprint arXiv:2603.18743, 2026
-
[26]
CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification
Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei-Chieh Huang, Yifei Yao, Kening Zheng, et al. Evoskills: Self-evolving agent skills via co-evolutionary verification.arXiv preprint arXiv:2604.01687, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[27]
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver
Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[28]
Skillcraft: Can LLM agents learn to use tools skillfully?
Shiqi Chen, Jingze Gai, Ruochen Zhou, Jinghan Zhang, Tongyao Zhu, Junlong Li, Kangrui Wang, Zihan Wang, Zhengyu Chen, Klara Kaleb, et al. Skillcraft: Can llm agents learn to use tools skillfully? arXiv preprint arXiv:2603.00718, 2026
-
[29]
EvoSkill: Automated Skill Discovery for Multi-Agent Systems
Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[30]
SkillX: Automatically Constructing Skill Knowledge Bases for Agents
Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao, Kexin Cao, Guozhou Zheng, Xiang Qi, Peng Zhang, et al. Skillx: Automatically constructing skill knowledge bases for agents.arXiv preprint arXiv:2604.04804, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[31]
Tingxu Han, Yi Zhang, Wei Song, Chunrong Fang, Zhenyu Chen, Youcheng Sun, and Lijie Hu. Swe-skills-bench: Do agent skills actually help in real-world software engineering?arXiv preprint arXiv:2603.15401, 2026
-
[32]
Pinchbench: Real-world benchmarks for ai coding agents.https://github.com/pinchbench/skill, 2026
Kilo AI. Pinchbench: Real-world benchmarks for ai coding agents.https://github.com/pinchbench/skill, 2026
2026
-
[33]
Skills: Public repository for agent skills.https://github.com/anthropics/skills, 2025
Anthropic. Skills: Public repository for agent skills.https://github.com/anthropics/skills, 2025
2025
-
[34]
Skill creator.https://claude.com/plugins/skill-creator, 2026
Anthropic. Skill creator.https://claude.com/plugins/skill-creator, 2026
2026
-
[35]
Claude code overview.https://code.claude.com/docs/en/overview, 2026
Anthropic. Claude code overview.https://code.claude.com/docs/en/overview, 2026
2026
-
[36]
Codex cli.https://developers.openai.com/codex/cli, 2026
OpenAI. Codex cli.https://developers.openai.com/codex/cli, 2026
2026
-
[37]
Google Cloud. Gemini cli.https://docs.cloud.google.com/gemini/docs/codeassist/gemini-cli, 2026. 16 A Complete Family Catalog Thisappendixcatalogsthefullsetofenvironment-levelskillfamiliesinSkillEvolBench. Eachfamilycorresponds to one procedural skill and contains six role-instantiated tasks. Table 3Complete skill-family catalog. Descriptions are taken fro...
-
[38]
Examples:`Dockerfile`,`Express middleware`,`JSON schema`, `LaTeX table`,`pytest fixture`,`verifier output`,`npm package`, `bearer token`,`trajectory.json`
**Task object(s)** -- the concrete things the skill operates on. Examples:`Dockerfile`,`Express middleware`,`JSON schema`, `LaTeX table`,`pytest fixture`,`verifier output`,`npm package`, `bearer token`,`trajectory.json`. List 2-4
-
[39]
Examples:`inspect`,`debug`,`validate`,`summarize`,`patch`, `compare`,`extract`,`aggregate`,`refactor`,`migrate`
**Action verb(s)** -- what the skill DOES with those objects. Examples:`inspect`,`debug`,`validate`,`summarize`,`patch`, `compare`,`extract`,`aggregate`,`refactor`,`migrate`. List 2-4
-
[40]
Use this skill when
**WHEN scenarios** -- 2-3 concrete trigger contexts in "Use this skill when..." form. Include INDIRECT phrasings users actually type ("the test is timing out", "the verifier reports X", "the JSON file in /tmp doesn't validate"). Synonyms count
-
[41]
Do not use this skill for X / Y
**Optional: boundary conditions** -- "Do not use this skill for X / Y" lines. Reduces false-positive triggers when neighbouring skills exist. ### Style: - **Third-person / imperative voice**: "This skill should be used when..." or "Use when...". NEVER "I help with..." / "you can use this to...". - **<=1024 chars hard cap.** Aim for 4-8 lines. - **Cover IN...
-
[42]
The PASS branch and the FAIL branch ask for different things; do NOT substitute one for the other
**Apply the diagnosis_rule** provided in the trial-outcome section. The PASS branch and the FAIL branch ask for different things; do NOT substitute one for the other. If passed and the trajectory shows nothing the skill can absorb, return upsert_files containing only the existing SKILL.md text verbatim -- the parser treats that as NoOp
-
[43]
Use when debugging a software bug
**Respect the layering** (see Progressive Disclosure in the spec): - Don't bloat SKILL.md with material that belongs in`references/` (long enums, edge-case catalogues, vendored API docs). - Don't shrink the`description`for token savings -- discoverability beats a few chars. - Don't change`name`unless operation_type=replace; renaming breaks the folder-slug...
-
[44]
-`narrow`-- tighten`description`/`## When to use`so the skill no longer triggers on the failing context
**operation_type semantics**: -`revise`-- normal edit; same name, same intent. -`narrow`-- tighten`description`/`## When to use`so the skill no longer triggers on the failing context. -`replace`-- full rewrite; only when minor edits cannot fix the pattern. -`create`-- introduces a new skill (see rule 5; must be in the same family as the current task)
-
[45]
when to read / run / copy
**Tier-3 files** (scripts/, references/, assets/) on EXISTING skills MUST be cited from SKILL.md with a clear "when to read / run / copy" trigger. Uncited Tier-3 files are dead weight -- the parser rejects them
-
[46]
Skills in family
**Update existing vs create new (SAME FAMILY ONLY).** When the family already has skills (shown in the "Skills in family" block of the user message), your DEFAULT action is to SELECT ONE OR MORE of them and REFINE them in place. Creating a brand-new sibling is the exception, not the default. ADDITIVE BIAS for revisions: - PRESERVE the existing SKILL.md st...
-
[47]
- What context (file paths, schema names, exact APIs, conventions) did the agent have to discover at runtime? Encode it directly so future runs do not
**The trajectory is your evidence.** Treat it like a debrief: - Where did the agent stumble, improvise, retry, or miss the obvious? Those moments become Gotchas, Workflow steps, or bundled scripts. - What context (file paths, schema names, exact APIs, conventions) did the agent have to discover at runtime? Encode it directly so future runs do not. - Skip ...
-
[48]
Use this skill when
**Frontmatter** --`name`and`description`(HARD constraints): `name`MUST equal the slug given in the user message. `description`is the entire trigger surface (Tier-1, always in future agents'context). A weak description = the skill never fires. REQUIREMENTS: a. <=1024 chars, imperative voice ("Use this skill when..."). b. **Keyword density**: list AT LEAST ...
-
[49]
Gotchas is usually the single highest-value section -- write it when the trajectory shows any near-miss or recovery
**Recommended body sections** (omit ones you don't need): `# <title>`->`## When to use`->`## Workflow`->`## Examples` ->`## Gotchas`->`## Output template`. Gotchas is usually the single highest-value section -- write it when the trajectory shows any near-miss or recovery
-
[50]
Alternatives
**Pick ONE default path** in the workflow; relegate alternatives to a brief "Alternatives" line
-
[51]
summary":
**Tier-3 files** are optional. Add a`<slug>/scripts/<file>`only when the trajectory shows the agent re-deriving the same logic or the step is fragile enough that a tested script beats freeform generation. Use`<slug>/references/<file>`for long material loaded on demand. Use`<slug>/assets/<file>`for verbatim templates. Every Tier-3 file MUST be cited from S...
-
[52]
Inventing project quirks would mislead the agent
**No invented gotchas, no invented examples.** You have no execution evidence. Inventing project quirks would mislead the agent. Leave `## Gotchas`and`## Examples`for T1 induction to fill in from the real trace
-
[53]
Use this skill when
**Frontmatter** --`name`and`description`(HARD constraints): `name`MUST equal the slug given in the user message. `description`is the entire trigger surface (Tier-1, always in future agents'context). REQUIREMENTS: a. <=1024 chars, imperative voice ("Use this skill when..."). b. **Keyword density**: extract AT LEAST 5 concrete keywords from the FAMILY descr...
-
[54]
-`## When to use`-- bullet list of trigger conditions (cast wider than the description)
**Body** (only the sections you can honestly write without a trace): -`# <title>`-- one line. -`## When to use`-- bullet list of trigger conditions (cast wider than the description). -`## Workflow`-- best-guess numbered procedure based on the family description. Prescriptive on obviously fragile / order-sensitive steps; permissive on creative ones. -`## O...
-
[55]
**One default path, no menus.** If multiple approaches are plausible, pick the most common one and move on
-
[56]
summary":
**Keep the body under ~200 lines.** Post-T1 revision will refine it; over-investing now will mostly be rewritten. ## Required JSON output ``` { "summary": "<one short sentence>", "operation_type": "create", "upsert_files": { "<slug>/SKILL.md": "<full SKILL.md content, YAML frontmatter first>" } 39 } ``` Legacy single-key form is also accepted for backward...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.