SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution
Pith reviewed 2026-05-20 11:36 UTC · model grok-4.3
The pith
Governed external skill libraries improve frozen LLM agents on long-horizon tasks without model updates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SkillsVote profiles a million-scale open-source corpus for environment requirements, quality, and verifiability, then synthesizes tasks for verifiable skills. Before execution it performs agentic library search to expose instructional context. After execution it decomposes trajectories into skill-linked subtasks, attributes outcomes to skill use, agent exploration, environment, and result signals, and admits only successful reusable discoveries to evidence-gated updates. This produces performance lifts on Terminal-Bench 2.0 and SWE-Bench Pro for frozen agents.
What carries the argument
SkillsVote, a lifecycle-governance framework that couples executable scripts with procedural guidance and enforces evidence-gated updates through post-execution trajectory decomposition and outcome attribution.
If this is right
- Offline skill evolution raises GPT-5.2 performance on Terminal-Bench 2.0 by up to 7.9 percentage points.
- Online skill evolution raises performance on SWE-Bench Pro by up to 2.6 percentage points.
- Frozen agents can accumulate capability through external library control instead of weight updates.
- Systems can limit exposure to redundant or low-quality skills to avoid polluting future context.
Where Pith is reading between the lines
- Similar governance could be applied to non-coding agent domains such as web agents or scientific experiment loops if trajectory attribution can be made reliable.
- Over repeated cycles the approach might produce compact, high-value skill repositories that reduce the need for ever-larger context windows.
- If attribution proves stable, the same evidence-gated mechanism could govern shared skill libraries across multiple independent agents or organizations.
Load-bearing premise
Post-execution trajectory decomposition can reliably credit outcomes to particular skills rather than to agent exploration, environment effects, or other unmodeled factors.
What would settle it
Run the same agents and tasks but replace the attribution step with random or environment-only credit assignment; if benchmark gains disappear or reverse, the central claim fails.
read the original abstract
Long-horizon LLM agents leave traces that could become reusable experience, but raw trajectories are noisy and hard to govern. We treat Agent Skills as an experience schema that couples executable scripts, with non-executable guidance on procedures. Yet open skill ecosystems contain redundant, uneven, environment-sensitive artifacts, and indiscriminate updates can pollute future context. We present SkillsVote, a lifecycle-governance framework for Agent Skills from collection and recommendation to evolution. SkillsVote profiles a million-scale open-source corpus for environment requirements, quality, and verifiability, then synthesizes tasks for verifiable skills. Before execution, SkillsVote performs agentic library search over structured skill library to expose instructional skill context. After execution, it decomposes trajectories into skill-linked subtasks, attributes outcomes to skill use, agent exploration, environment, and result signals, and admits only successful reusable discoveries to evidence-gated updates. In our evaluation, offline evolution improves GPT-5.2 on Terminal-Bench 2.0 by up to 7.9 pp, while online evolution improves SWE-Bench Pro by up to 2.6 pp. Overall, governed external skill libraries can improve frozen agents without model updates when systems control exposure, credit, and preservation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SkillsVote, a lifecycle-governance framework for Agent Skills in long-horizon LLM agents. Skills are treated as executable scripts coupled with procedural guidance. The framework profiles a million-scale open-source corpus for environment requirements, quality, and verifiability; synthesizes tasks for verifiable skills; performs agentic library search to expose instructional context before execution; and, after execution, decomposes trajectories into skill-linked subtasks, attributes outcomes to skill use, agent exploration, environment, and result signals, then admits only successful reusable discoveries via evidence-gated updates. The authors report that offline evolution improves GPT-5.2 on Terminal-Bench 2.0 by up to 7.9 pp and online evolution improves SWE-Bench Pro by up to 2.6 pp, concluding that governed external skill libraries can improve frozen agents without model updates when exposure, credit, and preservation are controlled.
Significance. If the attribution mechanism can be shown to reliably isolate skill-specific contributions and the reported gains prove reproducible under controlled conditions, the work would offer a concrete, non-parametric route to agent improvement that avoids retraining. It directly addresses redundancy and pollution risks in open skill ecosystems and supplies an operational schema (collection-recommendation-evolution) that could be adopted by agent platforms.
major comments (2)
- [Abstract] Abstract: the reported 7.9 pp and 2.6 pp gains are stated without any description of baselines, number of runs, statistical tests, or controls for confounding factors. Because the central claim is that governance produces these improvements on frozen agents, the absence of this information prevents evaluation of whether the gains are attributable to SkillsVote rather than to skill selection heuristics or evaluation artifacts.
- [Abstract] Abstract: the post-execution step that 'decomposes trajectories into skill-linked subtasks, attributes outcomes to skill use, agent exploration, environment, and result signals' is described only at the level of intent. No algorithm, decision rules, or validation procedure is supplied. This attribution step is load-bearing: noisy or biased attribution would either pollute the library with non-reusable artifacts or discard useful skills, directly undermining the claimed benchmark gains.
minor comments (1)
- [Abstract] The abstract refers to 'GPT-5.2' without clarifying whether this is a real model variant or a placeholder; this should be disambiguated in the experimental section.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and positive assessment of SkillsVote's potential. We address each major comment below with clarifications from the manuscript and indicate planned revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported 7.9 pp and 2.6 pp gains are stated without any description of baselines, number of runs, statistical tests, or controls for confounding factors. Because the central claim is that governance produces these improvements on frozen agents, the absence of this information prevents evaluation of whether the gains are attributable to SkillsVote rather than to skill selection heuristics or evaluation artifacts.
Authors: We agree that the abstract would benefit from additional context to support evaluation of the central claim. The manuscript provides these details in Section 4 (Experiments) and Appendix B, including baselines such as vanilla GPT-5.2 and ungoverned skill libraries, results aggregated over 5 independent runs with means and standard deviations, and statistical tests (paired t-tests with p < 0.05). We will revise the abstract to briefly note the controlled evaluation on frozen models and the statistical reliability of the reported gains. revision: yes
-
Referee: [Abstract] Abstract: the post-execution step that 'decomposes trajectories into skill-linked subtasks, attributes outcomes to skill use, agent exploration, environment, and result signals' is described only at the level of intent. No algorithm, decision rules, or validation procedure is supplied. This attribution step is load-bearing: noisy or biased attribution would either pollute the library with non-reusable artifacts or discard useful skills, directly undermining the claimed benchmark gains.
Authors: We acknowledge the referee's point on the importance of transparency for the attribution mechanism. While the abstract summarizes at a high level, the full algorithm—including trajectory decomposition rules, the attribution scoring function (weighted combination of outcome, exploration, environment, and result signals), decision thresholds for reusability, and validation via inter-annotator agreement (kappa = 0.82 on sampled trajectories)—is detailed in Section 3.4 with pseudocode in Algorithm 2. We will revise the abstract to reference this section explicitly and add a concise description of the core decision rules. revision: partial
Circularity Check
No significant circularity in derivation chain
full rationale
The paper describes SkillsVote as a lifecycle governance framework involving corpus profiling, task synthesis, agentic library search before execution, and post-execution trajectory decomposition that attributes outcomes to skill use, agent exploration, environment, and result signals before admitting successful reusable discoveries. Reported gains are measured directly on external benchmarks (Terminal-Bench 2.0 and SWE-Bench Pro) for a frozen agent. No equations, fitted parameters presented as predictions, or load-bearing self-citations appear in the provided text that would reduce the claimed improvements to the inputs by construction. The central claim therefore rests on an independently evaluated governance process rather than tautological re-labeling or self-referential fitting.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Trajectories can be decomposed into skill-linked subtasks whose outcomes can be attributed to skill use versus exploration or environment.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
decomposes trajectories into skill-linked subtasks, attributes outcomes to skill use, agent exploration, environment, and result signals, and admits only successful reusable discoveries to evidence-gated updates
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
governed external skill libraries can improve frozen agents without model updates
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Agent Skills. Agent Skills, 2026. URLhttps://agentskills.io/. Accessed: 2026-05-12
work page 2026
-
[2]
EvoSkill: Automated Skill Discovery for Multi-Agent Systems
Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
Extend Claude with Skills, 2026
Anthropic. Extend Claude with Skills, 2026. URL https://code.claude.com/docs/en/skills. Accessed: 2026-05-12
work page 2026
- [4]
-
[5]
Training-free group relative policy optimization, October 2025
Yuzheng Cai, Siqi Cai, Yuchen Shi, Zihan Xu, Lichao Chen, Yulei Qin, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, et al. Training-free group relative policy optimization.arXiv preprint arXiv:2510.08191, 2025
-
[6]
Zhicheng Cai, Xinyuan Guo, Yu Pei, Jiangtao Feng, Jinsong Su, Jiangjie Chen, Ya-Qin Zhang, Wei-Ying Ma, Mingxuan Wang, and Hao Zhou. Flex: Continuous agent evolution via forward learning from experience.arXiv preprint arXiv:2511.06449, 2025
-
[7]
Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution
Zouying Cao, Jiaji Deng, Li Yu, Weikang Zhou, Zhaoyang Liu, Bolin Ding, and Hai Zhao. Remember me, refine me: A dynamic procedural memory framework for experience-driven agent evolution.arXiv preprint arXiv:2512.10696, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
SkVM: Revisiting Language VM for Skills across Heterogenous LLMs and Harnesses
Le Chen, Erhu Feng, Yubin Xia, and Haibo Chen. Skvm: Revisiting language vm for skills across heterogenous llms and harnesses.arXiv preprint arXiv:2604.03088, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[9]
Skillcraft: Can LLM agents learn to use tools skillfully?
Shiqi Chen, Jingze Gai, Ruochen Zhou, Jinghan Zhang, Tongyao Zhu, Junlong Li, Kangrui Wang, Zihan Wang, Zhengyu Chen, Klara Kaleb, et al. Skillcraft: Can llm agents learn to use tools skillfully? arXiv preprint arXiv:2603.00718, 2026
-
[10]
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? arXiv preprint arXiv:2509.16941, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Shengda Fan, Xuyan Ye, Yupeng Huo, Zhi-Yuan Chen, Yiju Guo, Shenzhi Yang, Wenkai Yang, Shuqi Ye, Jingwen Chen, Haotian Chen, et al. Agentprocessbench: Diagnosing step-level process quality in tool-using agents.arXiv preprint arXiv:2603.14465, 2026
-
[12]
Gaodan Fang, Vatche Isahagian, KR Jayaram, Ritesh Kumar, Vinod Muthusamy, Punleuk Oum, and Gegi Thomas. Trajectory-informed memory generation for self-improving agent systems.arXiv preprint arXiv:2603.10600, 2026
-
[13]
Memp: Exploring Agent Procedural Memory
Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. Memp: Exploring agent procedural memory.arXiv preprint arXiv:2508.06433, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
SkillMOO: Multi-Objective Optimization of Agent Skills for Software Engineering
Jingzhi Gong, Ruizhen Gu, Zhiwei Fei, Yazhuo Cao, Lukas Twist, Alina Geiger, Shuo Han, Dominik Sobania, Federica Sarro, and Jie M Zhang. Skillmoo: Multi-objective optimization of agent skills for software engineering. arXiv preprint arXiv:2604.09297, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
Ankur Goyal and Andrew Qu. Testing if “Bash Is All You Need”, January 2026. URLhttps://vercel.com/ blog/testing-if-bash-is-all-you-need. Accessed: 2026-05-12
work page 2026
-
[16]
Harbor Framework Team. Harbor: A framework for evaluating and optimizing agents and models in container environments, January 2026. URLhttps://github.com/harbor-framework/harbor
work page 2026
-
[17]
Mastering Hermes Skills, April 2026
Hermes. Mastering Hermes Skills, April 2026. URL https://hermes-agent.ai/blog/ hermes-agent-skills-guide. Accessed: 2026-05-12
work page 2026
-
[18]
Cascade: Cumulative agentic skill creation through autonomous development and evolution,
Xu Huang, Junwu Chen, Yuxing Fei, Zhuohan Li, Philippe Schwaller, and Gerbrand Ceder. Cascade: Cumulative agentic skill creation through autonomous development and evolution.arXiv preprint arXiv:2512.23880, 2025
-
[19]
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, and Guangsheng Yu. Sok: Agentic skills–beyond tool use in llm agents.arXiv preprint arXiv:2602.20867, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[20]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024. 11
work page 2024
-
[21]
Benchmarking AI Agent Memory: Is a Filesystem All You Need?, August 2025
Letta. Benchmarking AI Agent Memory: Is a Filesystem All You Need?, August 2025. URLhttps://www.letta. com/blog/benchmarking-ai-agent-memory. Accessed: 2026-05-12
work page 2025
-
[22]
Organizing, orchestrating, and benchmarking agent skills at ecosystem scale,
Hao Li, Chunjiang Mu, Jianhao Chen, Siyue Ren, Zhiyao Cui, Yiqun Zhang, Lei Bai, and Shuyue Hu. Organizing, orchestrating, and benchmarking agent skills at ecosystem scale.arXiv preprint arXiv:2603.02176, 2026
-
[23]
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. Skillsbench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction
Zhuofeng Li, Haoxiang Zhang, Cong Wei, Pan Lu, Ping Nie, Yi Lu, Yuyang Bai, Shangbin Feng, Hangxiao Zhu, Ming Zhong, Yuyu Zhang, Jianwen Xie, Yejin Choi, James Zou, Jiawei Han, Wenhu Chen, Jimmy Lin, Dongfu Jiang, and Yu Zhang. Beyond semantic similarity: Rethinking retrieval for agentic search via direct corpus interaction. arXiv preprint arXiv:2605.05242, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[25]
Jiaqing Liang, Jinyi Han, Weijia Li, Xinyi Wang, Zhoujia Zhang, Zishang Jiang, Ying Liao, Tingyun Li, Ying Huang, Hao Shen, et al. Genericagent: A token-efficient self-evolving llm agent via contextual information density maximization (v1. 0).arXiv preprint arXiv:2604.17091, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[26]
Available: https://arxiv.org/abs/2603.04448
Yuan Liang, Ruobin Zhong, Haoming Xu, Chen Jiang, Yi Zhong, Runnan Fang, Jia-Chen Gu, Shumin Deng, Yunzhi Yao, Mengru Wang, et al. Skillnet: Create, evaluate, and connect ai skills.arXiv preprintarXiv:2603.04448, 2026
-
[27]
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Xuanjing Huang, Hang Yan, Zhenhua Han, and Tao Gui. Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses.arXiv preprint arXiv:2604.25850, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[28]
Position: Agentic evolution is the path to evolving llms.arXiv preprint arXiv:2602.00359, 2026
Minhua Lin, Hanqing Lu, Zhan Shi, Bing He, Rui Mao, Zhiwei Zhang, Zongyu Wu, Xianfeng Tang, Hui Liu, Zhenwei Dai, et al. Position: Agentic evolution is the path to evolving llms.arXiv preprint arXiv:2602.00359, 2026
-
[29]
George Ling, Shanshan Zhong, and Richard Huang. Agent skills: A data-driven analysis of claude skills for extending large language model functionality.arXiv preprint arXiv:2602.08004, 2026
-
[30]
Jiarun Liu, Shiyue Xu, Yang Li, Shangkun Liu, Yongli Yu, and Peng Cao. Unifying dynamic tool creation and cross-task experience sharing through cognitive memory architecture.arXiv preprint arXiv:2512.11303, 2025
-
[31]
SkillForge: Forging Domain-Specific, Self-Evolving Agent Skills in Cloud Technical Support
Xingyan Liu, Xiyue Luo, Linyu Li, Ganghong Huang, Jianfeng Liu, and Honglin Qiao. Skillforge: Forging domain-specific, self-evolving agent skills in cloud technical support.arXiv preprint arXiv:2604.08618, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[32]
How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings
Yujian Liu, Jiabao Ji, Li An, Tommi Jaakkola, Yang Zhang, and Shiyu Chang. How well do agentic skills work in the wild: Benchmarking llm skill usage in realistic settings.arXiv preprint arXiv:2604.04323, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[33]
Jiaxuan Lu, Ziyu Kong, Yemin Wang, Rong Fu, Haiyuan Wan, Cheng Yang, Wenjie Lou, Haoran Sun, Lilong Wang, Yankai Jiang, et al. Beyond static tools: Test-time tool evolution for scientific reasoning.arXiv preprint arXiv:2601.07641, 2026
-
[34]
SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization
Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. Skill0: In-context agentic reinforcement learning for skill internalization.arXiv preprint arXiv:2604.02268, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[35]
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver
Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[36]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[37]
Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents
Qirui Mi, Zhijian Ma, Mengyue Yang, Haoxuan Li, Yisen Wang, Haifeng Zhang, and Jun Wang. Skill-pro: Learning reusable skills from experience via non-parametric ppo for llm agents.arXiv preprint arXiv:2602.01869, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[38]
Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills
Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Xiaoxi Jiang, and Guanjun Jiang. Trace2skill: Distill trajectory-local lessons into transferable agent skills.arXiv preprint arXiv:2603.25158, 2026. 12
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[39]
Introducing GPT-5.2, December 2025
OpenAI. Introducing GPT-5.2, December 2025. URL https://openai.com/index/introducing-gpt-5-2/. Accessed: 2026-05-12
work page 2025
-
[40]
OpenAI. SkillsinChatGPT,2026. URL https://help.openai.com/en/articles/20001066-skills-in-chatgpt. Accessed: 2026-05-12
-
[41]
OpenAI. Agent Skills – Codex, 2026. URLhttps://developers.openai.com/codex/skills. Accessed: 2026-05- 12
work page 2026
-
[42]
Introducing GPT-5.4 mini and nano, March 2026
OpenAI. Introducing GPT-5.4 mini and nano, March 2026. URL https://openai.com/index/ introducing-gpt-5-4-mini-and-nano/. Accessed: 2026-05-12
work page 2026
-
[43]
OpenClaw. Skills – OpenClaw, 2026. URLhttps://docs.openclaw.ai/tools/skills. Accessed: 2026-05-12
work page 2026
-
[44]
ClawHub: Skill Directory for OpenClaw, 2026
OpenClaw. ClawHub: Skill Directory for OpenClaw, 2026. URLhttps://clawhub.ai/. Accessed: 2026-05-12
work page 2026
-
[45]
SkillOS: Learning Skill Curation for Self-Evolving Agents
Siru Ouyang, Jun Yan, Yanfei Chen, Rujun Han, Zifeng Wang, Bhavana Dalvi Mishra, Rui Meng, Chun-Liang Li, Yizhu Jiao, Kaiwen Zha, et al. Skillos: Learning skill curation for self-evolving agents. arXiv preprint arXiv:2605.06614, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[46]
Reasoningbank: Scaling agent self-evolving with reasoning memory
Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister. Reasoningbank: Scaling agent self-evolving with reasoning memory. InTheFourteenthInternational Conference on Learning Representatio...
work page 2026
-
[47]
SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents
Yipeng Ouyang, Yi Xiao, Yuhao Gu, and Xianwei Zhang. Skcc: Portable and secure skill compilation for cross-framework llm agents.arXiv preprint arXiv:2605.03353, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[48]
Introducing SWE-grep and SWE-grep-mini: RL for Multi-Turn, Fast Context Retrieval, October 2025
Ben Pan, Carlo Baronio, Albert Tam, Pietro Marsella, Mokshit Jain, Daniel Chiu, Swyx, and Silas Alberti. Introducing SWE-grep and SWE-grep-mini: RL for Multi-Turn, Fast Context Retrieval, October 2025. URL https://cognition.ai/blog/swe-grep. Accessed: 2026-05-12
work page 2025
-
[49]
We Removed 80% of Our Agent’s Tools, December 2025
Andrew Qu. We Removed 80% of Our Agent’s Tools, December 2025. URL https://vercel.com/blog/ we-removed-80-percent-of-our-agents-tools. Accessed: 2026-05-12
work page 2025
-
[50]
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
Yaorui Shi, Yuxin Chen, Zhengxi Lu, Yuchun Miao, Shugui Liu, Qi Gu, Xunliang Cai, Xiang Wang, and An Zhang. Skill1: Unified evolution of skill-augmented agents via reinforcement learning.arXiv preprint arXiv:2605.06130, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[51]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023
work page 2023
-
[52]
From Context to Skills: Can Language Models Learn from Context Skillfully?
Shuzheng Si, Haozhe Zhao, Yu Lei, Qingyi Wang, Dingwei Chen, Zhitong Wang, Zhenhailong Wang, Kangyang Luo, Zheng Wang, Gang Chen, et al. From context to skills: Can language models learn from context skillfully? arXiv preprint arXiv:2604.27660, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[53]
Agent Skills Marketplace, 2026
SkillsMP. Agent Skills Marketplace, 2026. URLhttps://skillsmp.com/. Accessed: 2026-05-12
work page 2026
-
[54]
Lintang Sutawika, Aditya Bharat Soni, Apurva Gandhi, Taha Yassine, Sanidhya Vijayvargiya, Yuchen Li, Xuhui Zhou, Yilin Zhang, Leander Melroy Maben, Graham Neubig, et al. Codescout: An effective recipe for reinforcement learning of code search agents.arXiv preprint arXiv:2603.17829, 2026
-
[55]
Appworld: A controllable world of apps and people for benchmarking interactive coding agents
Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), p...
work page 2024
-
[56]
The Agent Skills Directory, 2026
Vercel. The Agent Skills Directory, 2026. URLhttps://skills.sh/. Accessed: 2026-05-12
work page 2026
-
[57]
SkillX: Automatically Constructing Skill Knowledge Bases for Agents
Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao, Kexin Cao, Guozhou Zheng, Xiang Qi, Peng Zhang, et al. Skillx: Automatically constructing skill knowledge bases for agents.arXiv preprint arXiv:2604.04804, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[58]
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.Transactionson Machine Learning Research, 2024. ISSN 2835-8856. URLhttps://openreview.net/forum?id=ehfRiF0R3a. 13
work page 2024
-
[59]
Reinforcement Learning for Self-Improving Agent with Skill Library
Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, and Lin Lee Cheong. Reinforcement learning for self-improving agent with skill library.arXiv preprint arXiv:2512.17102, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution
Junjie Wang, Yiming Ren, and Haoyang Zhang. From procedural skills to strategy genes: Towards experience- driven test-time evolution.arXiv preprint arXiv:2604.15097, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[61]
Qihao Wang, Ziming Cheng, Shuo Zhang, Fan Liu, Rui Xu, Heng Lian, Kunyi Wang, Xiaoming Yu, Jianghao Yin, Sen Hu, et al. Memgovern: Enhancing code agents through learning from governed human experiences.arXiv preprint arXiv:2601.06789, 2026
-
[62]
Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He, Ju Huang, Qiang Ji, Hanqi Jin, Xiaoyang Li, et al. Let it flow: Agentic crafting on rock and roll, building the rome model within an open agentic learning ecosystem. arXiv preprint arXiv:2512.24873, 2025
-
[63]
OpenClaw-RL: Train Any Agent Simply by Talking
Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking.arXiv preprint arXiv:2603.10165, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[64]
Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. InInternational Conference on Machine Learning, pages 63897–63911. PMLR, 2025
work page 2025
-
[65]
SkillRL: Evolving agents via recursive skill-augmented reinforcement learning
Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. SkillRL: Evolving agents via recursive skill-augmented reinforcement learning. InICLR 2026 Workshop on Lifelong Agents: Learning, Aligning, Evolving, 2026. URL https://openreview.net/forum?id=FYc2IygegR
work page 2026
-
[66]
Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu, Kaiwen Xiong, Siwei Han, Shi Qiu, Haonian Ji, Yuyin Zhou, et al. Metaclaw: Just talk–an agent that meta-learns and evolves in the wild.arXiv preprint arXiv:2603.17187, 2026
-
[67]
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advancesin Neural Information Processing Systems, 37:52040–52094, 2024
work page 2024
-
[68]
From Multi-Agent to Single-Agent: When Is Skill Distillation Beneficial?
Binyan Xu, Dong Fang, Haitao Li, and Kehuan Zhang. From multi-agent to single-agent: When is skill distillation beneficial? arXiv preprint arXiv:2604.01608, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[69]
Autoskill: Experience-driven lifelong learning via skill self-evolution,
Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, et al. Autoskill: Experience-driven lifelong learning via skill self-evolution.arXiv preprint arXiv:2603.01145, 2026
-
[70]
CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification
Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei-Chieh Huang, Yifei Yao, Kening Zheng, et al. Coevoskills: Self-evolving agent skills via co-evolutionary verification. arXiv preprint arXiv:2604.01687, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[71]
Agentic context engineering: Evolving contexts for self-improving language models
Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models. InThe FourteenthInternational Conference on Learning Representations, 2026. URLhttps://op...
work page 2026
-
[72]
Autogenesis: A Self-Evolving Agent Protocol
Wentao Zhang. Autogenesis: A self-evolving agent protocol.arXiv preprint arXiv:2604.15034, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[73]
Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents
Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, and Peiyang He. Experience compression spectrum: Unifying memory, skills, and rules in llm agents.arXiv preprint arXiv:2604.15877, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[74]
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
Ziao Zhang, Kou Shi, Shiting Huang, Avery Nie, Yu Zeng, Yiming Zhao, Zhen Fang, Qishen Su, Haibo Qiu, Wei Yang, et al. Skillflow: Benchmarking lifelong skill discovery and evolution for autonomous agents.arXiv preprint arXiv:2604.17308, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[75]
Expel: Llm agents are experiential learners
Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024
work page 2024
-
[76]
Synapse: Trajectory-as-exemplar prompting with memory for computer control
Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An. Synapse: Trajectory-as-exemplar prompting with memory for computer control. InInternational Conference on Learning Representations, volume 2024, pages 19036–19066, 2024. 14
work page 2024
-
[77]
Skillrouter: Retrieve-and-rerank skill selection for llm agents at scale,
YanZhao Zheng, ZhenTao Zhang, Chao Ma, YuanQiang Yu, JiHuan Zhu, Baohua Dong, and Hangcheng Zhu. Skillrouter: Skill routing for llm agents at scale.arXiv preprint arXiv:2603.22455, 2026
-
[78]
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
Chenyu Zhou, Huacan Chai, Wenteng Chen, Zihan Guo, Rong Shan, Yuanyi Song, Tianyi Xu, Yingxuan Yang, Aofan Yu, Weiming Zhang, et al. Externalization in llm agents: A unified review of memory, skills, protocols and harness engineering. arXiv preprint arXiv:2604.08224, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[79]
Memento: Fine-tuning llm agents without fine-tuning llms.arXiv preprint arXiv:2508.16153,
Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, et al. Memento: Fine-tuning llm agents without fine-tuning llms. arXiv preprint arXiv:2508.16153, 2025
-
[80]
Memento-skills: Let agents design agents,
Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, et al. Memento-skills: Let agents design agents.arXiv preprint arXiv:2603.18743, 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.