pith. machine review for the scientific record. sign in

arxiv: 2605.13716 · v1 · submitted 2026-05-13 · 💻 cs.SE · cs.MA

Recognition: unknown

SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems

Hongji Pu, Liang Zhao, Xinyuan Song

Pith reviewed 2026-05-14 17:48 UTC · model grok-4.3

classification 💻 cs.SE cs.MA
keywords skillskillopslibrariesagentslanguagelargelibrarymaintenance
0
0 comments X

The pith

SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero library-time LLM cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language model agents use libraries of skills for complex tasks, but these libraries can develop defects when skills are added, reused, or linked to changing parts. The paper calls this skill technical debt. SkillOps treats each skill as a contract that records its purpose, outputs, assumptions, validation, and failure modes. It builds a graph of how skills depend on each other and checks the whole library for utility, compatibility, risk, and validation problems. The system then produces a cleaned library that existing agents can use without changing their own code. On the ALFWorld benchmark the maintained library alone reaches 79.5 percent success, beating the best prior method by 8.8 points and requiring almost no extra large-language-model calls during library upkeep.

Core claim

On ALFWorld, SkillOps achieves 79.5 percent task success as a standalone agent, outperforming the strongest baseline by 8.8 percentage points with no additional task-time large language model calls.

Load-bearing premise

That rule-based diagnosis across the four health dimensions (utility, compatibility, risk, validation) can reliably detect and repair library-level defects without task-specific LLM calls or human oversight.

Figures

Figures reproduced from arXiv: 2605.13716 by Hongji Pu, Liang Zhao, Xinyuan Song.

Figure 1
Figure 1. Figure 1: SkillOps System Architecture. The Hierarchical Skill Ecosystem Graph (HSEG) comprises two levels: (1) an Internal Skill Graph that models each skill as a contract graph over Precondition (P), Operation (O), Artifact (A), Validator (V ), and Failure Mode (F) nodes; and (2) an External Graph-of-Graphs connecting skills via typed dependency (dep), compatibility (comp), redundancy (red), and alternative (alt) … view at source ↗
Figure 2
Figure 2. Figure 2: Compact SkillOps algorithms. The Task-Time Loop plans and repairs the current execution, while the Library-Time Loop converts execution traces into persistent skill-library updates. CGPD: ContractGraph-Propagated Diagnosis. Standard health diagnosis evaluates each skill independently. CGPD is an additional advanced component that propagates risk scores along dep −−→ edges, enabling preemptive validator ins… view at source ↗
Figure 3
Figure 3. Figure 3: Maintenance cost summary. The library-time maintenance pass uses nearly zero LLM calls at all scales, while task-time token changes are mostly neutral or negative [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Noise-graded library scaling. SkillOps remains stable as the library grows from 200 to 2000 skills, while retrieval-heavy baselines degrade under increasing noise [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-task-type SR. Results are reported for the 200-skill library, pooled over 3 seeds. J Token Consumption and k-Sensitivity Analysis J.1 Token Consumption We measure the per-task token cost of each method across the nine H2 library scales, N ∈ {200, 250, 500, 750, 1000, 1250, 1500, 1750, 2000}, using three random seeds and 185 ALFWorld tasks per seed. All LLM-based methods use temperature=0 and gpt-4o-min… view at source ↗
Figure 6
Figure 6. Figure 6: H1 main results. Task SR on ALFWorld at the 200-skill scale, pooled over 3 seeds. Error bars show Wilson 95% confidence intervals. 0 20 40 60 80 100 Task Success Rate (%) SkillOps-Full NoCGPD NoRetire NoInternalGraph NoLibrary NoMerge NoExternalGraph NoRepair NoValidator NoTask NoAdapter 79.5% 79.0% 73.2% 72.2% 71.9% 71.9% 64.6% 55.9% 38.0% 15.7% 13.2% lib = 200 (Full = 79.5%) 0 20 40 60 80 100 Task Succes… view at source ↗
Figure 7
Figure 7. Figure 7: H3 ablation visualization. Each bar reports task SR after removing one SkillOps compo￾nent [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Token scaling with library size. SkillOps emits nearly zero task-time tokens across all evaluated scales, while LLM-based baselines keep nonzero token budgets [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: k-sensitivity visualization. Increasing k improves some retrieval baselines, but SkillOps remains clearly stronger across all tested library sizes [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
read the original abstract

Large language model agents increasingly rely on skill libraries for multi-step tasks, yet these libraries can accumulate persistent defects as skills are added, reused, patched, and linked to changing dependencies. We call this failure mode skill technical debt: library-level defects that may not break a single skill locally but can harm future retrieval, composition, and execution. Existing skill-based agents mainly focus on task-time retrieval, planning, and repair, while library-time maintenance remains underexplored. We propose SkillOps, a method-agnostic plug-in framework for maintaining skill libraries. SkillOps represents each skill as a typed Skill Contract (P, O, A, V, F), organizes skills with a Hierarchical Skill Ecosystem Graph, and diagnoses library health across utility, compatibility, risk, and validation dimensions. Given a raw skill library, SkillOps produces a maintained library that can be used by existing retrieval or planning agents without changing their internal code. On ALFWorld, SkillOps achieves 79.5 percent task success as a standalone agent, outperforming the strongest baseline by 8.8 percentage points with no additional task-time large language model calls. As a plug-in layer, it improves retrieval-heavy baselines by 0.68 to 2.90 percentage points. The current rule-based maintenance implementation uses nearly zero library-time large language model calls or tokens, showing that skill-library maintenance can be added as a low-overhead architectural layer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Circularity Check

0 steps flagged

No significant circularity; empirical ALFWorld results are external measurements

full rationale

The paper's core claim is an empirical benchmark: SkillOps achieves 79.5% task success on ALFWorld as a standalone agent, outperforming the strongest baseline by 8.8 pp with zero additional task-time LLM calls. This is reported as an external evaluation of the maintained library rather than a quantity computed from internal equations, fitted parameters, or self-referential definitions. The framework description (Skill Contract (P, O, A, V, F), Hierarchical Skill Ecosystem Graph, and four health dimensions) is presented as an architectural layer whose maintenance rules are rule-based and low-overhead, but the success metric is measured against external baselines and does not reduce to those rules by construction. No self-citation chains, ansatzes, or renamings of known results are invoked as load-bearing steps in the provided derivation. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework rests on the domain assumption that skill technical debt manifests in measurable utility, compatibility, risk, and validation dimensions that can be diagnosed rule-based without LLM calls.

axioms (1)
  • domain assumption Skill libraries accumulate persistent defects that harm future retrieval and composition even when individual skills appear locally correct.
    Stated as the core motivation for library-time maintenance.
invented entities (2)
  • Skill Contract (P, O, A, V, F) no independent evidence
    purpose: Typed representation of each skill for diagnosis and maintenance
    New data structure introduced to enable library health checks.
  • Hierarchical Skill Ecosystem Graph no independent evidence
    purpose: Organizes skill dependencies for compatibility and risk analysis
    New graph structure for modeling library-level interactions.

pith-pipeline@v0.9.0 · 5559 in / 1380 out tokens · 38572 ms · 2026-05-14T17:48:34.580765+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 25 canonical work pages · 14 internal anchors

  1. [1]

    Fundamenta Mathematicae , volume =

    Banach, Stefan , title =. Fundamenta Mathematicae , volume =

  2. [2]

    SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

    Li, Xinyi and others , year =. 2602.12670 , archivePrefix =

  3. [3]

    Alessandro Berti and Sebastiaan van Zelst and Wil M. P. van der Aalst , title =. 2019 , eprint =

  4. [4]

    2026 , eprint =

    Tianyi Chen and Yinheng Li and Michael Solodko and Sen Wang and Nan Jiang and Tingyuan Cui and Junheng Hao and Jongwoo Ko and Sara Abdali and Suzhen Zheng and Leon Xu and Hao Fan and Pashmina Cameron and Justin Wagle and Kazuhito Koishida , title =. 2026 , eprint =

  5. [5]

    2025 , eprint =

    Dongge Han and Camille Couturier and Daniel Madrigal Diaz and Xuchao Zhang and Victor R. 2025 , eprint =

  6. [6]

    2026 , eprint =

    Dawei Liu and Zongxia Li and Hongyang Du and Xiyang Wu and Shihang Gui and Yongbei Kuang and Lichao Sun , title =. 2026 , eprint =

  7. [7]

    Proceedings of the 39th

    Noble Saji Mathews and Meiyappan Nagappan , title =. Proceedings of the 39th. 2024 , doi =. 2402.13521 , archivePrefix =

  8. [8]

    Le , title =

    Lesly Miculicich and Mihir Parmar and Hamid Palangi and Krishnamurthy Dj Dvijotham and Mirko Montanari and Tomas Pfister and Long T. Le , title =. 2025 , eprint =

  9. [9]

    Gyunam Park and Wil M. P. van der Aalst , title =. Progress in Artificial Intelligence , year =

  10. [10]

    Patil and Tianjun Zhang and Xin Wang and Joseph E

    Shishir G. Patil and Tianjun Zhang and Xin Wang and Joseph E. Gonzalez , title =. Advances in Neural Information Processing Systems , volume =. 2024 , doi =

  11. [12]

    Sculley and Gary Holt and Daniel Golovin and Eugene Davydov and Todd Phillips and Dietmar Ebner and Vinay Chaudhary and Michael Young and Jean-Fran

    D. Sculley and Gary Holt and Daniel Golovin and Eugene Davydov and Todd Phillips and Dietmar Ebner and Vinay Chaudhary and Michael Young and Jean-Fran. Hidden Technical Debt in Machine Learning Systems , booktitle =. 2015 , url =

  12. [13]

    Sculley and Gary Holt and Daniel Golovin and Eugene Davydov and Todd Phillips and Dietmar Ebner and Vinay Chaudhary and Michael Young , title =

    D. Sculley and Gary Holt and Daniel Golovin and Eugene Davydov and Todd Phillips and Dietmar Ebner and Vinay Chaudhary and Michael Young , title =. 2014 , url =

  13. [14]

    2026 , eprint =

    Shuaike Shen and Wenduo Cheng and Mingqian Ma and Alistair Turcan and Martin Jinye Zhang and Jian Ma , title =. 2026 , eprint =

  14. [18]

    Wil M. P. van der Aalst , title =. 2016 , doi =

  15. [19]

    Transactions on Machine Learning Research , year =

    Guanzhi Wang and Yuqi Xie and Yunfan Jiang and Ajay Mandlekar and Chaowei Xiao and Yuke Zhu and Linxi Fan and Anima Anandkumar , title =. Transactions on Machine Learning Research , year =

  16. [20]

    2025 , eprint =

    Jiongxiao Wang and Qiaojing Yan and Yawei Wang and Yijun Tian and Soumya Smruti Mishra and Zhichao Xu and Megha Gandhi and Panpan Xu and Lin Lee Cheong , title =. 2025 , eprint =

  17. [21]

    2026 , eprint =

    Chenxi Wang and Zhuoyun Yu and Xin Xie and Wuguannan Yao and Runnan Fang and Shuofei Qiao and Kexin Cao and Guozhou Zheng and Xiang Qi and Peng Zhang and Shumin Deng , title =. 2026 , eprint =

  18. [22]

    2026 , eprint =

    Tianle Xia and Lingxiang Hu and Yiding Sun and Ming Xu and Lan Xu and Siying Wang and Wei Xu and Jie Jiang , title =. 2026 , eprint =

  19. [23]

    2023 , eprint =

    Binfeng Xu and Zhiyuan Peng and Bowen Lei and Subhabrata Mukherjee and Yuchen Liu and Dongkuan Xu , title =. 2023 , eprint =

  20. [24]

    2025 , eprint =

    Wujiang Xu and Zujie Liang and Kai Mei and Hang Gao and Juntao Tan and Shuyuan Xu and Yongfeng Zhang , title =. 2025 , eprint =

  21. [26]

    Fatemi and Xiaolong Jin and Zora Zhiruo Wang and Apurva Gandhi and Yueqi Song and Yu Gu and Jayanth Srinivasa and Gaowen Liu and Graham Neubig and Yu Su , title =

    Boyuan Zheng and Michael Y. Fatemi and Xiaolong Jin and Zora Zhiruo Wang and Apurva Gandhi and Yueqi Song and Yu Gu and Jayanth Srinivasa and Gaowen Liu and Graham Neubig and Yu Su , title =. 2025 , eprint =

  22. [27]

    Shanshan Zhong and Yi Lu and Jingjie Ning and Yibing Wan and Lihan Feng and Yuyi Ao and Leonardo F. R. Ribeiro and Markus Dreyer and Sean Ammirati and Chenyan Xiong , title =. 2026 , eprint =

  23. [28]

    1992 , url =

    Ward Cunningham , title =. 1992 , url =

  24. [29]

    Sur les op\' e rations dans les ensembles abstraits et leur application aux \' e quations int\' e grales

    Stefan Banach. Sur les op\' e rations dans les ensembles abstraits et leur application aux \' e quations int\' e grales. Fundamenta Mathematicae, 3 0 (1): 0 133--181, 1922

  25. [30]

    SkillsBench : A benchmark for evaluating LLM agent skills

    BenchFlow AI . SkillsBench : A benchmark for evaluating LLM agent skills. https://github.com/benchflow-ai/skillsbench, 2026. Apache-2.0 License

  26. [31]

    Alessandro Berti, Sebastiaan van Zelst, and Wil M. P. van der Aalst. Process mining for Python ( PM4Py ): Bridging the gap between process- and data science, 2019. URL https://arxiv.org/abs/1905.06169

  27. [32]

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu

    Tianyi Chen, Yinheng Li, Michael Solodko, Sen Wang, Nan Jiang, Tingyuan Cui, Junheng Hao, Jongwoo Ko, Sara Abdali, Suzhen Zheng, Leon Xu, Hao Fan, Pashmina Cameron, Justin Wagle, and Kazuhito Koishida. CUA-Skill : Develop skills for computer using agent, 2026. URL https://arxiv.org/abs/2601.21123

  28. [33]

    The WyCash portfolio management system

    Ward Cunningham. The WyCash portfolio management system. OOPSLA '92 Experience Report, 1992. URL http://c2.com/doc/oopsla92.html. Original coining of the technical debt metaphor

  29. [34]

    LEGOMem : Modular procedural memory for multi-agent LLM systems for workflow automation, 2025

    Dongge Han, Camille Couturier, Daniel Madrigal Diaz, Xuchao Zhang, Victor R \"u hle, and Saravan Rajmohan. LEGOMem : Modular procedural memory for multi-agent LLM systems for workflow automation, 2025. URL https://arxiv.org/abs/2510.04851

  30. [35]

    Graph of Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills

    Dawei Liu, Zongxia Li, Hongyang Du, Xiyang Wu, Shihang Gui, Yongbei Kuang, and Lichao Sun. Graph of Skills : Dependency-aware structural retrieval for massive agent skills, 2026. URL https://arxiv.org/abs/2604.05333

  31. [36]

    Test-Driven Development and LLM -Based Code Generation

    Noble Saji Mathews and Meiyappan Nagappan. Test-Driven Development and LLM -Based Code Generation . In Proceedings of the 39th IEEE / ACM International Conference on Automated Software Engineering , 2024. doi:10.1145/3691620.3695527. URL https://arxiv.org/abs/2402.13521

  32. [37]

    Lesly Miculicich, Mihir Parmar, Hamid Palangi, Krishnamurthy Dj Dvijotham, Mirko Montanari, Tomas Pfister, and Long T. Le. VeriGuard : Enhancing LLM agent safety via verified code generation, 2025. URL https://arxiv.org/abs/2510.05156

  33. [38]

    GPT-4o System Card

    OpenAI . GPT-4o System Card , 2024. URL https://arxiv.org/abs/2410.21276

  34. [39]

    Gyunam Park and Wil M. P. van der Aalst. Action-oriented process mining: Bridging the gap between insights and actions. Progress in Artificial Intelligence, 2022. doi:10.1007/s13748-022-00281-7

  35. [40]

    Patil, Tianjun Zhang, Xin Wang, and Joseph E

    Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive APIs . In Advances in Neural Information Processing Systems, volume 37, 2024. doi:10.52202/079017-4020. URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/e4c61f578ff07830f5c37378dd3ecb0d-Abstract-Conference.html

  36. [41]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM : Facilitating large language models to master 16000+ real-world APIs . In Proceedings of the 12th International Confere...

  37. [42]

    Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young

    D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young. Machine learning: The high interest credit card of technical debt. In SE4ML : Software Engineering for Machine Learning, NIPS 2014 Workshop , 2014. URL https://research.google/pubs/machine-learning-the-high-interest-credit-card-of-techn...

  38. [43]

    Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Fran c ois Crespo, and Dan Dennison

    D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Fran c ois Crespo, and Dan Dennison. Hidden technical debt in machine learning systems. In Advances in Neural Information Processing Systems, volume 28, 2015. URL https://papers.nips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-...

  39. [44]

    SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources

    Shuaike Shen, Wenduo Cheng, Mingqian Ma, Alistair Turcan, Martin Jinye Zhang, and Jian Ma. SKILLFOUNDRY : Building self-evolving agent skill libraries from heterogeneous scientific resources, 2026. URL https://arxiv.org/abs/2604.03964

  40. [45]

    HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT : Solving AI tasks with ChatGPT and its friends in Hugging Face . In Advances in Neural Information Processing Systems, 2023. URL https://arxiv.org/abs/2303.17580

  41. [46]

    ALFRED : A benchmark for interpreting grounded instructions for everyday tasks

    Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. ALFRED : A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. URL https://arxiv.org/abs/1912.01734

  42. [47]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre C\^ o t\' e , Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld : Aligning text and embodied environments for interactive learning. In Proceedings of the 9th International Conference on Learning Representations, 2021. URL https://arxiv.org/abs/2010.03768

  43. [48]

    Wil M. P. van der Aalst. Process Mining: Data Science in Action. Springer Berlin Heidelberg, 2nd edition, 2016. ISBN 978-3-662-49850-7. doi:10.1007/978-3-662-49851-4

  44. [49]

    SkillX: Automatically Constructing Skill Knowledge Bases for Agents

    Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao, Kexin Cao, Guozhou Zheng, Xiang Qi, Peng Zhang, and Shumin Deng. SkillX : Automatically constructing skill knowledge bases for agents, 2026. URL https://arxiv.org/abs/2604.04804

  45. [50]

    Voyager: An open-ended embodied agent with large language models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research, 2024. URL https://openreview.net/forum?id=ehfRiF0R3a

  46. [51]

    Reinforcement learning for self-improving agent with skill library, 2025

    Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, and Lin Lee Cheong. Reinforcement learning for self-improving agent with skill library, 2025. URL https://arxiv.org/abs/2512.17102

  47. [52]

    GraSP: Graph-Structured Skill Compositions for LLM Agents

    Tianle Xia, Lingxiang Hu, Yiding Sun, Ming Xu, Lan Xu, Siying Wang, Wei Xu, and Jie Jiang. GraSP : Graph-structured skill compositions for LLM agents, 2026. URL https://arxiv.org/abs/2604.17870

  48. [53]

    arXiv preprint , year =

    Binfeng Xu, Zhiyuan Peng, Bowen Lei, Subhabrata Mukherjee, Yuchen Liu, and Dongkuan Xu. ReWOO : Decoupling reasoning from observations for efficient augmented language models, 2023. URL https://arxiv.org/abs/2305.18323

  49. [54]

    A-MEM: Agentic Memory for LLM Agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Shuyuan Xu, and Yongfeng Zhang. A-MEM : Agentic memory for LLM agents, 2025. URL https://arxiv.org/abs/2502.12110

  50. [55]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct : Synergizing reasoning and acting in language models. In Proceedings of the 11th International Conference on Learning Representations, 2023. URL https://arxiv.org/abs/2210.03629

  51. [56]

    SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

    Boyuan Zheng, Michael Y. Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su. SkillWeaver : Web agents can self-improve by discovering and honing skills, 2025. URL https://arxiv.org/abs/2504.07079

  52. [57]

    Shanshan Zhong, Yi Lu, Jingjie Ning, Yibing Wan, Lihan Feng, Yuyi Ao, Leonardo F. R. Ribeiro, Markus Dreyer, Sean Ammirati, and Chenyan Xiong. SkillLearnBench : Benchmarking continual learning methods for agent skill generation on real-world tasks, 2026. URL https://arxiv.org/abs/2604.20087