SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems

Hongji Pu, Liang Zhao, Xinyuan Song

Pith reviewed 2026-05-14 17:48 UTC · model grok-4.3

classification 💻 cs.SE cs.MA

keywords skillskillopslibrariesagentslanguagelargelibrarymaintenance

0 comments

The pith

SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero library-time LLM cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language model agents use libraries of skills for complex tasks, but these libraries can develop defects when skills are added, reused, or linked to changing parts. The paper calls this skill technical debt. SkillOps treats each skill as a contract that records its purpose, outputs, assumptions, validation, and failure modes. It builds a graph of how skills depend on each other and checks the whole library for utility, compatibility, risk, and validation problems. The system then produces a cleaned library that existing agents can use without changing their own code. On the ALFWorld benchmark the maintained library alone reaches 79.5 percent success, beating the best prior method by 8.8 points and requiring almost no extra large-language-model calls during library upkeep.

Core claim

On ALFWorld, SkillOps achieves 79.5 percent task success as a standalone agent, outperforming the strongest baseline by 8.8 percentage points with no additional task-time large language model calls.

Load-bearing premise

That rule-based diagnosis across the four health dimensions (utility, compatibility, risk, validation) can reliably detect and repair library-level defects without task-specific LLM calls or human oversight.

Figures

Figures reproduced from arXiv: 2605.13716 by Hongji Pu, Liang Zhao, Xinyuan Song.

**Figure 1.** Figure 1: SkillOps System Architecture. The Hierarchical Skill Ecosystem Graph (HSEG) comprises two levels: (1) an Internal Skill Graph that models each skill as a contract graph over Precondition (P), Operation (O), Artifact (A), Validator (V ), and Failure Mode (F) nodes; and (2) an External Graph-of-Graphs connecting skills via typed dependency (dep), compatibility (comp), redundancy (red), and alternative (alt) … view at source ↗

**Figure 2.** Figure 2: Compact SkillOps algorithms. The Task-Time Loop plans and repairs the current execution, while the Library-Time Loop converts execution traces into persistent skill-library updates. CGPD: ContractGraph-Propagated Diagnosis. Standard health diagnosis evaluates each skill independently. CGPD is an additional advanced component that propagates risk scores along dep −−→ edges, enabling preemptive validator ins… view at source ↗

**Figure 3.** Figure 3: Maintenance cost summary. The library-time maintenance pass uses nearly zero LLM calls at all scales, while task-time token changes are mostly neutral or negative [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Noise-graded library scaling. SkillOps remains stable as the library grows from 200 to 2000 skills, while retrieval-heavy baselines degrade under increasing noise [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Per-task-type SR. Results are reported for the 200-skill library, pooled over 3 seeds. J Token Consumption and k-Sensitivity Analysis J.1 Token Consumption We measure the per-task token cost of each method across the nine H2 library scales, N ∈ {200, 250, 500, 750, 1000, 1250, 1500, 1750, 2000}, using three random seeds and 185 ALFWorld tasks per seed. All LLM-based methods use temperature=0 and gpt-4o-min… view at source ↗

**Figure 6.** Figure 6: H1 main results. Task SR on ALFWorld at the 200-skill scale, pooled over 3 seeds. Error bars show Wilson 95% confidence intervals. 0 20 40 60 80 100 Task Success Rate (%) SkillOps-Full NoCGPD NoRetire NoInternalGraph NoLibrary NoMerge NoExternalGraph NoRepair NoValidator NoTask NoAdapter 79.5% 79.0% 73.2% 72.2% 71.9% 71.9% 64.6% 55.9% 38.0% 15.7% 13.2% lib = 200 (Full = 79.5%) 0 20 40 60 80 100 Task Succes… view at source ↗

**Figure 7.** Figure 7: H3 ablation visualization. Each bar reports task SR after removing one SkillOps component [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Token scaling with library size. SkillOps emits nearly zero task-time tokens across all evaluated scales, while LLM-based baselines keep nonzero token budgets [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: k-sensitivity visualization. Increasing k improves some retrieval baselines, but SkillOps remains clearly stronger across all tested library sizes [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

read the original abstract

Large language model agents increasingly rely on skill libraries for multi-step tasks, yet these libraries can accumulate persistent defects as skills are added, reused, patched, and linked to changing dependencies. We call this failure mode skill technical debt: library-level defects that may not break a single skill locally but can harm future retrieval, composition, and execution. Existing skill-based agents mainly focus on task-time retrieval, planning, and repair, while library-time maintenance remains underexplored. We propose SkillOps, a method-agnostic plug-in framework for maintaining skill libraries. SkillOps represents each skill as a typed Skill Contract (P, O, A, V, F), organizes skills with a Hierarchical Skill Ecosystem Graph, and diagnoses library health across utility, compatibility, risk, and validation dimensions. Given a raw skill library, SkillOps produces a maintained library that can be used by existing retrieval or planning agents without changing their internal code. On ALFWorld, SkillOps achieves 79.5 percent task success as a standalone agent, outperforming the strongest baseline by 8.8 percentage points with no additional task-time large language model calls. As a plug-in layer, it improves retrieval-heavy baselines by 0.68 to 2.90 percentage points. The current rule-based maintenance implementation uses nearly zero library-time large language model calls or tokens, showing that skill-library maintenance can be added as a low-overhead architectural layer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SkillOps adds a library-time maintenance layer via typed contracts and a dependency graph, delivering ALFWorld gains with no extra task-time LLM calls, but the rule-based diagnosis lacks detail needed to verify reliability.

read the letter

The core contribution is a shift from task-time fixes to library-time upkeep for skill collections in LLM agents. Skills get represented as typed contracts covering preconditions, outcomes, actions, validation, and failures, then organized in a hierarchical graph that tracks dependencies across the collection. This produces a cleaned library that existing agents can plug into without code changes or added inference at runtime. The ALFWorld numbers show a standalone version reaching 79.5 percent success, 8.8 points above the strongest baseline, plus smaller lifts for retrieval-heavy setups, all with near-zero library-time LLM usage. That low-overhead plug-in angle is the practical strength if the maintenance actually works at scale. The soft spot is the diagnosis step. The four health dimensions are named but the concrete rules for spotting utility, compatibility, risk, or validation problems are not shown, nor are ablations or coverage arguments. Without those, it is difficult to tell whether the gains come from general maintenance or from rules tuned to the test domain, and whether latent defects still slip through. The numbers are presented cleanly but lack error bars or sensitivity checks. This paper is for teams building or maintaining large skill libraries for agents rather than one-off planners. A reader focused on long-term agent reliability would find the architecture worth examining even if the rules need more exposition. Send it to peer review; the framing of skill technical debt is timely and the plug-in design is usable, though revisions should supply the rule definitions and stronger validation evidence.

Circularity Check

0 steps flagged

No significant circularity; empirical ALFWorld results are external measurements

full rationale

The paper's core claim is an empirical benchmark: SkillOps achieves 79.5% task success on ALFWorld as a standalone agent, outperforming the strongest baseline by 8.8 pp with zero additional task-time LLM calls. This is reported as an external evaluation of the maintained library rather than a quantity computed from internal equations, fitted parameters, or self-referential definitions. The framework description (Skill Contract (P, O, A, V, F), Hierarchical Skill Ecosystem Graph, and four health dimensions) is presented as an architectural layer whose maintenance rules are rule-based and low-overhead, but the success metric is measured against external baselines and does not reduce to those rules by construction. No self-citation chains, ansatzes, or renamings of known results are invoked as load-bearing steps in the provided derivation. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework rests on the domain assumption that skill technical debt manifests in measurable utility, compatibility, risk, and validation dimensions that can be diagnosed rule-based without LLM calls.

axioms (1)

domain assumption Skill libraries accumulate persistent defects that harm future retrieval and composition even when individual skills appear locally correct.
Stated as the core motivation for library-time maintenance.

invented entities (2)

Skill Contract (P, O, A, V, F) no independent evidence
purpose: Typed representation of each skill for diagnosis and maintenance
New data structure introduced to enable library health checks.
Hierarchical Skill Ecosystem Graph no independent evidence
purpose: Organizes skill dependencies for compatibility and risk analysis
New graph structure for modeling library-level interactions.

pith-pipeline@v0.9.0 · 5559 in / 1380 out tokens · 38572 ms · 2026-05-14T17:48:34.580765+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 25 canonical work pages · 14 internal anchors

[1]

Fundamenta Mathematicae , volume =

Banach, Stefan , title =. Fundamenta Mathematicae , volume =
[2]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Li, Xinyi and others , year =. 2602.12670 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Alessandro Berti and Sebastiaan van Zelst and Wil M. P. van der Aalst , title =. 2019 , eprint =

2019
[4]

2026 , eprint =

Tianyi Chen and Yinheng Li and Michael Solodko and Sen Wang and Nan Jiang and Tingyuan Cui and Junheng Hao and Jongwoo Ko and Sara Abdali and Suzhen Zheng and Leon Xu and Hao Fan and Pashmina Cameron and Justin Wagle and Kazuhito Koishida , title =. 2026 , eprint =

2026
[5]

2025 , eprint =

Dongge Han and Camille Couturier and Daniel Madrigal Diaz and Xuchao Zhang and Victor R. 2025 , eprint =

2025
[6]

2026 , eprint =

Dawei Liu and Zongxia Li and Hongyang Du and Xiyang Wu and Shihang Gui and Yongbei Kuang and Lichao Sun , title =. 2026 , eprint =

2026
[7]

Proceedings of the 39th

Noble Saji Mathews and Meiyappan Nagappan , title =. Proceedings of the 39th. 2024 , doi =. 2402.13521 , archivePrefix =

work page arXiv 2024
[8]

Le , title =

Lesly Miculicich and Mihir Parmar and Hamid Palangi and Krishnamurthy Dj Dvijotham and Mirko Montanari and Tomas Pfister and Long T. Le , title =. 2025 , eprint =

2025
[9]

Gyunam Park and Wil M. P. van der Aalst , title =. Progress in Artificial Intelligence , year =
[10]

Patil and Tianjun Zhang and Xin Wang and Joseph E

Shishir G. Patil and Tianjun Zhang and Xin Wang and Joseph E. Gonzalez , title =. Advances in Neural Information Processing Systems , volume =. 2024 , doi =

2024
[12]

Sculley and Gary Holt and Daniel Golovin and Eugene Davydov and Todd Phillips and Dietmar Ebner and Vinay Chaudhary and Michael Young and Jean-Fran

D. Sculley and Gary Holt and Daniel Golovin and Eugene Davydov and Todd Phillips and Dietmar Ebner and Vinay Chaudhary and Michael Young and Jean-Fran. Hidden Technical Debt in Machine Learning Systems , booktitle =. 2015 , url =

2015
[13]

Sculley and Gary Holt and Daniel Golovin and Eugene Davydov and Todd Phillips and Dietmar Ebner and Vinay Chaudhary and Michael Young , title =

D. Sculley and Gary Holt and Daniel Golovin and Eugene Davydov and Todd Phillips and Dietmar Ebner and Vinay Chaudhary and Michael Young , title =. 2014 , url =

2014
[14]

2026 , eprint =

Shuaike Shen and Wenduo Cheng and Mingqian Ma and Alistair Turcan and Martin Jinye Zhang and Jian Ma , title =. 2026 , eprint =

2026
[18]

Wil M. P. van der Aalst , title =. 2016 , doi =

2016
[19]

Transactions on Machine Learning Research , year =

Guanzhi Wang and Yuqi Xie and Yunfan Jiang and Ajay Mandlekar and Chaowei Xiao and Yuke Zhu and Linxi Fan and Anima Anandkumar , title =. Transactions on Machine Learning Research , year =
[20]

2025 , eprint =

Jiongxiao Wang and Qiaojing Yan and Yawei Wang and Yijun Tian and Soumya Smruti Mishra and Zhichao Xu and Megha Gandhi and Panpan Xu and Lin Lee Cheong , title =. 2025 , eprint =

2025
[21]

2026 , eprint =

Chenxi Wang and Zhuoyun Yu and Xin Xie and Wuguannan Yao and Runnan Fang and Shuofei Qiao and Kexin Cao and Guozhou Zheng and Xiang Qi and Peng Zhang and Shumin Deng , title =. 2026 , eprint =

2026
[22]

2026 , eprint =

Tianle Xia and Lingxiang Hu and Yiding Sun and Ming Xu and Lan Xu and Siying Wang and Wei Xu and Jie Jiang , title =. 2026 , eprint =

2026
[23]

2023 , eprint =

Binfeng Xu and Zhiyuan Peng and Bowen Lei and Subhabrata Mukherjee and Yuchen Liu and Dongkuan Xu , title =. 2023 , eprint =

2023
[24]

2025 , eprint =

Wujiang Xu and Zujie Liang and Kai Mei and Hang Gao and Juntao Tan and Shuyuan Xu and Yongfeng Zhang , title =. 2025 , eprint =

2025
[26]

Fatemi and Xiaolong Jin and Zora Zhiruo Wang and Apurva Gandhi and Yueqi Song and Yu Gu and Jayanth Srinivasa and Gaowen Liu and Graham Neubig and Yu Su , title =

Boyuan Zheng and Michael Y. Fatemi and Xiaolong Jin and Zora Zhiruo Wang and Apurva Gandhi and Yueqi Song and Yu Gu and Jayanth Srinivasa and Gaowen Liu and Graham Neubig and Yu Su , title =. 2025 , eprint =

2025
[27]

Shanshan Zhong and Yi Lu and Jingjie Ning and Yibing Wan and Lihan Feng and Yuyi Ao and Leonardo F. R. Ribeiro and Markus Dreyer and Sean Ammirati and Chenyan Xiong , title =. 2026 , eprint =

2026
[28]

1992 , url =

Ward Cunningham , title =. 1992 , url =

1992
[29]

Sur les op\' e rations dans les ensembles abstraits et leur application aux \' e quations int\' e grales

Stefan Banach. Sur les op\' e rations dans les ensembles abstraits et leur application aux \' e quations int\' e grales. Fundamenta Mathematicae, 3 0 (1): 0 133--181, 1922

1922
[30]

SkillsBench : A benchmark for evaluating LLM agent skills

BenchFlow AI . SkillsBench : A benchmark for evaluating LLM agent skills. https://github.com/benchflow-ai/skillsbench, 2026. Apache-2.0 License

2026
[31]

Alessandro Berti, Sebastiaan van Zelst, and Wil M. P. van der Aalst. Process mining for Python ( PM4Py ): Bridging the gap between process- and data science, 2019. URL https://arxiv.org/abs/1905.06169

work page internal anchor Pith review Pith/arXiv arXiv 2019
[32]

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu

Tianyi Chen, Yinheng Li, Michael Solodko, Sen Wang, Nan Jiang, Tingyuan Cui, Junheng Hao, Jongwoo Ko, Sara Abdali, Suzhen Zheng, Leon Xu, Hao Fan, Pashmina Cameron, Justin Wagle, and Kazuhito Koishida. CUA-Skill : Develop skills for computer using agent, 2026. URL https://arxiv.org/abs/2601.21123

work page arXiv 2026
[33]

The WyCash portfolio management system

Ward Cunningham. The WyCash portfolio management system. OOPSLA '92 Experience Report, 1992. URL http://c2.com/doc/oopsla92.html. Original coining of the technical debt metaphor

1992
[34]

LEGOMem : Modular procedural memory for multi-agent LLM systems for workflow automation, 2025

Dongge Han, Camille Couturier, Daniel Madrigal Diaz, Xuchao Zhang, Victor R \"u hle, and Saravan Rajmohan. LEGOMem : Modular procedural memory for multi-agent LLM systems for workflow automation, 2025. URL https://arxiv.org/abs/2510.04851

work page arXiv 2025
[35]

Graph of Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills

Dawei Liu, Zongxia Li, Hongyang Du, Xiyang Wu, Shihang Gui, Yongbei Kuang, and Lichao Sun. Graph of Skills : Dependency-aware structural retrieval for massive agent skills, 2026. URL https://arxiv.org/abs/2604.05333

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

Test-Driven Development and LLM -Based Code Generation

Noble Saji Mathews and Meiyappan Nagappan. Test-Driven Development and LLM -Based Code Generation . In Proceedings of the 39th IEEE / ACM International Conference on Automated Software Engineering , 2024. doi:10.1145/3691620.3695527. URL https://arxiv.org/abs/2402.13521

work page doi:10.1145/3691620.3695527 2024
[37]

Lesly Miculicich, Mihir Parmar, Hamid Palangi, Krishnamurthy Dj Dvijotham, Mirko Montanari, Tomas Pfister, and Long T. Le. VeriGuard : Enhancing LLM agent safety via verified code generation, 2025. URL https://arxiv.org/abs/2510.05156

work page arXiv 2025
[38]

GPT-4o System Card

OpenAI . GPT-4o System Card , 2024. URL https://arxiv.org/abs/2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Gyunam Park and Wil M. P. van der Aalst. Action-oriented process mining: Bridging the gap between insights and actions. Progress in Artificial Intelligence, 2022. doi:10.1007/s13748-022-00281-7

work page doi:10.1007/s13748-022-00281-7 2022
[40]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive APIs . In Advances in Neural Information Processing Systems, volume 37, 2024. doi:10.52202/079017-4020. URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/e4c61f578ff07830f5c37378dd3ecb0d-Abstract-Conference.html

work page doi:10.52202/079017-4020 2024
[41]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM : Facilitating large language models to master 16000+ real-world APIs . In Proceedings of the 12th International Confere...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young

D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young. Machine learning: The high interest credit card of technical debt. In SE4ML : Software Engineering for Machine Learning, NIPS 2014 Workshop , 2014. URL https://research.google/pubs/machine-learning-the-high-interest-credit-card-of-techn...

2014
[43]

Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Fran c ois Crespo, and Dan Dennison

D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Fran c ois Crespo, and Dan Dennison. Hidden technical debt in machine learning systems. In Advances in Neural Information Processing Systems, volume 28, 2015. URL https://papers.nips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-...

2015
[44]

SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources

Shuaike Shen, Wenduo Cheng, Mingqian Ma, Alistair Turcan, Martin Jinye Zhang, and Jian Ma. SKILLFOUNDRY : Building self-evolving agent skill libraries from heterogeneous scientific resources, 2026. URL https://arxiv.org/abs/2604.03964

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT : Solving AI tasks with ChatGPT and its friends in Hugging Face . In Advances in Neural Information Processing Systems, 2023. URL https://arxiv.org/abs/2303.17580

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

ALFRED : A benchmark for interpreting grounded instructions for everyday tasks

Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. ALFRED : A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. URL https://arxiv.org/abs/1912.01734

work page arXiv 2020
[47]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre C\^ o t\' e , Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld : Aligning text and embodied environments for interactive learning. In Proceedings of the 9th International Conference on Learning Representations, 2021. URL https://arxiv.org/abs/2010.03768

work page internal anchor Pith review Pith/arXiv arXiv 2021
[48]

Wil M. P. van der Aalst. Process Mining: Data Science in Action. Springer Berlin Heidelberg, 2nd edition, 2016. ISBN 978-3-662-49850-7. doi:10.1007/978-3-662-49851-4

work page doi:10.1007/978-3-662-49851-4 2016
[49]

SkillX: Automatically Constructing Skill Knowledge Bases for Agents

Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao, Kexin Cao, Guozhou Zheng, Xiang Qi, Peng Zhang, and Shumin Deng. SkillX : Automatically constructing skill knowledge bases for agents, 2026. URL https://arxiv.org/abs/2604.04804

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

Voyager: An open-ended embodied agent with large language models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research, 2024. URL https://openreview.net/forum?id=ehfRiF0R3a

2024
[51]

Reinforcement learning for self-improving agent with skill library, 2025

Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, and Lin Lee Cheong. Reinforcement learning for self-improving agent with skill library, 2025. URL https://arxiv.org/abs/2512.17102

work page arXiv 2025
[52]

GraSP: Graph-Structured Skill Compositions for LLM Agents

Tianle Xia, Lingxiang Hu, Yiding Sun, Ming Xu, Lan Xu, Siying Wang, Wei Xu, and Jie Jiang. GraSP : Graph-structured skill compositions for LLM agents, 2026. URL https://arxiv.org/abs/2604.17870

work page internal anchor Pith review Pith/arXiv arXiv 2026
[53]

arXiv preprint , year =

Binfeng Xu, Zhiyuan Peng, Bowen Lei, Subhabrata Mukherjee, Yuchen Liu, and Dongkuan Xu. ReWOO : Decoupling reasoning from observations for efficient augmented language models, 2023. URL https://arxiv.org/abs/2305.18323

work page arXiv 2023
[54]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Shuyuan Xu, and Yongfeng Zhang. A-MEM : Agentic memory for LLM agents, 2025. URL https://arxiv.org/abs/2502.12110

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct : Synergizing reasoning and acting in language models. In Proceedings of the 11th International Conference on Learning Representations, 2023. URL https://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Boyuan Zheng, Michael Y. Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su. SkillWeaver : Web agents can self-improve by discovering and honing skills, 2025. URL https://arxiv.org/abs/2504.07079

work page internal anchor Pith review arXiv 2025
[57]

Shanshan Zhong, Yi Lu, Jingjie Ning, Yibing Wan, Lihan Feng, Yuyi Ao, Leonardo F. R. Ribeiro, Markus Dreyer, Sean Ammirati, and Chenyan Xiong. SkillLearnBench : Benchmarking continual learning methods for agent skill generation on real-world tasks, 2026. URL https://arxiv.org/abs/2604.20087

work page internal anchor Pith review Pith/arXiv arXiv 2026