Recognition: unknown
SkillOS: Learning Skill Curation for Self-Evolving Agents
Pith reviewed 2026-05-08 09:43 UTC · model grok-4.3
The pith
SkillOS uses RL to train a skill curator that lets LLM agents accumulate and reuse skills across related tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SkillOS trains a skill curator with reinforcement learning so that an agent can update an external SkillRepo from experience. Earlier trajectories in a dependency-grouped task stream modify the repository; later related tasks measure whether those changes helped. Composite rewards guide the curator toward policies that improve long-term performance. The result is higher success rates, lower token use, and skills that grow into structured Markdown files encoding meta-skills.
What carries the argument
A trainable RL skill curator that updates an external SkillRepo using composite rewards on skill-relevant task streams, while the executor remains frozen.
If this is right
- Agents improve both success rate and efficiency on streaming tasks as the SkillRepo grows.
- The learned curator policy transfers to different executor models and task domains.
- Skills evolve from simple notes into higher-level structured Markdown files over successive updates.
- Targeted skill retrieval replaces generic memory retrieval, reducing irrelevant context.
Where Pith is reading between the lines
- The same separation of curator and executor could be applied to other memory or tool-use systems that suffer from delayed credit assignment.
- If task streams with natural dependencies are unavailable, synthetic grouping methods might still provide usable training signals.
- Over longer horizons the curator might discover when to delete or merge skills, not only add them.
Load-bearing premise
Composite rewards and grouped task streams supply enough learning signal for the curator to learn effective long-term policies from delayed and indirect feedback.
What would settle it
An ablation that removes either the grouped task streams or the composite rewards and still matches or exceeds SkillOS performance on the same multi-turn and reasoning benchmarks.
read the original abstract
LLM-based agents are increasingly deployed to handle streaming tasks, yet they often remain one-off problem solvers that fail to learn from past interactions. Reusable skills distilled from experience provide a natural substrate for self-evolution, where high-quality skill curation serves as the key bottleneck. Existing approaches either rely on manual skill curation, prescribe heuristic skill operations, or train for short-horizon skill operations. However, they still struggle to learn complex long-term curation policies from indirect and delayed feedback. To tackle this challenge, we propose SkillOS, an experience-driven RL training recipe for learning skill curation in self-evolving agents. SkillOS pairs a frozen agent executor that retrieves and applies skills with a trainable skill curator that updates an external SkillRepo from accumulated experience. To provide learning signals for curation, we design composite rewards and train on grouped task streams based on skill-relevant task dependencies, where earlier trajectories update the SkillRepo, and later related tasks evaluate these updates. Across multi-turn agentic tasks and single-turn reasoning tasks, SkillOS consistently outperforms memory-free and strong memory-based baselines in both effectiveness and efficiency, with the learned skill curator generalizing across different executor backbones and task domains. Further analyses show that the learned curator produces more targeted skill use, while the skills in SkillRepo evolve into more richly structured Markdown files that encode higher-level meta-skills over time.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SkillOS, an experience-driven RL training recipe for learning long-term skill curation policies in self-evolving LLM-based agents. It pairs a frozen executor (that retrieves and applies skills from an external SkillRepo) with a trainable curator; composite rewards and grouped task streams (earlier trajectories update the repo, later related tasks provide evaluation signals) supply learning signals from indirect/delayed feedback. The central empirical claim is that the resulting curator outperforms memory-free and strong memory-based baselines in effectiveness and efficiency on both multi-turn agentic tasks and single-turn reasoning tasks, generalizes across executor backbones and task domains, and produces more targeted skill use with increasingly structured meta-skills in the repo.
Significance. If the results hold under rigorous evaluation, the work would be significant for autonomous agent research: it offers a concrete mechanism to move agents beyond one-off problem solving toward cumulative, experience-driven self-evolution via reusable skills. The RL framing for curation policies and the use of grouped streams to handle delayed feedback are technically interesting directions that could influence future designs of lifelong-learning agents.
major comments (2)
- [Method / Training Procedure] Training recipe (grouped task streams): the description of how 'grouped task streams based on skill-relevant task dependencies' are constructed must explicitly state whether dependency identification is performed online from experience alone or requires pre-known/manual/oracle grouping. If the latter, the composite-reward signal is not purely experience-driven; this directly affects the central claim that SkillOS learns effective long-term curation policies from indirect feedback and undermines generalization to arbitrary, ungrouped streaming tasks.
- [Experiments / Results] Experimental evaluation: the claims of 'consistent outperformance' and 'generalization across different executor backbones and task domains' require detailed quantitative support (specific metrics, error bars, statistical tests, baseline implementations, and ablation on composite rewards vs. grouping). Without these, it is impossible to assess whether the reported gains are load-bearing or sensitive to the grouping construction.
minor comments (2)
- [Abstract] Abstract would be strengthened by including one or two key quantitative results (e.g., average success-rate improvement or efficiency gain) rather than purely qualitative statements.
- [Method] Notation for SkillRepo, curator, and executor should be introduced once with consistent symbols or acronyms in the method section to improve readability.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and for recognizing the potential significance of SkillOS for autonomous agent research. We address each major comment below and will revise the manuscript to provide the requested clarifications and additional quantitative details.
read point-by-point responses
-
Referee: [Method / Training Procedure] Training recipe (grouped task streams): the description of how 'grouped task streams based on skill-relevant task dependencies' are constructed must explicitly state whether dependency identification is performed online from experience alone or requires pre-known/manual/oracle grouping. If the latter, the composite-reward signal is not purely experience-driven; this directly affects the central claim that SkillOS learns effective long-term curation policies from indirect feedback and undermines generalization to arbitrary, ungrouped streaming tasks.
Authors: We agree that explicit clarification is needed to support the experience-driven claim. In SkillOS the grouped task streams are constructed online from experience alone: dependency identification relies on patterns observed in accumulated trajectories, skill retrieval histories, and task similarity signals derived directly from the streaming data, without any pre-known, manual, or oracle grouping. We will revise the method section (including the description of grouped task streams and the composite reward design) to state this explicitly and to include algorithmic details or pseudocode for the online identification process. revision: yes
-
Referee: [Experiments / Results] Experimental evaluation: the claims of 'consistent outperformance' and 'generalization across different executor backbones and task domains' require detailed quantitative support (specific metrics, error bars, statistical tests, baseline implementations, and ablation on composite rewards vs. grouping). Without these, it is impossible to assess whether the reported gains are load-bearing or sensitive to the grouping construction.
Authors: We acknowledge that the current presentation would benefit from more rigorous quantitative reporting. The manuscript already reports comparative results on multi-turn agentic and single-turn reasoning tasks, but we will expand the experimental section to include: concrete metric values with error bars from multiple random seeds, statistical significance tests, full implementation details for all baselines, and a dedicated ablation isolating the contribution of the composite reward versus the grouping mechanism. These additions will be placed in the main results and appendix to allow direct assessment of robustness. revision: yes
Circularity Check
No significant circularity in derivation chain.
full rationale
The paper presents SkillOS as an empirical RL training procedure for skill curation in agents, using composite rewards and grouped task streams to supply learning signals from delayed feedback. Central claims rest on experimental outperformance and generalization across backbones/domains rather than any closed-form derivation, prediction, or first-principles result. No equations appear that equate outputs to inputs by construction, no fitted parameters are relabeled as predictions, and no load-bearing self-citations or uniqueness theorems are invoked to force the architecture. The training signal derives from external task evaluations on later streams, rendering results independently falsifiable rather than tautological.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
URL https://arxiv.org/abs/2602.06052. M. A. Islam, M. E. Ali, and M. R. Parvez. MapCoder: Multi-agent code generation for competitive problem solving. In L.-W. Ku, A. Martins, and V. Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4912–4944, 12 SkillOS: Learning Skill...
-
[2]
URL https://arxiv.org/abs/2402.03300. M. Shen, K. Zha, Z. He, Z.-W. Hong, S. Ouyang, J. J. Ryu, P. Sattigeri, S. Diggavi, and G. Wornell. Decocted experience improves test-time inference in llm agents.ArXiv preprint, abs/2604.04373,
work page internal anchor Pith review arXiv
-
[3]
URL https://arxiv.org/abs/2604.04373. G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y. Peng, H. Lin, and C. Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024. M. Shridhar, X. Yuan, M. Côté, Y. Bisk, A. Trischler, and M. J. Hausknecht. Alfworld: Aligning text and embodied environments for interactive lea...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/s11704-024-40231-1 2024
-
[4]
URL https://arxiv.org/abs/2602.08234. W. Xu, Z. Liang, K. Mei, H. Gao, J. Tan, and Y. Zhang. A-mem: Agentic memory for LLM agents. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=FiM0M8gcct. S. Yan, X. Yang, Z. Huang, E. Nie, Z. Ding, Z. Li, X. Ma, H. Schütze, V. Tresp, and Y. Ma. M...
work page internal anchor Pith review arXiv 2025
-
[5]
arXiv preprint arXiv:2603.01145 , year=
URL https://arxiv.org/abs/2603.01145. 15 SkillOS: Learning Skill Curation for Self-Evolving Agents S. Yao, H. Chen, J. Yang, and K. Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 3...
-
[6]
ExpeL: LLM agents are experiential learners
URL https://arxiv.org/abs/2602.10652. H. Yu, T. Chen, J. Feng, J. Chen, W. Dai, Q. Yu, Y.-Q. Zhang, W.-Y. Ma, J. Liu, M. Wang, and H. Zhou. Memagent: Reshapinglong-contextLLMwithmulti-convRL-basedmemoryagent. In TheFourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum? id=k5nIOvYGCL. C.Zhang, Y.Jian, Z.Ouya...
-
[7]
Shared foundation:𝑚𝜏(𝐶𝑠, 𝐶𝑡) ≥ 𝜅𝐶 and 𝑚𝜏(𝑆𝑠, 𝑆𝑡) ≥ 𝜅𝑆
-
[8]
Shared reasoning: 𝑚𝜏(𝑅𝑠, 𝑅𝑡) + 𝑚𝜏(𝑃𝑠, 𝑃𝑡) ≥ 1
-
[9]
Not a near-duplicate:SJ𝜏(𝑇𝑠, 𝑇𝑡) ≤ 𝜃𝑇 and the weighted overall similarityΩ(𝑥𝑠, 𝑥𝑡) ≤ 𝜎max
-
[10]
Not too unrelated:Ω(𝑥𝑠, 𝑥𝑡) ≥ 𝜎min
-
[11]
Progression: 𝑥𝑡 introduces at least one new concept or skill, i.e.|𝐶𝑡 | > 𝑚 𝜏(𝐶𝑠, 𝐶𝑡) or |𝑆𝑡 | > 𝑚 𝜏(𝑆𝑠, 𝑆𝑡)
-
[12]
related but not redundant
Curriculum direction: 𝑑𝑡 − 𝑑𝑠 ≥ 𝛿min. Here Ω is a convex combination of per-dimension soft-Jaccard scores across{𝐶, 𝑆, 𝑅, 𝑃, 𝑇 } with weights listed in Table 5. Conditions (1)–(2) ensure genuine reuse of foundational knowledge and reasoning machinery; (3)–(4) place the pair in a useful “related but not redundant” band; (5) guarantees that 𝑥𝑡 carries somet...
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.