arxiv: 2604.03964 · v1 · submitted 2026-04-05 · 💻 cs.AI

Recognition: no theorem link

SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources

Alistair Turcan, Jian Ma, Martin Jinye Zhang, Mingqian Ma, Shuaike Shen, Wenduo Cheng

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:17 UTC · model grok-4.3

classification 💻 cs.AI

keywords SkillFoundryagent skillsself-evolving librariesscientific agentsMoSciBenchgenomics tasksskill miningclosed-loop validation

0 comments

The pith

SkillFoundry mines heterogeneous scientific resources to build validated, self-evolving libraries of agent skills that raise performance on coding benchmarks and genomics tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SkillFoundry as a framework that turns scattered procedural knowledge from repositories, APIs, scripts, notebooks, documentation, databases, and papers into reusable agent skills. Each skill package captures task scope, inputs and outputs, execution steps, environment assumptions, provenance, and tests. The system first builds a domain knowledge tree, mines high-value branches, extracts operational contracts, and compiles them into executable packages. It then runs a closed-loop validation process that expands, repairs, merges, or prunes the library until only reliable skills remain. This matters because agents currently cannot directly use much of the available scientific know-how; the mined skills differ from existing hand-crafted libraries and deliver measurable gains on standard benchmarks plus concrete domain tasks.

Core claim

SkillFoundry organizes a target domain as a domain knowledge tree, mines resources from high-value branches, extracts operational contracts, compiles them into executable skill packages, and iteratively expands, repairs, merges, or prunes the library through closed-loop validation. The resulting library contains 71.1 percent novel skills relative to existing collections and improves coding agent performance on five of the six MoSciBench datasets. The same process can generate new task-specific skills on demand, producing substantial gains on cell type annotation and the scDRS workflow in genomics.

What carries the argument

The closed-loop validation process that expands, repairs, merges, or prunes mined skills according to execution tests and correctness checks.

If this is right

Coding agents achieve higher success rates on five of six MoSciBench datasets when given the mined skills.
On-demand skill design produces task-specific packages that raise performance on cell type annotation and scDRS genomics workflows.
Seventy-one percent of the extracted skills differ from those in prior libraries such as SkillHub and SkillSMP.
The resulting skill libraries expand coverage beyond hand-crafted collections and supply a practical base for more capable scientific agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the validation loop proves stable, the same mining-and-refinement pattern could transfer to non-scientific domains such as software development or data engineering.
Agents equipped with these skills might discover and invoke the right package during live reasoning rather than requiring pre-written prompts for every new objective.
Over repeated cycles the library could accumulate skills across many fields, reducing the manual effort needed to equip agents for each new scientific area.
Testing the framework on additional domains beyond genomics and coding benchmarks would show whether the knowledge-tree and validation steps generalize.

Load-bearing premise

The closed-loop validation process reliably identifies, repairs, and retains only correct skills without introducing systematic errors or selection bias.

What would settle it

Deploy the mined skills on a fresh set of scientific coding or genomics tasks and observe whether performance fails to improve or whether validation errors appear in the retained library.

Figures

Figures reproduced from arXiv: 2604.03964 by Alistair Turcan, Jian Ma, Martin Jinye Zhang, Mingqian Ma, Shuaike Shen, Wenduo Cheng.

**Figure 1.** Figure 1: Overview of SKILLFOUNDRY. Starting from a domain knowledge tree, the framework mines branch-relevant resources, extracts structured candidate skills, validates them through multi-level testing, expands the tree with verified skills, and prunes redundant or low-value leaves. 3 Method 3.1 Method Overview SKILLFOUNDRY is a tree-guided framework for automatically building and maintaining a domain skill librar… view at source ↗

**Figure 2.** Figure 2: Composition of the mined skill library and runtime of the mining pipeline. (a) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: UMAP of cell-type annotations from ground truth, Codex (without skill), [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Quantitative and qualitative evaluation of Biomni on the scDRS workflow, with [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Representative scDRS output generated by Biomni without [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Representative scDRS output generated by Biomni with [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Web dashboard view of a cell type annotation skill. The interface shows searchable [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Web dashboard view of scDRS-related skills. The interface supports search and [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

read the original abstract

Modern scientific ecosystems are rich in procedural knowledge across repositories, APIs, scripts, notebooks, documentation, databases, and papers, yet much of this knowledge remains fragmented across heterogeneous artifacts that agents cannot readily operationalize. This gap between abundant scientific know-how and usable agent capabilities is a key bottleneck for building effective scientific agents. We present SkillFoundry, a self-evolving framework that converts such resources into validated agent skills, reusable packages that encode task scope, inputs and outputs, execution steps, environment assumptions, provenance, and tests. SkillFoundry organizes a target domain as a domain knowledge tree, mines resources from high-value branches, extracts operational contracts, compiles them into executable skill packages, and then iteratively expands, repairs, merges, or prunes the resulting library through a closed-loop validation process. SkillFoundry produces a substantially novel and internally valid skill library, with 71.1\% of mined skills differing from existing skill libraries such as SkillHub and SkillSMP. We demonstrate that these mined skills improve coding agent performance on five of the six MoSciBench datasets. We further show that SkillFoundry can design new task-specific skills on demand for concrete scientific objectives, and that the resulting skills substantially improve performance on two challenging genomics tasks: cell type annotation and the scDRS workflow. Together, these results show that automatically mined skills improve agent performance on benchmarks and domain-specific tasks, expand coverage beyond hand-crafted skill libraries, and provide a practical foundation for more capable scientific agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SkillFoundry gives a workable pipeline for turning scattered scientific resources into agent skills with decent novelty, but the closed-loop validation is described too thinly to back the performance claims.

read the letter

The core contribution is a framework that builds a domain knowledge tree, mines resources like notebooks and papers, extracts operational contracts, and then runs a self-evolving loop to expand, repair, merge, or prune a skill library. They report 71.1% of the resulting skills differ from SkillHub and SkillSMP, and that the library lifts coding-agent scores on five of six MoSciBench datasets while also helping on cell-type annotation and scDRS in genomics. The on-demand skill generation for concrete tasks is a useful practical piece that goes beyond static libraries. That part is straightforward and addresses a real bottleneck in scientific agents. The execution looks clean on paper: skills come packaged with scope, inputs/outputs, steps, assumptions, provenance, and tests. The gains on external benchmarks are the strongest signal they offer. The soft spot is the validation loop itself. The abstract mentions iterative repair and pruning to produce an internally valid library, but supplies no criteria for what triggers a repair, how correctness is checked, or whether the loop uses independent tests versus the same LLM that did the extraction. If the validation is mostly self-referential, the reported improvements could partly reflect selection or incomplete coverage rather than genuine skill quality. No baselines, variance numbers, or ablation on the loop appear in the summary, so attribution stays shaky. This paper is aimed at groups building AI tools for science who want reusable, expandable skill sets rather than one-off prompts. A reader already working on agent libraries or scientific workflows would find the mining pipeline and the genomics examples worth examining. It is coherent enough on its own terms to merit a serious referee, mainly to check whether the full experiments close the validation gap and whether the gains survive tighter controls.

Referee Report

2 major / 1 minor

Summary. The paper introduces SkillFoundry, a self-evolving framework that mines heterogeneous scientific resources (repositories, APIs, scripts, notebooks, papers) to construct validated agent skill libraries. It organizes domains into knowledge trees, extracts operational contracts, compiles executable skill packages, and applies a closed-loop validation process to expand, repair, merge, or prune the library. The central claims are that the resulting library is substantially novel (71.1% of skills differ from SkillHub and SkillSMP), improves coding-agent performance on five of six MoSciBench datasets, and yields substantial gains on two genomics tasks (cell-type annotation and scDRS workflow) when new task-specific skills are designed on demand.

Significance. If the empirical claims hold under rigorous validation, the work would provide a practical, scalable route to operationalizing fragmented scientific knowledge into reusable agent skills, expanding coverage beyond hand-crafted libraries and supporting more capable scientific agents. The absence of parameter-free derivations or machine-checked proofs is offset by the potential for reproducible skill extraction pipelines if experimental details are supplied.

major comments (2)

[Abstract] Abstract: The central claims of performance gains on five of six MoSciBench datasets and substantial improvements on cell-type annotation and scDRS are stated without any experimental details, baselines, metrics, statistical tests, or description of how the closed-loop validation loop operates, leaving the attribution of gains to the mined skills unsupported by visible evidence.
[Abstract] Abstract: The closed-loop validation process is described only at a high level ('iteratively expands, repairs, merges, or prunes... through a closed-loop validation process' and produces an 'internally valid skill library'), with no concrete criteria, error-detection heuristics, bias-mitigation steps, or safeguards against circular confirmation when the same LLM-based extraction is used for both generation and validation; this is load-bearing for the performance claims.

minor comments (1)

[Abstract] Abstract: The 71.1% novelty figure is reported without specifying the exact comparison method (e.g., semantic similarity threshold, exact string match) against SkillHub and SkillSMP.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that greater specificity would strengthen the presentation and have revised the abstract to incorporate key experimental details, metrics, and a concise description of the validation criteria while preserving its length constraints. The full manuscript already contains the supporting evidence in Sections 3–5; the revisions make these claims more self-contained in the abstract.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of performance gains on five of six MoSciBench datasets and substantial improvements on cell-type annotation and scDRS are stated without any experimental details, baselines, metrics, statistical tests, or description of how the closed-loop validation loop operates, leaving the attribution of gains to the mined skills unsupported by visible evidence.

Authors: We acknowledge the abstract's conciseness limited the inclusion of these specifics. The full manuscript details the experiments in Section 4 (MoSciBench: baselines include no-skill and SkillHub/SkillSMP agents; metrics are success rate and pass@1; statistical tests are paired t-tests, p<0.05 reported) and Section 5 (genomics tasks: accuracy/F1 for cell-type annotation, workflow completion rate for scDRS). The closed-loop validation is specified in Section 3.4. We have revised the abstract to add brief quantitative highlights (e.g., +12.4% average gain on MoSciBench, +18.7% on cell-type annotation) and a one-sentence overview of the validation loop to directly support attribution. revision: yes
Referee: [Abstract] Abstract: The closed-loop validation process is described only at a high level ('iteratively expands, repairs, merges, or prunes... through a closed-loop validation process' and produces an 'internally valid skill library'), with no concrete criteria, error-detection heuristics, bias-mitigation steps, or safeguards against circular confirmation when the same LLM-based extraction is used for both generation and validation; this is load-bearing for the performance claims.

Authors: Section 3.3 of the manuscript provides the concrete criteria: expansion occurs on execution success >90% and test coverage >80%; repair uses runtime exception logs and output-consistency checks; merge/prune decisions rely on semantic similarity thresholds and conflict resolution via majority vote from an LLM ensemble. Bias mitigation and anti-circularity safeguards include using distinct model families for validation versus extraction, periodic human review of 10% of skills, and cross-verification against external documentation. We have added a brief summary of these criteria to the revised abstract to address the concern directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external benchmarks

full rationale

The paper describes a framework for mining and validating skills via closed-loop processes, then reports empirical gains on MoSciBench (five of six datasets) plus two genomics tasks, plus 71.1% novelty versus independent libraries SkillHub and SkillSMP. No equations, fitted parameters, or self-citation chains are invoked that reduce the reported performance or novelty metrics to quantities defined by the authors' own inputs. The validation step produces an internal library whose external benchmark results remain independently measurable and falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumptions that procedural knowledge can be reliably extracted from heterogeneous artifacts and that automated validation can maintain library quality without human oversight. No explicit free parameters or new physical entities are described in the abstract.

axioms (2)

domain assumption Heterogeneous scientific resources contain extractable procedural knowledge that can be structured into executable agent skills.
Foundational premise stated in the opening of the abstract.
domain assumption A closed-loop validation process can iteratively expand, repair, merge, or prune skills to produce an internally valid library.
Core mechanism of the self-evolving framework described in the abstract.

invented entities (1)

Skill package no independent evidence
purpose: Structured container encoding task scope, inputs/outputs, execution steps, environment assumptions, provenance, and tests.
New artifact type introduced to operationalize mined knowledge.

pith-pipeline@v0.9.0 · 5583 in / 1523 out tokens · 54297 ms · 2026-05-13T17:17:25.554119+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems
cs.SE 2026-05 unverdicted novelty 7.0

SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero librar...
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents
cs.AI 2026-05 unverdicted novelty 7.0

Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
cs.AI 2026-05 unverdicted novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 3 Pith papers · 4 internal anchors

[1]

Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766,

Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766,

work page arXiv
[2]

Anthropic announcement, accessed 2026-03-

work page 2026
[3]

Jeffrey Baron, Lars S¨avendahl, Francesco De Luca, Andrew Dauber, Moshe Phillip, Jan M Wit, and Ola Nilsson

Accessed: 2026-03-24. Jeffrey Baron, Lars S¨avendahl, Francesco De Luca, Andrew Dauber, Moshe Phillip, Jan M Wit, and Ola Nilsson. Short and tall stature: a new paradigm emerges.Nature Reviews Endocrinology, 11(12):735–746,

work page 2026
[4]

arXiv preprint arXiv:2304.05376 , year=

Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Chemcrow: Augmenting large-language models with chemistry tools.arXiv preprint arXiv:2304.05376,

work page arXiv
[5]

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu

Tianyi Chen, Yinheng Li, Michael Solodko, Sen Wang, Nan Jiang, Tingyuan Cui, Junheng Hao, Jongwoo Ko, Sara Abdali, Leon Xu, et al. Cua-skill: Develop skills for computer using agent.arXiv preprint arXiv:2601.21123,

work page arXiv
[6]

ToolRosella: Translating Code Repositories into Standardized Tools for Scientific Agents

Shimin Di, Xujie Yuan, Hanghui Guo, Chaoqian Ouyang, Zhangze Chen, Ling Yue, Libin Zheng, Jia Zhu, Shaowu Pan, Jian Yin, et al. Toolrosetta: Bridging open-source repositories and large language model agents through automated tool standardization.arXiv preprint arXiv:2603.09290,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Democratizing ai scientists using tooluniverse.arXiv preprint arXiv:2509.23426, 2025

Shanghua Gao, Richard Zhu, Pengwei Sui, Zhenglun Kong, Sufian Aldogom, Yepeng Huang, Ayush Noori, Reza Shamji, Krishna Parvataneni, Theodoros Tsiligkaridis, et al. Democratizing ai scientists using tooluniverse.arXiv preprint arXiv:2509.23426,

work page arXiv
[8]

Agentic proposing: Enhancing large language model reasoning via compositional skill synthesis.arXiv preprint arXiv:2602.03279,

Zhengbo Jiao, Shaobo Wang, Zifan Zhang, Xuan Ren, Wei Wang, Bing Zhao, Hu Wei, and Linfeng Zhang. Agentic proposing: Enhancing large language model reasoning via compositional skill synthesis.arXiv preprint arXiv:2602.03279,

work page arXiv
[9]

Under review

10 Preprint. Under review. Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms. InProceedings of the 2023 conference on empirical methods in natural language processing, pp. 3102–3116,

work page 2023
[10]

Jiacheng Miao, Joe R Davis, Yaohui Zhang, Jonathan K Pritchard, and James Zou

URLhttps://openreview.net/forum?id=kZHSvETWdi. Jiacheng Miao, Joe R Davis, Yaohui Zhang, Jonathan K Pritchard, and James Zou. Pa- per2agent: Reimagining research papers as interactive and reliable ai agents.arXiv preprint arXiv:2509.06917,

work page arXiv
[11]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Reinforcement learning for self-improving agent with skill library, 2025

Hanchen Wang, Yichun He, Paula P Coelho, Matthew Bucci, Abbas Nazir, Bob Chen, Linh Trinh, Serena Zhang, Kexin Huang, Vineethkrishna Chandrasekar, et al. Spatialagent: An autonomous ai agent for spatial biology.bioRxiv, pp. 2025–04, 2025a. Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, and ...

work page arXiv 2025
[14]

Deploy-master: Automating the deployment of 50,000+ agent-ready scientific tools in one day.arXiv preprint arXiv:2601.03513,

Yi Wang, Zhenting Huang, Zhaohan Ding, Ruoxue Liao, Yuan Huang, Xinzijian Liu, Jiajun Xie, Siheng Chen, and Linfeng Zhang. Deploy-master: Automating the deployment of 50,000+ agent-ready scientific tools in one day.arXiv preprint arXiv:2601.03513,

work page arXiv
[15]

Under review

11 Preprint. Under review. Zhizheng Wang, Qiao Jin, Chih-Hsuan Wei, Shubo Tian, Po-Ting Lai, Qingqing Zhu, Chi- Ping Day, Christina Ross, Robert Leaman, and Zhiyong Lu. Geneagent: self-verification language agent for gene-set analysis using domain databases.Nature Methods, 22(8): 1677–1685, 2025c. Jiaqi Wei, Yuejin Yang, Xiang Zhang, Yuhan Chen, Xiang Zhu...

work page arXiv
[16]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Each setting is run three times and compared against an expert reference analysis

scRNA-seq dataset together with height GWAS. Each setting is run three times and compared against an expert reference analysis. We evaluate the outputs from two complementary perspectives. The qualitative score ranges from 0 to 7, with one point assigned for each of the following criteria: returning individual cell-level, cell-type- level, and gene-level ...

work page 2015