AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows

Jian Ma; Mingqian Ma; Shike Wang; Shuaike Shen; Wenduo Cheng

arxiv: 2605.20425 · v1 · pith:2CFVY7URnew · submitted 2026-05-19 · 💻 cs.AI

AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows

Shuaike Shen , Wenduo Cheng , Shike Wang , Mingqian Ma , Jian Ma This is my paper

Pith reviewed 2026-05-21 07:01 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent workflowsretrieval-based synthesisagent interoperabilityworkflow compositiongenomicslocal repairscientific agentstyped handoffs

0 comments

The pith

AgentCo-op assembles independent agents and tools into genomics workflows through typed handoffs and local repair.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that retrieval can synthesize executable multi-agent workflows from existing scientific agents and tool repositories in open domains that lack standard interfaces or evaluation metrics. It connects components via typed artifact handoffs so that data flows correctly between independently built pieces, then applies bounded local repair only to the parts that fail during a run. A reader would care because this sidesteps the usual need to redesign agents or optimize an entire graph topology for each new task. The genomics demonstrations show the method coordinating agents on spatial transcriptomics and single-cell multiome analysis while keeping the resulting workflows auditable. The same framework also improves benchmark performance and lowers per-task cost relative to other multi-agent setups.

Core claim

AgentCo-op is a retrieval-based synthesis framework that composes reusable skills, tools, and external agents into executable workflows through typed artifact handoffs and then applies bounded self-guided local repair to implicated components when execution evidence indicates failure. In two open-world genomics case studies it assembles independently developed scientific agents and external tool repositories into auditable workflows without redesigning them or running global topology search. One workflow coordinates agents for spatial transcriptomics and gene-set interpretation; the other builds a parallel workflow for cross-modality marker analysis on single-cell multiome data. The method也可

What carries the argument

Typed artifact handoffs, which specify the data types exchanged between agents to guarantee interoperability, together with bounded self-guided local repair that targets only failing components instead of rewriting the whole workflow.

If this is right

Independently developed scientific agents can be reused across tasks without redesign or interface changes.
Workflows remain auditable because each step uses known, retrieved components rather than opaque generated code.
A previously searched workflow can serve as a structural prior that retrieval then grounds with concrete components and repairs.
Per-task cost drops compared with standard multi-agent baselines while matching or exceeding accuracy on coding, math, and QA benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same retrieval-plus-local-repair pattern could apply to other heterogeneous scientific fields that already have scattered agent repositories.
Keeping workflows built from explicit, typed components may make AI-assisted scientific pipelines easier to reproduce and audit by human experts.
Hybrid systems that alternate between global search and retrieval-based grounding might handle even larger or more open-ended discovery tasks.
If local repair proves insufficient in new domains, the method would need extensions such as automatic interface synthesis to stay fully automatic.

Load-bearing premise

Existing agents and tools can be made to work together once typed artifact handoffs are defined, and that fixing problems locally will usually be enough to produce a working workflow in open scientific settings.

What would settle it

A new genomics task in which repeated local repairs on retrieved components still leave persistent handoff or execution failures that require manual redesign of agent interfaces or the overall topology.

Figures

Figures reproduced from arXiv: 2605.20425 by Jian Ma, Mingqian Ma, Shike Wang, Shuaike Shen, Wenduo Cheng.

**Figure 1.** Figure 1: Overview of AGENTCO-OP. AGENTCO-OP synthesizes multi-agent workflows through five main stages: Planning, Retrieval, Synthesis, Execution, and Review. Given a typed task specification x = (g, c, r, Ω), the system retrieves relevant knowledge, skills, tools, repositories, and datasets, then synthesizes an executable workflow graph G = (V, E). The synthesis stage includes initial graph construction, Dockerfil… view at source ↗

**Figure 3.** Figure 3: Sche builds isolated Docker containers, registers each container [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 3.** Figure 3: AGENTCO-OP coordinates external tools for cross-modal marker discovery. AGENTCO-OP registers external Seurat and Signac tool nodes, runs parallel RNA and ATAC marker-discovery branches, validates typed artifacts, evaluates marker support against CellMarker 2.0 and PanglaoDB, and integrates the evidence into a final report. node-associated fibroblast program. Finally, the Integrator combines the differentia… view at source ↗

read the original abstract

Designing multi-agent workflows is especially difficult in open-ended scientific settings where tasks lack curated training sets, reliable scalar evaluation metrics, and standardized interfaces between existing tools and agents. We propose AgentCo-op, a retrieval-based synthesis framework that composes reusable skills, tools, and external agents into executable workflows through typed artifact handoffs, then applies bounded self-guided local repair to implicated components when execution evidence indicates failure. In two open-world genomics case studies, AgentCo-op composes independently developed scientific agents and external tool repositories into auditable workflows without redesigning them or running global topology search. It coordinates specialized agents for spatial transcriptomics and gene-set interpretation to enable collaborative discovery from spatial transcriptomics data, and builds a parallel workflow for cross-modality marker analysis on single-cell multiome data. AgentCo-op can also import a searched workflow as a structural prior and improve it by grounding nodes with retrieved components and applying local repair, showing that synthesis and search are complementary. On six coding, math, and question-answering benchmarks, AgentCo-op achieves the best result on four benchmarks and the best average score under a unified backbone setting, while consistently reducing per-task cost relative to multi-agent baselines. Together, these results suggest that retrieval-based synthesis can extend automated agentic workflow design beyond benchmark-optimized agent graphs to open-world workflows built from existing agents, tools, and typed artifacts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgentCo-op gives a retrieval-plus-local-repair route to composing existing agents into workflows for open scientific tasks, with decent benchmark numbers, but the genomics claims rest on unshown mechanics for handoffs and repair.

read the letter

Colleague, the main point is that AgentCo-op retrieves existing agents and tools, wires them together with typed artifact handoffs, and applies bounded local repair when execution shows problems. It avoids both full redesign and global topology search. In the two genomics examples it assembles agents for spatial transcriptomics and gene-set work, plus a parallel workflow for single-cell multiome marker analysis. It also takes a searched workflow as a prior, grounds nodes with retrieved pieces, and repairs locally, which suggests the two styles can be combined. On the six coding/math/QA benchmarks it leads on four and posts the best average under one backbone while cutting per-task cost versus baselines. That part looks reproducible enough from the numbers given. The soft spot is the open-world case studies. The abstract states that independently developed agents become interoperable and that local repair resolves failures, but it gives no concrete artifact types, matching procedure, repair triggers, or failure traces. Without those, it is hard to judge whether the typed handoffs actually deliver the claimed interoperability or whether the repair stays bounded in real genomics settings. The paper is aimed at people who want modular agent systems for data-scarce scientific domains rather than pure benchmark optimization. A reader working on practical multi-agent design for biology or similar fields would find the complementarity with search useful. It deserves a serious referee because the core synthesis idea is distinct from standard graph search and the benchmark results are concrete, even if the genomics evidence will need more detail to hold up.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes AgentCo-op, a retrieval-based synthesis framework that composes reusable skills, tools, and external agents into executable multi-agent workflows via typed artifact handoffs, followed by bounded self-guided local repair when execution evidence indicates failure. It reports results from two open-world genomics case studies (spatial transcriptomics collaboration and cross-modality marker analysis on single-cell multiome data) in which independently developed agents and external tool repositories are assembled into auditable workflows without redesign or global topology search. The paper also shows that the method can improve a searched workflow by grounding nodes with retrieved components plus local repair, and reports top performance on four of six coding/math/QA benchmarks with reduced per-task cost under a unified backbone.

Significance. If the central claims are supported by the missing implementation details and validation, the work would usefully extend automated agentic workflow design from benchmark-optimized graphs to open scientific settings that lack curated training data, scalar metrics, and standardized interfaces. The explicit demonstration that retrieval-based synthesis and search-based methods are complementary, together with the reported cost reductions, are concrete strengths that could be leveraged by follow-on research.

major comments (2)

[Genomics case studies] Genomics case studies section: the central claim that AgentCo-op composes independently developed scientific agents and external tool repositories into auditable workflows without redesign or global topology search depends on the sufficiency of typed artifact handoffs plus bounded local self-guided repair. The manuscript supplies no concrete description of the artifact types employed, the discovery or matching procedure to existing agent interfaces, the repair triggers and bounds, whether repair ever escalates to node replacement, or any failure-resolution traces from the two case studies. Without these, the claim that the method works in open scientific settings reduces to an untested assumption about interface compatibility.
[Experimental evaluation] Benchmark results: the abstract states that AgentCo-op achieves the best result on four benchmarks and the best average score, yet the reported evaluation lacks statistical validation, error analysis, or per-run variance, which weakens the support for the claim of consistent superiority and cost reduction relative to multi-agent baselines.

minor comments (2)

[Abstract] The abstract refers to 'six coding, math, and question-answering benchmarks' without naming them; listing the specific benchmarks (e.g., HumanEval, GSM8K, etc.) would improve clarity and reproducibility.
Ensure that any workflow diagrams or tables in the case-study sections are accompanied by explicit legends that define the typed artifacts and repair actions shown.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify key areas where additional clarity and validation would strengthen the manuscript. We address each major comment below and will revise the manuscript to incorporate the requested details and improvements.

read point-by-point responses

Referee: [Genomics case studies] Genomics case studies section: the central claim that AgentCo-op composes independently developed scientific agents and external tool repositories into auditable workflows without redesign or global topology search depends on the sufficiency of typed artifact handoffs plus bounded local self-guided repair. The manuscript supplies no concrete description of the artifact types employed, the discovery or matching procedure to existing agent interfaces, the repair triggers and bounds, whether repair ever escalates to node replacement, or any failure-resolution traces from the two case studies. Without these, the claim that the method works in open scientific settings reduces to an untested assumption about interface compatibility.

Authors: We agree that the manuscript currently lacks sufficient concrete implementation details on these elements, which are necessary to fully substantiate the claims regarding interoperability in open scientific settings. In the revised manuscript, we will add a dedicated subsection (and supporting appendix material) that explicitly describes the artifact types employed in the genomics workflows, the discovery and matching procedure for agent interfaces, the repair triggers and iteration bounds, confirmation that repairs remain strictly local and do not escalate to node replacement, and selected failure-resolution traces from the two case studies. These additions will provide the missing evidence that typed handoffs combined with bounded local repair enable the reported workflows without redesign or global search. revision: yes
Referee: [Experimental evaluation] Benchmark results: the abstract states that AgentCo-op achieves the best result on four benchmarks and the best average score, yet the reported evaluation lacks statistical validation, error analysis, or per-run variance, which weakens the support for the claim of consistent superiority and cost reduction relative to multi-agent baselines.

Authors: We acknowledge that the current benchmark evaluation reports point estimates from single runs without variance measures or statistical tests, which limits the robustness of the superiority and cost-reduction claims. In the revised manuscript, we will augment the experimental evaluation section with results aggregated over multiple independent runs, including mean scores, standard deviations, and appropriate statistical significance tests (e.g., paired comparisons against baselines) to better support the reported performance advantages and cost savings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from case studies and benchmarks

full rationale

The paper describes a retrieval-based synthesis method using typed artifact handoffs and bounded local repair, then reports outcomes from two genomics case studies and six benchmarks as experimental findings. No equations, fitted parameters, or derivations are presented that reduce by construction to the inputs. Claims about interoperability and repair are validated externally via application to open-world tasks rather than self-definition or self-citation chains. The derivation chain remains self-contained through method description plus independent evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on domain assumptions about agent interoperability and repair sufficiency, with no explicit free parameters or invented entities detailed in the abstract; the framework itself is the primary contribution.

axioms (1)

domain assumption Existing scientific agents and tools possess compatible typed interfaces that enable artifact handoffs without redesign.
Invoked to support composition of independently developed components in the genomics case studies.

pith-pipeline@v0.9.0 · 5785 in / 1347 out tokens · 57911 ms · 2026-05-21T07:01:45.063896+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 15 internal anchors

[1]

Program Synthesis with Large Language Models

Anthropic. Agent skills. https://docs.anthropic.com/en/docs/claude-code/skills, 2025a. Anthropic. Create custom subagents. https://code.claude.com/docs/en/sub-agents, 2025b. Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with ...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Varun Pratap Bhardwaj. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gard- ner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics,

work page 2019
[6]

Oscar Franzén, Li-Ming Gan, and Johan L. M. Björkegren. PanglaoDB: a web server for exploration of mouse and human single-cell rna sequencing data.Database, 2019:baz046,

work page 2019
[7]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt

doi: 10.1016/j.cell.2021.04.048. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InAdvances in Neural Information Processing Systems,

work page doi:10.1016/j.cell.2021.04.048 2021
[8]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework.arXiv preprint arXiv:2308.00352,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Ruida Hu, Chao Peng, Xinchen Wang, Junjielong Xu, and Cuiyun Gao

doi: 10.1093/nar/gkac947. Ruida Hu, Chao Peng, Xinchen Wang, Junjielong Xu, and Cuiyun Gao. Repo2run: Automated building executable environment for code repository at scale.arXiv preprint arXiv:2502.13681,

work page doi:10.1093/nar/gkac947
[10]

Automated Design of Agentic Systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems.arXiv preprint arXiv:2408.08435,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Johnson, Di Yin, Mihir Shah, Denny Zhou, Russ Altman, Mengdi Wang, and Le Cong

Kaixuan Huang, Yuanhao Qu, Henry Cousins, William A. Johnson, Di Yin, Mihir Shah, Denny Zhou, Russ Altman, Mengdi Wang, and Le Cong. CRISPR-GPT: An LLM agent for automated design of gene-editing experiments.bioRxiv 2024.04.25.591003,

work page 2024
[12]

Carter, Xin Zhou, Matthew Wheeler, Jonathan A

Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, Di Yin, Shruti Marwaha, Jennefer N. Carter, Xin Zhou, Matthew Wheeler, Jonathan A. Bernstein, Mengdi Wang, Peng He, Jingtian Zhou, Michael Snyder, Le Cong, Aviv Regev, and Jure Leskovec. Biomni: A general-purpose biomedical AI agent. b...

work page 2025
[13]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

arXiv preprint arXiv:2507.02004 , year=

Ruofan Jin, Zaixi Zhang, Mengdi Tang, Le Cong Wang, and Mengdi Wang. STELLA: Self-evolving LLM agent for biomedical research.arXiv preprint arXiv:2507.02004,

work page arXiv
[15]

Autoflow: Automated workflow generation for large language model agents

Zelong Li, Shuyuan Xu, Kai Mei, Wenyue Hua, Balaji Rama, Om Raheja, Hao Wang, He Zhu, and Yongfeng Zhang. AutoFlow: Automated workflow generation for large language model agents. arXiv preprint arXiv:2407.12821,

work page arXiv
[16]

A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration

Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. A dynamic llm-powered agent network for task-oriented agent collaboration.arXiv preprint arXiv:2310.02170,

work page internal anchor Pith review arXiv
[17]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al

Accessed: 2026-05-06. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594,

work page 2026
[18]

Can generalist foundation models outcompete special-purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452, 2023

Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452,

work page arXiv
[19]

arXiv preprint arXiv:2410.06153 , year=

Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, and Yong Li. Agentsquare: Automatic llm agent search in modular design space.arXiv preprint arXiv:2410.06153,

work page arXiv
[20]

SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources

Shuaike Shen, Wenduo Cheng, Mingqian Ma, Alistair Turcan, Martin Jinye Zhang, and Jian Ma. Skillfoundry: Building self-evolving agent skill libraries from heterogeneous scientific resources. arXiv preprint arXiv:2604.03964,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

doi: 10.1038/ s41586-025-09442-9. Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. Multi-agent collaboration mechanisms: A survey of llms.arXiv preprint arXiv:2501.06322,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

H. Wang et al. Spatialagent: An autonomous ai agent for spatial biology.bioRxiv, 2025a. doi: 10.1101/2025.04.03.646459. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh- ery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1101/2025.04.03.646459 2025
[23]

EvoAgentX: An automated framework for evolving agentic workflows.arXiv preprint arXiv:2507.03616, 2025b

Yingxu Wang, Yuxuan Liu, Lu Tian, Wenkang Shen, Zixuan Tang, Tianqi Wang, Wenhao Wu, Wenjun Liu, and Quanjia Yu. EvoAgentX: An automated framework for evolving agentic workflows.arXiv preprint arXiv:2507.03616, 2025b. 13 Z. Wang, Q. Jin, C.-H. Wei, et al. Geneagent: Self-verification language agent for gene-set analysis using domain databases.Nature Metho...

work page doi:10.1038/s41592-025-02748-6 2024
[24]

From ai for science to agentic science: A survey on autonomous scientific discovery.arXiv preprint arXiv:2508.14111,

Jiaqi Wei, Yuejin Yang, Xiang Zhang, Yuhan Chen, Xiang Zhuang, Zhangyang Gao, Dongzhan Zhou, Guangshuai Wang, Zhiqiang Gao, Juntai Cao, et al. From ai for science to agentic science: A survey on autonomous scientific discovery.arXiv preprint arXiv:2508.14111,

work page arXiv
[25]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation.arXiv preprint arXiv:2308.08155,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380,

work page 2018
[27]

Zhang, L

Guibin Zhang, Luyang Niu, Junfeng Fang, Kun Wang, Lei Bai, and Xiang Wang. Multi-agent architecture search via agentic supernet.arXiv preprint arXiv:2502.04180,

work page arXiv
[28]

CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification

Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei-Chieh Huang, Yifei Yao, Kening Zheng, et al. Evoskills: Self-evolving agent skills via co-evolutionary verification.arXiv preprint arXiv:2604.01687,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

AFlow: Automating Agentic Workflow Generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. Aflow: Automating agentic workflow generation.arXiv preprint arXiv:2410.10762,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

SEW: Self-Evolving Agentic Workflows for Automated Code Generation

Siwei Zhao, Xinyu Liu, Yifei Zhao, Tianyu Yang, Jiaqi Wang, Yong Bai, and Yang Liu. SEW: Self-evolving agentic workflows for automated code generation.arXiv preprint arXiv:2505.18646,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Language agents as optimizable graphs

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. Language agents as optimizable graphs.arXiv preprint arXiv:2402.16823,

work page arXiv
[32]

Matching is performed by scoring candidate skills and tools against the role description and the upstream and downstream artifact types of the node, and selecting the top-ranked entries. As a result, every node carries not only an instruction but also the procedural knowledge and callable operations needed to execute it, which both reduces prompt-engineer...

work page 2025
[33]

The remaining two benchmarks fluctuate within a small margin, which suggests that local repair contributes most when tasks involve longer reasoning chains or precise generation

After removing the runtime local repair, most benchmarks show a drop in accuracy. The remaining two benchmarks fluctuate within a small margin, which suggests that local repair contributes most when tasks involve longer reasoning chains or precise generation. When we further remove agent skills and tools, the performance on most benchmarks remains close t...

work page 2024

[1] [1]

Program Synthesis with Large Language Models

Anthropic. Agent skills. https://docs.anthropic.com/en/docs/claude-code/skills, 2025a. Anthropic. Create custom subagents. https://code.claude.com/docs/en/sub-agents, 2025b. Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with ...

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Varun Pratap Bhardwaj. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gard- ner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics,

work page 2019

[6] [6]

Oscar Franzén, Li-Ming Gan, and Johan L. M. Björkegren. PanglaoDB: a web server for exploration of mouse and human single-cell rna sequencing data.Database, 2019:baz046,

work page 2019

[7] [7]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt

doi: 10.1016/j.cell.2021.04.048. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InAdvances in Neural Information Processing Systems,

work page doi:10.1016/j.cell.2021.04.048 2021

[8] [8]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework.arXiv preprint arXiv:2308.00352,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Ruida Hu, Chao Peng, Xinchen Wang, Junjielong Xu, and Cuiyun Gao

doi: 10.1093/nar/gkac947. Ruida Hu, Chao Peng, Xinchen Wang, Junjielong Xu, and Cuiyun Gao. Repo2run: Automated building executable environment for code repository at scale.arXiv preprint arXiv:2502.13681,

work page doi:10.1093/nar/gkac947

[10] [10]

Automated Design of Agentic Systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems.arXiv preprint arXiv:2408.08435,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Johnson, Di Yin, Mihir Shah, Denny Zhou, Russ Altman, Mengdi Wang, and Le Cong

Kaixuan Huang, Yuanhao Qu, Henry Cousins, William A. Johnson, Di Yin, Mihir Shah, Denny Zhou, Russ Altman, Mengdi Wang, and Le Cong. CRISPR-GPT: An LLM agent for automated design of gene-editing experiments.bioRxiv 2024.04.25.591003,

work page 2024

[12] [12]

Carter, Xin Zhou, Matthew Wheeler, Jonathan A

Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, Di Yin, Shruti Marwaha, Jennefer N. Carter, Xin Zhou, Matthew Wheeler, Jonathan A. Bernstein, Mengdi Wang, Peng He, Jingtian Zhou, Michael Snyder, Le Cong, Aviv Regev, and Jure Leskovec. Biomni: A general-purpose biomedical AI agent. b...

work page 2025

[13] [13]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

arXiv preprint arXiv:2507.02004 , year=

Ruofan Jin, Zaixi Zhang, Mengdi Tang, Le Cong Wang, and Mengdi Wang. STELLA: Self-evolving LLM agent for biomedical research.arXiv preprint arXiv:2507.02004,

work page arXiv

[15] [15]

Autoflow: Automated workflow generation for large language model agents

Zelong Li, Shuyuan Xu, Kai Mei, Wenyue Hua, Balaji Rama, Om Raheja, Hao Wang, He Zhu, and Yongfeng Zhang. AutoFlow: Automated workflow generation for large language model agents. arXiv preprint arXiv:2407.12821,

work page arXiv

[16] [16]

A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration

Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. A dynamic llm-powered agent network for task-oriented agent collaboration.arXiv preprint arXiv:2310.02170,

work page internal anchor Pith review arXiv

[17] [17]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al

Accessed: 2026-05-06. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594,

work page 2026

[18] [18]

Can generalist foundation models outcompete special-purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452, 2023

Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452,

work page arXiv

[19] [19]

arXiv preprint arXiv:2410.06153 , year=

Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, and Yong Li. Agentsquare: Automatic llm agent search in modular design space.arXiv preprint arXiv:2410.06153,

work page arXiv

[20] [20]

SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources

Shuaike Shen, Wenduo Cheng, Mingqian Ma, Alistair Turcan, Martin Jinye Zhang, and Jian Ma. Skillfoundry: Building self-evolving agent skill libraries from heterogeneous scientific resources. arXiv preprint arXiv:2604.03964,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

doi: 10.1038/ s41586-025-09442-9. Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. Multi-agent collaboration mechanisms: A survey of llms.arXiv preprint arXiv:2501.06322,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

H. Wang et al. Spatialagent: An autonomous ai agent for spatial biology.bioRxiv, 2025a. doi: 10.1101/2025.04.03.646459. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh- ery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1101/2025.04.03.646459 2025

[23] [23]

EvoAgentX: An automated framework for evolving agentic workflows.arXiv preprint arXiv:2507.03616, 2025b

Yingxu Wang, Yuxuan Liu, Lu Tian, Wenkang Shen, Zixuan Tang, Tianqi Wang, Wenhao Wu, Wenjun Liu, and Quanjia Yu. EvoAgentX: An automated framework for evolving agentic workflows.arXiv preprint arXiv:2507.03616, 2025b. 13 Z. Wang, Q. Jin, C.-H. Wei, et al. Geneagent: Self-verification language agent for gene-set analysis using domain databases.Nature Metho...

work page doi:10.1038/s41592-025-02748-6 2024

[24] [24]

From ai for science to agentic science: A survey on autonomous scientific discovery.arXiv preprint arXiv:2508.14111,

Jiaqi Wei, Yuejin Yang, Xiang Zhang, Yuhan Chen, Xiang Zhuang, Zhangyang Gao, Dongzhan Zhou, Guangshuai Wang, Zhiqiang Gao, Juntai Cao, et al. From ai for science to agentic science: A survey on autonomous scientific discovery.arXiv preprint arXiv:2508.14111,

work page arXiv

[25] [25]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation.arXiv preprint arXiv:2308.08155,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380,

work page 2018

[27] [27]

Zhang, L

Guibin Zhang, Luyang Niu, Junfeng Fang, Kun Wang, Lei Bai, and Xiang Wang. Multi-agent architecture search via agentic supernet.arXiv preprint arXiv:2502.04180,

work page arXiv

[28] [28]

CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification

Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei-Chieh Huang, Yifei Yao, Kening Zheng, et al. Evoskills: Self-evolving agent skills via co-evolutionary verification.arXiv preprint arXiv:2604.01687,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

AFlow: Automating Agentic Workflow Generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. Aflow: Automating agentic workflow generation.arXiv preprint arXiv:2410.10762,

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

SEW: Self-Evolving Agentic Workflows for Automated Code Generation

Siwei Zhao, Xinyu Liu, Yifei Zhao, Tianyu Yang, Jiaqi Wang, Yong Bai, and Yang Liu. SEW: Self-evolving agentic workflows for automated code generation.arXiv preprint arXiv:2505.18646,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Language agents as optimizable graphs

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. Language agents as optimizable graphs.arXiv preprint arXiv:2402.16823,

work page arXiv

[32] [32]

Matching is performed by scoring candidate skills and tools against the role description and the upstream and downstream artifact types of the node, and selecting the top-ranked entries. As a result, every node carries not only an instruction but also the procedural knowledge and callable operations needed to execute it, which both reduces prompt-engineer...

work page 2025

[33] [33]

The remaining two benchmarks fluctuate within a small margin, which suggests that local repair contributes most when tasks involve longer reasoning chains or precise generation

After removing the runtime local repair, most benchmarks show a drop in accuracy. The remaining two benchmarks fluctuate within a small margin, which suggests that local repair contributes most when tasks involve longer reasoning chains or precise generation. When we further remove agent skills and tools, the performance on most benchmarks remains close t...

work page 2024