AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows
Pith reviewed 2026-05-21 07:01 UTC · model grok-4.3
The pith
AgentCo-op assembles independent agents and tools into genomics workflows through typed handoffs and local repair.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AgentCo-op is a retrieval-based synthesis framework that composes reusable skills, tools, and external agents into executable workflows through typed artifact handoffs and then applies bounded self-guided local repair to implicated components when execution evidence indicates failure. In two open-world genomics case studies it assembles independently developed scientific agents and external tool repositories into auditable workflows without redesigning them or running global topology search. One workflow coordinates agents for spatial transcriptomics and gene-set interpretation; the other builds a parallel workflow for cross-modality marker analysis on single-cell multiome data. The method也可
What carries the argument
Typed artifact handoffs, which specify the data types exchanged between agents to guarantee interoperability, together with bounded self-guided local repair that targets only failing components instead of rewriting the whole workflow.
If this is right
- Independently developed scientific agents can be reused across tasks without redesign or interface changes.
- Workflows remain auditable because each step uses known, retrieved components rather than opaque generated code.
- A previously searched workflow can serve as a structural prior that retrieval then grounds with concrete components and repairs.
- Per-task cost drops compared with standard multi-agent baselines while matching or exceeding accuracy on coding, math, and QA benchmarks.
Where Pith is reading between the lines
- The same retrieval-plus-local-repair pattern could apply to other heterogeneous scientific fields that already have scattered agent repositories.
- Keeping workflows built from explicit, typed components may make AI-assisted scientific pipelines easier to reproduce and audit by human experts.
- Hybrid systems that alternate between global search and retrieval-based grounding might handle even larger or more open-ended discovery tasks.
- If local repair proves insufficient in new domains, the method would need extensions such as automatic interface synthesis to stay fully automatic.
Load-bearing premise
Existing agents and tools can be made to work together once typed artifact handoffs are defined, and that fixing problems locally will usually be enough to produce a working workflow in open scientific settings.
What would settle it
A new genomics task in which repeated local repairs on retrieved components still leave persistent handoff or execution failures that require manual redesign of agent interfaces or the overall topology.
Figures
read the original abstract
Designing multi-agent workflows is especially difficult in open-ended scientific settings where tasks lack curated training sets, reliable scalar evaluation metrics, and standardized interfaces between existing tools and agents. We propose AgentCo-op, a retrieval-based synthesis framework that composes reusable skills, tools, and external agents into executable workflows through typed artifact handoffs, then applies bounded self-guided local repair to implicated components when execution evidence indicates failure. In two open-world genomics case studies, AgentCo-op composes independently developed scientific agents and external tool repositories into auditable workflows without redesigning them or running global topology search. It coordinates specialized agents for spatial transcriptomics and gene-set interpretation to enable collaborative discovery from spatial transcriptomics data, and builds a parallel workflow for cross-modality marker analysis on single-cell multiome data. AgentCo-op can also import a searched workflow as a structural prior and improve it by grounding nodes with retrieved components and applying local repair, showing that synthesis and search are complementary. On six coding, math, and question-answering benchmarks, AgentCo-op achieves the best result on four benchmarks and the best average score under a unified backbone setting, while consistently reducing per-task cost relative to multi-agent baselines. Together, these results suggest that retrieval-based synthesis can extend automated agentic workflow design beyond benchmark-optimized agent graphs to open-world workflows built from existing agents, tools, and typed artifacts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes AgentCo-op, a retrieval-based synthesis framework that composes reusable skills, tools, and external agents into executable multi-agent workflows via typed artifact handoffs, followed by bounded self-guided local repair when execution evidence indicates failure. It reports results from two open-world genomics case studies (spatial transcriptomics collaboration and cross-modality marker analysis on single-cell multiome data) in which independently developed agents and external tool repositories are assembled into auditable workflows without redesign or global topology search. The paper also shows that the method can improve a searched workflow by grounding nodes with retrieved components plus local repair, and reports top performance on four of six coding/math/QA benchmarks with reduced per-task cost under a unified backbone.
Significance. If the central claims are supported by the missing implementation details and validation, the work would usefully extend automated agentic workflow design from benchmark-optimized graphs to open scientific settings that lack curated training data, scalar metrics, and standardized interfaces. The explicit demonstration that retrieval-based synthesis and search-based methods are complementary, together with the reported cost reductions, are concrete strengths that could be leveraged by follow-on research.
major comments (2)
- [Genomics case studies] Genomics case studies section: the central claim that AgentCo-op composes independently developed scientific agents and external tool repositories into auditable workflows without redesign or global topology search depends on the sufficiency of typed artifact handoffs plus bounded local self-guided repair. The manuscript supplies no concrete description of the artifact types employed, the discovery or matching procedure to existing agent interfaces, the repair triggers and bounds, whether repair ever escalates to node replacement, or any failure-resolution traces from the two case studies. Without these, the claim that the method works in open scientific settings reduces to an untested assumption about interface compatibility.
- [Experimental evaluation] Benchmark results: the abstract states that AgentCo-op achieves the best result on four benchmarks and the best average score, yet the reported evaluation lacks statistical validation, error analysis, or per-run variance, which weakens the support for the claim of consistent superiority and cost reduction relative to multi-agent baselines.
minor comments (2)
- [Abstract] The abstract refers to 'six coding, math, and question-answering benchmarks' without naming them; listing the specific benchmarks (e.g., HumanEval, GSM8K, etc.) would improve clarity and reproducibility.
- Ensure that any workflow diagrams or tables in the case-study sections are accompanied by explicit legends that define the typed artifacts and repair actions shown.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments identify key areas where additional clarity and validation would strengthen the manuscript. We address each major comment below and will revise the manuscript to incorporate the requested details and improvements.
read point-by-point responses
-
Referee: [Genomics case studies] Genomics case studies section: the central claim that AgentCo-op composes independently developed scientific agents and external tool repositories into auditable workflows without redesign or global topology search depends on the sufficiency of typed artifact handoffs plus bounded local self-guided repair. The manuscript supplies no concrete description of the artifact types employed, the discovery or matching procedure to existing agent interfaces, the repair triggers and bounds, whether repair ever escalates to node replacement, or any failure-resolution traces from the two case studies. Without these, the claim that the method works in open scientific settings reduces to an untested assumption about interface compatibility.
Authors: We agree that the manuscript currently lacks sufficient concrete implementation details on these elements, which are necessary to fully substantiate the claims regarding interoperability in open scientific settings. In the revised manuscript, we will add a dedicated subsection (and supporting appendix material) that explicitly describes the artifact types employed in the genomics workflows, the discovery and matching procedure for agent interfaces, the repair triggers and iteration bounds, confirmation that repairs remain strictly local and do not escalate to node replacement, and selected failure-resolution traces from the two case studies. These additions will provide the missing evidence that typed handoffs combined with bounded local repair enable the reported workflows without redesign or global search. revision: yes
-
Referee: [Experimental evaluation] Benchmark results: the abstract states that AgentCo-op achieves the best result on four benchmarks and the best average score, yet the reported evaluation lacks statistical validation, error analysis, or per-run variance, which weakens the support for the claim of consistent superiority and cost reduction relative to multi-agent baselines.
Authors: We acknowledge that the current benchmark evaluation reports point estimates from single runs without variance measures or statistical tests, which limits the robustness of the superiority and cost-reduction claims. In the revised manuscript, we will augment the experimental evaluation section with results aggregated over multiple independent runs, including mean scores, standard deviations, and appropriate statistical significance tests (e.g., paired comparisons against baselines) to better support the reported performance advantages and cost savings. revision: yes
Circularity Check
No circularity: empirical results from case studies and benchmarks
full rationale
The paper describes a retrieval-based synthesis method using typed artifact handoffs and bounded local repair, then reports outcomes from two genomics case studies and six benchmarks as experimental findings. No equations, fitted parameters, or derivations are presented that reduce by construction to the inputs. Claims about interoperability and repair are validated externally via application to open-world tasks rather than self-definition or self-citation chains. The derivation chain remains self-contained through method description plus independent evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing scientific agents and tools possess compatible typed interfaces that enable artifact handoffs without redesign.
Reference graph
Works this paper leans on
-
[1]
Program Synthesis with Large Language Models
Anthropic. Agent skills. https://docs.anthropic.com/en/docs/claude-code/skills, 2025a. Anthropic. Create custom subagents. https://code.claude.com/docs/en/sub-agents, 2025b. Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward
Varun Pratap Bhardwaj. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs
Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gard- ner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics,
work page 2019
-
[6]
Oscar Franzén, Li-Ming Gan, and Johan L. M. Björkegren. PanglaoDB: a web server for exploration of mouse and human single-cell rna sequencing data.Database, 2019:baz046,
work page 2019
-
[7]
doi: 10.1016/j.cell.2021.04.048. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InAdvances in Neural Information Processing Systems,
-
[8]
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework.arXiv preprint arXiv:2308.00352,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Ruida Hu, Chao Peng, Xinchen Wang, Junjielong Xu, and Cuiyun Gao
doi: 10.1093/nar/gkac947. Ruida Hu, Chao Peng, Xinchen Wang, Junjielong Xu, and Cuiyun Gao. Repo2run: Automated building executable environment for code repository at scale.arXiv preprint arXiv:2502.13681,
-
[10]
Automated Design of Agentic Systems
Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems.arXiv preprint arXiv:2408.08435,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Johnson, Di Yin, Mihir Shah, Denny Zhou, Russ Altman, Mengdi Wang, and Le Cong
Kaixuan Huang, Yuanhao Qu, Henry Cousins, William A. Johnson, Di Yin, Mihir Shah, Denny Zhou, Russ Altman, Mengdi Wang, and Le Cong. CRISPR-GPT: An LLM agent for automated design of gene-editing experiments.bioRxiv 2024.04.25.591003,
work page 2024
-
[12]
Carter, Xin Zhou, Matthew Wheeler, Jonathan A
Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, Di Yin, Shruti Marwaha, Jennefer N. Carter, Xin Zhou, Matthew Wheeler, Jonathan A. Bernstein, Mengdi Wang, Peng He, Jingtian Zhou, Michael Snyder, Le Cong, Aviv Regev, and Jure Leskovec. Biomni: A general-purpose biomedical AI agent. b...
work page 2025
-
[13]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
arXiv preprint arXiv:2507.02004 , year=
Ruofan Jin, Zaixi Zhang, Mengdi Tang, Le Cong Wang, and Mengdi Wang. STELLA: Self-evolving LLM agent for biomedical research.arXiv preprint arXiv:2507.02004,
-
[15]
Autoflow: Automated workflow generation for large language model agents
Zelong Li, Shuyuan Xu, Kai Mei, Wenyue Hua, Balaji Rama, Om Raheja, Hao Wang, He Zhu, and Yongfeng Zhang. AutoFlow: Automated workflow generation for large language model agents. arXiv preprint arXiv:2407.12821,
-
[16]
A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration
Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. A dynamic llm-powered agent network for task-oriented agent collaboration.arXiv preprint arXiv:2310.02170,
work page internal anchor Pith review arXiv
-
[17]
Accessed: 2026-05-06. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594,
work page 2026
-
[18]
Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452,
-
[19]
arXiv preprint arXiv:2410.06153 , year=
Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, and Yong Li. Agentsquare: Automatic llm agent search in modular design space.arXiv preprint arXiv:2410.06153,
-
[20]
SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources
Shuaike Shen, Wenduo Cheng, Mingqian Ma, Alistair Turcan, Martin Jinye Zhang, and Jian Ma. Skillfoundry: Building self-evolving agent skill libraries from heterogeneous scientific resources. arXiv preprint arXiv:2604.03964,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Multi-Agent Collaboration Mechanisms: A Survey of LLMs
doi: 10.1038/ s41586-025-09442-9. Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. Multi-agent collaboration mechanisms: A survey of llms.arXiv preprint arXiv:2501.06322,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
H. Wang et al. Spatialagent: An autonomous ai agent for spatial biology.bioRxiv, 2025a. doi: 10.1101/2025.04.03.646459. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh- ery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1101/2025.04.03.646459 2025
-
[23]
Yingxu Wang, Yuxuan Liu, Lu Tian, Wenkang Shen, Zixuan Tang, Tianqi Wang, Wenhao Wu, Wenjun Liu, and Quanjia Yu. EvoAgentX: An automated framework for evolving agentic workflows.arXiv preprint arXiv:2507.03616, 2025b. 13 Z. Wang, Q. Jin, C.-H. Wei, et al. Geneagent: Self-verification language agent for gene-set analysis using domain databases.Nature Metho...
-
[24]
Jiaqi Wei, Yuejin Yang, Xiang Zhang, Yuhan Chen, Xiang Zhuang, Zhangyang Gao, Dongzhan Zhou, Guangshuai Wang, Zhiqiang Gao, Juntai Cao, et al. From ai for science to agentic science: A survey on autonomous scientific discovery.arXiv preprint arXiv:2508.14111,
-
[25]
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation.arXiv preprint arXiv:2308.08155,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Cohen, Ruslan Salakhutdinov, and Christopher D
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380,
work page 2018
- [27]
-
[28]
CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification
Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei-Chieh Huang, Yifei Yao, Kening Zheng, et al. Evoskills: Self-evolving agent skills via co-evolutionary verification.arXiv preprint arXiv:2604.01687,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
AFlow: Automating Agentic Workflow Generation
Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. Aflow: Automating agentic workflow generation.arXiv preprint arXiv:2410.10762,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
SEW: Self-Evolving Agentic Workflows for Automated Code Generation
Siwei Zhao, Xinyu Liu, Yifei Zhao, Tianyu Yang, Jiaqi Wang, Yong Bai, and Yang Liu. SEW: Self-evolving agentic workflows for automated code generation.arXiv preprint arXiv:2505.18646,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Language agents as optimizable graphs
Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. Language agents as optimizable graphs.arXiv preprint arXiv:2402.16823,
-
[32]
Matching is performed by scoring candidate skills and tools against the role description and the upstream and downstream artifact types of the node, and selecting the top-ranked entries. As a result, every node carries not only an instruction but also the procedural knowledge and callable operations needed to execute it, which both reduces prompt-engineer...
work page 2025
-
[33]
After removing the runtime local repair, most benchmarks show a drop in accuracy. The remaining two benchmarks fluctuate within a small margin, which suggests that local repair contributes most when tasks involve longer reasoning chains or precise generation. When we further remove agent skills and tools, the performance on most benchmarks remains close t...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.