AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

arxiv: 2605.20025 · v1 · pith:WLFJUSZOnew · submitted 2026-05-19 · 💻 cs.AI

AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

Jiaqi Liu , Shi Qiu , Mairui Li , Bingzhou Li , Haonian Ji , Siwei Han , Xinyu Ye , Peng Xia

show 27 more authors

Zihan Dong Congyu Zhang Letian Zhang Guiming Chen Haoqin Tu Xinyu Yang Lu Feng Xujiang Zhao Haifeng Chen Jiawei Zhou Xiao Wang Weitong Zhang Hongtu Zhu Yun Li Jieru Mei Hongliang Fei Jiaheng Zhang Linjie Li Linjun Zhang Yuyin Zhou Sheng Wang Caiming Xiong James Zou Zeyu Zheng Cihang Xie Mingyu Ding Huaxiu Yao

This is my paper

Pith reviewed 2026-05-20 05:18 UTC · model grok-4.3

classification 💻 cs.AI

keywords autonomous researchmulti-agent systemshuman-AI collaborationiterative discoveryself-healing executionresearch automationAI for science

0 comments p. Extension

pith:WLFJUSZO Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{WLFJUSZO}

Prints a linked pith:WLFJUSZO badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

AutoResearchClaw uses multi-agent debate, self-healing from failures, verifiable reporting, targeted human input, and cross-run learning to improve autonomous research.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AutoResearchClaw as a system designed to automate scientific discovery by modeling research as an iterative cycle of hypothesis, experiment, failure, and refinement rather than a single linear pass. It achieves this through five mechanisms that let multiple agents debate ideas and analyze results, recover from execution problems by deciding to pivot or refine, enforce accurate reporting to block made-up data, permit human collaboration in seven different modes, and carry forward lessons from prior runs to avoid repeating mistakes. Experiments on a benchmark covering 25 topics show stronger results than an earlier autonomous system, while tests of the human modes indicate that focused input at key points works better than either complete independence or full step-by-step checking. If the approach holds, AI tools could serve as steady amplifiers that help human researchers iterate more effectively without replacing their judgment.

Core claim

AutoResearchClaw is a multi-agent autonomous research pipeline built on five mechanisms: structured multi-agent debate for hypothesis generation and result analysis, a self-healing executor with a Pivot/Refine decision loop that transforms failures into information, verifiable result reporting that prevents fabricated numbers and hallucinated citations, human-in-the-loop collaboration with seven intervention modes spanning full autonomy to step-by-step oversight, and cross-run evolution that converts past mistakes into future safeguards. On ARC-Bench, a 25-topic experiment-stage benchmark, AutoResearchClaw outperforms AI Scientist v2 by 54.7%. A human-in-the-loop ablation across seven modes,

What carries the argument

The five mechanisms of AutoResearchClaw that together support iterative research: multi-agent debate, a self-healing Pivot/Refine executor loop, verifiable result reporting, seven graduated human intervention modes, and cross-run experience transfer.

If this is right

Research pipelines can continue after failed experiments by extracting information and choosing to pivot or refine rather than halting.
Human input improves results most when applied only at high-leverage decision points instead of continuously or not at all.
Lessons from earlier research runs can be stored and applied to reduce repeated errors in later projects.
Built-in verification steps can lower the rate of incorrect numerical results or citations in AI-produced research.
Debate among multiple agents can generate stronger initial hypotheses and more balanced analysis of experimental outcomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The cross-run learning mechanism could allow separate research projects to share safeguards and build cumulative capability over time.
Pairing the pipeline with automated lab equipment might enable longer chains of physical experiments with limited human setup.
Testing the system on problems where success criteria are less structured than the current benchmark would show how well the mechanisms generalize.
Repeated use might gradually reduce the amount of human oversight needed as internal safeguards accumulate from past runs.

Load-bearing premise

The ARC-Bench benchmark and its scoring rules accurately reflect the quality and iterative nature of real scientific research, and the measured performance gain is driven by the five listed mechanisms rather than differences in base models, prompt engineering, or implementation details not controlled for in the comparison.

What would settle it

A side-by-side run of AutoResearchClaw and prior systems on a fresh set of research problems outside the original 25 topics, with outcomes judged by independent experts for novelty, correctness, and usefulness.

read the original abstract

Automating scientific discovery requires more than generating papers from ideas. Real research is iterative: hypotheses are challenged from multiple perspectives, experiments fail and inform the next attempt, and lessons accumulate across cycles. Existing autonomous research systems often model this process as a linear pipeline: they rely on single-agent reasoning, stop when execution fails, and do not carry experience across runs. We present AutoResearchClaw, a multi-agent autonomous research pipeline built on five mechanisms: structured multi-agent debate for hypothesis generation and result analysis, a self-healing executor with a \textsc{Pivot}/\textsc{Refine} decision loop that transforms failures into information, verifiable result reporting that prevents fabricated numbers and hallucinated citations, human-in-the-loop collaboration with seven intervention modes spanning full autonomy to step-by-step oversight, and cross-run evolution that converts past mistakes into future safeguards. On ARC-Bench, a 25-topic experiment-stage benchmark, AutoResearchClaw outperforms AI Scientist v2 by 54.7%. A human-in-the-loop ablation across seven intervention modes reveals that precise, targeted collaboration at high-leverage decision points consistently outperforms both full autonomy and exhaustive step-by-step oversight. We position AutoResearchClaw as a research amplifier that augments rather than replaces human scientific judgment. Code is available at https://github.com/aiming-lab/AutoResearchClaw.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AutoResearchClaw assembles a workable multi-agent pipeline with failure recovery and graded human oversight, but the 54.7% gain over AI Scientist v2 cannot be cleanly attributed to those features without matching controls on models and prompts.

read the letter

The main thing to know is that this paper describes a concrete system that stitches together multi-agent debate, a Pivot/Refine recovery loop, output verification, seven human intervention levels, and cross-run memory into one running pipeline. The authors also release code, which is useful for anyone who wants to inspect or extend it. The ablation on human modes is the clearest positive result: targeted intervention at key points beats both full autonomy and constant micro-management on their benchmark. That part feels like a practical observation worth noting for tool builders. The rest of the contribution is mostly integration rather than a single new mechanism. Each piece has earlier examples, but the specific loop and the seven-mode menu are packaged together here in a way that could be copied or tested by others. The soft spot sits in the central comparison. The abstract reports a 54.7% improvement on ARC-Bench without stating whether the baseline ran on the same model, used equivalent prompts, or followed identical topic sampling and scoring. If those factors differed, the number does not isolate the effect of the five listed mechanisms. The benchmark itself is introduced in the paper, so its coverage of real iterative research remains an open question until more external validation appears. This work is aimed at groups already building or evaluating autonomous research agents. A reader who needs a ready-to-run example with human-in-the-loop options and wants to see how the pieces interact will find the description and code directly usable. It is not yet a finished theoretical advance, but the implementation is far enough along that a serious referee could usefully press on the evaluation details and the benchmark design. I would send it out for review rather than desk-reject it.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AutoResearchClaw, a multi-agent autonomous research pipeline built on five mechanisms: structured multi-agent debate for hypothesis generation and analysis, a self-healing executor with Pivot/Refine decision loops to convert failures into information, verifiable result reporting to prevent hallucinations, human-in-the-loop collaboration via seven intervention modes, and cross-run evolution to accumulate safeguards from past runs. On the ARC-Bench 25-topic benchmark, it reports a 54.7% outperformance over AI Scientist v2, with an ablation study on the human collaboration modes.

Significance. If the performance gains prove robust under controlled conditions, the work could meaningfully advance autonomous research systems by demonstrating practical benefits of iterative self-correction and calibrated human oversight rather than full autonomy. The open availability of code at the provided GitHub repository is a clear strength that supports reproducibility and community follow-up.

major comments (2)

[Abstract] Abstract and experimental results section: The headline 54.7% outperformance on ARC-Bench is presented without details on whether the AI Scientist v2 baseline used the same base LLM, equivalent prompt templates, matched token budgets, identical topic sampling, or the same scoring rubric. This omission prevents attribution of the delta specifically to the five mechanisms rather than uncontrolled implementation differences.
[Results] Results and ablation sections: No variance across runs, statistical significance tests, or quantified before/after metrics are reported for the self-healing executor or cross-run evolution claims; these rest on design descriptions rather than isolated empirical evidence that would confirm their load-bearing contribution to the overall result.

minor comments (2)

[System Design] The seven human-in-the-loop intervention modes would be easier to compare if summarized in a table listing autonomy level, typical use case, and example intervention point for each mode.
[Evaluation] ARC-Bench is introduced without a brief description of its topic selection criteria or scoring protocol in the main text; a short paragraph or reference to an appendix would improve accessibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript introducing AutoResearchClaw. We provide point-by-point responses to the major comments and outline the revisions we will make to strengthen the experimental reporting.

read point-by-point responses

Referee: [Abstract] Abstract and experimental results section: The headline 54.7% outperformance on ARC-Bench is presented without details on whether the AI Scientist v2 baseline used the same base LLM, equivalent prompt templates, matched token budgets, identical topic sampling, or the same scoring rubric. This omission prevents attribution of the delta specifically to the five mechanisms rather than uncontrolled implementation differences.

Authors: We agree that more explicit details on the baseline are needed to attribute the performance gains. We will revise the experimental results section to include a clear description of the AI Scientist v2 implementation, specifying the base LLM, prompt templates, token budgets, topic sampling method, and scoring rubric used. This will allow for better assessment of the contribution of our five mechanisms. revision: yes
Referee: [Results] Results and ablation sections: No variance across runs, statistical significance tests, or quantified before/after metrics are reported for the self-healing executor or cross-run evolution claims; these rest on design descriptions rather than isolated empirical evidence that would confirm their load-bearing contribution to the overall result.

Authors: We acknowledge that the evidence for the self-healing executor and cross-run evolution is primarily through their integration in the overall pipeline and the reported 54.7% improvement, rather than through isolated ablations with variance and statistical tests. The manuscript does include an ablation for human collaboration modes. To address this, we will add quantified before/after metrics and variance measures in the revised results section based on our experimental data. We will also include statistical significance tests where applicable. revision: yes

Circularity Check

0 steps flagged

No circularity: central claim is empirical benchmark result with no self-referential derivation

full rationale

The paper presents AutoResearchClaw as a system built on five explicitly listed mechanisms and reports its performance as an observed outcome on the external ARC-Bench benchmark (25 topics). No equations, fitted parameters, or derivations are described that reduce to the inputs by construction. The outperformance figure (54.7% vs AI Scientist v2) is framed as a measured comparison rather than a prediction derived from the architecture itself. Self-citations, if present, are not load-bearing for the headline result, which rests on external evaluation. This matches the default expectation of a self-contained empirical paper.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The system is built on standard assumptions about large language model capabilities for debate and code execution plus the validity of the ARC-Bench tasks as proxies for research quality; no new physical entities or mathematical axioms are introduced.

free parameters (1)

agent prompts and decision thresholds
Hyperparameters controlling debate structure, pivot/refine logic, and intervention triggers are necessarily tuned during development.

axioms (2)

domain assumption Multi-agent debate produces higher-quality hypotheses than single-agent reasoning
Invoked as the basis for the hypothesis generation stage.
domain assumption Experiment failures contain recoverable information that can be turned into improved future attempts
Core premise of the self-healing executor and cross-run evolution.

pith-pipeline@v0.9.0 · 5905 in / 1539 out tokens · 56634 ms · 2026-05-20T05:18:15.065373+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AutoResearchClaw... built on five mechanisms: structured multi-agent debate... self-healing executor with a Pivot/Refine decision loop... verifiable result reporting... human-in-the-loop collaboration with seven intervention modes... cross-run evolution
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

On ARC-Bench... outperforms AI Scientist v2 by 54.7%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 10 internal anchors

[1]

Alwall, R

J. Alwall, R. Frederix, S. Frixione, V. Hirschi, F. Maltoni, O. Mattelaer, H.-S. Shao, T. Stelzer, P. Torrielli, and M. Zaro. The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations.Journal of High Energy Physics, 2014(7):79,

work page 2014
[2]

Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes

doi: 10.1007/JHEP07(2014)079. Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Emergent autonomous scientific research capabilities of large language models.Nature, 624:570–578,

work page doi:10.1007/jhep07(2014)079 2014
[3]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, et al. MLE-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch

doi: 10.1111/ectj.12097. Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate.Proceedings of the International Conference on Machine Learning (ICML),

work page doi:10.1111/ectj.12097
[5]

Bradley Efron

doi: 10.1186/1752-0509-7-74. Bradley Efron. Bootstrap methods: Another look at the jackknife.The Annals of Statistics, 7(1):1–26,

work page doi:10.1186/1752-0509-7-74
[6]

Sasi Kiran Gaddipati et al

doi: 10.1214/aos/1176344552. Sasi Kiran Gaddipati et al. Aissistant: Human-ai collaborative review and perspective research workflows in data science. arXiv preprint arXiv:2509.12282,

work page doi:10.1214/aos/1176344552
[7]

Towards an AI co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist.arXiv preprint arXiv:2502.18864, 2025a. Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, et al. Towards an AI co-scientist.arXiv preprint arXiv:2502.18864, 2025b....

work page internal anchor Pith review Pith/arXiv arXiv
[8]

DISCOVERYWORLD: A virtual environment for developing and evaluating automated scientific discovery agents.arXiv preprint arXiv:2406.06769,

Peter Jansen, Marc-Alexandre Cote, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark. DISCOVERYWORLD: A virtual environment for developing and evaluating automated scientific discovery agents.arXiv preprint arXiv:2406.06769,

work page arXiv
[9]

AIDE: AI-Driven Exploration in the Space of Code

URLhttps://arxiv.org/abs/2502.13138. Zachary A. King, Justin Lu, Andreas Dräger, Philip Miller, Stephen Federowicz, Joshua A. Lerman, Ali Ebrahim, Bernhard O. Palsson, and Nathan E. Lewis. BiGG models: A platform for integrating, standardizing and sharing genome-scale models.Nucleic Acids Research, 44(D1):D515–D522,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Ruochen Li, Teerth Patel, Qingyun Wang, and Xinya Du

doi: 10.1093/nar/gkv1049. Ruochen Li, Teerth Patel, Qingyun Wang, and Xinya Du. Mlr-copilot: Autonomous machine learning research based on large language models agents.arXiv preprint arXiv:2408.14033,

work page doi:10.1093/nar/gkv1049
[11]

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. Encouraging divergent thinking in large language models through multi-agent debate.arXiv preprint arXiv:2305.19118,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngan Vu, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, et al. AlphaEvolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

URLhttps://arxiv.org/abs/ 2603.14553. Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models.Nature, 625:468–475,

work page arXiv
[14]

Agent Laboratory: Using LLM Agents as Research Assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants.arXiv preprint arXiv:2501.04227, 2025a. Samuel Schmidgall et al. AgentRxiv: Towards collaborative autonomous research.arXiv preprint arXiv:2503.18102, 2025b. Chenyang...

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

Khanh-Tung Tran et al. Multi-agent collaboration mechanisms: A survey of LLMs.arXiv preprint arXiv:2501.06322,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

Xiaoman Wang et al. EvolveR: Self-evolving LLM agents through an experience-driven lifecycle.arXiv preprint arXiv:2510.16079,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. SkillRL: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Alwall, R

J. Alwall, R. Frederix, S. Frixione, V. Hirschi, F. Maltoni, O. Mattelaer, H.-S. Shao, T. Stelzer, P. Torrielli, and M. Zaro. The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations.Journal of High Energy Physics, 2014(7):79,

work page 2014

[2] [2]

Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes

doi: 10.1007/JHEP07(2014)079. Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Emergent autonomous scientific research capabilities of large language models.Nature, 624:570–578,

work page doi:10.1007/jhep07(2014)079 2014

[3] [3]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, et al. MLE-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch

doi: 10.1111/ectj.12097. Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate.Proceedings of the International Conference on Machine Learning (ICML),

work page doi:10.1111/ectj.12097

[5] [5]

Bradley Efron

doi: 10.1186/1752-0509-7-74. Bradley Efron. Bootstrap methods: Another look at the jackknife.The Annals of Statistics, 7(1):1–26,

work page doi:10.1186/1752-0509-7-74

[6] [6]

Sasi Kiran Gaddipati et al

doi: 10.1214/aos/1176344552. Sasi Kiran Gaddipati et al. Aissistant: Human-ai collaborative review and perspective research workflows in data science. arXiv preprint arXiv:2509.12282,

work page doi:10.1214/aos/1176344552

[7] [7]

Towards an AI co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist.arXiv preprint arXiv:2502.18864, 2025a. Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, et al. Towards an AI co-scientist.arXiv preprint arXiv:2502.18864, 2025b....

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

DISCOVERYWORLD: A virtual environment for developing and evaluating automated scientific discovery agents.arXiv preprint arXiv:2406.06769,

Peter Jansen, Marc-Alexandre Cote, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark. DISCOVERYWORLD: A virtual environment for developing and evaluating automated scientific discovery agents.arXiv preprint arXiv:2406.06769,

work page arXiv

[9] [9]

AIDE: AI-Driven Exploration in the Space of Code

URLhttps://arxiv.org/abs/2502.13138. Zachary A. King, Justin Lu, Andreas Dräger, Philip Miller, Stephen Federowicz, Joshua A. Lerman, Ali Ebrahim, Bernhard O. Palsson, and Nathan E. Lewis. BiGG models: A platform for integrating, standardizing and sharing genome-scale models.Nucleic Acids Research, 44(D1):D515–D522,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Ruochen Li, Teerth Patel, Qingyun Wang, and Xinya Du

doi: 10.1093/nar/gkv1049. Ruochen Li, Teerth Patel, Qingyun Wang, and Xinya Du. Mlr-copilot: Autonomous machine learning research based on large language models agents.arXiv preprint arXiv:2408.14033,

work page doi:10.1093/nar/gkv1049

[11] [11]

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. Encouraging divergent thinking in large language models through multi-agent debate.arXiv preprint arXiv:2305.19118,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngan Vu, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, et al. AlphaEvolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

URLhttps://arxiv.org/abs/ 2603.14553. Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models.Nature, 625:468–475,

work page arXiv

[14] [14]

Agent Laboratory: Using LLM Agents as Research Assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants.arXiv preprint arXiv:2501.04227, 2025a. Samuel Schmidgall et al. AgentRxiv: Towards collaborative autonomous research.arXiv preprint arXiv:2503.18102, 2025b. Chenyang...

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

Khanh-Tung Tran et al. Multi-agent collaboration mechanisms: A survey of LLMs.arXiv preprint arXiv:2501.06322,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

Xiaoman Wang et al. EvolveR: Self-evolving LLM agents through an experience-driven lifecycle.arXiv preprint arXiv:2510.16079,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. SkillRL: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066,

work page internal anchor Pith review Pith/arXiv arXiv