AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery

Caiming Xiong; Cong Lu; Daizong Liu; Dingjie Song; Guiyao Tie; Jianfeng Gao; Jiawen Shi; Lei Xing; Lichao Sun; Lifang He

arxiv: 2605.23204 · v1 · pith:SYLBMX4Xnew · submitted 2026-05-22 · 💻 cs.AI

AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery

Guiyao Tie , Jiawen Shi , Dingjie Song , Yixiao Huang , Ziji Sheng , Xueyang Zhou , Daizong Liu , Pan Zhou

show 15 more authors

Yongchao Chen Ran Xu Lifang He Qingsong Wen Manling Li Cong Lu Shuai Li Pengtao Xie Yixuan Yuan Rui Meng Lei Xing Lichao Sun Caiming Xiong Philip S. Yu Jianfeng Gao

This is my paper

Pith reviewed 2026-05-25 04:44 UTC · model grok-4.3

classification 💻 cs.AI

keywords AutoResearchAI-powered research automationscientific workflow automationdomain-conditioned autonomyAI scientist systemsresearch automation surveyworkflow conditionsevaluation dimensions

0 comments

The pith

AutoResearch autonomy is credible only in structured, executable, and rapidly verifiable domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys AI systems advancing from isolated task assistance to full scientific workflows that include literature review, hypothesis generation, experimentation, validation, and reporting. It defines AutoResearch as the spectrum spanning human-steered prompt assistance to AI-led coordination of larger discovery loops. Systems are organized around five workflow conditions and evaluated on novelty, validity, impact, reliability, and provenance. The central finding is that autonomy levels achieved so far remain domain-conditioned rather than general.

Core claim

AutoResearch is defined as the developmental spectrum of AI-powered scientific workflow automation. Within it, Vibe Research covers human-steered prompt-based assistance while AI-led systems coordinate larger portions of the discovery loop. Analysis of how these systems redistribute control, evidence, execution, validation, and accountability shows that autonomy is domain-conditioned: more credible in structured, executable, and rapidly verifiable settings but limited in embodied, delayed, heterogeneous, ethical, or institutionally accountable contexts.

What carries the argument

AutoResearch, the developmental spectrum of AI-powered scientific workflow automation that partitions control across five workflow conditions: literature and research grounding; hypothesis formation and planning; experimentation and tool use; feedback, validation, and review; and reporting and knowledge communication.

If this is right

Systems should be assessed using the five proposed dimensions of novelty, validity, impact, reliability, and provenance rather than isolated task metrics.
Development of AI scientist systems will be more credible when restricted to structured domains with rapid verification mechanisms.
Mixed-initiative frameworks remain necessary in domains involving ethics, embodiment, or institutional accountability.
Challenges in evidence preservation, reproducibility, and provenance tracking persist across all surveyed systems regardless of autonomy level.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same domain-conditioning pattern may appear in AI automation of non-scientific workflows such as legal analysis or engineering design.
Cross-domain benchmarks that deliberately vary structure, verification speed, and ethical stakes could quantify the boundaries of current autonomy.
Institutional review boards could serve as an external validation layer to extend AutoResearch into currently limited ethical and accountable contexts.

Load-bearing premise

The surveyed systems and literature can be meaningfully partitioned into the proposed spectrum of autonomy levels and five workflow conditions without significant overlap or omission that would alter the domain-conditioned conclusion.

What would settle it

A documented AI system achieving robust autonomy, with minimal human oversight, in an embodied domain such as physical robotics experimentation or in an ethical context such as human-subjects clinical research would falsify the domain-conditioned claim.

Figures

Figures reproduced from arXiv: 2605.23204 by Caiming Xiong, Cong Lu, Daizong Liu, Dingjie Song, Guiyao Tie, Jianfeng Gao, Jiawen Shi, Lei Xing, Lichao Sun, Lifang He, Manling Li, Pan Zhou, Pengtao Xie, Philip S. Yu, Qingsong Wen, Ran Xu, Rui Meng, Shuai Li, Xueyang Zhou, Yixiao Huang, Yixuan Yuan, Yongchao Chen, Ziji Sheng.

**Figure 2.** Figure 2: Five-level autonomy spectrum of AutoResearch. The figure summarizes the L0–L4 levels by comparing how workflow control, task execution, and validation authority shift from human research to AI-autonomous research. The higher levels define stricter autonomy targets rather than implying that current systems densely populate them. ity, and scientific responsibility. The levels are therefore descriptive alloca… view at source ↗

**Figure 3.** Figure 3: Overview of the AutoResearch survey framework. The figure organizes the survey around a workflowcentered account of AutoResearch by linking five connected components: concept and scope, technical foundations, evaluation, domain-specific realizations, and broader discussion. Rather than treating research automation as a single model class or benchmark trend, it situates the field as a layered landscape spa… view at source ↗

**Figure 4.** Figure 4: Historical overview of AutoResearch. The figure maps representative works, systems, benchmarks, and open-source infrastructures onto the L0–L4 autonomy spectrum, with L2 further divided into single-step execution, interactive workflow automation, and pipeline automation under human verification. through computation, historical or literature-derived knowledge, active learning, and robotic execution. Both wo… view at source ↗

**Figure 5.** Figure 5: Technical workflow stages of AutoResearch. The figure decomposes AutoResearch into five recurring stages, from literature grounding and hypothesis planning to experimentation, validation, and reporting. Each stage lists representative technical operations and the corresponding workflow function. validation constrains outputs through rejection pressure; and reporting constrains communication through provena… view at source ↗

**Figure 6.** Figure 6: summarizes the main grounding regimes that recur across current systems, and the discussion below examines how these regimes differ in evidential strength, persistence, and workflow integration. Current grounding techniques can be organized into four recurring regimes: search-centered, evidence-centered, structure-centered, and literaturememory grounding. This distinction emphasizes not retrieval strength… view at source ↗

**Figure 7.** Figure 7: Hypothesis formation and planning regimes in AutoResearch. The figure compares four recurring regimes for generating and selecting candidate research directions: proposal-centered generation, deliberative multiagent planning, structure-guided planning, and search-based evolutionary planning. The central selection gate summarizes the main constraints used to filter directions before execution, including e… view at source ↗

**Figure 8.** Figure 8: summarizes the main execution regimes that recur across current systems, and the discussion below examines how these regimes differ in execution substrate, action procedure, and workflow control. Current execution techniques can be organized into four recurring regimes: code-native, tool-orchestrated, laboratory-robotic, and human-gated execution. The first three mainly differ by the substrate through whi… view at source ↗

**Figure 9.** Figure 9: Validation and rejection regimes in AutoResearch workflows. The figure compares three recurring regimes through which rejection pressure enters AutoResearch: execution-coupled reruns, critique-mediated review, and expert-grounded evaluation. The resulting signals support keep, revise, or reject decisions. This stage covers a wide range of operations that sit between execution and reporting. It includes con… view at source ↗

**Figure 10.** Figure 10: Reporting and knowledge communication regimes in AutoResearch. The figure compares three recurring regimes for communicating scientific claims: draft-centered reporting, review-centered communication, and artifact-linked reporting. These regimes differ in how claims remain aligned with manuscript text, critique, figures, tables, code, metadata, and provenance. grounding provides the evidential basis of t… view at source ↗

**Figure 11.** Figure 11: Scientific-quality dimensions and evidence instruments in AutoResearch. The figure separates five judgment targets-novelty, validity, impact, reliability, and provenance-from the evidence instruments used to support them, such as benchmarks, expert review, reruns, artifact tracing, and longitudinal follow-up. Scientific quality is developed here through five jointly necessary dimensions: novelty, validity… view at source ↗

**Figure 12.** Figure 12: summarizes this domain-conditioned landscape by positioning major scientific areas along the L0–L4 spectrum. The profile shows a clear gradient from computationally closed domains to empirically and socially constrained domains. Computational and formal sciences occupy the highest current position because their workflows can be represented, executed, verified, and repeated within digital environments. P… view at source ↗

read the original abstract

Scientific research is being reshaped by AI systems that move beyond isolated assistance toward longer-horizon workflows spanning literature grounding, hypothesis generation, experimentation, validation, reporting, and revision. This shift marks a transition from task-level AI for science to workflow-level research automation. Yet current systems remain fragmented, differing in autonomy, domain scope, execution environment, validation mechanism, and human oversight, while still struggling with evidence preservation, reproducibility, weak-direction rejection, provenance tracking, cross-domain robustness, and accountable scientific closure. This survey examines these developments through AutoResearch, defined as the developmental spectrum of AI-powered scientific workflow automation. Within it, Vibe Research denotes the human-steered region of prompt-based assistance and human-verified execution, whereas emerging AI-led systems coordinate larger portions of the discovery loop without achieving robust autonomy. We analyze how research systems redistribute control, evidence, execution, validation, and accountability across workflows and organize the field around five workflow conditions: literature and research grounding; hypothesis formation and planning; experimentation and tool use; feedback, validation, and review; and reporting and knowledge communication. We further synthesize AI scientist systems, mixed-initiative co-research frameworks, benchmarks, domain deployments, and open-source infrastructures. Finally, we propose five evaluation dimensions--novelty, validity, impact, reliability, and provenance--and show that AutoResearch autonomy is domain-conditioned, being more credible in structured, executable, and rapidly verifiable settings but limited in embodied, delayed, heterogeneous, ethical, or institutionally accountable contexts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A survey that maps existing AI-for-science systems into workflow stages and an autonomy spectrum, but the domain-conditioned claim rests on qualitative partitions without sharp boundaries.

read the letter

This paper is a survey that collects and organizes work on AI systems automating pieces of the research process. It introduces the umbrella term AutoResearch for the spectrum from prompt-driven human-steered work (called Vibe Research) to systems that handle larger workflow chunks, breaks research into five stages (literature grounding, hypothesis/planning, experimentation/tool use, validation/review, reporting), and lists five evaluation axes: novelty, validity, impact, reliability, and provenance. The closing observation is that credible autonomy appears more feasible in structured, quickly verifiable domains than in embodied, delayed, or ethically loaded ones.

Referee Report

1 major / 0 minor

Summary. The paper surveys AI systems for automating scientific research workflows, framing the field as 'AutoResearch'—a developmental spectrum from 'Vibe Research' (human-steered, prompt-based assistance with human-verified execution) to emerging AI-led systems that coordinate larger portions of the discovery loop. It organizes the literature around five workflow conditions (literature/research grounding; hypothesis formation/planning; experimentation/tool use; feedback/validation/review; reporting/knowledge communication), synthesizes AI scientist systems, mixed-initiative frameworks, benchmarks, domain deployments, and open-source infrastructures, proposes five evaluation dimensions (novelty, validity, impact, reliability, provenance), and concludes that AutoResearch autonomy is domain-conditioned: more credible in structured, executable, rapidly verifiable settings but limited in embodied, delayed, heterogeneous, ethical, or institutionally accountable contexts.

Significance. If the framework and partitioning hold, this survey offers a coherent synthesis of an emerging area, redistributing control/evidence/validation across workflows and delineating feasible versus limited domains for automation. The proposed evaluation dimensions provide a starting point for assessing such systems, and the domain-conditioned conclusion, if rigorously supported, would help prioritize research directions in structured versus complex settings.

major comments (1)

[Section on workflow conditions and autonomy spectrum] The section organizing the field around the five workflow conditions and the autonomy spectrum (Vibe Research vs. AI-led): these are defined qualitatively without operational decision rules, boundary conditions, or assignment criteria (e.g., thresholds for autonomous steps, handling of iterative loops between experimentation and validation, or percentage-based classification). This partitioning is load-bearing for the central claim that autonomy is domain-conditioned, as non-trivial overlap or omission in system assignments could alter which domains appear limited.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our survey. We address the major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Section on workflow conditions and autonomy spectrum] The section organizing the field around the five workflow conditions and the autonomy spectrum (Vibe Research vs. AI-led): these are defined qualitatively without operational decision rules, boundary conditions, or assignment criteria (e.g., thresholds for autonomous steps, handling of iterative loops between experimentation and validation, or percentage-based classification). This partitioning is load-bearing for the central claim that autonomy is domain-conditioned, as non-trivial overlap or omission in system assignments could alter which domains appear limited.

Authors: We agree that the workflow conditions and autonomy spectrum (Vibe Research to AI-led) are defined qualitatively in the current manuscript. This choice reflects the emerging nature of the field, where systems are described in the literature using narrative capabilities rather than standardized metrics, making uniform quantitative thresholds (e.g., step percentages or loop-handling rules) premature and potentially arbitrary. The five conditions follow conventional research workflow stages, and system assignments draw from explicit descriptions in cited works. However, the referee's point is valid: clearer criteria would strengthen the domain-conditioned claim. In revision we will add explicit qualitative boundary conditions (e.g., based on degree of closed-loop execution without human intervention, presence of automated validation, and handling of iterative feedback), include a table mapping representative systems to the spectrum with justifications, and discuss iterative loops explicitly. These additions will be conceptual rather than numeric to avoid over-prescription. revision: yes

Circularity Check

0 steps flagged

No circularity: literature survey without derivations or self-referential reductions

full rationale

The paper is a qualitative survey that defines AutoResearch, Vibe Research, AI-led systems, and five workflow conditions (literature grounding; hypothesis/planning; experimentation/tool use; feedback/validation/review; reporting/communication) as organizing categories. It synthesizes existing systems and proposes evaluation dimensions but contains no equations, fitted parameters, predictions, or derivations. No step reduces a claimed result to its own inputs by construction, and no self-citation chain is load-bearing for a mathematical claim. The domain-conditioned autonomy conclusion follows from the surveyed literature rather than from any internal tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces definitional terms (AutoResearch, Vibe Research) and an organizational framework but does not introduce new physical entities, fitted parameters, or unproven axioms beyond standard assumptions in AI literature surveys.

pith-pipeline@v0.9.0 · 5884 in / 1027 out tokens · 29671 ms · 2026-05-25T04:44:51.622196+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

180 extracted references · 80 canonical work pages · 20 internal anchors

[1]

Llm4sr: A survey on large language models for scientific research.arXiv preprint arXiv:2501.04306, 2025

Ziming Luo, Zonglin Yang, Zexin Xu, Wei Yang, and Xinya Du. Llm4sr: A survey on large language models for scientific research.arXiv preprint arXiv:2501.04306, 2025

work page arXiv 2025
[2]

Highly accurate protein structure pre- diction with alphafold.nature, 596(7873):583–589, 2021

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure pre- diction with alphafold.nature, 596(7873):583–589, 2021

2021
[3]

Agentic ai for scientific discovery: A survey of progress, challenges, and future directions.arXiv preprint arXiv:2503.08979, 2025

Mourad Gridach, Jay Nanavati, Khaldoun Zine El Abidine, Lenon Mendes, and Christina Mack. Agentic ai for scientific discovery: A survey of progress, challenges, and future directions.arXiv preprint arXiv:2503.08979, 2025

work page arXiv 2025
[4]

From ai for science to agentic science: A survey on autonomous scientific discovery.arXiv preprint arXiv:2508.14111, 2025

Jiaqi Wei, Yuejin Yang, Xiang Zhang, Yuhan Chen, Xiang Zhuang, Zhangyang Gao, Dongzhan Zhou, Guang- shuai Wang, Zhiqiang Gao, Juntai Cao, et al. From ai for science to agentic science: A survey on autonomous scientific discovery.arXiv preprint arXiv:2508.14111, 2025

work page arXiv 2025
[5]

Evolving Roles of LLMs in Scientific Innovation: Assistant, Collaborator, Scientist, and Evaluator

Haoxuan Zhang, Ruochi Li, Yang Zhang, Ting Xiao, Jiangping Chen, Junhua Ding, and Haihua Chen. The evolving role of large language models in scientific innovation: Evaluator, collaborator, and scientist.arXiv preprint arXiv:2507.11810, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

From automation to autonomy: A survey on large language models in scientific discovery

Tianshi Zheng, Zheye Deng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Zihao Wang, and Yangqiu Song. From automation to autonomy: A survey on large language models in scientific discovery. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 17744–17761, 2025

2025
[7]

Ai for science: Opportunities, challenges, and future directions.Authorea Preprints, 2025

Valerie Fu. Ai for science: Opportunities, challenges, and future directions.Authorea Preprints, 2025

2025
[8]

A process-centric survey of ai for scientific discovery through the exhyte framework

Md Musaddaqul Hasib, Sumin Jo, Harsh Sinha, Jifeng Song, Arun Das, Zhentao Liu, Hugh Galloway, Huey Huang, Kexun Zhang, Shou-Jiang Gao, et al. A process-centric survey of ai for scientific discovery through the exhyte framework. 2025

2025
[9]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Curie: Toward rigorous and automated scientific experimentation with ai agents.arXiv preprint arXiv:2502.16069, 2025

Patrick Tser Jern Kon, Jiachen Liu, Qiuyi Ding, Yiming Qiu, Zhenning Yang, Yibo Huang, Jayanth Srini- vasa, Myungjin Lee, Mosharaf Chowdhury, and Ang Chen. Curie: Toward rigorous and automated scientific experimentation with ai agents.arXiv preprint arXiv:2502.16069, 2025

work page arXiv 2025
[12]

Piflow: Principle-aware scientific discovery with multi-agent col- laboration.arXiv preprint arXiv:2505.15047, 2025

Yingming Pu, Tao Lin, and Hongyu Chen. Piflow: Principle-aware scientific discovery with multi-agent col- laboration.arXiv preprint arXiv:2505.15047, 2025

work page arXiv 2025
[13]

Large language models for automated scholarly paper review: A survey.Information Fusion, 124:103332, 2025

Zhenzhen Zhuang, Jiandong Chen, Hongfeng Xu, Yuwen Jiang, and Jialiang Lin. Large language models for automated scholarly paper review: A survey.Information Fusion, 124:103332, 2025

2025
[14]

A vision for auto research with llm agents.arXiv preprint arXiv:2504.18765, 2025

Chengwei Liu, Chong Wang, Jiayue Cao, Jingquan Ge, Kun Wang, Lyuye Zhang, Ming-Ming Cheng, Peng- hai Zhao, Tianlin Li, Xiaojun Jia, et al. A vision for auto research with llm agents.arXiv preprint arXiv:2504.18765, 2025

work page arXiv 2025
[15]

Litllm: A toolkit for scientific literature review.arXiv preprint arXiv:2402.01788, 2024

Shubham Agarwal, Gaurav Sahu, Abhay Puri, Issam H Laradji, Krishnamurthy DJ Dvijotham, Jason Stan- ley, Laurent Charlin, and Christopher Pal. Litllm: A toolkit for scientific literature review.arXiv preprint arXiv:2402.01788, 2024

work page arXiv 2024
[16]

Openscholar.https://github.com/AkariAsai/OpenScholar, 2026

GitHub repository. Openscholar.https://github.com/AkariAsai/OpenScholar, 2026

2026
[17]

Language agents achieve superhuman synthesis of scientific knowledge.arXiv preprint arXiv:2409.13740, 2024

Michael D Skarlinski, Sam Cox, Jon M Laurent, James D Braza, Michaela Hinks, Michael J Hammerling, Man- vitha Ponnapati, Samuel G Rodriques, and Andrew D White. Language agents achieve superhuman synthesis of scientific knowledge.arXiv preprint arXiv:2409.13740, 2024

work page arXiv 2024
[18]

Paperqa2.https://github.com/Future-House/paper-qa, 2026

GitHub repository. Paperqa2.https://github.com/Future-House/paper-qa, 2026

2026
[19]

Openhands: An open platform for ai software developers as generalist agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. InInternational Conference on Learning Representations, volume 2025, pages 65882–65919, 2025. 41 AUTORESEARCHAI

2025
[20]

Aider.https://github.com/Aider-AI/aider, 2026

GitHub repository. Aider.https://github.com/Aider-AI/aider, 2026

2026
[21]

Swe-agent.https://github.com/SWE-agent/SWE-agent, 2026

GitHub repository. Swe-agent.https://github.com/SWE-agent/SWE-agent, 2026

2026
[22]

Agent laboratory.https://github.com/SamuelSchmidgall/ AgentLaboratory, 2026

GitHub repository. Agent laboratory.https://github.com/SamuelSchmidgall/ AgentLaboratory, 2026

2026
[23]

Ai-researcher.https://github.com/HKUDS/AI-Researcher, 2026

GitHub repository. Ai-researcher.https://github.com/HKUDS/AI-Researcher, 2026

2026
[24]

Auto-claude-code-research-in-sleep (aris).https://github.com/wanshuiyin/ Auto-claude-code-research-in-sleep, 2026

GitHub repository. Auto-claude-code-research-in-sleep (aris).https://github.com/wanshuiyin/ Auto-claude-code-research-in-sleep, 2026

2026
[25]

Nanoresearch, 2026

OpenRaiser. Nanoresearch, 2026

2026
[26]

Airs-bench: a suite of tasks for frontier ai research science agents.arXiv preprint arXiv:2602.06855, 2026

Alisia Lupidi, Bhavul Gauri, Thomas Simon Foster, Bassel Al Omari, Despoina Magka, Alberto Pepe, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, Lucia Cipolina-Kun, et al. Airs-bench: a suite of tasks for frontier ai research science agents.arXiv preprint arXiv:2602.06855, 2026

work page arXiv 2026
[27]

When ai co-scientists fail: Spot-a benchmark for automated verification of scientific research.arXiv preprint arXiv:2505.11855, 2025

Guijin Son, Jiwoo Hong, Honglu Fan, Heejeong Nam, Hyunwoo Ko, Seungwon Lim, Jinyeop Song, Jinha Choi, Gonçalo Paulo, Youngjae Yu, et al. When ai co-scientists fail: Spot-a benchmark for automated verification of scientific research.arXiv preprint arXiv:2505.11855, 2025

work page arXiv 2025
[28]

Controla: Agentic workflow control mechanisms for reliable science

Amal Gueroudji, Tanwi Mallick, Renan Souza, Rafael Ferreira Da Silva, Robert Ross, Matthieu Dorier, Philip Carns, Kyle Chard, and Ian Foster. Controla: Agentic workflow control mechanisms for reliable science. In 2025 IEEE International Conference on eScience (eScience), pages 415–426. IEEE, 2025

2025
[29]

CoRR , volume =

Qiujie Xie, Yixuan Weng, Minjun Zhu, Fuchen Shen, Shulin Huang, Zhen Lin, Jiahui Zhou, Zilan Mao, Zijie Yang, Linyi Yang, et al. How far are ai scientists from changing the world?arXiv preprint arXiv:2507.23276, 2025

work page arXiv 2025
[30]

A survey of ai scientists.arXiv preprint arXiv:2510.23045, 2025

Guiyao Tie, Pan Zhou, and Lichao Sun. A survey of ai scientists.arXiv preprint arXiv:2510.23045, 2025

work page arXiv 2025
[31]

AI4Research:

Qiguang Chen, Mingda Yang, Libo Qin, Jinhao Liu, Zheng Yan, Jiannan Guan, Dengyun Peng, Yiyan Ji, Hanjing Li, Mengkang Hu, et al. Ai4research: A survey of artificial intelligence for scientific research.arXiv preprint arXiv:2507.01903, 2025

work page arXiv 2025
[32]

Routledge, 2005

Karl Popper.The logic of scientific discovery. Routledge, 2005

2005
[33]

University of Chicago press Chicago, 1970

Thomas S Kuhn and Ian Hacking.The structure of scientific revolutions, volume 2. University of Chicago press Chicago, 1970

1970
[34]

University of Chicago press, 1973

Robert K Merton.The sociology of science: Theoretical and empirical investigations. University of Chicago press, 1973

1973
[35]

Peer review of gpt-4 technical report and systems card.PLOS digital health, 3(1):e0000417, 2024

Jack Gallifant, Amelia Fiske, Yulia A Levites Strekalova, Juan S Osorio-Valencia, Rachael Parke, Rogers Mwavu, Nicole Martinez, Judy Wawira Gichoya, Marzyeh Ghassemi, Dina Demner-Fushman, et al. Peer review of gpt-4 technical report and systems card.PLOS digital health, 3(1):e0000417, 2024

2024
[36]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

2025
[37]

Openhands.https://github.com/All-Hands-AI/OpenHands, 2026

GitHub repository. Openhands.https://github.com/All-Hands-AI/OpenHands, 2026

2026
[38]

Towards an AI co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist.arXiv preprint arXiv:2502.18864, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Build your personalized research group: A multiagent framework for continual and interactive science automation.arXiv preprint arXiv:2510.15624, 2025

Ed Li, Junyu Ren, Xintian Pan, Cat Yan, Chuanhao Li, Dirk Bergemann, and Zhuoran Yang. Build your personalized research group: A multiagent framework for continual and interactive science automation.arXiv preprint arXiv:2510.15624, 2025

work page arXiv 2025
[40]

Evaluating sakana’s ai scientist: Bold claims, mixed results, and a promising future? InACM SIGIR Forum, volume 59, pages 1–20

Joeran Beel, Min-Yen Kan, and Moritz Baumgart. Evaluating sakana’s ai scientist: Bold claims, mixed results, and a promising future? InACM SIGIR Forum, volume 59, pages 1–20. ACM New York, NY , USA, 2025

2025
[41]

Can ai conduct autonomous scientific research? case studies on two real-world tasks.bioRxiv, pages 2026–01, 2026

Shreyansh Agrawal, Harsh B Anadkat, Kiran K Athimoolam, Harsh Bhardwaj, Trishul Chowdhury, Shengtao Gao, Purva K Kamat, Vishwadeepsinh Makwana, Mohammed H Shariff, Amitesh Badkul, et al. Can ai conduct autonomous scientific research? case studies on two real-world tasks.bioRxiv, pages 2026–01, 2026

2026
[42]

The more you automate, the less you see: Hidden pitfalls of ai scientist systems.arXiv preprint arXiv:2509.08713, 2025

Ziming Luo, Atoosa Kasirzadeh, and Nihar B Shah. The more you automate, the less you see: Hidden pitfalls of ai scientist systems.arXiv preprint arXiv:2509.08713, 2025

work page arXiv 2025
[43]

Autonomous ‘self-driving’laboratories: a review of technology and policy implications.Royal Society Open Science, 12(7):250646, 2025

Alexander V Tobias and Adam Wahab. Autonomous ‘self-driving’laboratories: a review of technology and policy implications.Royal Society Open Science, 12(7):250646, 2025. 42 AUTORESEARCHAI

2025
[44]

Empowering biomedical discovery with ai agents.Cell, 187(22):6125–6151, 2024

Shanghua Gao, Ada Fang, Yepeng Huang, Valentina Giunchiglia, Ayush Noori, Jonathan Richard Schwarz, Yasha Ektefaie, Jovana Kondic, and Marinka Zitnik. Empowering biomedical discovery with ai agents.Cell, 187(22):6125–6151, 2024

2024
[45]

Ai-researcher: Autonomous scientific innovation

Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. Ai-researcher: Autonomous scientific innovation. Advances in Neural Information Processing Systems, 38:9481–9520, 2026

2026
[46]

Perspective on utilizing foundation models for laboratory automation in materials research.Science and Technology of Advanced Materials: Methods, 5(1):2582379, 2025

Kan Hatakeyama-Sato, Toshihiko Nishida, Kenta Kitamura, Yoshitaka Ushiku, Koichi Takahashi, Yuta Nabae, and Teruaki Hayakawa. Perspective on utilizing foundation models for laboratory automation in materials research.Science and Technology of Advanced Materials: Methods, 5(1):2582379, 2025

2025
[47]

Agents for self-driving laboratories applied to quantum computing.arXiv preprint arXiv:2412.07978, 2024

Shuxiang Cao, Zijian Zhang, Mohammed Alghadeer, Simone D Fasciati, Michele Piscitelli, Mustafa Bakr, Peter Leek, and Alán Aspuru-Guzik. Agents for self-driving laboratories applied to quantum computing.arXiv preprint arXiv:2412.07978, 2024

work page arXiv 2024
[48]

Hy- per: Literature-grounded hypothesis generation and distillation with provenance

Rosni Vasu, Chandrayee Basu, Bhavana Dalvi, Cristina Sarasua, Peter Clark, and Abraham Bernstein. Hy- per: Literature-grounded hypothesis generation and distillation with provenance. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25424–25449, 2025

2025
[49]

Bioplanner: automatic evaluation of llms on protocol planning in biology

Odhran O’Donoghue, Aleksandar Shtedritski, John Ginger, Ralph Abboud, Ali Ghareeb, and Samuel Ro- driques. Bioplanner: automatic evaluation of llms on protocol planning in biology. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2676–2694, 2023

2023
[50]

Researchagent: Iterative research idea generation over scientific literature with large language models

Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. Researchagent: Iterative research idea generation over scientific literature with large language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pag...

2025
[51]

Omniscientist: Toward a co-evolving ecosystem of human and ai scientists

Chenyang Shao, Dehao Huang, Yu Li, Keyu Zhao, Weiquan Lin, Yining Zhang, Qingbin Zeng, Zhiyu Chen, Tianxing Li, Yifei Huang, et al. Omniscientist: Toward a co-evolving ecosystem of human and ai scientists. arXiv preprint arXiv:2511.16931, 2025

work page arXiv 2025
[52]

Codescientist: End-to-end semi-automated scien- tific discovery with code-based experimentation

Peter Jansen, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi, Bod- hisattwa Prasad Majumder, Daniel S Weld, and Peter Clark. Codescientist: End-to-end semi-automated scien- tific discovery with code-based experimentation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 13370–13467, 2025

2025
[53]

Autonomous Agents for Scientific Discovery: Orchestrating Scientists, Language, Code, and Physics

Lianhao Zhou, Hongyi Ling, Cong Fu, Yepeng Huang, Michael Sun, Wendi Yu, Xiaoxuan Wang, Xiner Li, Xingyu Su, Junkai Zhang, et al. Autonomous agents for scientific discovery: Orchestrating scientists, language, code, and physics.arXiv preprint arXiv:2510.09901, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Auto-bench: An automated benchmark for scientific discovery in llms.arXiv preprint arXiv:2502.15224, 2025

Tingting Chen, Srinivas Anumasa, Beibei Lin, Vedant Shah, Anirudh Goyal, and Dianbo Liu. Auto-bench: An automated benchmark for scientific discovery in llms.arXiv preprint arXiv:2502.15224, 2025

work page arXiv 2025
[55]

Biodsa-1k: Benchmarking data science agents for biomedical research.arXiv preprint arXiv:2505.16100, 2025

Zifeng Wang, Benjamin Danek, and Jimeng Sun. Biodsa-1k: Benchmarking data science agents for biomedical research.arXiv preprint arXiv:2505.16100, 2025

work page arXiv 2025
[56]

ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

Yujie Liu, Zonglin Yang, Tong Xie, Jinjie Ni, Ben Gao, Yuqiang Li, Shixiang Tang, Wanli Ouyang, Erik Cambria, and Dongzhan Zhou. Researchbench: Benchmarking llms in scientific discovery via inspiration- based task decomposition.arXiv preprint arXiv:2503.21248, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Researcherbench: Evaluating deep ai research systems on the frontiers of scientific inquiry.arXiv preprint arXiv:2507.16280, 2025

Tianze Xu, Pengrui Lu, Lyumanshan Ye, Xiangkun Hu, and Pengfei Liu. Researcherbench: Evaluating deep ai research systems on the frontiers of scientific inquiry.arXiv preprint arXiv:2507.16280, 2025

work page arXiv 2025
[58]

autoresearch.https://github.com/karpathy/autoresearch, 2026

GitHub repository. autoresearch.https://github.com/karpathy/autoresearch, 2026

2026
[59]

Deerflow.https://github.com/bytedance/deer-flow, 2026

GitHub repository. Deerflow.https://github.com/bytedance/deer-flow, 2026

2026
[60]

Open deep research.https://github.com/langchain-ai/open_deep_ research, 2026

GitHub repository. Open deep research.https://github.com/langchain-ai/open_deep_ research, 2026

2026
[61]

Automated scientific discov- ery: from equation discovery to autonomous discovery systems.Machine Learning, 115(5):109, 2026

Stefan Kramer, Mattia Cerrato, Jannis Brugger, Sašo Džeroski, and Ross D King. Automated scientific discov- ery: from equation discovery to autonomous discovery systems.Machine Learning, 115(5):109, 2026

2026
[62]

Agent4s: The transformation of research paradigms from the perspective of large language models.arXiv preprint arXiv:2506.23692, 2025

Boyuan Zheng, Zerui Fang, Zhe Xu, Rui Wang, Yiwen Chen, Cunshi Wang, Mengwei Qu, Lei Lei, Zhen Feng, Yan Liu, et al. Agent4s: The transformation of research paradigms from the perspective of large language models.arXiv preprint arXiv:2506.23692, 2025

work page arXiv 2025
[63]

Columbia university press, 1963

Derek J De Solla Price.Little science, big science. Columbia university press, 1963

1963
[64]

Functional genomic hypothesis generation and experimentation by a robot scientist.Nature, 427(6971):247–252, 2004

Ross D King, Kenneth E Whelan, Ffion M Jones, Philip GK Reiser, Christopher H Bryant, Stephen H Muggle- ton, Douglas B Kell, and Stephen G Oliver. Functional genomic hypothesis generation and experimentation by a robot scientist.Nature, 427(6971):247–252, 2004. 43 AUTORESEARCHAI

2004
[65]

Ai feynman: A physics-inspired method for symbolic regression

Silviu-Marian Udrescu and Max Tegmark. Ai feynman: A physics-inspired method for symbolic regression. Science advances, 6(16):eaay2631, 2020

2020
[66]

Storm.https://github.com/stanford-oval/storm, 2026

GitHub repository. Storm.https://github.com/stanford-oval/storm, 2026

2026
[67]

Scis- age: A multi-agent framework for high-quality scientific survey generation.arXiv preprint arXiv:2506.12689, 2025

Xiaofeng Shi, Qian Kou, Yuduo Li, Ning Tang, Jinxin Xie, Longbin Yu, Songjing Wang, and Hua Zhou. Scis- age: A multi-agent framework for high-quality scientific survey generation.arXiv preprint arXiv:2506.12689, 2025

work page arXiv 2025
[68]

Deep research arena: The first exam of llms’ research abilities via seminar-grounded tasks

Haiyuan Wan, Chen Yang, Junchi Yu, Meiqi Tu, Jiaxuan Lu, Di Yu, Jianbao Cao, Ben Gao, Jiaqing Xie, Aoran Wang, et al. Deep research arena: The first exam of llms’ research abilities via seminar-grounded tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33341–33349, 2026

2026
[69]

Gpt researcher.https://github.com/assafelovic/gpt-researcher, 2026

GitHub repository. Gpt researcher.https://github.com/assafelovic/gpt-researcher, 2026

2026
[70]

Tongyi deepresearch.https://github.com/Alibaba-NLP/DeepResearch, 2026

GitHub repository. Tongyi deepresearch.https://github.com/Alibaba-NLP/DeepResearch, 2026

2026
[71]

Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

2023
[72]

An autonomous laboratory for the accelerated synthesis of inorganic materials.Nature, 624(7990):86, 2023

Nathan J Szymanski, Bernardus Rendy, Yuxing Fei, Rishi E Kumar, Tanjin He, David Milsted, Matthew J Mc- Dermott, Max Gallant, Ekin Dogus Cubuk, Amil Merchant, et al. An autonomous laboratory for the accelerated synthesis of inorganic materials.Nature, 624(7990):86, 2023

2023
[73]

Cy- cleresearcher: Improving automated research via automated review

Yixuan Weng, Minjun Zhu, Guangsheng Bao, Hongbo Zhang, Jindong Wang, Yue Zhang, and Linyi Yang. Cy- cleresearcher: Improving automated research via automated review. InInternational Conference on Learning Representations, volume 2025, pages 3669–3709, 2025

2025
[74]

Mlr-copilot: Autonomous machine learning research based on large language models agents.arXiv preprint arXiv:2408.14033, 2024

Ruochen Li, Teerth Patel, Qingyun Wang, and Xinya Du. Mlr-copilot: Autonomous machine learning research based on large language models agents.arXiv preprint arXiv:2408.14033, 2024

work page arXiv 2024
[75]

R&d-agent: An llm-agent framework towards autonomous data science, 2025

Xu Yang, Xiao Yang, Shikai Fang, Yifei Zhang, Jian Wang, Bowen Xian, Qizheng Li, Jingyuan Li, Minrui Xu, Yuante Li, et al. R&d-agent: An llm-agent framework towards autonomous data science, 2025

2025
[76]

Aigs: Generating science from ai-powered automated falsification.arXiv preprint arXiv:2411.11910, 2024

Zijun Liu, Kaiming Liu, Yiqi Zhu, Xuanyu Lei, Zonghan Yang, Zhenhe Zhang, Peng Li, and Yang Liu. Aigs: Generating science from ai-powered automated falsification.arXiv preprint arXiv:2411.11910, 2024

work page arXiv 2024
[77]

The virtual lab of ai agents designs new sars-cov-2 nanobodies.Nature, 646(8085):716–723, 2025

Kyle Swanson, Wesley Wu, Nash L Bulaong, John E Pak, and James Zou. The virtual lab of ai agents designs new sars-cov-2 nanobodies.Nature, 646(8085):716–723, 2025

2025
[78]

Sciagents: automating scientific discovery through bioinspired multi-agent intelligent graph reasoning.Advanced Materials, 37(22):2413523, 2025

Alireza Ghafarollahi and Markus J Buehler. Sciagents: automating scientific discovery through bioinspired multi-agent intelligent graph reasoning.Advanced Materials, 37(22):2413523, 2025

2025
[79]

Sciscigpt: advancing human–ai collaboration in the science of science.Nature Computational Science, 6(3):301–315, 2026

Erzhuo Shao, Yifang Wang, Yifan Qian, Zhenyu Pan, Han Liu, and Dashun Wang. Sciscigpt: advancing human–ai collaboration in the science of science.Nature Computational Science, 6(3):301–315, 2026

2026
[80]

Szostkiewicz, Dmytro Shved, Gavin J

Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J. Szostkiewicz, Dmytro Shved, Gavin J. Gyimesi, Jon M. Laurent, Samantha M. Wright, Muhammed T. Razzak, Andrew D. White, Silvia C. Finnemann, Michaela M. Hinks, and Samuel G. Rodriques. A multi-agent system for automating scientific discovery.Nature, May 2026

2026

Showing first 80 references.

[1] [1]

Llm4sr: A survey on large language models for scientific research.arXiv preprint arXiv:2501.04306, 2025

Ziming Luo, Zonglin Yang, Zexin Xu, Wei Yang, and Xinya Du. Llm4sr: A survey on large language models for scientific research.arXiv preprint arXiv:2501.04306, 2025

work page arXiv 2025

[2] [2]

Highly accurate protein structure pre- diction with alphafold.nature, 596(7873):583–589, 2021

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure pre- diction with alphafold.nature, 596(7873):583–589, 2021

2021

[3] [3]

Agentic ai for scientific discovery: A survey of progress, challenges, and future directions.arXiv preprint arXiv:2503.08979, 2025

Mourad Gridach, Jay Nanavati, Khaldoun Zine El Abidine, Lenon Mendes, and Christina Mack. Agentic ai for scientific discovery: A survey of progress, challenges, and future directions.arXiv preprint arXiv:2503.08979, 2025

work page arXiv 2025

[4] [4]

From ai for science to agentic science: A survey on autonomous scientific discovery.arXiv preprint arXiv:2508.14111, 2025

Jiaqi Wei, Yuejin Yang, Xiang Zhang, Yuhan Chen, Xiang Zhuang, Zhangyang Gao, Dongzhan Zhou, Guang- shuai Wang, Zhiqiang Gao, Juntai Cao, et al. From ai for science to agentic science: A survey on autonomous scientific discovery.arXiv preprint arXiv:2508.14111, 2025

work page arXiv 2025

[5] [5]

Evolving Roles of LLMs in Scientific Innovation: Assistant, Collaborator, Scientist, and Evaluator

Haoxuan Zhang, Ruochi Li, Yang Zhang, Ting Xiao, Jiangping Chen, Junhua Ding, and Haihua Chen. The evolving role of large language models in scientific innovation: Evaluator, collaborator, and scientist.arXiv preprint arXiv:2507.11810, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

From automation to autonomy: A survey on large language models in scientific discovery

Tianshi Zheng, Zheye Deng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Zihao Wang, and Yangqiu Song. From automation to autonomy: A survey on large language models in scientific discovery. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 17744–17761, 2025

2025

[7] [7]

Ai for science: Opportunities, challenges, and future directions.Authorea Preprints, 2025

Valerie Fu. Ai for science: Opportunities, challenges, and future directions.Authorea Preprints, 2025

2025

[8] [8]

A process-centric survey of ai for scientific discovery through the exhyte framework

Md Musaddaqul Hasib, Sumin Jo, Harsh Sinha, Jifeng Song, Arun Das, Zhentao Liu, Hugh Galloway, Huey Huang, Kexun Zhang, Shou-Jiang Gao, et al. A process-centric survey of ai for scientific discovery through the exhyte framework. 2025

2025

[9] [9]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Curie: Toward rigorous and automated scientific experimentation with ai agents.arXiv preprint arXiv:2502.16069, 2025

Patrick Tser Jern Kon, Jiachen Liu, Qiuyi Ding, Yiming Qiu, Zhenning Yang, Yibo Huang, Jayanth Srini- vasa, Myungjin Lee, Mosharaf Chowdhury, and Ang Chen. Curie: Toward rigorous and automated scientific experimentation with ai agents.arXiv preprint arXiv:2502.16069, 2025

work page arXiv 2025

[12] [12]

Piflow: Principle-aware scientific discovery with multi-agent col- laboration.arXiv preprint arXiv:2505.15047, 2025

Yingming Pu, Tao Lin, and Hongyu Chen. Piflow: Principle-aware scientific discovery with multi-agent col- laboration.arXiv preprint arXiv:2505.15047, 2025

work page arXiv 2025

[13] [13]

Large language models for automated scholarly paper review: A survey.Information Fusion, 124:103332, 2025

Zhenzhen Zhuang, Jiandong Chen, Hongfeng Xu, Yuwen Jiang, and Jialiang Lin. Large language models for automated scholarly paper review: A survey.Information Fusion, 124:103332, 2025

2025

[14] [14]

A vision for auto research with llm agents.arXiv preprint arXiv:2504.18765, 2025

Chengwei Liu, Chong Wang, Jiayue Cao, Jingquan Ge, Kun Wang, Lyuye Zhang, Ming-Ming Cheng, Peng- hai Zhao, Tianlin Li, Xiaojun Jia, et al. A vision for auto research with llm agents.arXiv preprint arXiv:2504.18765, 2025

work page arXiv 2025

[15] [15]

Litllm: A toolkit for scientific literature review.arXiv preprint arXiv:2402.01788, 2024

Shubham Agarwal, Gaurav Sahu, Abhay Puri, Issam H Laradji, Krishnamurthy DJ Dvijotham, Jason Stan- ley, Laurent Charlin, and Christopher Pal. Litllm: A toolkit for scientific literature review.arXiv preprint arXiv:2402.01788, 2024

work page arXiv 2024

[16] [16]

Openscholar.https://github.com/AkariAsai/OpenScholar, 2026

GitHub repository. Openscholar.https://github.com/AkariAsai/OpenScholar, 2026

2026

[17] [17]

Language agents achieve superhuman synthesis of scientific knowledge.arXiv preprint arXiv:2409.13740, 2024

Michael D Skarlinski, Sam Cox, Jon M Laurent, James D Braza, Michaela Hinks, Michael J Hammerling, Man- vitha Ponnapati, Samuel G Rodriques, and Andrew D White. Language agents achieve superhuman synthesis of scientific knowledge.arXiv preprint arXiv:2409.13740, 2024

work page arXiv 2024

[18] [18]

Paperqa2.https://github.com/Future-House/paper-qa, 2026

GitHub repository. Paperqa2.https://github.com/Future-House/paper-qa, 2026

2026

[19] [19]

Openhands: An open platform for ai software developers as generalist agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. InInternational Conference on Learning Representations, volume 2025, pages 65882–65919, 2025. 41 AUTORESEARCHAI

2025

[20] [20]

Aider.https://github.com/Aider-AI/aider, 2026

GitHub repository. Aider.https://github.com/Aider-AI/aider, 2026

2026

[21] [21]

Swe-agent.https://github.com/SWE-agent/SWE-agent, 2026

GitHub repository. Swe-agent.https://github.com/SWE-agent/SWE-agent, 2026

2026

[22] [22]

Agent laboratory.https://github.com/SamuelSchmidgall/ AgentLaboratory, 2026

GitHub repository. Agent laboratory.https://github.com/SamuelSchmidgall/ AgentLaboratory, 2026

2026

[23] [23]

Ai-researcher.https://github.com/HKUDS/AI-Researcher, 2026

GitHub repository. Ai-researcher.https://github.com/HKUDS/AI-Researcher, 2026

2026

[24] [24]

Auto-claude-code-research-in-sleep (aris).https://github.com/wanshuiyin/ Auto-claude-code-research-in-sleep, 2026

GitHub repository. Auto-claude-code-research-in-sleep (aris).https://github.com/wanshuiyin/ Auto-claude-code-research-in-sleep, 2026

2026

[25] [25]

Nanoresearch, 2026

OpenRaiser. Nanoresearch, 2026

2026

[26] [26]

Airs-bench: a suite of tasks for frontier ai research science agents.arXiv preprint arXiv:2602.06855, 2026

Alisia Lupidi, Bhavul Gauri, Thomas Simon Foster, Bassel Al Omari, Despoina Magka, Alberto Pepe, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, Lucia Cipolina-Kun, et al. Airs-bench: a suite of tasks for frontier ai research science agents.arXiv preprint arXiv:2602.06855, 2026

work page arXiv 2026

[27] [27]

When ai co-scientists fail: Spot-a benchmark for automated verification of scientific research.arXiv preprint arXiv:2505.11855, 2025

Guijin Son, Jiwoo Hong, Honglu Fan, Heejeong Nam, Hyunwoo Ko, Seungwon Lim, Jinyeop Song, Jinha Choi, Gonçalo Paulo, Youngjae Yu, et al. When ai co-scientists fail: Spot-a benchmark for automated verification of scientific research.arXiv preprint arXiv:2505.11855, 2025

work page arXiv 2025

[28] [28]

Controla: Agentic workflow control mechanisms for reliable science

Amal Gueroudji, Tanwi Mallick, Renan Souza, Rafael Ferreira Da Silva, Robert Ross, Matthieu Dorier, Philip Carns, Kyle Chard, and Ian Foster. Controla: Agentic workflow control mechanisms for reliable science. In 2025 IEEE International Conference on eScience (eScience), pages 415–426. IEEE, 2025

2025

[29] [29]

CoRR , volume =

Qiujie Xie, Yixuan Weng, Minjun Zhu, Fuchen Shen, Shulin Huang, Zhen Lin, Jiahui Zhou, Zilan Mao, Zijie Yang, Linyi Yang, et al. How far are ai scientists from changing the world?arXiv preprint arXiv:2507.23276, 2025

work page arXiv 2025

[30] [30]

A survey of ai scientists.arXiv preprint arXiv:2510.23045, 2025

Guiyao Tie, Pan Zhou, and Lichao Sun. A survey of ai scientists.arXiv preprint arXiv:2510.23045, 2025

work page arXiv 2025

[31] [31]

AI4Research:

Qiguang Chen, Mingda Yang, Libo Qin, Jinhao Liu, Zheng Yan, Jiannan Guan, Dengyun Peng, Yiyan Ji, Hanjing Li, Mengkang Hu, et al. Ai4research: A survey of artificial intelligence for scientific research.arXiv preprint arXiv:2507.01903, 2025

work page arXiv 2025

[32] [32]

Routledge, 2005

Karl Popper.The logic of scientific discovery. Routledge, 2005

2005

[33] [33]

University of Chicago press Chicago, 1970

Thomas S Kuhn and Ian Hacking.The structure of scientific revolutions, volume 2. University of Chicago press Chicago, 1970

1970

[34] [34]

University of Chicago press, 1973

Robert K Merton.The sociology of science: Theoretical and empirical investigations. University of Chicago press, 1973

1973

[35] [35]

Peer review of gpt-4 technical report and systems card.PLOS digital health, 3(1):e0000417, 2024

Jack Gallifant, Amelia Fiske, Yulia A Levites Strekalova, Juan S Osorio-Valencia, Rachael Parke, Rogers Mwavu, Nicole Martinez, Judy Wawira Gichoya, Marzyeh Ghassemi, Dina Demner-Fushman, et al. Peer review of gpt-4 technical report and systems card.PLOS digital health, 3(1):e0000417, 2024

2024

[36] [36]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

2025

[37] [37]

Openhands.https://github.com/All-Hands-AI/OpenHands, 2026

GitHub repository. Openhands.https://github.com/All-Hands-AI/OpenHands, 2026

2026

[38] [38]

Towards an AI co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist.arXiv preprint arXiv:2502.18864, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Build your personalized research group: A multiagent framework for continual and interactive science automation.arXiv preprint arXiv:2510.15624, 2025

Ed Li, Junyu Ren, Xintian Pan, Cat Yan, Chuanhao Li, Dirk Bergemann, and Zhuoran Yang. Build your personalized research group: A multiagent framework for continual and interactive science automation.arXiv preprint arXiv:2510.15624, 2025

work page arXiv 2025

[40] [40]

Evaluating sakana’s ai scientist: Bold claims, mixed results, and a promising future? InACM SIGIR Forum, volume 59, pages 1–20

Joeran Beel, Min-Yen Kan, and Moritz Baumgart. Evaluating sakana’s ai scientist: Bold claims, mixed results, and a promising future? InACM SIGIR Forum, volume 59, pages 1–20. ACM New York, NY , USA, 2025

2025

[41] [41]

Can ai conduct autonomous scientific research? case studies on two real-world tasks.bioRxiv, pages 2026–01, 2026

Shreyansh Agrawal, Harsh B Anadkat, Kiran K Athimoolam, Harsh Bhardwaj, Trishul Chowdhury, Shengtao Gao, Purva K Kamat, Vishwadeepsinh Makwana, Mohammed H Shariff, Amitesh Badkul, et al. Can ai conduct autonomous scientific research? case studies on two real-world tasks.bioRxiv, pages 2026–01, 2026

2026

[42] [42]

The more you automate, the less you see: Hidden pitfalls of ai scientist systems.arXiv preprint arXiv:2509.08713, 2025

Ziming Luo, Atoosa Kasirzadeh, and Nihar B Shah. The more you automate, the less you see: Hidden pitfalls of ai scientist systems.arXiv preprint arXiv:2509.08713, 2025

work page arXiv 2025

[43] [43]

Autonomous ‘self-driving’laboratories: a review of technology and policy implications.Royal Society Open Science, 12(7):250646, 2025

Alexander V Tobias and Adam Wahab. Autonomous ‘self-driving’laboratories: a review of technology and policy implications.Royal Society Open Science, 12(7):250646, 2025. 42 AUTORESEARCHAI

2025

[44] [44]

Empowering biomedical discovery with ai agents.Cell, 187(22):6125–6151, 2024

Shanghua Gao, Ada Fang, Yepeng Huang, Valentina Giunchiglia, Ayush Noori, Jonathan Richard Schwarz, Yasha Ektefaie, Jovana Kondic, and Marinka Zitnik. Empowering biomedical discovery with ai agents.Cell, 187(22):6125–6151, 2024

2024

[45] [45]

Ai-researcher: Autonomous scientific innovation

Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. Ai-researcher: Autonomous scientific innovation. Advances in Neural Information Processing Systems, 38:9481–9520, 2026

2026

[46] [46]

Perspective on utilizing foundation models for laboratory automation in materials research.Science and Technology of Advanced Materials: Methods, 5(1):2582379, 2025

Kan Hatakeyama-Sato, Toshihiko Nishida, Kenta Kitamura, Yoshitaka Ushiku, Koichi Takahashi, Yuta Nabae, and Teruaki Hayakawa. Perspective on utilizing foundation models for laboratory automation in materials research.Science and Technology of Advanced Materials: Methods, 5(1):2582379, 2025

2025

[47] [47]

Agents for self-driving laboratories applied to quantum computing.arXiv preprint arXiv:2412.07978, 2024

Shuxiang Cao, Zijian Zhang, Mohammed Alghadeer, Simone D Fasciati, Michele Piscitelli, Mustafa Bakr, Peter Leek, and Alán Aspuru-Guzik. Agents for self-driving laboratories applied to quantum computing.arXiv preprint arXiv:2412.07978, 2024

work page arXiv 2024

[48] [48]

Hy- per: Literature-grounded hypothesis generation and distillation with provenance

Rosni Vasu, Chandrayee Basu, Bhavana Dalvi, Cristina Sarasua, Peter Clark, and Abraham Bernstein. Hy- per: Literature-grounded hypothesis generation and distillation with provenance. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25424–25449, 2025

2025

[49] [49]

Bioplanner: automatic evaluation of llms on protocol planning in biology

Odhran O’Donoghue, Aleksandar Shtedritski, John Ginger, Ralph Abboud, Ali Ghareeb, and Samuel Ro- driques. Bioplanner: automatic evaluation of llms on protocol planning in biology. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2676–2694, 2023

2023

[50] [50]

Researchagent: Iterative research idea generation over scientific literature with large language models

Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. Researchagent: Iterative research idea generation over scientific literature with large language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pag...

2025

[51] [51]

Omniscientist: Toward a co-evolving ecosystem of human and ai scientists

Chenyang Shao, Dehao Huang, Yu Li, Keyu Zhao, Weiquan Lin, Yining Zhang, Qingbin Zeng, Zhiyu Chen, Tianxing Li, Yifei Huang, et al. Omniscientist: Toward a co-evolving ecosystem of human and ai scientists. arXiv preprint arXiv:2511.16931, 2025

work page arXiv 2025

[52] [52]

Codescientist: End-to-end semi-automated scien- tific discovery with code-based experimentation

Peter Jansen, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi, Bod- hisattwa Prasad Majumder, Daniel S Weld, and Peter Clark. Codescientist: End-to-end semi-automated scien- tific discovery with code-based experimentation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 13370–13467, 2025

2025

[53] [53]

Autonomous Agents for Scientific Discovery: Orchestrating Scientists, Language, Code, and Physics

Lianhao Zhou, Hongyi Ling, Cong Fu, Yepeng Huang, Michael Sun, Wendi Yu, Xiaoxuan Wang, Xiner Li, Xingyu Su, Junkai Zhang, et al. Autonomous agents for scientific discovery: Orchestrating scientists, language, code, and physics.arXiv preprint arXiv:2510.09901, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

Auto-bench: An automated benchmark for scientific discovery in llms.arXiv preprint arXiv:2502.15224, 2025

Tingting Chen, Srinivas Anumasa, Beibei Lin, Vedant Shah, Anirudh Goyal, and Dianbo Liu. Auto-bench: An automated benchmark for scientific discovery in llms.arXiv preprint arXiv:2502.15224, 2025

work page arXiv 2025

[55] [55]

Biodsa-1k: Benchmarking data science agents for biomedical research.arXiv preprint arXiv:2505.16100, 2025

Zifeng Wang, Benjamin Danek, and Jimeng Sun. Biodsa-1k: Benchmarking data science agents for biomedical research.arXiv preprint arXiv:2505.16100, 2025

work page arXiv 2025

[56] [56]

ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

Yujie Liu, Zonglin Yang, Tong Xie, Jinjie Ni, Ben Gao, Yuqiang Li, Shixiang Tang, Wanli Ouyang, Erik Cambria, and Dongzhan Zhou. Researchbench: Benchmarking llms in scientific discovery via inspiration- based task decomposition.arXiv preprint arXiv:2503.21248, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[57] [57]

Researcherbench: Evaluating deep ai research systems on the frontiers of scientific inquiry.arXiv preprint arXiv:2507.16280, 2025

Tianze Xu, Pengrui Lu, Lyumanshan Ye, Xiangkun Hu, and Pengfei Liu. Researcherbench: Evaluating deep ai research systems on the frontiers of scientific inquiry.arXiv preprint arXiv:2507.16280, 2025

work page arXiv 2025

[58] [58]

autoresearch.https://github.com/karpathy/autoresearch, 2026

GitHub repository. autoresearch.https://github.com/karpathy/autoresearch, 2026

2026

[59] [59]

Deerflow.https://github.com/bytedance/deer-flow, 2026

GitHub repository. Deerflow.https://github.com/bytedance/deer-flow, 2026

2026

[60] [60]

Open deep research.https://github.com/langchain-ai/open_deep_ research, 2026

GitHub repository. Open deep research.https://github.com/langchain-ai/open_deep_ research, 2026

2026

[61] [61]

Automated scientific discov- ery: from equation discovery to autonomous discovery systems.Machine Learning, 115(5):109, 2026

Stefan Kramer, Mattia Cerrato, Jannis Brugger, Sašo Džeroski, and Ross D King. Automated scientific discov- ery: from equation discovery to autonomous discovery systems.Machine Learning, 115(5):109, 2026

2026

[62] [62]

Agent4s: The transformation of research paradigms from the perspective of large language models.arXiv preprint arXiv:2506.23692, 2025

Boyuan Zheng, Zerui Fang, Zhe Xu, Rui Wang, Yiwen Chen, Cunshi Wang, Mengwei Qu, Lei Lei, Zhen Feng, Yan Liu, et al. Agent4s: The transformation of research paradigms from the perspective of large language models.arXiv preprint arXiv:2506.23692, 2025

work page arXiv 2025

[63] [63]

Columbia university press, 1963

Derek J De Solla Price.Little science, big science. Columbia university press, 1963

1963

[64] [64]

Functional genomic hypothesis generation and experimentation by a robot scientist.Nature, 427(6971):247–252, 2004

Ross D King, Kenneth E Whelan, Ffion M Jones, Philip GK Reiser, Christopher H Bryant, Stephen H Muggle- ton, Douglas B Kell, and Stephen G Oliver. Functional genomic hypothesis generation and experimentation by a robot scientist.Nature, 427(6971):247–252, 2004. 43 AUTORESEARCHAI

2004

[65] [65]

Ai feynman: A physics-inspired method for symbolic regression

Silviu-Marian Udrescu and Max Tegmark. Ai feynman: A physics-inspired method for symbolic regression. Science advances, 6(16):eaay2631, 2020

2020

[66] [66]

Storm.https://github.com/stanford-oval/storm, 2026

GitHub repository. Storm.https://github.com/stanford-oval/storm, 2026

2026

[67] [67]

Scis- age: A multi-agent framework for high-quality scientific survey generation.arXiv preprint arXiv:2506.12689, 2025

Xiaofeng Shi, Qian Kou, Yuduo Li, Ning Tang, Jinxin Xie, Longbin Yu, Songjing Wang, and Hua Zhou. Scis- age: A multi-agent framework for high-quality scientific survey generation.arXiv preprint arXiv:2506.12689, 2025

work page arXiv 2025

[68] [68]

Deep research arena: The first exam of llms’ research abilities via seminar-grounded tasks

Haiyuan Wan, Chen Yang, Junchi Yu, Meiqi Tu, Jiaxuan Lu, Di Yu, Jianbao Cao, Ben Gao, Jiaqing Xie, Aoran Wang, et al. Deep research arena: The first exam of llms’ research abilities via seminar-grounded tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33341–33349, 2026

2026

[69] [69]

Gpt researcher.https://github.com/assafelovic/gpt-researcher, 2026

GitHub repository. Gpt researcher.https://github.com/assafelovic/gpt-researcher, 2026

2026

[70] [70]

Tongyi deepresearch.https://github.com/Alibaba-NLP/DeepResearch, 2026

GitHub repository. Tongyi deepresearch.https://github.com/Alibaba-NLP/DeepResearch, 2026

2026

[71] [71]

Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

2023

[72] [72]

An autonomous laboratory for the accelerated synthesis of inorganic materials.Nature, 624(7990):86, 2023

Nathan J Szymanski, Bernardus Rendy, Yuxing Fei, Rishi E Kumar, Tanjin He, David Milsted, Matthew J Mc- Dermott, Max Gallant, Ekin Dogus Cubuk, Amil Merchant, et al. An autonomous laboratory for the accelerated synthesis of inorganic materials.Nature, 624(7990):86, 2023

2023

[73] [73]

Cy- cleresearcher: Improving automated research via automated review

Yixuan Weng, Minjun Zhu, Guangsheng Bao, Hongbo Zhang, Jindong Wang, Yue Zhang, and Linyi Yang. Cy- cleresearcher: Improving automated research via automated review. InInternational Conference on Learning Representations, volume 2025, pages 3669–3709, 2025

2025

[74] [74]

Mlr-copilot: Autonomous machine learning research based on large language models agents.arXiv preprint arXiv:2408.14033, 2024

Ruochen Li, Teerth Patel, Qingyun Wang, and Xinya Du. Mlr-copilot: Autonomous machine learning research based on large language models agents.arXiv preprint arXiv:2408.14033, 2024

work page arXiv 2024

[75] [75]

R&d-agent: An llm-agent framework towards autonomous data science, 2025

Xu Yang, Xiao Yang, Shikai Fang, Yifei Zhang, Jian Wang, Bowen Xian, Qizheng Li, Jingyuan Li, Minrui Xu, Yuante Li, et al. R&d-agent: An llm-agent framework towards autonomous data science, 2025

2025

[76] [76]

Aigs: Generating science from ai-powered automated falsification.arXiv preprint arXiv:2411.11910, 2024

Zijun Liu, Kaiming Liu, Yiqi Zhu, Xuanyu Lei, Zonghan Yang, Zhenhe Zhang, Peng Li, and Yang Liu. Aigs: Generating science from ai-powered automated falsification.arXiv preprint arXiv:2411.11910, 2024

work page arXiv 2024

[77] [77]

The virtual lab of ai agents designs new sars-cov-2 nanobodies.Nature, 646(8085):716–723, 2025

Kyle Swanson, Wesley Wu, Nash L Bulaong, John E Pak, and James Zou. The virtual lab of ai agents designs new sars-cov-2 nanobodies.Nature, 646(8085):716–723, 2025

2025

[78] [78]

Sciagents: automating scientific discovery through bioinspired multi-agent intelligent graph reasoning.Advanced Materials, 37(22):2413523, 2025

Alireza Ghafarollahi and Markus J Buehler. Sciagents: automating scientific discovery through bioinspired multi-agent intelligent graph reasoning.Advanced Materials, 37(22):2413523, 2025

2025

[79] [79]

Sciscigpt: advancing human–ai collaboration in the science of science.Nature Computational Science, 6(3):301–315, 2026

Erzhuo Shao, Yifang Wang, Yifan Qian, Zhenyu Pan, Han Liu, and Dashun Wang. Sciscigpt: advancing human–ai collaboration in the science of science.Nature Computational Science, 6(3):301–315, 2026

2026

[80] [80]

Szostkiewicz, Dmytro Shved, Gavin J

Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J. Szostkiewicz, Dmytro Shved, Gavin J. Gyimesi, Jon M. Laurent, Samantha M. Wright, Muhammed T. Razzak, Andrew D. White, Silvia C. Finnemann, Michaela M. Hinks, and Samuel G. Rodriques. A multi-agent system for automating scientific discovery.Nature, May 2026

2026