pith. sign in

arxiv: 2606.12736 · v1 · pith:77NRCAIXnew · submitted 2026-06-10 · 💻 cs.AI · cs.LG

Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

Pith reviewed 2026-06-27 09:32 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords AI agentsscientific discoverybenchmarkSciAgentArenadata analysisnovel insightsautonomyopen-ended research
0
0 comments X

The pith

Current AI agents contribute effectively to structured scientific data analysis but struggle to generate novel insights or handle open-ended research.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SciAgentArena, a benchmark of roughly 200 tasks drawn from multiple scientific domains, to test AI agents in interactive, stepwise-verified research scenarios. It establishes that agents perform reliably on well-specified data-analysis workflows with clear evaluation criteria, yet their output remains limited when tasks require original ideas, independent exploration, or solutions to loosely defined problems. A reader would care because the work isolates where agents can already assist real laboratory and analysis pipelines and where they still fall short of autonomous scientific contribution.

Core claim

SciAgentArena supplies an interactive, agent-agnostic environment that shows current agents can support well-specified data-analysis workflows when task structure and success metrics are explicit, but performance drops sharply on tasks demanding genuinely novel insights, sustained self-directed exploration, or robust answers to open-ended scientific questions.

What carries the argument

SciAgentArena, a benchmark of approximately 200 tasks equipped with stepwise verification inside an interactive environment that supports diverse agents without favoring any single architecture.

If this is right

  • Agents can already be integrated into pipelines that perform well-specified data analysis with explicit criteria.
  • Common failure modes across agents have been catalogued, supplying concrete targets for reliability and autonomy improvements.
  • The benchmark itself supplies a repeatable method for tracking whether future agents close the gap on novelty and self-directed work.
  • Design of new agents can now be guided by the observed contrast between structured and open-ended scientific tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The multi-domain construction of the benchmark makes it possible to test whether failure patterns are consistent across fields or remain domain-specific.
  • Hybrid agent designs that combine structured analysis modules with separate hypothesis-generation modules could be evaluated directly against the same task set.
  • Extending the benchmark with longer-horizon tasks that span weeks of simulated work would expose whether current limits on self-direction persist at realistic research timescales.

Load-bearing premise

The roughly 200 tasks chosen for SciAgentArena adequately represent the full range of complexity, heterogeneity, and extended reasoning found in actual scientific research.

What would settle it

An agent that produces and verifies a genuinely novel scientific result on a set of tasks constructed independently of the original 200 would contradict the reported limits on novelty and open-ended reasoning.

Figures

Figures reproduced from arXiv: 2606.12736 by Ada Fang, Allen Xin Wang, Antonia Panescu, Arman Cohan, Botao Yu, Haoran Shao, Hongyu Zhao, Hua Xu, James Zou, Jihang Chen, Kaize Ding, Kunyang Sun, Leqi Xu, Lingzhou Xue, Lisa Xinyi Chen, Marinka Zitnik, Qingyu Chen, Rex Ying, Sihan Jiang, Siyi Gu, Siyu Chen, Tianyu Liu, Wangjie Zheng, Wengong Jin, Wenxin Long, Xinyang Hu, Xinyu Wei, Yuanqi Du, Yueqian Jing, Zhiyuan Cao, Zhuoran Yang, Ziqing Wang, Ziyao Zeng.

Figure 1
Figure 1. Figure 1: Overview of SciAgentArena. (a) Limitations of the current AI agent benchmark. (b) Lim [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Key AI agent capacities and benchmarking platform development. (a) Categories of scien [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Summarizing AI Agent performances on the drug-discovery domain. (a) Performances on [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Summarizing AI Agent performances in cell and tissue domains. (a)-(g): single-cell omics; [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Summary of AI agent performance on EHR-based clinical and statistical genetics tasks. (a) [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Summary of our benchmarking studies across different fields: Error types, sources, error [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
read the original abstract

AI agents are increasingly being developed to accelerate scientific discovery, yet their practical capabilities in real research settings remain poorly understood. Existing benchmarks for AI agents rarely capture the complexity, heterogeneity, and extended reasoning required by scientific work, whereas benchmarks for scientific tasks often reduce research to static, direct problems and provide limited support for interactive evaluation. Here, we introduce SciAgentArena, a systematic benchmark for evaluating AI agents in real-world scientific research scenarios drawn from emerging needs across multiple domains. SciAgentArena comprises approximately 200 tasks with stepwise verification and an interactive, agent-agnostic environment for assessing diverse AI agents. Using this benchmark, we find that current agents can contribute effectively to well-specified data-analysis workflows, particularly when the task structure and evaluation criteria are clear. However, their performance remains uneven across scientific contexts: agents struggle to generate genuinely novel insights, sustain self-directed exploration, and formulate robust solutions for open-ended research questions. We further characterize common failure modes across agents and identify opportunities for improving their reliability, autonomy, and scientific reasoning. Together, SciAgentArena provides a practical framework for measuring progress in AI agents for science and for guiding the design of future agents capable of addressing complex scientific challenges. Full codes, tasks, and datasets can be accessed via this link: https://sciagentarena.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces SciAgentArena, a benchmark with approximately 200 tasks drawn from multiple scientific domains, featuring stepwise verification and an interactive agent-agnostic environment. It evaluates current AI agents and claims they perform effectively on well-specified data-analysis workflows with clear structure and criteria, but exhibit uneven results overall, struggling to generate novel insights, sustain self-directed exploration, or solve open-ended research questions. The work also identifies common failure modes and opportunities for improving agent reliability and scientific reasoning.

Significance. If the benchmark tasks are shown to be faithful proxies for real scientific complexity and extended reasoning, SciAgentArena would offer a practical, reproducible framework for tracking progress in AI for science and guiding agent design. The provision of full codes, tasks, and datasets is a strength that supports reproducibility.

major comments (2)
  1. [abstract and task construction section] Task construction and validation (abstract and §3): The central claim that performance gaps reflect agent limitations rather than benchmark artifacts rests on the ~200 tasks capturing 'complexity, heterogeneity, and extended reasoning.' No concrete criteria for domain sampling, expert validation of realism, or quantitative checks confirming sustained multi-step reasoning chains (vs. static problems) are supplied, undermining the diagnostic value of the reported uneven performance.
  2. [results and evaluation sections] Results and evaluation protocols (§4 and §5): The abstract states high-level findings on agent performance without reporting specific metrics, statistical tests, agent architectures/details, or evaluation protocols. This prevents assessment of whether the data-analysis success vs. open-ended failure distinction is robust or sensitive to task selection.
minor comments (1)
  1. [abstract] The link to codes/tasks/datasets is provided but the manuscript should include a brief summary table of task categories, domains, and verification steps for immediate clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify how to strengthen the presentation of SciAgentArena. We respond to each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [abstract and task construction section] Task construction and validation (abstract and §3): The central claim that performance gaps reflect agent limitations rather than benchmark artifacts rests on the ~200 tasks capturing 'complexity, heterogeneity, and extended reasoning.' No concrete criteria for domain sampling, expert validation of realism, or quantitative checks confirming sustained multi-step reasoning chains (vs. static problems) are supplied, undermining the diagnostic value of the reported uneven performance.

    Authors: We agree that Section 3 would benefit from greater explicitness. In the revision we will add: (i) the precise sampling criteria used to select the ~200 tasks across domains (prioritizing tasks drawn from recent open research questions and expert-identified gaps), (ii) the protocol for expert validation of task realism (including the number of domain scientists consulted and the rubric applied), and (iii) quantitative descriptors of reasoning depth (average verification steps per task, distribution of task lengths, and results from pilot studies confirming that tasks require sustained multi-step interaction rather than single-shot answers). These additions will directly support the claim that observed performance differences arise from agent capabilities. revision: yes

  2. Referee: [results and evaluation sections] Results and evaluation protocols (§4 and §5): The abstract states high-level findings on agent performance without reporting specific metrics, statistical tests, agent architectures/details, or evaluation protocols. This prevents assessment of whether the data-analysis success vs. open-ended failure distinction is robust or sensitive to task selection.

    Authors: Sections 4 and 5 already contain the requested details (per-task success rates, statistical comparisons, agent configurations, and the stepwise verification protocol). To make the abstract self-contained and allow immediate assessment of robustness, we will revise it to include a small number of key quantitative results (e.g., aggregate success rates on structured data-analysis tasks versus open-ended tasks) while retaining the high-level narrative. We view this as a modest but effective clarification. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark paper with no derivations or self-referential steps

full rationale

This paper introduces an empirical benchmark (SciAgentArena) consisting of ~200 tasks for evaluating AI agents on scientific workflows. The abstract and described content contain no equations, fitted parameters, predictions derived from inputs, or derivation chains. Claims about agent performance rest on task construction and reported runs rather than any self-definitional or self-citation load-bearing logic. The assumption that tasks capture real scientific complexity is a design choice open to external validation, not a reduction by construction. No circular steps exist.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark introduction paper rather than a theoretical derivation; the central claims rest on the construction of the task set and the reported agent evaluations rather than on any free parameters, mathematical axioms, or postulated entities.

pith-pipeline@v0.9.1-grok · 5885 in / 1161 out tokens · 24365 ms · 2026-06-27T09:32:22.904345+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Closed-loop Auto Research for Molecular Property Prediction: Discovering and Certifying Generalizable Improvements

    cs.AI 2026-06 unverdicted novelty 6.0

    Closed-loop LM-agent auto research finds some transferable gains on molecular property prediction benchmarks via external data but shows non-transfer for model and feature edits selected on validation.

Reference graph

Works this paper leans on

148 extracted references · 8 linked inside Pith · cited by 1 Pith paper

  1. [1]

    A survey of large language models.arXiv preprint arXiv:2303.18223, 1(2):1–124, 2023

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 1(2):1–124, 2023

  2. [2]

    Agenticaiforscientificdiscovery: Asurveyofprogress,challenges,andfuturedirections.arXiv preprint arXiv:2503.08979, 2025

    MouradGridach,JayNanavati,KhaldounZineElAbidine,LenonMendes,andChristinaMack. Agenticaiforscientificdiscovery: Asurveyofprogress,challenges,andfuturedirections.arXiv preprint arXiv:2503.08979, 2025

  3. [3]

    Litllms, llms for literature re- view: Are we there yet?arXiv preprint arXiv:2412.15249, 2024

    Shubham Agarwal*, Gaurav Sahu*, Abhay Puri*, Issam H Laradji, Krishnamurthy DJ Dvi- jotham, Jason Stanley, Laurent Charlin, and Christopher Pal. Litllms, llms for literature re- view: Are we there yet?arXiv preprint arXiv:2412.15249, 2024. 38 Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

  4. [4]

    Biomni: A general-purpose biomedical ai agent

    KexinHuang, SerenaZhang, HanchenWang, YuanhaoQu, YingzhouLu, YusufRoohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, et al. Biomni: A general-purpose biomedical ai agent. biorxiv, 2025

  5. [5]

    Accelerating scientific discovery with autonomous goal-evolving agents.arXiv preprint arXiv:2512.21782, 2025

    Yuanqi Du, Botao Yu, Tianyu Liu, Tony Shen, Junwu Chen, Jan G Rittig, Kunyang Sun, Yikun Zhang, Zhangde Song, Bo Zhou, et al. Accelerating scientific discovery with autonomous goal-evolving agents.arXiv preprint arXiv:2512.21782, 2025

  6. [6]

    Deep research: A survey of autonomous research agents.arXiv preprint arXiv:2508.12752, 2025

    Wenlin Zhang, Xiaopeng Li, Yingyi Zhang, Pengyue Jia, Yichao Wang, Huifeng Guo, Yong Liu, and Xiangyu Zhao. Deep research: A survey of autonomous research agents.arXiv preprint arXiv:2508.12752, 2025

  7. [7]

    Deep research, 2026

    OpenAI. Deep research, 2026

  8. [8]

    Towards an ai co-scientist.arXiv preprint arXiv:2502.18864, 2025

    Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Ar- tiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist.arXiv preprint arXiv:2502.18864, 2025

  9. [9]

    Towards end-to-end automation of ai research.Nature, 651(8107):914– 919, 2026

    Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and Jeff Clune. Towards end-to-end automation of ai research.Nature, 651(8107):914– 919, 2026

  10. [10]

    Sciarena: An open evaluation platform for foundation models in scientific literature tasks.arXiv preprint arXiv:2507.01001, 2025

    Yilun Zhao, Kaiyan Zhang, Tiansheng Hu, Sihong Wu, Ronan Le Bras, Taira Anderson, Jonathan Bragg, Joseph Chee Chang, Jesse Dodge, Matt Latzke, et al. Sciarena: An open evaluation platform for foundation models in scientific literature tasks.arXiv preprint arXiv:2507.01001, 2025

  11. [11]

    Evaluatinglargelanguagemodelsinscientificdiscovery

    Zhangde Song, Jieyu Lu, Yuanqi Du, Botao Yu, Thomas M Pruyn, Yue Huang, Kehan Guo, XiuzheLuo,YuanhaoQu,YiQu,etal. Evaluatinglargelanguagemodelsinscientificdiscovery. arXiv preprint arXiv:2512.15567, 2025

  12. [12]

    Sciagentgym: Benchmarking multi-step scien- tific tool-use in llm agents.arXiv preprint arXiv:2602.12984, 2026

    Yujiong Shen, Yajie Yang, Zhiheng Xi, Binze Hu, Huayu Sha, Jiazheng Zhang, Qiyuan Peng, Junlin Shang, Jixuan Huang, Yutao Fan, et al. Sciagentgym: Benchmarking multi-step scien- tific tool-use in llm agents.arXiv preprint arXiv:2602.12984, 2026

  13. [13]

    Scienceagentbench: Toward rigorous assessment of lan- guage agents for data-driven scientific discovery

    Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, et al. Scienceagentbench: Toward rigorous assessment of lan- guage agents for data-driven scientific discovery. InThe Thirteenth International Conference on Learning Representations

  14. [14]

    Dsaeval: Evaluating data science agents on a wide range of real-world data science problems.arXiv preprint arXiv:2601.13591, 2026

    Maojun Sun, Yifei Xie, Yue Wu, Ruijian Han, Binyan Jiang, Defeng Sun, Yancheng Yuan, and Jian Huang. Dsaeval: Evaluating data science agents on a wide range of real-world data science problems.arXiv preprint arXiv:2601.13591, 2026

  15. [15]

    Towards artificial intelligence research assistant for expert- involved learning.arXiv e-prints, pages arXiv–2505, 2025

    Tianyu Liu, Simeng Han, Xiao Luo, Hanchen Wang, Pan Lu, Biqing Zhu, Yuge Wang, Keyi Li, Jiapeng Chen, Rihao Qu, et al. Towards artificial intelligence research assistant for expert- involved learning.arXiv e-prints, pages arXiv–2505, 2025

  16. [16]

    Ml- gym: A new framework and benchmark for advancing ai research agents.arXiv preprint arXiv:2502.14499, 2025

    Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, et al. Ml- gym: A new framework and benchmark for advancing ai research agents.arXiv preprint arXiv:2502.14499, 2025. 39 Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

  17. [17]

    Astabench: Rigorous benchmarking of ai agents with a scientific research suite.arXiv preprint arXiv:2510.21652, 2025

    Jonathan Bragg, Mike D’Arcy, Nishant Balepur, Dan Bareket, Bhavana Dalvi, Sergey Feldman, Dany Haddad, Jena D Hwang, Peter Jansen, Varsha Kishore, et al. Astabench: Rigorous benchmarking of ai agents with a scientific research suite.arXiv preprint arXiv:2510.21652, 2025

  18. [18]

    Math- arena: Evaluating llms on uncontaminated math competitions.Proceedings of the Neural In- formation Processing Systems Track on Datasets and Benchmark, 2025

    Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Math- arena: Evaluating llms on uncontaminated math competitions.Proceedings of the Neural In- formation Processing Systems Track on Datasets and Benchmark, 2025

  19. [19]

    Folio: Natural language reasoning with first-order logic

    SimengHan,HaileySchoelkopf,YilunZhao,ZhentingQi,MartinRiddell,WenfeiZhou,James Coady, David Peng, Yujie Qiao, Luke Benson, et al. Folio: Natural language reasoning with first-order logic. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 22017–22031, 2024

  20. [20]

    Scicode: Aresearchcodingbenchmarkcurated by scientists.Advances in Neural Information Processing Systems, 37:30624–30650, 2024

    Minyang Tian, Luyu Gao, Shizhuo D Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas,PanJi,KittithatKrongchon,YaoLi,etal. Scicode: Aresearchcodingbenchmarkcurated by scientists.Advances in Neural Information Processing Systems, 37:30624–30650, 2024

  21. [21]

    Swe-bench: Can language models resolve real-world github issues? In12th International Conference on Learning Representations, ICLR 2024, 2024

    CarlosEJimenez,JohnYang,AlexanderWettig,ShunyuYao,KexinPei,OfirPress,andKarthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? In12th International Conference on Learning Representations, ICLR 2024, 2024

  22. [22]

    Gpqa: A graduate-level google-proof q&a benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst conference on language modeling, 2024

  23. [23]

    Sci- enceqa: A novel resource for question answering on scholarly articles.International Journal on Digital Libraries, 23(3):289–301, 2022

    Tanik Saikh, Tirthankar Ghosal, Amish Mittal, Asif Ekbal, and Pushpak Bhattacharyya. Sci- enceqa: A novel resource for question answering on scholarly articles.International Journal on Digital Libraries, 23(3):289–301, 2022

  24. [24]

    Bioml-bench: Evaluation of ai agents for end-to-end biomedical ml.bioRxiv, pages 2025–09, 2025

    HenryEMiller,MatthewGreenig,BenjaminTenmann,andBoWang. Bioml-bench: Evaluation of ai agents for end-to-end biomedical ml.bioRxiv, pages 2025–09, 2025

  25. [25]

    Bixbench: a comprehensive benchmark for llm-based agents in computational biology.arXiv preprint arXiv:2503.00096, 2025

    Ludovico Mitchener, Jon M Laurent, Alex Andonian, Benjamin Tenmann, Siddharth Narayanan, Geemi P Wellawatte, Andrew White, Lorenzo Sani, and Samuel G Rodriques. Bixbench: a comprehensive benchmark for llm-based agents in computational biology.arXiv preprint arXiv:2503.00096, 2025

  26. [26]

    Benchmarking ai scientists in omics data-driven biological research.arXiv preprint arXiv:2505.08341, 2025

    Erpai Luo, Jinmeng Jia, Yifan Xiong, Xiangyu Li, Xiaobo Guo, Baoqi Yu, Lei Wei, and Xuegong Zhang. Benchmarking ai scientists in omics data-driven biological research.arXiv preprint arXiv:2505.08341, 2025

  27. [27]

    Agentic systems are adept at solving well-scoped, verifiable problems in computational biol- ogy.bioRxiv, pages 2026–04, 2026

    SuragNair,LauraGunsalus,BrianOrcutt-Jahns,JordanRossen,AvantikaLal,CarloDeDonno, Muhammed Hasan Celik, Kipper Fletez-Brant, Xiaoman Xie, Hector Corrada Bravo, et al. Agentic systems are adept at solving well-scoped, verifiable problems in computational biol- ogy.bioRxiv, pages 2026–04, 2026

  28. [28]

    Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

    Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

  29. [29]

    Gpt-5.2 system card.https://openai.com/index/ gpt-5-system-card-update-gpt-5-2/, 2025

    OpenAI. Gpt-5.2 system card.https://openai.com/index/ gpt-5-system-card-update-gpt-5-2/, 2025. Accessed: 2026-04-18. 40 Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

  30. [30]

    Single cell analysis: the new frontier in ‘omics’.Trends in biotechnology, 28(6):281–290, 2010

    Daojing Wang and Steven Bodovitz. Single cell analysis: the new frontier in ‘omics’.Trends in biotechnology, 28(6):281–290, 2010

  31. [31]

    The dawn of spatial omics.Science, 381(6657):eabq4964, 2023

    Dario Bressan, Giorgia Battistoni, and Gregory J Hannon. The dawn of spatial omics.Science, 381(6657):eabq4964, 2023

  32. [32]

    The role of ai in drug discovery: challenges, opportunities, and strategies.Pharmaceuticals, 16(6):891, 2023

    Alexandre Blanco-Gonzalez, Alfonso Cabezon, Alejandro Seco-Gonzalez, Daniel Conde- Torres, Paula Antelo-Riveiro, Angel Pineiro, and Rebeca Garcia-Fandino. The role of ai in drug discovery: challenges, opportunities, and strategies.Pharmaceuticals, 16(6):891, 2023

  33. [33]

    From real-world electronic health record data to real- world results using artificial intelligence.Annals of the Rheumatic Diseases, 82(3):306–311, 2023

    Rachel Knevel and Katherine P Liao. From real-world electronic health record data to real- world results using artificial intelligence.Annals of the Rheumatic Diseases, 82(3):306–311, 2023

  34. [34]

    Engineeringaico-scientistsforstatisticalgeneticsapplications.NatureGenetics, pages 1–4, 2026

    BingxinZhao. Engineeringaico-scientistsforstatisticalgeneticsapplications.NatureGenetics, pages 1–4, 2026

  35. [35]

    Gemini 3 pro.https://storage.googleapis.com/deepmind-media/ Model-Cards/Gemini-3-Pro-Model-Card.pdf, 2025

    Google. Gemini 3 pro.https://storage.googleapis.com/deepmind-media/ Model-Cards/Gemini-3-Pro-Model-Card.pdf, 2025. Accessed: 2026-04-18

  36. [36]

    Claude sonnet 4.6.https://docs.anthropic.com/, 2025

    Anthropic. Claude sonnet 4.6.https://docs.anthropic.com/, 2025. Accessed: 2026- 04-18

  37. [37]

    Democratizing ai scientists using tooluniverse.arXiv preprint arXiv:2509.23426, 2025

    Shanghua Gao, Richard Zhu, Pengwei Sui, Zhenglun Kong, Sufian Aldogom, Yepeng Huang, Ayush Noori, Reza Shamji, Krishna Parvataneni, Theodoros Tsiligkaridis, et al. Democratizing ai scientists using tooluniverse.arXiv preprint arXiv:2509.23426, 2025

  38. [38]

    ChatGPT Codex.https://chatgpt.com/codex/, 2026

    OpenAI. ChatGPT Codex.https://chatgpt.com/codex/, 2026. Accessed: 2026-04-23

  39. [39]

    Claude Code Overview.https://code.claude.com/docs/en/overview,

    Anthropic. Claude Code Overview.https://code.claude.com/docs/en/overview,

  40. [40]

    Accessed: 2026-04-23

  41. [41]

    Cellforge: agentic design of virtual cell models

    Xiangru Tang, Zhuoyun Yu, Jiapeng Chen, Yan Cui, Daniel Shao, Weixu Wang, Fang Wu, Yuchen Zhuang, Wenqi Shi, Zhi Huang, et al. Cellforge: agentic design of virtual cell models. arXiv preprint arXiv:2508.02276, 2025

  42. [42]

    Stella: Towards a biomedical world model with self-evolving multimodal agents.bioRxiv, pages 2025–07, 2025

    Ruofan Jin, Mingyang Xu, Fei Meng, Guancheng Wan, Qingran Cai, Yize Jiang, Jin Han, Yuanyuan Chen, Wanqing Lu, Mengyang Wang, et al. Stella: Towards a biomedical world model with self-evolving multimodal agents.bioRxiv, pages 2025–07, 2025

  43. [43]

    An aiagentforfullyautomatedmulti-omic analyses.Advanced Science, 11(44):2407094, 2024

    Juexiao Zhou, Bin Zhang, Guowei Li, Xiuying Chen, Haoyang Li, Xiaopeng Xu, Siyuan Chen, WenjiaHe, ChenchengXu, LiweiLiu, and XinGao. An aiagentforfullyautomatedmulti-omic analyses.Advanced Science, 11(44):2407094, 2024

  44. [44]

    Txagent: An ai agent for therapeutic reason- ing across a universe of tools, 2025

    Shanghua Gao, Richard Zhu, Zhenglun Kong, Ayush Noori, Xiaorui Su, Curtis Ginder, Theodoros Tsiligkaridis, and Marinka Zitnik. Txagent: An ai agent for therapeutic reason- ing across a universe of tools, 2025

  45. [45]

    Medea: An omics ai agent for therapeutic discovery.bioRxiv, pages 2026–01, 2026

    Pengwei Sui, Michelle M Li, Shanghua Gao, Wanxiang Shen, Valentina Giunchiglia, Andrew Shen, Yepeng Huang, Zhenglun Kong, and Marinka Zitnik. Medea: An omics ai agent for therapeutic discovery.bioRxiv, pages 2026–01, 2026

  46. [46]

    McNaughton, Gautham Ramalaxmi, Agustin Kruel, Carter R

    Andrew D. McNaughton, Gautham Ramalaxmi, Agustin Kruel, Carter R. Knutson, Rohith A. Varikoti, and Neeraj Kumar. Cactus: Chemistry agent connecting tool-usage to science. 2024. 41 Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

  47. [47]

    Baker, Ziru Chen, Garrett Herb, Boyu Gou, Daniel Adu-Ampratwum, Xia Ning, and Huan Sun

    Botao Yu, Frazier N. Baker, Ziru Chen, Garrett Herb, Boyu Gou, Daniel Adu-Ampratwum, Xia Ning, and Huan Sun. Tooling or not tooling? the impact of tools on language agents for chemistry problem solving. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 7635–7655, Albuquerque, N...

  48. [48]

    Dru- gagent: Automating ai-aided drug discovery programming through llm multi-agent collabo- ration, 2025

    Sizhe Liu, Yizhou Lu, Siyu Chen, Xiyang Hu, Jieyu Zhao, Yingzhou Lu, and Yue Zhao. Dru- gagent: Automating ai-aided drug discovery programming through llm multi-agent collabo- ration, 2025

  49. [49]

    LIDDIA:Language-basedintelligent drug discovery agent

    RezaAverly, FrazierN.Baker, IanAWatson, andXiaNing. LIDDIA:Language-basedintelligent drug discovery agent. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12004–12028, Suzhou, China, November 2025. Association for Comput...

  50. [50]

    An auditable agent platform for automated molec- ular optimisation, 2025

    Atabey Ünlü, Phil Rohr, and Ahmet Celebi. An auditable agent platform for automated molec- ular optimisation, 2025

  51. [51]

    Mragent: an llm-based automated agent for causal knowledge discovery in disease via mendelian randomization.Briefings in Bioinformat- ics, 26(2):bbaf140, 03 2025

    Wei Xu, Gang Luo, Weiyu Meng, Xiaobing Zhai, Keli Zheng, Ji Wu, Yanrong Li, Abao Xing, Junrong Li, Zhifan Li, Ke Zheng, and Kefeng Li. Mragent: an llm-based automated agent for causal knowledge discovery in disease via mendelian randomization.Briefings in Bioinformat- ics, 26(2):bbaf140, 03 2025

  52. [52]

    RDKit: Open-source cheminformatics

  53. [53]

    Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development.arXiv preprint arXiv:2102.09548, 2021

    Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf Roohani, Jure Leskovec, Con- nor W Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development.arXiv preprint arXiv:2102.09548, 2021

  54. [54]

    Patrícia Bento, Jon Chambers, Marleen De Veij, Eloy Félix, María Paula Magariños, Juan F

    David Mendez, Anna Gaulton, A. Patrícia Bento, Jon Chambers, Marleen De Veij, Eloy Félix, María Paula Magariños, Juan F. Mosquera, Prudence Mutowo, Michal Nowotka, María Gordillo-Marañón, FionaHunter, LauraJunco, GraceMugumbate, MilagrosRodriguez-Lopez, Francis Atkinson, Nicolas Bosc, Chris J. Radoux, Aldo Segura-Cabrera, Anne Hersey, and An- drew R. Leac...

  55. [55]

    Gilson, Tiqing Liu, Michael Baitaluk, George Nicola, Linda Hwang, and Jenny Chong

    Michael K. Gilson, Tiqing Liu, Michael Baitaluk, George Nicola, Linda Hwang, and Jenny Chong. BindingDB in 2015: A public database for medicinal chemistry, computational chem- istryandsystemspharmacology.NucleicAcidsResearch,44(D1):D1045–D1053,January2016

  56. [56]

    Baell and Georgina A

    Jonathan B. Baell and Georgina A. Holloway. New Substructure Filters for Removal of Pan Assay Interference Compounds (PAINS) from Screening Libraries and for Their Exclusion in Bioassays.Journal of Medicinal Chemistry, 53(7):2719–2740, April 2010

  57. [57]

    Guacamol: bench- marking models for de novo molecular design.Journal of chemical information and modeling, 59(3):1096–1108, 2019

    Nathan Brown, Marco Fiscato, Marwin HS Segler, and Alain C Vaucher. Guacamol: bench- marking models for de novo molecular design.Journal of chemical information and modeling, 59(3):1096–1108, 2019

  58. [58]

    Clustering with the average silhouette width.Computa- tional Statistics & Data Analysis, 158:107190, 2021

    Fatima Batool and Christian Hennig. Clustering with the average silhouette width.Computa- tional Statistics & Data Analysis, 158:107190, 2021. 42 Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

  59. [59]

    Fast, sensitive and accu- rate integration of single-cell data with harmony.Nature methods, 16(12):1289–1296, 2019

    Ilya Korsunsky, Nghia Millard, Jean Fan, Kamil Slowikowski, Fan Zhang, Kevin Wei, Yuriy Baglaenko, Michael Brenner, Po-ru Loh, and Soumya Raychaudhuri. Fast, sensitive and accu- rate integration of single-cell data with harmony.Nature methods, 16(12):1289–1296, 2019

  60. [60]

    Deep genera- tive modeling for single-cell transcriptomics.Nature methods, 15(12):1053–1058, 2018

    Romain Lopez, Jeffrey Regier, Michael B Cole, Michael I Jordan, and Nir Yosef. Deep genera- tive modeling for single-cell transcriptomics.Nature methods, 15(12):1053–1058, 2018

  61. [61]

    Moderated estimation of fold change and dispersion for rna-seq data with deseq2.Genome biology, 15(12):550, 2014

    Michael I Love, Wolfgang Huber, and Simon Anders. Moderated estimation of fold change and dispersion for rna-seq data with deseq2.Genome biology, 15(12):550, 2014

  62. [62]

    Modelingandpredicting single-cell multi-gene perturbation responses with sclambda.bioRxiv, 2024

    GefeiWang, TianyuLiu, JiaZhao, YoushuCheng, andHongyuZhao. Modelingandpredicting single-cell multi-gene perturbation responses with sclambda.bioRxiv, 2024

  63. [63]

    Ibarra, Olle Holmberg, Isaac Virshup, Mohammad Lotfollahi, Sabrina Richter, and Fabian J

    Giovanni Palla, Hannah Spitzer, Michal Klein, David Fischer, Anna Christina Schaar, Louis Benedikt Kuemmerle, Sergei Rybakov, Ignacio L. Ibarra, Olle Holmberg, Isaac Virshup, Mohammad Lotfollahi, Sabrina Richter, and Fabian J. Theis. Squidpy: a scalable framework for spatial omics analysis.Nature Methods, 19(2):171–178, 2 2022

  64. [64]

    Jensen, Lars J

    Peter B. Jensen, Lars J. Jensen, and Søren Brunak. Mining electronic health records: towards better research applications and clinical care.Nature Reviews Genetics, 13(6):395–405, 2012

  65. [65]

    George Hripcsak and David J. Albers. Next-generation phenotyping of electronic health records.Journal of the American Medical Informatics Association, 20(1):117–121, 2013

  66. [66]

    Jason Walonoski, Mike Kramer, Justin Nichols, Anthony Quina, Christian Moesel, Derek Hall, Chris Duffett, Kristi Dube, Tony Gallagher, and Sean McLachlan. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic elec- tronic health care record.Journal of the American Medical Informatics Association, 25(3):23...

  67. [68]

    Polygenic prediction via bayesian regression and continuous shrinkage priors.Nature communications, 10(1):1776, 2019

    Tian Ge, Chia-Yen Chen, Yang Ni, Yen-Chen Anne Feng, and Jordan W Smoller. Polygenic prediction via bayesian regression and continuous shrinkage priors.Nature communications, 10(1):1776, 2019

  68. [69]

    Martin, Shengying Qin, Hail- iang Huang, and Tian Ge

    Yunfeng Ruan, Yen-Feng Lin, Yen-Chen Anne Feng, Chia-Yen Chen, Max Lam, Zhenglin Guo, Stanley Global Asia Initiatives, Lin He, Akira Sawa, Alicia R. Martin, Shengying Qin, Hail- iang Huang, and Tian Ge. Improving polygenic prediction in ancestrally diverse populations. Nature Genetics, 54(5):573–580, May 2022

  69. [70]

    The gtex consortium atlas of genetic regulatory effects across human tissues.Science, 369(6509):1318–1330, 2020

    GTEx Consortium. The gtex consortium atlas of genetic regulatory effects across human tissues.Science, 369(6509):1318–1330, 2020

  70. [71]

    On the art of compilingandusing’drug-like’chemicalfragmentspaces.ChemMedChem, 3(10):1503–1507, October 2008

    Jörg Degen, Christof Wegscheid-Gerlach, Andrea Zaliani, and Matthias Rarey. On the art of compilingandusing’drug-like’chemicalfragmentspaces.ChemMedChem, 3(10):1503–1507, October 2008

  71. [72]

    conda: Asystem-level,binarypackageandenvironmentmanagerrunning on all major operating systems and platforms

    condacontributors. conda: Asystem-level,binarypackageandenvironmentmanagerrunning on all major operating systems and platforms

  72. [73]

    QuantStack and Mamba Contributors. mamba. 43 Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

  73. [74]

    Shoemaker, Paul A

    Sunghwan Kim, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He, Qingliang Li, Benjamin A. Shoemaker, Paul A. Thiessen, Bo Yu, Leonid Zaslavsky, Jian Zhang, and Evan E. Bolton. PubChem in 2021: new data content and improved web interfaces.Nucleic Acids Research, 49(D1):D1388–D1395, January 2021

  74. [75]

    ChEBI in 2016: Improved services and an expanding collection of metabolites.Nucleic Acids Research, 44(D1):D1214–1219, January 2016

    Janna Hastings, Gareth Owen, Adriano Dekker, Marcus Ennis, Namrata Kale, Venkatesh Muthukrishnan, Steve Turner, Neil Swainston, Pedro Mendes, and Christoph Steinbeck. ChEBI in 2016: Improved services and an expanding collection of metabolites.Nucleic Acids Research, 44(D1):D1214–1219, January 2016

  75. [76]

    Irwin and Brian K

    John J. Irwin and Brian K. Shoichet. ZINC - A Free Database of Commercially Available Com- pounds for Virtual Screening.Journal of Chemical Information and Modeling, 45(1):177–182, January 2005

  76. [77]

    A clinical road map for single-cell omics.Cell, 188(14):3633–3647, 2025

    Michael A Skinnider, Gregoire Courtine, Jocelyne Bloch, and Jordan W Squair. A clinical road map for single-cell omics.Cell, 188(14):3633–3647, 2025

  77. [78]

    Single- cell rna sequencing technologies and applications: a brief overview.Clinical and translational medicine, 12(3):e694, 2022

    Dragomirka Jovic, Xue Liang, Hua Zeng, Lin Lin, Fengping Xu, and Yonglun Luo. Single- cell rna sequencing technologies and applications: a brief overview.Clinical and translational medicine, 12(3):e694, 2022

  78. [79]

    Scanpy: large-scale single-cell gene expression data analysis.Genome biology, 19(1):15, 2018

    F Alexander Wolf, Philipp Angerer, and Fabian J Theis. Scanpy: large-scale single-cell gene expression data analysis.Genome biology, 19(1):15, 2018

  79. [80]

    Benchmarking atlas-level data integration in single-cell genomics.Nature methods, 19(1):41–50, 2022

    Malte D Luecken, Maren Büttner, Kridsadakorn Chaichoompu, Anna Danese, Marta Inter- landi, Michaela F Müller, Daniel C Strobl, Luke Zappia, Martin Dugas, Maria Colomé-Tatché, et al. Benchmarking atlas-level data integration in single-cell genomics.Nature methods, 19(1):41–50, 2022

  80. [81]

    Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

    Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

Showing first 80 references.