Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

Ada Fang; Allen Xin Wang; Antonia Panescu; Arman Cohan; Botao Yu; Haoran Shao; Hongyu Zhao; Hua Xu; James Zou; Jihang Chen

arxiv: 2606.12736 · v1 · pith:77NRCAIXnew · submitted 2026-06-10 · 💻 cs.AI · cs.LG

Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

Tianyu Liu , Allen Xin Wang , Antonia Panescu , Lisa Xinyi Chen , Wenxin Long , Xinyu Wei , Yueqian Jing , Ziyao Zeng

show 25 more authors

Jihang Chen Sihan Jiang Ziqing Wang Siyi Gu Siyu Chen Xinyang Hu Haoran Shao Leqi Xu Wangjie Zheng Zhiyuan Cao Ada Fang Botao Yu Kunyang Sun Rex Ying Arman Cohan Qingyu Chen Lingzhou Xue Kaize Ding Yuanqi Du Wengong Jin Zhuoran Yang Marinka Zitnik James Zou Hua Xu Hongyu Zhao

This is my paper

Pith reviewed 2026-06-27 09:32 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords AI agentsscientific discoverybenchmarkSciAgentArenadata analysisnovel insightsautonomyopen-ended research

0 comments

The pith

Current AI agents contribute effectively to structured scientific data analysis but struggle to generate novel insights or handle open-ended research.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SciAgentArena, a benchmark of roughly 200 tasks drawn from multiple scientific domains, to test AI agents in interactive, stepwise-verified research scenarios. It establishes that agents perform reliably on well-specified data-analysis workflows with clear evaluation criteria, yet their output remains limited when tasks require original ideas, independent exploration, or solutions to loosely defined problems. A reader would care because the work isolates where agents can already assist real laboratory and analysis pipelines and where they still fall short of autonomous scientific contribution.

Core claim

SciAgentArena supplies an interactive, agent-agnostic environment that shows current agents can support well-specified data-analysis workflows when task structure and success metrics are explicit, but performance drops sharply on tasks demanding genuinely novel insights, sustained self-directed exploration, or robust answers to open-ended scientific questions.

What carries the argument

SciAgentArena, a benchmark of approximately 200 tasks equipped with stepwise verification inside an interactive environment that supports diverse agents without favoring any single architecture.

If this is right

Agents can already be integrated into pipelines that perform well-specified data analysis with explicit criteria.
Common failure modes across agents have been catalogued, supplying concrete targets for reliability and autonomy improvements.
The benchmark itself supplies a repeatable method for tracking whether future agents close the gap on novelty and self-directed work.
Design of new agents can now be guided by the observed contrast between structured and open-ended scientific tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The multi-domain construction of the benchmark makes it possible to test whether failure patterns are consistent across fields or remain domain-specific.
Hybrid agent designs that combine structured analysis modules with separate hypothesis-generation modules could be evaluated directly against the same task set.
Extending the benchmark with longer-horizon tasks that span weeks of simulated work would expose whether current limits on self-direction persist at realistic research timescales.

Load-bearing premise

The roughly 200 tasks chosen for SciAgentArena adequately represent the full range of complexity, heterogeneity, and extended reasoning found in actual scientific research.

What would settle it

An agent that produces and verifies a genuinely novel scientific result on a set of tasks constructed independently of the original 200 would contradict the reported limits on novelty and open-ended reasoning.

Figures

Figures reproduced from arXiv: 2606.12736 by Ada Fang, Allen Xin Wang, Antonia Panescu, Arman Cohan, Botao Yu, Haoran Shao, Hongyu Zhao, Hua Xu, James Zou, Jihang Chen, Kaize Ding, Kunyang Sun, Leqi Xu, Lingzhou Xue, Lisa Xinyi Chen, Marinka Zitnik, Qingyu Chen, Rex Ying, Sihan Jiang, Siyi Gu, Siyu Chen, Tianyu Liu, Wangjie Zheng, Wengong Jin, Wenxin Long, Xinyang Hu, Xinyu Wei, Yuanqi Du, Yueqian Jing, Zhiyuan Cao, Zhuoran Yang, Ziqing Wang, Ziyao Zeng.

**Figure 2.** Figure 2: Key AI agent capacities and benchmarking platform development. (a) Categories of scien [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Summarizing AI Agent performances on the drug-discovery domain. (a) Performances on [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Summarizing AI Agent performances in cell and tissue domains. (a)-(g): single-cell omics; [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Summary of AI agent performance on EHR-based clinical and statistical genetics tasks. (a) [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Summary of our benchmarking studies across different fields: Error types, sources, error [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

read the original abstract

AI agents are increasingly being developed to accelerate scientific discovery, yet their practical capabilities in real research settings remain poorly understood. Existing benchmarks for AI agents rarely capture the complexity, heterogeneity, and extended reasoning required by scientific work, whereas benchmarks for scientific tasks often reduce research to static, direct problems and provide limited support for interactive evaluation. Here, we introduce SciAgentArena, a systematic benchmark for evaluating AI agents in real-world scientific research scenarios drawn from emerging needs across multiple domains. SciAgentArena comprises approximately 200 tasks with stepwise verification and an interactive, agent-agnostic environment for assessing diverse AI agents. Using this benchmark, we find that current agents can contribute effectively to well-specified data-analysis workflows, particularly when the task structure and evaluation criteria are clear. However, their performance remains uneven across scientific contexts: agents struggle to generate genuinely novel insights, sustain self-directed exploration, and formulate robust solutions for open-ended research questions. We further characterize common failure modes across agents and identify opportunities for improving their reliability, autonomy, and scientific reasoning. Together, SciAgentArena provides a practical framework for measuring progress in AI agents for science and for guiding the design of future agents capable of addressing complex scientific challenges. Full codes, tasks, and datasets can be accessed via this link: https://sciagentarena.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SciAgentArena is a new benchmark with stepwise scientific tasks, but its takeaways on agent limits hinge on whether the tasks match real research complexity.

read the letter

The main thing to know is that SciAgentArena introduces a benchmark of roughly 200 interactive tasks with stepwise verification, aimed at evaluating AI agents on scientific work across domains. They built an agent-agnostic environment and released the code, tasks, and data.

The paper sets up the benchmark to fill gaps in static scientific tests and general agent evaluations. It reports that agents handle well-specified data analysis when structure and criteria are clear, but do worse on novel insights, self-directed exploration, and open-ended questions, and it flags some shared failure modes. Making the resources public is a practical step that lets others run their own checks.

The soft spot is task construction. The performance differences only diagnose agent shortcomings if the tasks actually require the extended reasoning and heterogeneity of real science. The abstract notes tasks drawn from emerging needs with stepwise checks, yet gives no concrete details on expert validation, domain sampling, or confirmation that problems demand sustained multi-step chains rather than narrower workflows. If the tasks skew toward structured analysis, the gap between data work and open-ended research is expected and not especially revealing.

This is for researchers developing or testing AI agents for discovery. Someone comparing agents or looking for a new testbed would find the setup and failure analysis useful. The work engages the literature on benchmark limitations and tries to move past toy problems, so it deserves a serious referee even with questions on task realism.

I would send it to peer review.

Referee Report

2 major / 1 minor

Summary. The paper introduces SciAgentArena, a benchmark with approximately 200 tasks drawn from multiple scientific domains, featuring stepwise verification and an interactive agent-agnostic environment. It evaluates current AI agents and claims they perform effectively on well-specified data-analysis workflows with clear structure and criteria, but exhibit uneven results overall, struggling to generate novel insights, sustain self-directed exploration, or solve open-ended research questions. The work also identifies common failure modes and opportunities for improving agent reliability and scientific reasoning.

Significance. If the benchmark tasks are shown to be faithful proxies for real scientific complexity and extended reasoning, SciAgentArena would offer a practical, reproducible framework for tracking progress in AI for science and guiding agent design. The provision of full codes, tasks, and datasets is a strength that supports reproducibility.

major comments (2)

[abstract and task construction section] Task construction and validation (abstract and §3): The central claim that performance gaps reflect agent limitations rather than benchmark artifacts rests on the ~200 tasks capturing 'complexity, heterogeneity, and extended reasoning.' No concrete criteria for domain sampling, expert validation of realism, or quantitative checks confirming sustained multi-step reasoning chains (vs. static problems) are supplied, undermining the diagnostic value of the reported uneven performance.
[results and evaluation sections] Results and evaluation protocols (§4 and §5): The abstract states high-level findings on agent performance without reporting specific metrics, statistical tests, agent architectures/details, or evaluation protocols. This prevents assessment of whether the data-analysis success vs. open-ended failure distinction is robust or sensitive to task selection.

minor comments (1)

[abstract] The link to codes/tasks/datasets is provided but the manuscript should include a brief summary table of task categories, domains, and verification steps for immediate clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify how to strengthen the presentation of SciAgentArena. We respond to each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [abstract and task construction section] Task construction and validation (abstract and §3): The central claim that performance gaps reflect agent limitations rather than benchmark artifacts rests on the ~200 tasks capturing 'complexity, heterogeneity, and extended reasoning.' No concrete criteria for domain sampling, expert validation of realism, or quantitative checks confirming sustained multi-step reasoning chains (vs. static problems) are supplied, undermining the diagnostic value of the reported uneven performance.

Authors: We agree that Section 3 would benefit from greater explicitness. In the revision we will add: (i) the precise sampling criteria used to select the ~200 tasks across domains (prioritizing tasks drawn from recent open research questions and expert-identified gaps), (ii) the protocol for expert validation of task realism (including the number of domain scientists consulted and the rubric applied), and (iii) quantitative descriptors of reasoning depth (average verification steps per task, distribution of task lengths, and results from pilot studies confirming that tasks require sustained multi-step interaction rather than single-shot answers). These additions will directly support the claim that observed performance differences arise from agent capabilities. revision: yes
Referee: [results and evaluation sections] Results and evaluation protocols (§4 and §5): The abstract states high-level findings on agent performance without reporting specific metrics, statistical tests, agent architectures/details, or evaluation protocols. This prevents assessment of whether the data-analysis success vs. open-ended failure distinction is robust or sensitive to task selection.

Authors: Sections 4 and 5 already contain the requested details (per-task success rates, statistical comparisons, agent configurations, and the stepwise verification protocol). To make the abstract self-contained and allow immediate assessment of robustness, we will revise it to include a small number of key quantitative results (e.g., aggregate success rates on structured data-analysis tasks versus open-ended tasks) while retaining the high-level narrative. We view this as a modest but effective clarification. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark paper with no derivations or self-referential steps

full rationale

This paper introduces an empirical benchmark (SciAgentArena) consisting of ~200 tasks for evaluating AI agents on scientific workflows. The abstract and described content contain no equations, fitted parameters, predictions derived from inputs, or derivation chains. Claims about agent performance rest on task construction and reported runs rather than any self-definitional or self-citation load-bearing logic. The assumption that tasks capture real scientific complexity is a design choice open to external validation, not a reduction by construction. No circular steps exist.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark introduction paper rather than a theoretical derivation; the central claims rest on the construction of the task set and the reported agent evaluations rather than on any free parameters, mathematical axioms, or postulated entities.

pith-pipeline@v0.9.1-grok · 5885 in / 1161 out tokens · 24365 ms · 2026-06-27T09:32:22.904345+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Closed-loop Auto Research for Molecular Property Prediction: Discovering and Certifying Generalizable Improvements
cs.AI 2026-06 unverdicted novelty 6.0

Closed-loop LM-agent auto research finds some transferable gains on molecular property prediction benchmarks via external data but shows non-transfer for model and feature edits selected on validation.

Reference graph

Works this paper leans on

148 extracted references · 8 linked inside Pith · cited by 1 Pith paper

[1]

A survey of large language models.arXiv preprint arXiv:2303.18223, 1(2):1–124, 2023

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 1(2):1–124, 2023

Pith/arXiv arXiv 2023
[2]

Agenticaiforscientificdiscovery: Asurveyofprogress,challenges,andfuturedirections.arXiv preprint arXiv:2503.08979, 2025

MouradGridach,JayNanavati,KhaldounZineElAbidine,LenonMendes,andChristinaMack. Agenticaiforscientificdiscovery: Asurveyofprogress,challenges,andfuturedirections.arXiv preprint arXiv:2503.08979, 2025

arXiv 2025
[3]

Litllms, llms for literature re- view: Are we there yet?arXiv preprint arXiv:2412.15249, 2024

Shubham Agarwal*, Gaurav Sahu*, Abhay Puri*, Issam H Laradji, Krishnamurthy DJ Dvi- jotham, Jason Stanley, Laurent Charlin, and Christopher Pal. Litllms, llms for literature re- view: Are we there yet?arXiv preprint arXiv:2412.15249, 2024. 38 Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

arXiv 2024
[4]

Biomni: A general-purpose biomedical ai agent

KexinHuang, SerenaZhang, HanchenWang, YuanhaoQu, YingzhouLu, YusufRoohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, et al. Biomni: A general-purpose biomedical ai agent. biorxiv, 2025

2025
[5]

Accelerating scientific discovery with autonomous goal-evolving agents.arXiv preprint arXiv:2512.21782, 2025

Yuanqi Du, Botao Yu, Tianyu Liu, Tony Shen, Junwu Chen, Jan G Rittig, Kunyang Sun, Yikun Zhang, Zhangde Song, Bo Zhou, et al. Accelerating scientific discovery with autonomous goal-evolving agents.arXiv preprint arXiv:2512.21782, 2025

arXiv 2025
[6]

Deep research: A survey of autonomous research agents.arXiv preprint arXiv:2508.12752, 2025

Wenlin Zhang, Xiaopeng Li, Yingyi Zhang, Pengyue Jia, Yichao Wang, Huifeng Guo, Yong Liu, and Xiangyu Zhao. Deep research: A survey of autonomous research agents.arXiv preprint arXiv:2508.12752, 2025

arXiv 2025
[7]

Deep research, 2026

OpenAI. Deep research, 2026

2026
[8]

Towards an ai co-scientist.arXiv preprint arXiv:2502.18864, 2025

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Ar- tiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist.arXiv preprint arXiv:2502.18864, 2025

Pith/arXiv arXiv 2025
[9]

Towards end-to-end automation of ai research.Nature, 651(8107):914– 919, 2026

Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and Jeff Clune. Towards end-to-end automation of ai research.Nature, 651(8107):914– 919, 2026

2026
[10]

Sciarena: An open evaluation platform for foundation models in scientific literature tasks.arXiv preprint arXiv:2507.01001, 2025

Yilun Zhao, Kaiyan Zhang, Tiansheng Hu, Sihong Wu, Ronan Le Bras, Taira Anderson, Jonathan Bragg, Joseph Chee Chang, Jesse Dodge, Matt Latzke, et al. Sciarena: An open evaluation platform for foundation models in scientific literature tasks.arXiv preprint arXiv:2507.01001, 2025

arXiv 2025
[11]

Evaluatinglargelanguagemodelsinscientificdiscovery

Zhangde Song, Jieyu Lu, Yuanqi Du, Botao Yu, Thomas M Pruyn, Yue Huang, Kehan Guo, XiuzheLuo,YuanhaoQu,YiQu,etal. Evaluatinglargelanguagemodelsinscientificdiscovery. arXiv preprint arXiv:2512.15567, 2025

Pith/arXiv arXiv 2025
[12]

Sciagentgym: Benchmarking multi-step scien- tific tool-use in llm agents.arXiv preprint arXiv:2602.12984, 2026

Yujiong Shen, Yajie Yang, Zhiheng Xi, Binze Hu, Huayu Sha, Jiazheng Zhang, Qiyuan Peng, Junlin Shang, Jixuan Huang, Yutao Fan, et al. Sciagentgym: Benchmarking multi-step scien- tific tool-use in llm agents.arXiv preprint arXiv:2602.12984, 2026

Pith/arXiv arXiv 2026
[13]

Scienceagentbench: Toward rigorous assessment of lan- guage agents for data-driven scientific discovery

Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, et al. Scienceagentbench: Toward rigorous assessment of lan- guage agents for data-driven scientific discovery. InThe Thirteenth International Conference on Learning Representations
[14]

Dsaeval: Evaluating data science agents on a wide range of real-world data science problems.arXiv preprint arXiv:2601.13591, 2026

Maojun Sun, Yifei Xie, Yue Wu, Ruijian Han, Binyan Jiang, Defeng Sun, Yancheng Yuan, and Jian Huang. Dsaeval: Evaluating data science agents on a wide range of real-world data science problems.arXiv preprint arXiv:2601.13591, 2026

Pith/arXiv arXiv 2026
[15]

Towards artificial intelligence research assistant for expert- involved learning.arXiv e-prints, pages arXiv–2505, 2025

Tianyu Liu, Simeng Han, Xiao Luo, Hanchen Wang, Pan Lu, Biqing Zhu, Yuge Wang, Keyi Li, Jiapeng Chen, Rihao Qu, et al. Towards artificial intelligence research assistant for expert- involved learning.arXiv e-prints, pages arXiv–2505, 2025

2025
[16]

Ml- gym: A new framework and benchmark for advancing ai research agents.arXiv preprint arXiv:2502.14499, 2025

Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, et al. Ml- gym: A new framework and benchmark for advancing ai research agents.arXiv preprint arXiv:2502.14499, 2025. 39 Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

arXiv 2025
[17]

Astabench: Rigorous benchmarking of ai agents with a scientific research suite.arXiv preprint arXiv:2510.21652, 2025

Jonathan Bragg, Mike D’Arcy, Nishant Balepur, Dan Bareket, Bhavana Dalvi, Sergey Feldman, Dany Haddad, Jena D Hwang, Peter Jansen, Varsha Kishore, et al. Astabench: Rigorous benchmarking of ai agents with a scientific research suite.arXiv preprint arXiv:2510.21652, 2025

Pith/arXiv arXiv 2025
[18]

Math- arena: Evaluating llms on uncontaminated math competitions.Proceedings of the Neural In- formation Processing Systems Track on Datasets and Benchmark, 2025

Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Math- arena: Evaluating llms on uncontaminated math competitions.Proceedings of the Neural In- formation Processing Systems Track on Datasets and Benchmark, 2025

2025
[19]

Folio: Natural language reasoning with first-order logic

SimengHan,HaileySchoelkopf,YilunZhao,ZhentingQi,MartinRiddell,WenfeiZhou,James Coady, David Peng, Yujie Qiao, Luke Benson, et al. Folio: Natural language reasoning with first-order logic. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 22017–22031, 2024

2024
[20]

Scicode: Aresearchcodingbenchmarkcurated by scientists.Advances in Neural Information Processing Systems, 37:30624–30650, 2024

Minyang Tian, Luyu Gao, Shizhuo D Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas,PanJi,KittithatKrongchon,YaoLi,etal. Scicode: Aresearchcodingbenchmarkcurated by scientists.Advances in Neural Information Processing Systems, 37:30624–30650, 2024

2024
[21]

Swe-bench: Can language models resolve real-world github issues? In12th International Conference on Learning Representations, ICLR 2024, 2024

CarlosEJimenez,JohnYang,AlexanderWettig,ShunyuYao,KexinPei,OfirPress,andKarthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? In12th International Conference on Learning Representations, ICLR 2024, 2024

2024
[22]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst conference on language modeling, 2024

2024
[23]

Sci- enceqa: A novel resource for question answering on scholarly articles.International Journal on Digital Libraries, 23(3):289–301, 2022

Tanik Saikh, Tirthankar Ghosal, Amish Mittal, Asif Ekbal, and Pushpak Bhattacharyya. Sci- enceqa: A novel resource for question answering on scholarly articles.International Journal on Digital Libraries, 23(3):289–301, 2022

2022
[24]

Bioml-bench: Evaluation of ai agents for end-to-end biomedical ml.bioRxiv, pages 2025–09, 2025

HenryEMiller,MatthewGreenig,BenjaminTenmann,andBoWang. Bioml-bench: Evaluation of ai agents for end-to-end biomedical ml.bioRxiv, pages 2025–09, 2025

2025
[25]

Bixbench: a comprehensive benchmark for llm-based agents in computational biology.arXiv preprint arXiv:2503.00096, 2025

Ludovico Mitchener, Jon M Laurent, Alex Andonian, Benjamin Tenmann, Siddharth Narayanan, Geemi P Wellawatte, Andrew White, Lorenzo Sani, and Samuel G Rodriques. Bixbench: a comprehensive benchmark for llm-based agents in computational biology.arXiv preprint arXiv:2503.00096, 2025

arXiv 2025
[26]

Benchmarking ai scientists in omics data-driven biological research.arXiv preprint arXiv:2505.08341, 2025

Erpai Luo, Jinmeng Jia, Yifan Xiong, Xiangyu Li, Xiaobo Guo, Baoqi Yu, Lei Wei, and Xuegong Zhang. Benchmarking ai scientists in omics data-driven biological research.arXiv preprint arXiv:2505.08341, 2025

arXiv 2025
[27]

Agentic systems are adept at solving well-scoped, verifiable problems in computational biol- ogy.bioRxiv, pages 2026–04, 2026

SuragNair,LauraGunsalus,BrianOrcutt-Jahns,JordanRossen,AvantikaLal,CarloDeDonno, Muhammed Hasan Celik, Kipper Fletez-Brant, Xiaoman Xie, Hector Corrada Bravo, et al. Agentic systems are adept at solving well-scoped, verifiable problems in computational biol- ogy.bioRxiv, pages 2026–04, 2026

2026
[28]

Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

Pith/arXiv arXiv 2026
[29]

Gpt-5.2 system card.https://openai.com/index/ gpt-5-system-card-update-gpt-5-2/, 2025

OpenAI. Gpt-5.2 system card.https://openai.com/index/ gpt-5-system-card-update-gpt-5-2/, 2025. Accessed: 2026-04-18. 40 Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

2025
[30]

Single cell analysis: the new frontier in ‘omics’.Trends in biotechnology, 28(6):281–290, 2010

Daojing Wang and Steven Bodovitz. Single cell analysis: the new frontier in ‘omics’.Trends in biotechnology, 28(6):281–290, 2010

2010
[31]

The dawn of spatial omics.Science, 381(6657):eabq4964, 2023

Dario Bressan, Giorgia Battistoni, and Gregory J Hannon. The dawn of spatial omics.Science, 381(6657):eabq4964, 2023

2023
[32]

The role of ai in drug discovery: challenges, opportunities, and strategies.Pharmaceuticals, 16(6):891, 2023

Alexandre Blanco-Gonzalez, Alfonso Cabezon, Alejandro Seco-Gonzalez, Daniel Conde- Torres, Paula Antelo-Riveiro, Angel Pineiro, and Rebeca Garcia-Fandino. The role of ai in drug discovery: challenges, opportunities, and strategies.Pharmaceuticals, 16(6):891, 2023

2023
[33]

From real-world electronic health record data to real- world results using artificial intelligence.Annals of the Rheumatic Diseases, 82(3):306–311, 2023

Rachel Knevel and Katherine P Liao. From real-world electronic health record data to real- world results using artificial intelligence.Annals of the Rheumatic Diseases, 82(3):306–311, 2023

2023
[34]

Engineeringaico-scientistsforstatisticalgeneticsapplications.NatureGenetics, pages 1–4, 2026

BingxinZhao. Engineeringaico-scientistsforstatisticalgeneticsapplications.NatureGenetics, pages 1–4, 2026

2026
[35]

Gemini 3 pro.https://storage.googleapis.com/deepmind-media/ Model-Cards/Gemini-3-Pro-Model-Card.pdf, 2025

Google. Gemini 3 pro.https://storage.googleapis.com/deepmind-media/ Model-Cards/Gemini-3-Pro-Model-Card.pdf, 2025. Accessed: 2026-04-18

2025
[36]

Claude sonnet 4.6.https://docs.anthropic.com/, 2025

Anthropic. Claude sonnet 4.6.https://docs.anthropic.com/, 2025. Accessed: 2026- 04-18

2025
[37]

Democratizing ai scientists using tooluniverse.arXiv preprint arXiv:2509.23426, 2025

Shanghua Gao, Richard Zhu, Pengwei Sui, Zhenglun Kong, Sufian Aldogom, Yepeng Huang, Ayush Noori, Reza Shamji, Krishna Parvataneni, Theodoros Tsiligkaridis, et al. Democratizing ai scientists using tooluniverse.arXiv preprint arXiv:2509.23426, 2025

arXiv 2025
[38]

ChatGPT Codex.https://chatgpt.com/codex/, 2026

OpenAI. ChatGPT Codex.https://chatgpt.com/codex/, 2026. Accessed: 2026-04-23

2026
[39]

Claude Code Overview.https://code.claude.com/docs/en/overview,

Anthropic. Claude Code Overview.https://code.claude.com/docs/en/overview,
[40]

Accessed: 2026-04-23

2026
[41]

Cellforge: agentic design of virtual cell models

Xiangru Tang, Zhuoyun Yu, Jiapeng Chen, Yan Cui, Daniel Shao, Weixu Wang, Fang Wu, Yuchen Zhuang, Wenqi Shi, Zhi Huang, et al. Cellforge: agentic design of virtual cell models. arXiv preprint arXiv:2508.02276, 2025

arXiv 2025
[42]

Stella: Towards a biomedical world model with self-evolving multimodal agents.bioRxiv, pages 2025–07, 2025

Ruofan Jin, Mingyang Xu, Fei Meng, Guancheng Wan, Qingran Cai, Yize Jiang, Jin Han, Yuanyuan Chen, Wanqing Lu, Mengyang Wang, et al. Stella: Towards a biomedical world model with self-evolving multimodal agents.bioRxiv, pages 2025–07, 2025

2025
[43]

An aiagentforfullyautomatedmulti-omic analyses.Advanced Science, 11(44):2407094, 2024

Juexiao Zhou, Bin Zhang, Guowei Li, Xiuying Chen, Haoyang Li, Xiaopeng Xu, Siyuan Chen, WenjiaHe, ChenchengXu, LiweiLiu, and XinGao. An aiagentforfullyautomatedmulti-omic analyses.Advanced Science, 11(44):2407094, 2024

2024
[44]

Txagent: An ai agent for therapeutic reason- ing across a universe of tools, 2025

Shanghua Gao, Richard Zhu, Zhenglun Kong, Ayush Noori, Xiaorui Su, Curtis Ginder, Theodoros Tsiligkaridis, and Marinka Zitnik. Txagent: An ai agent for therapeutic reason- ing across a universe of tools, 2025

2025
[45]

Medea: An omics ai agent for therapeutic discovery.bioRxiv, pages 2026–01, 2026

Pengwei Sui, Michelle M Li, Shanghua Gao, Wanxiang Shen, Valentina Giunchiglia, Andrew Shen, Yepeng Huang, Zhenglun Kong, and Marinka Zitnik. Medea: An omics ai agent for therapeutic discovery.bioRxiv, pages 2026–01, 2026

2026
[46]

McNaughton, Gautham Ramalaxmi, Agustin Kruel, Carter R

Andrew D. McNaughton, Gautham Ramalaxmi, Agustin Kruel, Carter R. Knutson, Rohith A. Varikoti, and Neeraj Kumar. Cactus: Chemistry agent connecting tool-usage to science. 2024. 41 Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

2024
[47]

Baker, Ziru Chen, Garrett Herb, Boyu Gou, Daniel Adu-Ampratwum, Xia Ning, and Huan Sun

Botao Yu, Frazier N. Baker, Ziru Chen, Garrett Herb, Boyu Gou, Daniel Adu-Ampratwum, Xia Ning, and Huan Sun. Tooling or not tooling? the impact of tools on language agents for chemistry problem solving. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 7635–7655, Albuquerque, N...

2025
[48]

Dru- gagent: Automating ai-aided drug discovery programming through llm multi-agent collabo- ration, 2025

Sizhe Liu, Yizhou Lu, Siyu Chen, Xiyang Hu, Jieyu Zhao, Yingzhou Lu, and Yue Zhao. Dru- gagent: Automating ai-aided drug discovery programming through llm multi-agent collabo- ration, 2025

2025
[49]

LIDDIA:Language-basedintelligent drug discovery agent

RezaAverly, FrazierN.Baker, IanAWatson, andXiaNing. LIDDIA:Language-basedintelligent drug discovery agent. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12004–12028, Suzhou, China, November 2025. Association for Comput...

2025
[50]

An auditable agent platform for automated molec- ular optimisation, 2025

Atabey Ünlü, Phil Rohr, and Ahmet Celebi. An auditable agent platform for automated molec- ular optimisation, 2025

2025
[51]

Mragent: an llm-based automated agent for causal knowledge discovery in disease via mendelian randomization.Briefings in Bioinformat- ics, 26(2):bbaf140, 03 2025

Wei Xu, Gang Luo, Weiyu Meng, Xiaobing Zhai, Keli Zheng, Ji Wu, Yanrong Li, Abao Xing, Junrong Li, Zhifan Li, Ke Zheng, and Kefeng Li. Mragent: an llm-based automated agent for causal knowledge discovery in disease via mendelian randomization.Briefings in Bioinformat- ics, 26(2):bbaf140, 03 2025

2025
[52]

RDKit: Open-source cheminformatics
[53]

Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development.arXiv preprint arXiv:2102.09548, 2021

Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf Roohani, Jure Leskovec, Con- nor W Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development.arXiv preprint arXiv:2102.09548, 2021

arXiv 2021
[54]

Patrícia Bento, Jon Chambers, Marleen De Veij, Eloy Félix, María Paula Magariños, Juan F

David Mendez, Anna Gaulton, A. Patrícia Bento, Jon Chambers, Marleen De Veij, Eloy Félix, María Paula Magariños, Juan F. Mosquera, Prudence Mutowo, Michal Nowotka, María Gordillo-Marañón, FionaHunter, LauraJunco, GraceMugumbate, MilagrosRodriguez-Lopez, Francis Atkinson, Nicolas Bosc, Chris J. Radoux, Aldo Segura-Cabrera, Anne Hersey, and An- drew R. Leac...

2019
[55]

Gilson, Tiqing Liu, Michael Baitaluk, George Nicola, Linda Hwang, and Jenny Chong

Michael K. Gilson, Tiqing Liu, Michael Baitaluk, George Nicola, Linda Hwang, and Jenny Chong. BindingDB in 2015: A public database for medicinal chemistry, computational chem- istryandsystemspharmacology.NucleicAcidsResearch,44(D1):D1045–D1053,January2016

2015
[56]

Baell and Georgina A

Jonathan B. Baell and Georgina A. Holloway. New Substructure Filters for Removal of Pan Assay Interference Compounds (PAINS) from Screening Libraries and for Their Exclusion in Bioassays.Journal of Medicinal Chemistry, 53(7):2719–2740, April 2010

2010
[57]

Guacamol: bench- marking models for de novo molecular design.Journal of chemical information and modeling, 59(3):1096–1108, 2019

Nathan Brown, Marco Fiscato, Marwin HS Segler, and Alain C Vaucher. Guacamol: bench- marking models for de novo molecular design.Journal of chemical information and modeling, 59(3):1096–1108, 2019

2019
[58]

Clustering with the average silhouette width.Computa- tional Statistics & Data Analysis, 158:107190, 2021

Fatima Batool and Christian Hennig. Clustering with the average silhouette width.Computa- tional Statistics & Data Analysis, 158:107190, 2021. 42 Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

2021
[59]

Fast, sensitive and accu- rate integration of single-cell data with harmony.Nature methods, 16(12):1289–1296, 2019

Ilya Korsunsky, Nghia Millard, Jean Fan, Kamil Slowikowski, Fan Zhang, Kevin Wei, Yuriy Baglaenko, Michael Brenner, Po-ru Loh, and Soumya Raychaudhuri. Fast, sensitive and accu- rate integration of single-cell data with harmony.Nature methods, 16(12):1289–1296, 2019

2019
[60]

Deep genera- tive modeling for single-cell transcriptomics.Nature methods, 15(12):1053–1058, 2018

Romain Lopez, Jeffrey Regier, Michael B Cole, Michael I Jordan, and Nir Yosef. Deep genera- tive modeling for single-cell transcriptomics.Nature methods, 15(12):1053–1058, 2018

2018
[61]

Moderated estimation of fold change and dispersion for rna-seq data with deseq2.Genome biology, 15(12):550, 2014

Michael I Love, Wolfgang Huber, and Simon Anders. Moderated estimation of fold change and dispersion for rna-seq data with deseq2.Genome biology, 15(12):550, 2014

2014
[62]

Modelingandpredicting single-cell multi-gene perturbation responses with sclambda.bioRxiv, 2024

GefeiWang, TianyuLiu, JiaZhao, YoushuCheng, andHongyuZhao. Modelingandpredicting single-cell multi-gene perturbation responses with sclambda.bioRxiv, 2024

2024
[63]

Ibarra, Olle Holmberg, Isaac Virshup, Mohammad Lotfollahi, Sabrina Richter, and Fabian J

Giovanni Palla, Hannah Spitzer, Michal Klein, David Fischer, Anna Christina Schaar, Louis Benedikt Kuemmerle, Sergei Rybakov, Ignacio L. Ibarra, Olle Holmberg, Isaac Virshup, Mohammad Lotfollahi, Sabrina Richter, and Fabian J. Theis. Squidpy: a scalable framework for spatial omics analysis.Nature Methods, 19(2):171–178, 2 2022

2022
[64]

Jensen, Lars J

Peter B. Jensen, Lars J. Jensen, and Søren Brunak. Mining electronic health records: towards better research applications and clinical care.Nature Reviews Genetics, 13(6):395–405, 2012

2012
[65]

George Hripcsak and David J. Albers. Next-generation phenotyping of electronic health records.Journal of the American Medical Informatics Association, 20(1):117–121, 2013

2013
[66]

Jason Walonoski, Mike Kramer, Justin Nichols, Anthony Quina, Christian Moesel, Derek Hall, Chris Duffett, Kristi Dube, Tony Gallagher, and Sean McLachlan. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic elec- tronic health care record.Journal of the American Medical Informatics Association, 25(3):23...

2018
[68]

Polygenic prediction via bayesian regression and continuous shrinkage priors.Nature communications, 10(1):1776, 2019

Tian Ge, Chia-Yen Chen, Yang Ni, Yen-Chen Anne Feng, and Jordan W Smoller. Polygenic prediction via bayesian regression and continuous shrinkage priors.Nature communications, 10(1):1776, 2019

2019
[69]

Martin, Shengying Qin, Hail- iang Huang, and Tian Ge

Yunfeng Ruan, Yen-Feng Lin, Yen-Chen Anne Feng, Chia-Yen Chen, Max Lam, Zhenglin Guo, Stanley Global Asia Initiatives, Lin He, Akira Sawa, Alicia R. Martin, Shengying Qin, Hail- iang Huang, and Tian Ge. Improving polygenic prediction in ancestrally diverse populations. Nature Genetics, 54(5):573–580, May 2022

2022
[70]

The gtex consortium atlas of genetic regulatory effects across human tissues.Science, 369(6509):1318–1330, 2020

GTEx Consortium. The gtex consortium atlas of genetic regulatory effects across human tissues.Science, 369(6509):1318–1330, 2020

2020
[71]

On the art of compilingandusing’drug-like’chemicalfragmentspaces.ChemMedChem, 3(10):1503–1507, October 2008

Jörg Degen, Christof Wegscheid-Gerlach, Andrea Zaliani, and Matthias Rarey. On the art of compilingandusing’drug-like’chemicalfragmentspaces.ChemMedChem, 3(10):1503–1507, October 2008

2008
[72]

conda: Asystem-level,binarypackageandenvironmentmanagerrunning on all major operating systems and platforms

condacontributors. conda: Asystem-level,binarypackageandenvironmentmanagerrunning on all major operating systems and platforms
[73]

QuantStack and Mamba Contributors. mamba. 43 Benchmarking AI Agents for Addressing Scientific Challenges Across Scales
[74]

Shoemaker, Paul A

Sunghwan Kim, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He, Qingliang Li, Benjamin A. Shoemaker, Paul A. Thiessen, Bo Yu, Leonid Zaslavsky, Jian Zhang, and Evan E. Bolton. PubChem in 2021: new data content and improved web interfaces.Nucleic Acids Research, 49(D1):D1388–D1395, January 2021

2021
[75]

ChEBI in 2016: Improved services and an expanding collection of metabolites.Nucleic Acids Research, 44(D1):D1214–1219, January 2016

Janna Hastings, Gareth Owen, Adriano Dekker, Marcus Ennis, Namrata Kale, Venkatesh Muthukrishnan, Steve Turner, Neil Swainston, Pedro Mendes, and Christoph Steinbeck. ChEBI in 2016: Improved services and an expanding collection of metabolites.Nucleic Acids Research, 44(D1):D1214–1219, January 2016

2016
[76]

Irwin and Brian K

John J. Irwin and Brian K. Shoichet. ZINC - A Free Database of Commercially Available Com- pounds for Virtual Screening.Journal of Chemical Information and Modeling, 45(1):177–182, January 2005

2005
[77]

A clinical road map for single-cell omics.Cell, 188(14):3633–3647, 2025

Michael A Skinnider, Gregoire Courtine, Jocelyne Bloch, and Jordan W Squair. A clinical road map for single-cell omics.Cell, 188(14):3633–3647, 2025

2025
[78]

Single- cell rna sequencing technologies and applications: a brief overview.Clinical and translational medicine, 12(3):e694, 2022

Dragomirka Jovic, Xue Liang, Hua Zeng, Lin Lin, Fengping Xu, and Yonglun Luo. Single- cell rna sequencing technologies and applications: a brief overview.Clinical and translational medicine, 12(3):e694, 2022

2022
[79]

Scanpy: large-scale single-cell gene expression data analysis.Genome biology, 19(1):15, 2018

F Alexander Wolf, Philipp Angerer, and Fabian J Theis. Scanpy: large-scale single-cell gene expression data analysis.Genome biology, 19(1):15, 2018

2018
[80]

Benchmarking atlas-level data integration in single-cell genomics.Nature methods, 19(1):41–50, 2022

Malte D Luecken, Maren Büttner, Kridsadakorn Chaichoompu, Anna Danese, Marta Inter- landi, Michaela F Müller, Daniel C Strobl, Luke Zappia, Martin Dugas, Maria Colomé-Tatché, et al. Benchmarking atlas-level data integration in single-cell genomics.Nature methods, 19(1):41–50, 2022

2022
[81]

Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

2011

Showing first 80 references.

[1] [1]

A survey of large language models.arXiv preprint arXiv:2303.18223, 1(2):1–124, 2023

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 1(2):1–124, 2023

Pith/arXiv arXiv 2023

[2] [2]

Agenticaiforscientificdiscovery: Asurveyofprogress,challenges,andfuturedirections.arXiv preprint arXiv:2503.08979, 2025

MouradGridach,JayNanavati,KhaldounZineElAbidine,LenonMendes,andChristinaMack. Agenticaiforscientificdiscovery: Asurveyofprogress,challenges,andfuturedirections.arXiv preprint arXiv:2503.08979, 2025

arXiv 2025

[3] [3]

Litllms, llms for literature re- view: Are we there yet?arXiv preprint arXiv:2412.15249, 2024

Shubham Agarwal*, Gaurav Sahu*, Abhay Puri*, Issam H Laradji, Krishnamurthy DJ Dvi- jotham, Jason Stanley, Laurent Charlin, and Christopher Pal. Litllms, llms for literature re- view: Are we there yet?arXiv preprint arXiv:2412.15249, 2024. 38 Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

arXiv 2024

[4] [4]

Biomni: A general-purpose biomedical ai agent

KexinHuang, SerenaZhang, HanchenWang, YuanhaoQu, YingzhouLu, YusufRoohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, et al. Biomni: A general-purpose biomedical ai agent. biorxiv, 2025

2025

[5] [5]

Accelerating scientific discovery with autonomous goal-evolving agents.arXiv preprint arXiv:2512.21782, 2025

Yuanqi Du, Botao Yu, Tianyu Liu, Tony Shen, Junwu Chen, Jan G Rittig, Kunyang Sun, Yikun Zhang, Zhangde Song, Bo Zhou, et al. Accelerating scientific discovery with autonomous goal-evolving agents.arXiv preprint arXiv:2512.21782, 2025

arXiv 2025

[6] [6]

Deep research: A survey of autonomous research agents.arXiv preprint arXiv:2508.12752, 2025

Wenlin Zhang, Xiaopeng Li, Yingyi Zhang, Pengyue Jia, Yichao Wang, Huifeng Guo, Yong Liu, and Xiangyu Zhao. Deep research: A survey of autonomous research agents.arXiv preprint arXiv:2508.12752, 2025

arXiv 2025

[7] [7]

Deep research, 2026

OpenAI. Deep research, 2026

2026

[8] [8]

Towards an ai co-scientist.arXiv preprint arXiv:2502.18864, 2025

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Ar- tiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist.arXiv preprint arXiv:2502.18864, 2025

Pith/arXiv arXiv 2025

[9] [9]

Towards end-to-end automation of ai research.Nature, 651(8107):914– 919, 2026

Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and Jeff Clune. Towards end-to-end automation of ai research.Nature, 651(8107):914– 919, 2026

2026

[10] [10]

Sciarena: An open evaluation platform for foundation models in scientific literature tasks.arXiv preprint arXiv:2507.01001, 2025

Yilun Zhao, Kaiyan Zhang, Tiansheng Hu, Sihong Wu, Ronan Le Bras, Taira Anderson, Jonathan Bragg, Joseph Chee Chang, Jesse Dodge, Matt Latzke, et al. Sciarena: An open evaluation platform for foundation models in scientific literature tasks.arXiv preprint arXiv:2507.01001, 2025

arXiv 2025

[11] [11]

Evaluatinglargelanguagemodelsinscientificdiscovery

Zhangde Song, Jieyu Lu, Yuanqi Du, Botao Yu, Thomas M Pruyn, Yue Huang, Kehan Guo, XiuzheLuo,YuanhaoQu,YiQu,etal. Evaluatinglargelanguagemodelsinscientificdiscovery. arXiv preprint arXiv:2512.15567, 2025

Pith/arXiv arXiv 2025

[12] [12]

Sciagentgym: Benchmarking multi-step scien- tific tool-use in llm agents.arXiv preprint arXiv:2602.12984, 2026

Yujiong Shen, Yajie Yang, Zhiheng Xi, Binze Hu, Huayu Sha, Jiazheng Zhang, Qiyuan Peng, Junlin Shang, Jixuan Huang, Yutao Fan, et al. Sciagentgym: Benchmarking multi-step scien- tific tool-use in llm agents.arXiv preprint arXiv:2602.12984, 2026

Pith/arXiv arXiv 2026

[13] [13]

Scienceagentbench: Toward rigorous assessment of lan- guage agents for data-driven scientific discovery

Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, et al. Scienceagentbench: Toward rigorous assessment of lan- guage agents for data-driven scientific discovery. InThe Thirteenth International Conference on Learning Representations

[14] [14]

Dsaeval: Evaluating data science agents on a wide range of real-world data science problems.arXiv preprint arXiv:2601.13591, 2026

Maojun Sun, Yifei Xie, Yue Wu, Ruijian Han, Binyan Jiang, Defeng Sun, Yancheng Yuan, and Jian Huang. Dsaeval: Evaluating data science agents on a wide range of real-world data science problems.arXiv preprint arXiv:2601.13591, 2026

Pith/arXiv arXiv 2026

[15] [15]

Towards artificial intelligence research assistant for expert- involved learning.arXiv e-prints, pages arXiv–2505, 2025

Tianyu Liu, Simeng Han, Xiao Luo, Hanchen Wang, Pan Lu, Biqing Zhu, Yuge Wang, Keyi Li, Jiapeng Chen, Rihao Qu, et al. Towards artificial intelligence research assistant for expert- involved learning.arXiv e-prints, pages arXiv–2505, 2025

2025

[16] [16]

Ml- gym: A new framework and benchmark for advancing ai research agents.arXiv preprint arXiv:2502.14499, 2025

Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, et al. Ml- gym: A new framework and benchmark for advancing ai research agents.arXiv preprint arXiv:2502.14499, 2025. 39 Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

arXiv 2025

[17] [17]

Astabench: Rigorous benchmarking of ai agents with a scientific research suite.arXiv preprint arXiv:2510.21652, 2025

Jonathan Bragg, Mike D’Arcy, Nishant Balepur, Dan Bareket, Bhavana Dalvi, Sergey Feldman, Dany Haddad, Jena D Hwang, Peter Jansen, Varsha Kishore, et al. Astabench: Rigorous benchmarking of ai agents with a scientific research suite.arXiv preprint arXiv:2510.21652, 2025

Pith/arXiv arXiv 2025

[18] [18]

Math- arena: Evaluating llms on uncontaminated math competitions.Proceedings of the Neural In- formation Processing Systems Track on Datasets and Benchmark, 2025

Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Math- arena: Evaluating llms on uncontaminated math competitions.Proceedings of the Neural In- formation Processing Systems Track on Datasets and Benchmark, 2025

2025

[19] [19]

Folio: Natural language reasoning with first-order logic

SimengHan,HaileySchoelkopf,YilunZhao,ZhentingQi,MartinRiddell,WenfeiZhou,James Coady, David Peng, Yujie Qiao, Luke Benson, et al. Folio: Natural language reasoning with first-order logic. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 22017–22031, 2024

2024

[20] [20]

Scicode: Aresearchcodingbenchmarkcurated by scientists.Advances in Neural Information Processing Systems, 37:30624–30650, 2024

Minyang Tian, Luyu Gao, Shizhuo D Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas,PanJi,KittithatKrongchon,YaoLi,etal. Scicode: Aresearchcodingbenchmarkcurated by scientists.Advances in Neural Information Processing Systems, 37:30624–30650, 2024

2024

[21] [21]

Swe-bench: Can language models resolve real-world github issues? In12th International Conference on Learning Representations, ICLR 2024, 2024

CarlosEJimenez,JohnYang,AlexanderWettig,ShunyuYao,KexinPei,OfirPress,andKarthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? In12th International Conference on Learning Representations, ICLR 2024, 2024

2024

[22] [22]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst conference on language modeling, 2024

2024

[23] [23]

Sci- enceqa: A novel resource for question answering on scholarly articles.International Journal on Digital Libraries, 23(3):289–301, 2022

Tanik Saikh, Tirthankar Ghosal, Amish Mittal, Asif Ekbal, and Pushpak Bhattacharyya. Sci- enceqa: A novel resource for question answering on scholarly articles.International Journal on Digital Libraries, 23(3):289–301, 2022

2022

[24] [24]

Bioml-bench: Evaluation of ai agents for end-to-end biomedical ml.bioRxiv, pages 2025–09, 2025

HenryEMiller,MatthewGreenig,BenjaminTenmann,andBoWang. Bioml-bench: Evaluation of ai agents for end-to-end biomedical ml.bioRxiv, pages 2025–09, 2025

2025

[25] [25]

Bixbench: a comprehensive benchmark for llm-based agents in computational biology.arXiv preprint arXiv:2503.00096, 2025

Ludovico Mitchener, Jon M Laurent, Alex Andonian, Benjamin Tenmann, Siddharth Narayanan, Geemi P Wellawatte, Andrew White, Lorenzo Sani, and Samuel G Rodriques. Bixbench: a comprehensive benchmark for llm-based agents in computational biology.arXiv preprint arXiv:2503.00096, 2025

arXiv 2025

[26] [26]

Benchmarking ai scientists in omics data-driven biological research.arXiv preprint arXiv:2505.08341, 2025

Erpai Luo, Jinmeng Jia, Yifan Xiong, Xiangyu Li, Xiaobo Guo, Baoqi Yu, Lei Wei, and Xuegong Zhang. Benchmarking ai scientists in omics data-driven biological research.arXiv preprint arXiv:2505.08341, 2025

arXiv 2025

[27] [27]

Agentic systems are adept at solving well-scoped, verifiable problems in computational biol- ogy.bioRxiv, pages 2026–04, 2026

SuragNair,LauraGunsalus,BrianOrcutt-Jahns,JordanRossen,AvantikaLal,CarloDeDonno, Muhammed Hasan Celik, Kipper Fletez-Brant, Xiaoman Xie, Hector Corrada Bravo, et al. Agentic systems are adept at solving well-scoped, verifiable problems in computational biol- ogy.bioRxiv, pages 2026–04, 2026

2026

[28] [28]

Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

Pith/arXiv arXiv 2026

[29] [29]

Gpt-5.2 system card.https://openai.com/index/ gpt-5-system-card-update-gpt-5-2/, 2025

OpenAI. Gpt-5.2 system card.https://openai.com/index/ gpt-5-system-card-update-gpt-5-2/, 2025. Accessed: 2026-04-18. 40 Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

2025

[30] [30]

Single cell analysis: the new frontier in ‘omics’.Trends in biotechnology, 28(6):281–290, 2010

Daojing Wang and Steven Bodovitz. Single cell analysis: the new frontier in ‘omics’.Trends in biotechnology, 28(6):281–290, 2010

2010

[31] [31]

The dawn of spatial omics.Science, 381(6657):eabq4964, 2023

Dario Bressan, Giorgia Battistoni, and Gregory J Hannon. The dawn of spatial omics.Science, 381(6657):eabq4964, 2023

2023

[32] [32]

The role of ai in drug discovery: challenges, opportunities, and strategies.Pharmaceuticals, 16(6):891, 2023

Alexandre Blanco-Gonzalez, Alfonso Cabezon, Alejandro Seco-Gonzalez, Daniel Conde- Torres, Paula Antelo-Riveiro, Angel Pineiro, and Rebeca Garcia-Fandino. The role of ai in drug discovery: challenges, opportunities, and strategies.Pharmaceuticals, 16(6):891, 2023

2023

[33] [33]

From real-world electronic health record data to real- world results using artificial intelligence.Annals of the Rheumatic Diseases, 82(3):306–311, 2023

Rachel Knevel and Katherine P Liao. From real-world electronic health record data to real- world results using artificial intelligence.Annals of the Rheumatic Diseases, 82(3):306–311, 2023

2023

[34] [34]

Engineeringaico-scientistsforstatisticalgeneticsapplications.NatureGenetics, pages 1–4, 2026

BingxinZhao. Engineeringaico-scientistsforstatisticalgeneticsapplications.NatureGenetics, pages 1–4, 2026

2026

[35] [35]

Gemini 3 pro.https://storage.googleapis.com/deepmind-media/ Model-Cards/Gemini-3-Pro-Model-Card.pdf, 2025

Google. Gemini 3 pro.https://storage.googleapis.com/deepmind-media/ Model-Cards/Gemini-3-Pro-Model-Card.pdf, 2025. Accessed: 2026-04-18

2025

[36] [36]

Claude sonnet 4.6.https://docs.anthropic.com/, 2025

Anthropic. Claude sonnet 4.6.https://docs.anthropic.com/, 2025. Accessed: 2026- 04-18

2025

[37] [37]

Democratizing ai scientists using tooluniverse.arXiv preprint arXiv:2509.23426, 2025

Shanghua Gao, Richard Zhu, Pengwei Sui, Zhenglun Kong, Sufian Aldogom, Yepeng Huang, Ayush Noori, Reza Shamji, Krishna Parvataneni, Theodoros Tsiligkaridis, et al. Democratizing ai scientists using tooluniverse.arXiv preprint arXiv:2509.23426, 2025

arXiv 2025

[38] [38]

ChatGPT Codex.https://chatgpt.com/codex/, 2026

OpenAI. ChatGPT Codex.https://chatgpt.com/codex/, 2026. Accessed: 2026-04-23

2026

[39] [39]

Claude Code Overview.https://code.claude.com/docs/en/overview,

Anthropic. Claude Code Overview.https://code.claude.com/docs/en/overview,

[40] [40]

Accessed: 2026-04-23

2026

[41] [41]

Cellforge: agentic design of virtual cell models

Xiangru Tang, Zhuoyun Yu, Jiapeng Chen, Yan Cui, Daniel Shao, Weixu Wang, Fang Wu, Yuchen Zhuang, Wenqi Shi, Zhi Huang, et al. Cellforge: agentic design of virtual cell models. arXiv preprint arXiv:2508.02276, 2025

arXiv 2025

[42] [42]

Stella: Towards a biomedical world model with self-evolving multimodal agents.bioRxiv, pages 2025–07, 2025

Ruofan Jin, Mingyang Xu, Fei Meng, Guancheng Wan, Qingran Cai, Yize Jiang, Jin Han, Yuanyuan Chen, Wanqing Lu, Mengyang Wang, et al. Stella: Towards a biomedical world model with self-evolving multimodal agents.bioRxiv, pages 2025–07, 2025

2025

[43] [43]

An aiagentforfullyautomatedmulti-omic analyses.Advanced Science, 11(44):2407094, 2024

Juexiao Zhou, Bin Zhang, Guowei Li, Xiuying Chen, Haoyang Li, Xiaopeng Xu, Siyuan Chen, WenjiaHe, ChenchengXu, LiweiLiu, and XinGao. An aiagentforfullyautomatedmulti-omic analyses.Advanced Science, 11(44):2407094, 2024

2024

[44] [44]

Txagent: An ai agent for therapeutic reason- ing across a universe of tools, 2025

Shanghua Gao, Richard Zhu, Zhenglun Kong, Ayush Noori, Xiaorui Su, Curtis Ginder, Theodoros Tsiligkaridis, and Marinka Zitnik. Txagent: An ai agent for therapeutic reason- ing across a universe of tools, 2025

2025

[45] [45]

Medea: An omics ai agent for therapeutic discovery.bioRxiv, pages 2026–01, 2026

Pengwei Sui, Michelle M Li, Shanghua Gao, Wanxiang Shen, Valentina Giunchiglia, Andrew Shen, Yepeng Huang, Zhenglun Kong, and Marinka Zitnik. Medea: An omics ai agent for therapeutic discovery.bioRxiv, pages 2026–01, 2026

2026

[46] [46]

McNaughton, Gautham Ramalaxmi, Agustin Kruel, Carter R

Andrew D. McNaughton, Gautham Ramalaxmi, Agustin Kruel, Carter R. Knutson, Rohith A. Varikoti, and Neeraj Kumar. Cactus: Chemistry agent connecting tool-usage to science. 2024. 41 Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

2024

[47] [47]

Baker, Ziru Chen, Garrett Herb, Boyu Gou, Daniel Adu-Ampratwum, Xia Ning, and Huan Sun

Botao Yu, Frazier N. Baker, Ziru Chen, Garrett Herb, Boyu Gou, Daniel Adu-Ampratwum, Xia Ning, and Huan Sun. Tooling or not tooling? the impact of tools on language agents for chemistry problem solving. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 7635–7655, Albuquerque, N...

2025

[48] [48]

Dru- gagent: Automating ai-aided drug discovery programming through llm multi-agent collabo- ration, 2025

Sizhe Liu, Yizhou Lu, Siyu Chen, Xiyang Hu, Jieyu Zhao, Yingzhou Lu, and Yue Zhao. Dru- gagent: Automating ai-aided drug discovery programming through llm multi-agent collabo- ration, 2025

2025

[49] [49]

LIDDIA:Language-basedintelligent drug discovery agent

RezaAverly, FrazierN.Baker, IanAWatson, andXiaNing. LIDDIA:Language-basedintelligent drug discovery agent. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12004–12028, Suzhou, China, November 2025. Association for Comput...

2025

[50] [50]

An auditable agent platform for automated molec- ular optimisation, 2025

Atabey Ünlü, Phil Rohr, and Ahmet Celebi. An auditable agent platform for automated molec- ular optimisation, 2025

2025

[51] [51]

Mragent: an llm-based automated agent for causal knowledge discovery in disease via mendelian randomization.Briefings in Bioinformat- ics, 26(2):bbaf140, 03 2025

Wei Xu, Gang Luo, Weiyu Meng, Xiaobing Zhai, Keli Zheng, Ji Wu, Yanrong Li, Abao Xing, Junrong Li, Zhifan Li, Ke Zheng, and Kefeng Li. Mragent: an llm-based automated agent for causal knowledge discovery in disease via mendelian randomization.Briefings in Bioinformat- ics, 26(2):bbaf140, 03 2025

2025

[52] [52]

RDKit: Open-source cheminformatics

[53] [53]

Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development.arXiv preprint arXiv:2102.09548, 2021

Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf Roohani, Jure Leskovec, Con- nor W Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development.arXiv preprint arXiv:2102.09548, 2021

arXiv 2021

[54] [54]

Patrícia Bento, Jon Chambers, Marleen De Veij, Eloy Félix, María Paula Magariños, Juan F

David Mendez, Anna Gaulton, A. Patrícia Bento, Jon Chambers, Marleen De Veij, Eloy Félix, María Paula Magariños, Juan F. Mosquera, Prudence Mutowo, Michal Nowotka, María Gordillo-Marañón, FionaHunter, LauraJunco, GraceMugumbate, MilagrosRodriguez-Lopez, Francis Atkinson, Nicolas Bosc, Chris J. Radoux, Aldo Segura-Cabrera, Anne Hersey, and An- drew R. Leac...

2019

[55] [55]

Gilson, Tiqing Liu, Michael Baitaluk, George Nicola, Linda Hwang, and Jenny Chong

Michael K. Gilson, Tiqing Liu, Michael Baitaluk, George Nicola, Linda Hwang, and Jenny Chong. BindingDB in 2015: A public database for medicinal chemistry, computational chem- istryandsystemspharmacology.NucleicAcidsResearch,44(D1):D1045–D1053,January2016

2015

[56] [56]

Baell and Georgina A

Jonathan B. Baell and Georgina A. Holloway. New Substructure Filters for Removal of Pan Assay Interference Compounds (PAINS) from Screening Libraries and for Their Exclusion in Bioassays.Journal of Medicinal Chemistry, 53(7):2719–2740, April 2010

2010

[57] [57]

Guacamol: bench- marking models for de novo molecular design.Journal of chemical information and modeling, 59(3):1096–1108, 2019

Nathan Brown, Marco Fiscato, Marwin HS Segler, and Alain C Vaucher. Guacamol: bench- marking models for de novo molecular design.Journal of chemical information and modeling, 59(3):1096–1108, 2019

2019

[58] [58]

Clustering with the average silhouette width.Computa- tional Statistics & Data Analysis, 158:107190, 2021

Fatima Batool and Christian Hennig. Clustering with the average silhouette width.Computa- tional Statistics & Data Analysis, 158:107190, 2021. 42 Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

2021

[59] [59]

Fast, sensitive and accu- rate integration of single-cell data with harmony.Nature methods, 16(12):1289–1296, 2019

Ilya Korsunsky, Nghia Millard, Jean Fan, Kamil Slowikowski, Fan Zhang, Kevin Wei, Yuriy Baglaenko, Michael Brenner, Po-ru Loh, and Soumya Raychaudhuri. Fast, sensitive and accu- rate integration of single-cell data with harmony.Nature methods, 16(12):1289–1296, 2019

2019

[60] [60]

Deep genera- tive modeling for single-cell transcriptomics.Nature methods, 15(12):1053–1058, 2018

Romain Lopez, Jeffrey Regier, Michael B Cole, Michael I Jordan, and Nir Yosef. Deep genera- tive modeling for single-cell transcriptomics.Nature methods, 15(12):1053–1058, 2018

2018

[61] [61]

Moderated estimation of fold change and dispersion for rna-seq data with deseq2.Genome biology, 15(12):550, 2014

Michael I Love, Wolfgang Huber, and Simon Anders. Moderated estimation of fold change and dispersion for rna-seq data with deseq2.Genome biology, 15(12):550, 2014

2014

[62] [62]

Modelingandpredicting single-cell multi-gene perturbation responses with sclambda.bioRxiv, 2024

GefeiWang, TianyuLiu, JiaZhao, YoushuCheng, andHongyuZhao. Modelingandpredicting single-cell multi-gene perturbation responses with sclambda.bioRxiv, 2024

2024

[63] [63]

Ibarra, Olle Holmberg, Isaac Virshup, Mohammad Lotfollahi, Sabrina Richter, and Fabian J

Giovanni Palla, Hannah Spitzer, Michal Klein, David Fischer, Anna Christina Schaar, Louis Benedikt Kuemmerle, Sergei Rybakov, Ignacio L. Ibarra, Olle Holmberg, Isaac Virshup, Mohammad Lotfollahi, Sabrina Richter, and Fabian J. Theis. Squidpy: a scalable framework for spatial omics analysis.Nature Methods, 19(2):171–178, 2 2022

2022

[64] [64]

Jensen, Lars J

Peter B. Jensen, Lars J. Jensen, and Søren Brunak. Mining electronic health records: towards better research applications and clinical care.Nature Reviews Genetics, 13(6):395–405, 2012

2012

[65] [65]

George Hripcsak and David J. Albers. Next-generation phenotyping of electronic health records.Journal of the American Medical Informatics Association, 20(1):117–121, 2013

2013

[66] [66]

Jason Walonoski, Mike Kramer, Justin Nichols, Anthony Quina, Christian Moesel, Derek Hall, Chris Duffett, Kristi Dube, Tony Gallagher, and Sean McLachlan. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic elec- tronic health care record.Journal of the American Medical Informatics Association, 25(3):23...

2018

[67] [68]

Polygenic prediction via bayesian regression and continuous shrinkage priors.Nature communications, 10(1):1776, 2019

Tian Ge, Chia-Yen Chen, Yang Ni, Yen-Chen Anne Feng, and Jordan W Smoller. Polygenic prediction via bayesian regression and continuous shrinkage priors.Nature communications, 10(1):1776, 2019

2019

[68] [69]

Martin, Shengying Qin, Hail- iang Huang, and Tian Ge

Yunfeng Ruan, Yen-Feng Lin, Yen-Chen Anne Feng, Chia-Yen Chen, Max Lam, Zhenglin Guo, Stanley Global Asia Initiatives, Lin He, Akira Sawa, Alicia R. Martin, Shengying Qin, Hail- iang Huang, and Tian Ge. Improving polygenic prediction in ancestrally diverse populations. Nature Genetics, 54(5):573–580, May 2022

2022

[69] [70]

The gtex consortium atlas of genetic regulatory effects across human tissues.Science, 369(6509):1318–1330, 2020

GTEx Consortium. The gtex consortium atlas of genetic regulatory effects across human tissues.Science, 369(6509):1318–1330, 2020

2020

[70] [71]

On the art of compilingandusing’drug-like’chemicalfragmentspaces.ChemMedChem, 3(10):1503–1507, October 2008

Jörg Degen, Christof Wegscheid-Gerlach, Andrea Zaliani, and Matthias Rarey. On the art of compilingandusing’drug-like’chemicalfragmentspaces.ChemMedChem, 3(10):1503–1507, October 2008

2008

[71] [72]

conda: Asystem-level,binarypackageandenvironmentmanagerrunning on all major operating systems and platforms

condacontributors. conda: Asystem-level,binarypackageandenvironmentmanagerrunning on all major operating systems and platforms

[72] [73]

QuantStack and Mamba Contributors. mamba. 43 Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

[73] [74]

Shoemaker, Paul A

Sunghwan Kim, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He, Qingliang Li, Benjamin A. Shoemaker, Paul A. Thiessen, Bo Yu, Leonid Zaslavsky, Jian Zhang, and Evan E. Bolton. PubChem in 2021: new data content and improved web interfaces.Nucleic Acids Research, 49(D1):D1388–D1395, January 2021

2021

[74] [75]

ChEBI in 2016: Improved services and an expanding collection of metabolites.Nucleic Acids Research, 44(D1):D1214–1219, January 2016

Janna Hastings, Gareth Owen, Adriano Dekker, Marcus Ennis, Namrata Kale, Venkatesh Muthukrishnan, Steve Turner, Neil Swainston, Pedro Mendes, and Christoph Steinbeck. ChEBI in 2016: Improved services and an expanding collection of metabolites.Nucleic Acids Research, 44(D1):D1214–1219, January 2016

2016

[75] [76]

Irwin and Brian K

John J. Irwin and Brian K. Shoichet. ZINC - A Free Database of Commercially Available Com- pounds for Virtual Screening.Journal of Chemical Information and Modeling, 45(1):177–182, January 2005

2005

[76] [77]

A clinical road map for single-cell omics.Cell, 188(14):3633–3647, 2025

Michael A Skinnider, Gregoire Courtine, Jocelyne Bloch, and Jordan W Squair. A clinical road map for single-cell omics.Cell, 188(14):3633–3647, 2025

2025

[77] [78]

Single- cell rna sequencing technologies and applications: a brief overview.Clinical and translational medicine, 12(3):e694, 2022

Dragomirka Jovic, Xue Liang, Hua Zeng, Lin Lin, Fengping Xu, and Yonglun Luo. Single- cell rna sequencing technologies and applications: a brief overview.Clinical and translational medicine, 12(3):e694, 2022

2022

[78] [79]

Scanpy: large-scale single-cell gene expression data analysis.Genome biology, 19(1):15, 2018

F Alexander Wolf, Philipp Angerer, and Fabian J Theis. Scanpy: large-scale single-cell gene expression data analysis.Genome biology, 19(1):15, 2018

2018

[79] [80]

Benchmarking atlas-level data integration in single-cell genomics.Nature methods, 19(1):41–50, 2022

Malte D Luecken, Maren Büttner, Kridsadakorn Chaichoompu, Anna Danese, Marta Inter- landi, Michaela F Müller, Daniel C Strobl, Luke Zappia, Martin Dugas, Maria Colomé-Tatché, et al. Benchmarking atlas-level data integration in single-cell genomics.Nature methods, 19(1):41–50, 2022

2022

[80] [81]

Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

2011