pith. machine review for the scientific record. sign in

arxiv: 2510.21652 · v2 · submitted 2025-10-24 · 💻 cs.AI · cs.CL

Recognition: unknown

AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

Jonathan Bragg , Mike D'Arcy , Nishant Balepur , Dan Bareket , Bhavana Dalvi , Sergey Feldman , Dany Haddad , Jena D. Hwang , Peter Jansen , Varsha Kishore , Bodhisattwa Prasad Majumder , Aakanksha Naik , Sigal Rahamimov , Kyle Richardson , Amanpreet Singh , Harshit Surana , Aryeh Tiktinsky , Rosni Vasu , Guy Wiener , Chloe Anastasiades , Stefan Candra , Jason Dunkelberger , Dan Emery , Rob Evans , Malachi Hamada , Regan Huff , Rodney Kinney , Matt Latzke , Jaron Lochner , Ruben Lozano-Aguilera , Cecile Nguyen , Smita Rao , Amber Tanaka , Brooke Vlahos , Peter Clark , Doug Downey , Yoav Goldberg , Ashish Sabharwal , Daniel S. Weld

Authors on Pith no claims yet
classification 💻 cs.AI cs.CL
keywords agentsresearchscientificevaluationsuiteagentagenticasta
0
0 comments X
read the original abstract

AI agents hold the potential to revolutionize scientific productivity by automating literature reviews, replicating experiments, analyzing data, and even proposing new directions of inquiry; indeed, there are now many such agents, ranging from general-purpose "deep research" systems to specialized science-specific agents, such as AI Scientist and AIGS. Rigorous evaluation of these agents is critical for progress. Yet existing benchmarks fall short on several fronts: they often (1) lack reproducible agent tools necessary for a controlled comparison of core agentic capabilities; (2) do not account for confounding variables such as model cost and tool access; (3) do not provide standardized interfaces for quick agent prototyping and evaluation; (4) fail to provide holistic, product-informed measures of real-world use cases such as science research; and (5) lack comprehensive baseline agents necessary to identify true advances. In response, we define principles and tooling for more rigorously benchmarking agents. Using these, we present AstaBench, a suite that provides a holistic measure of agentic ability to perform scientific research, comprising 2400+ problems spanning the entire scientific discovery process and multiple scientific domains, and including many problems inspired by actual user requests to deployed Asta agents. Our suite comes with the first scientific research environment with production-grade search tools that enable controlled, reproducible evaluation, better accounting for confounders. Alongside, we provide a comprehensive suite of nine science-optimized classes of Asta agents and numerous baselines. Our extensive evaluation of 57 agents across 22 agent classes reveals several interesting findings, most importantly that despite meaningful progress on certain individual aspects, AI remains far from solving the challenge of science research assistance.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MaD Physics: Evaluating information seeking under constraints in physical environments

    cs.AI 2026-05 unverdicted novelty 7.0

    MaD Physics is a new benchmark for evaluating AI agents on constrained information-seeking, model inference, and prediction in three physical environments with altered laws to avoid knowledge contamination.

  2. AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents

    physics.flu-dyn 2026-05 conditional novelty 7.0

    AI CFD Scientist autonomously finds a Spalart-Allmaras turbulence correction that lowers wall-friction error by 7.89% versus DNS on the periodic hill case using vision-language physics verification.

  3. AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents

    physics.flu-dyn 2026-05 conditional novelty 7.0

    AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction reducing lower-wall Cf RMSE by 7.89% on the periodic hill at Reh=5600 while using a vision-language gate to detect 14 of 16 silent failures...

  4. AI scientists produce results without reasoning scientifically

    cs.AI 2026-04 conditional novelty 7.0

    LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.

  5. MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

    cs.LG 2026-05 unverdicted novelty 6.0

    MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.

  6. AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents

    physics.flu-dyn 2026-05 unverdicted novelty 6.0

    An integrated AI agent framework for CFD uses vision-based physics gates to autonomously discover a Spalart-Allmaras runtime correction that cuts lower-wall skin-friction error by 7.89% versus DNS on the periodic hill...