Airs-bench: a suite of tasks for frontier ai research science agents

Lupidi, Alisia, Gauri, Bhavul, Foster, Thomas Simon, Al Omari, Bassel, Magka, Despoina, Pepe, Alberto · 2026 · arXiv 2602.06855

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

citation-role summary

background 3 baseline 1

citation-polarity summary

background 3 baseline 1

representative citing papers

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

cs.AI · 2026-06-03 · unverdicted · novelty 8.0

AutoLab benchmark shows frontier models mostly fail at sustained iterative optimization due to premature termination, with persistence as the key success factor.

What Do Evolutionary Coding Agents Evolve?

cs.NE · 2026-05-19 · unverdicted · novelty 7.0

Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.

BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks

cs.CE · 2026-05-15 · unverdicted · novelty 7.0

BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.

MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

cs.LG · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

MLS-Bench is a benchmark with 140 tasks that evaluates AI agents on inventing generalizable and scalable ML methods, finding they lag human performance especially in insight-driven invention rather than tuning.

SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?

cs.LG · 2026-05-28 · conditional · novelty 6.0

SoundnessBench shows frontier LLMs exhibit pervasive optimism bias when rating the soundness of ML research proposals, frequently calling low-soundness ideas sound under standard prompts.

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

cs.AI · 2026-05-25 · unverdicted · novelty 6.0

ScientistOne introduces Chain-of-Evidence and an audit system that achieves zero hallucinated references, perfect score verification, and top method-code alignment while matching or beating human experts on five frontier tasks and generalizing to six more.

ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review

cs.DL · 2026-05-04 · unverdicted · novelty 6.0 · 2 refs

ARA uses LLMs to build workflow graphs linking sources, methods, and outputs in papers, then scores reproducibility, reaching ~61% accuracy on 213 ReScience C articles and outperforming priors on ReproBench and GoldStandardDB.

AIRA_2: Overcoming Bottlenecks in AI Research Agents

cs.AI · 2026-03-27 · conditional · novelty 6.0

AIRA₂ improves AI research agents via asynchronous multi-GPU workers, hidden consistent evaluation, and interactive ReAct agents, reaching 81.5-83.1% percentile rank on MLE-bench-30 and exceeding human SOTA on 6 of 20 AIRS-Bench tasks.

GEAR: Genetic AutoResearch for Agentic Code Evolution

cs.NE · 2026-05-08 · unverdicted · novelty 5.0

GEAR applies genetic algorithms to maintain and evolve multiple research states in autonomous code agents, outperforming single-path baselines by continuing to discover improvements over extended runs.

AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery

cs.AI · 2026-05-22 · unverdicted · novelty 4.0

A survey organizing AI-powered research automation into five workflow stages, defining AutoResearch and Vibe Research, and proposing five evaluation dimensions while noting domain-conditioned limits on autonomy.

ResearchLoop: An Evidence-Gated Control Plane for AI-Assisted Research

cs.AI · 2026-05-27 · unverdicted · novelty 3.0

ResearchLoop defines a protocol and state model for evidence-gated AI-assisted computational research and reports experiments across nine versions including self-hosting and task ablations.

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

cs.LG · 2026-05-28

citing papers explorer

Showing 1 of 1 citing paper after filters.

BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks cs.CE · 2026-05-15 · unverdicted · none · ref 13
BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.

Airs-bench: a suite of tasks for frontier ai research science agents

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer