Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.
Airs-bench: a suite of tasks for frontier ai research science agents
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 10representative citing papers
BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.
SoundnessBench shows frontier LLMs exhibit pervasive optimism bias when rating the soundness of ML research proposals, frequently calling low-soundness ideas sound under standard prompts.
ResearchClawBench is a new benchmark that evaluates autonomous AI research agents on 40 tasks grounded in published papers using expert rubrics, finding that top systems score only 20-26 out of 100.
ARA uses LLMs to build workflow graphs linking sources, methods, and outputs in papers, then scores reproducibility, reaching ~61% accuracy on 213 ReScience C articles and outperforming priors on ReproBench and GoldStandardDB.
AIRA₂ improves AI research agents via asynchronous multi-GPU workers, hidden consistent evaluation, and interactive ReAct agents, reaching 81.5-83.1% percentile rank on MLE-bench-30 and exceeding human SOTA on 6 of 20 AIRS-Bench tasks.
GEAR applies genetic algorithms to maintain and evolve multiple research states in autonomous code agents, outperforming single-path baselines by continuing to discover improvements over extended runs.
A survey organizing AI-powered research automation into five workflow stages, defining AutoResearch and Vibe Research, and proposing five evaluation dimensions while noting domain-conditioned limits on autonomy.
ResearchLoop defines a protocol and state model for evidence-gated AI-assisted computational research and reports experiments across nine versions including self-hosting and task ablations.
citing papers explorer
-
What Do Evolutionary Coding Agents Evolve?
Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.
-
BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks
BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.
-
SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?
SoundnessBench shows frontier LLMs exhibit pervasive optimism bias when rating the soundness of ML research proposals, frequently calling low-soundness ideas sound under standard prompts.
-
ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research
ResearchClawBench is a new benchmark that evaluates autonomous AI research agents on 40 tasks grounded in published papers using expert rubrics, finding that top systems score only 20-26 out of 100.
-
ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review
ARA uses LLMs to build workflow graphs linking sources, methods, and outputs in papers, then scores reproducibility, reaching ~61% accuracy on 213 ReScience C articles and outperforming priors on ReproBench and GoldStandardDB.
-
AIRA_2: Overcoming Bottlenecks in AI Research Agents
AIRA₂ improves AI research agents via asynchronous multi-GPU workers, hidden consistent evaluation, and interactive ReAct agents, reaching 81.5-83.1% percentile rank on MLE-bench-30 and exceeding human SOTA on 6 of 20 AIRS-Bench tasks.
-
GEAR: Genetic AutoResearch for Agentic Code Evolution
GEAR applies genetic algorithms to maintain and evolve multiple research states in autonomous code agents, outperforming single-path baselines by continuing to discover improvements over extended runs.
-
AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery
A survey organizing AI-powered research automation into five workflow stages, defining AutoResearch and Vibe Research, and proposing five evaluation dimensions while noting domain-conditioned limits on autonomy.
-
ResearchLoop: An Evidence-Gated Control Plane for AI-Assisted Research
ResearchLoop defines a protocol and state model for evidence-gated AI-assisted computational research and reports experiments across nine versions including self-hosting and task ablations.
- MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI