MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
Miller, Tatiana Shavrina, Jakob N
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
ARA extracts workflow graphs from papers and scores reproducibility, reaching 61% accuracy on 213 ReScience C articles and outperforming priors on ReproBench and GoldStandardDB.
AIRA₂ improves AI research agents via asynchronous multi-GPU workers, hidden consistent evaluation, and interactive ReAct agents, reaching 81.5-83.1% percentile rank on MLE-bench-30 and exceeding human SOTA on 6 of 20 AIRS-Bench tasks.
citing papers explorer
-
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
-
ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review
ARA extracts workflow graphs from papers and scores reproducibility, reaching 61% accuracy on 213 ReScience C articles and outperforming priors on ReproBench and GoldStandardDB.
-
AIRA_2: Overcoming Bottlenecks in AI Research Agents
AIRA₂ improves AI research agents via asynchronous multi-GPU workers, hidden consistent evaluation, and interactive ReAct agents, reaching 81.5-83.1% percentile rank on MLE-bench-30 and exceeding human SOTA on 6 of 20 AIRS-Bench tasks.