NatureBench evaluates ten frontier AI coding agents on 90 tasks from Nature papers under web-search-disabled conditions and finds the strongest agent surpasses published SOTA on only 17.8% of tasks, succeeding mainly by translating problems into familiar supervised learning setups.
Accelerating scientific discovery with Co-Scientist
2 Pith papers cite this work. Polarity classification is still indexing.
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Coordinated AI agents improve scientific inference from partial evidence in cross-domain tasks when single sources are incomplete, as demonstrated by AUROC gains in vector-borne disease and exoplanet benchmarks but tied performance in others.
citing papers explorer
-
NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?
NatureBench evaluates ten frontier AI coding agents on 90 tasks from Nature papers under web-search-disabled conditions and finds the strongest agent surpasses published SOTA on only 17.8% of tasks, succeeding mainly by translating problems into familiar supervised learning setups.
-
Cross-domain benchmarks reveal when coordinated AI agents improve scientific inference from partial evidence
Coordinated AI agents improve scientific inference from partial evidence in cross-domain tasks when single sources are incomplete, as demonstrated by AUROC gains in vector-borne disease and exoplanet benchmarks but tied performance in others.