NatureBench evaluates ten frontier AI coding agents on 90 tasks from Nature papers under web-search-disabled conditions and finds the strongest agent surpasses published SOTA on only 17.8% of tasks, succeeding mainly by translating problems into familiar supervised learning setups.
Author contributions H.N
9 Pith papers cite this work. Polarity classification is still indexing.
years
2026 9verdicts
UNVERDICTED 9representative citing papers
Closed-loop LM-agent auto research finds some transferable gains on molecular property prediction benchmarks via external data but shows non-transfer for model and feature edits selected on validation.
Presents MedSci Skills, an open-source toolkit with deterministic integrity gates for verifying LLM-assisted clinical manuscripts against reporting guidelines like STARD, PRISMA, and STROBE.
DN-Hypo-Pipeline operationalizes three philosophy-of-science accounts to direct LLMs toward principle-based hypothesis generation, claims superior performance over direct prompting, and derives two new transformer algorithms from the resulting hypotheses.
Speak-to-Objective is a modular agentic pipeline that translates spoken or written commands into fully differentiable objective functions for optofluidic microparticle assembly using LLMs, inverse solvers, and experimental platforms.
A multi-LLM council scores predictive processing papers on an expert ontology, maps results in 3D hypothesis space, and introduces a dispersion metric showing greater spread in global versus local oddball paradigms.
Human-AI collaboration expanded a meta-idea on rational approximation into sign-embedding quantum algorithms for matrix problems, with humans retaining final judgment on routes and refinements.
Coordinated AI agents improve scientific inference from partial evidence in cross-domain tasks when single sources are incomplete, as demonstrated by AUROC gains in vector-borne disease and exoplanet benchmarks but tied performance in others.
The paper proposes the Cybersecurity AI Scientist as a modular multi-agent architecture for automating cybersecurity research, distinguished by its focus on non-stationary threats and anchored in a four-zeros risk-trust-incident-energy frame.
citing papers explorer
-
NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?
NatureBench evaluates ten frontier AI coding agents on 90 tasks from Nature papers under web-search-disabled conditions and finds the strongest agent surpasses published SOTA on only 17.8% of tasks, succeeding mainly by translating problems into familiar supervised learning setups.
-
Closed-loop Auto Research for Molecular Property Prediction: Discovering and Certifying Generalizable Improvements
Closed-loop LM-agent auto research finds some transferable gains on molecular property prediction benchmarks via external data but shows non-transfer for model and feature edits selected on validation.
-
Deterministic Integrity Gates for LLM-Assisted Clinical Manuscript Preparation: An Auditable Biomedical Informatics Architecture
Presents MedSci Skills, an open-source toolkit with deterministic integrity gates for verifying LLM-assisted clinical manuscripts against reporting guidelines like STARD, PRISMA, and STROBE.
-
DN-Hypo-Pipeline: An AI-Driven Workflow for Generating Hypotheses using Large Language Models and Scientific Explanations
DN-Hypo-Pipeline operationalizes three philosophy-of-science accounts to direct LLMs toward principle-based hypothesis generation, claims superior performance over direct prompting, and derives two new transformer algorithms from the resulting hypotheses.
-
Agentic Language-to-Objective Synthesis for Optofluidic Assembly
Speak-to-Objective is a modular agentic pipeline that translates spoken or written commands into fully differentiable objective functions for optofluidic microparticle assembly using LLMs, inverse solvers, and experimental platforms.
-
Ontology-constrained multi-LLM scoring of hypothesis support in the predictive processing literature
A multi-LLM council scores predictive processing papers on an expert ontology, maps results in 3D hypothesis space, and introduces a dispersion metric showing greater spread in global versus local oddball paradigms.
-
From Meta Idea to Advanced Mathematical Discovery -- Human-AI Co-Discovery of Sign-Embedding Quantum Algorithms
Human-AI collaboration expanded a meta-idea on rational approximation into sign-embedding quantum algorithms for matrix problems, with humans retaining final judgment on routes and refinements.
-
Cross-domain benchmarks reveal when coordinated AI agents improve scientific inference from partial evidence
Coordinated AI agents improve scientific inference from partial evidence in cross-domain tasks when single sources are incomplete, as demonstrated by AUROC gains in vector-borne disease and exoplanet benchmarks but tied performance in others.
-
Hephaestus: Toward a Cybersecurity AI Scientist
The paper proposes the Cybersecurity AI Scientist as a modular multi-agent architecture for automating cybersecurity research, distinguished by its focus on non-stationary threats and anchored in a four-zeros risk-trust-incident-energy frame.