The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
Evaluating Frontier Models for Dangerous Capabilities,
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
Synthesizes current evidence on AI biological risks and provides experience-grounded considerations for defining, running, and interpreting agentic evaluations.
citing papers explorer
-
Evaluating Large Language Models in Scientific Discovery
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
-
Measuring Biological Capabilities and Risks of AI Agents
Synthesizes current evidence on AI biological risks and provides experience-grounded considerations for defining, running, and interpreting agentic evaluations.