Measuring Biological Capabilities and Risks of AI Agents
Pith reviewed 2026-06-26 15:36 UTC · model grok-4.3
The pith
Choices around defining, designing, running, scoring, and documenting biological agentic evaluations materially shape what their results imply about AI risks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Biological agentic evaluations assess AI systems capable of autonomously or collaboratively performing multi-step scientific tasks, but choices in how these evaluations are defined, designed, run, scored, and documented materially shape what results do and do not imply about biological risk. Drawing from the authors' own evaluations, the paper supplies practical considerations intended to help interpret outputs with appropriate caution and to guide investments and assessments.
What carries the argument
Biological agentic evaluations together with the set of practical, experience-grounded considerations on how design choices affect risk implications.
If this is right
- Policymakers should interpret biological evaluation outputs with appropriate caution.
- Public and private funders should direct resources toward high-leverage investments in AI-biology evaluation research.
- Biosecurity practitioners gain support when assessing emerging AI systems.
- Researchers designing or conducting agentic evaluations receive guidance on documenting choices to clarify what results imply.
Where Pith is reading between the lines
- Standardized reporting templates could emerge as a practical response to the emphasis on documentation.
- Meta-analyses across different labs may need to adjust for variation in evaluation design choices to remain reliable.
- The same interpretive caution could apply to agentic evaluations in other domains such as chemical or cyber risks.
Load-bearing premise
The practical considerations drawn from the authors' own evaluations are generalizable enough to guide interpretation of results produced by other organizations and frontier systems.
What would settle it
A direct comparison of two evaluations of the same AI agent that differ in only one documented design choice, such as scoring criteria, yet produce identical conclusions about biological risk levels would challenge the central claim.
Figures
read the original abstract
This paper addresses a rapidly emerging policy challenge: how to generate and interpret credible evidence about the biological capabilities and risks of AI scientists, or agentic AI systems capable of autonomously or collaboratively performing multi-step scientific tasks. As these systems enter real research workflows, decision-makers increasingly face evaluation results whose meaning depends on underlying design choices that are often implicit or under-documented. We synthesize current evidence on AI-enabled biological risks and introduce biological agentic evaluations as a promising, but interpretation-sensitive, tool for assessing these systems. Our central contribution is a set of practical, experience-grounded considerations -- drawing from our own evaluations -- that show how choices around defining, designing, running, scoring, and documenting evaluations materially shape what results do and do not imply about risk. The analysis is intended to help policymakers interpret biological evaluation outputs with appropriate caution; guide public and private funders toward high-leverage investments in AI-biology evaluation research; and support biosecurity practitioners assessing emerging AI systems. A secondary audience includes researchers designing or conducting agentic evaluations within frontier AI labs, AI providers, scientific institutions, and third-party evaluation organizations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper synthesizes evidence on AI-enabled biological risks and positions biological agentic evaluations as an interpretation-sensitive tool for assessing AI scientists and agentic systems. Its central contribution is a set of practical considerations, drawn from the authors' own evaluations, on how choices in defining, designing, running, scoring, and documenting evaluations shape what results imply about risk; these are offered to help policymakers interpret outputs, guide funders, and support biosecurity practitioners.
Significance. If the considerations hold beyond the authors' specific setups, the work could usefully caution against over-interpreting evaluation results for frontier AI biological capabilities. The experience-grounded framing is a strength for highlighting under-documented design sensitivities, but the absence of comparative evidence across organizations or systems limits the strength of claims about guiding external interpretation and investment decisions.
major comments (1)
- [Abstract/Introduction] Abstract and introduction: the intended uses (guiding policymakers and funders on outputs from other organizations, assessing emerging systems) require the considerations to transfer beyond the authors' evaluations, yet no comparative analysis, cross-lab validation, or evidence is provided that the identified sensitivities hold for different labs, models, or frontier-scale systems; this untested transferability assumption is load-bearing for the policy contribution.
minor comments (1)
- The manuscript would benefit from explicit statements distinguishing claims directly supported by the authors' evaluation data from broader interpretive guidance.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address the major comment below and outline revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract/Introduction] Abstract and introduction: the intended uses (guiding policymakers and funders on outputs from other organizations, assessing emerging systems) require the considerations to transfer beyond the authors' evaluations, yet no comparative analysis, cross-lab validation, or evidence is provided that the identified sensitivities hold for different labs, models, or frontier-scale systems; this untested transferability assumption is load-bearing for the policy contribution.
Authors: We agree that the intended uses described in the abstract and introduction presuppose a degree of transferability of the considerations beyond the authors' specific evaluations, and that the manuscript provides no comparative analysis or cross-lab validation to support this. The considerations are explicitly drawn from our own evaluation experience, and the paper does not claim or demonstrate that the identified sensitivities are universal across labs, models, or frontier-scale systems. To address this, we will revise the abstract and introduction to qualify the scope more precisely: the considerations are presented as experience-grounded insights intended to illustrate how design choices can affect interpretation and to encourage caution, rather than as validated general principles ready for direct application to other organizations' outputs. We will also add explicit language noting the lack of comparative evidence as a limitation and a direction for future work. revision: yes
Circularity Check
No circularity: considerations are experience-based guidance without reduction to fitted inputs or self-citation chains
full rationale
The paper's central contribution consists of practical considerations for interpreting biological agentic evaluations, explicitly drawn from the authors' own work but presented as qualitative guidance rather than a derivation, prediction, or theorem. No equations, fitted parameters, or load-bearing self-citations appear in the provided text; the argument does not reduce any result to its inputs by construction. The manuscript is self-contained as a synthesis and advisory document whose claims rest on documented experience rather than a closed logical loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Agentic evaluations can provide credible evidence about biological capabilities and risks when properly designed and interpreted.
Reference graph
Works this paper leans on
-
[1]
Introducing the Frontier Safety Framework,
As of February 9, 2026: https://www.rand.org/pubs/research_reports/RRA4591-1.html 20 Dev, Sunishchal, Charles Teague, Grant Ellison, Kyle Brady, Ying-Chiang Jeffrey Lee, Sarah L. Gebauer, Henry Alexander Bradley, Dawid Maciorowski, Bria Persaud, Jordan Despanie, Barbara Del Castello, Alyssa Worland, Michael Miller, Adrian Salas, Dave Nguyen, James Liu, Ja...
2026
-
[2]
Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark,
As of January 13, 2026: https://www.frontiermodelforum.org/uploads/2025/03/PDF-Version-of-Preliminary- Reporting-Tiers.pdf Götting, Jasper, Pedro Medeiros, Jon G Sanders, Nathaniel Li, Long Phan, Karam Elabd, Lennart Justen, Dan Hendrycks, and Seth Donoughe, “Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark,” arXiv, April 29, 2025. As...
-
[3]
The Reality of AI and Biorisk,
As of January 13, 2026: https://arxiv.org/abs/2502.10517 Paskov, Patricia, Michael J. Byun, Kevin Wei, and Toby Webster, Preliminary Suggestions for Rigorous GPAI Model Evaluations, RAND Corporation, May 1, 2025. As of January 13, 2026: https://www.rand.org/pubs/perspectives/PEA3971-1.html Peppin, Aidan, Anka Reuel, Stephen Casper, Elliot Jones, Andrew St...
Pith/arXiv arXiv 2026
-
[4]
Evaluating Frontier Models for Dangerous Capabilities,
As of January 13, 2026: https://arxiv.org/pdf/2412.01946 Persaud, Bria, Ying-Chiang Jeffrey Lee, Jordan Despanie, Helin Hernandez, Henry Alexander Bradley, Sarah L. Gebauer, and Greg McKelvey, Jr., Automated Grading for Efficiently Evaluating the Dual-Use Biological Capabilities of Large Language Models, RAND Corporation, 2025. As of January 13, 2026: htt...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.