Measuring Biological Capabilities and Risks of AI Agents

Alyssa Worland; Jeffrey Lee; Kyle Brady; Patricia Paskov

arxiv: 2606.19899 · v1 · pith:O26AJ7KSnew · submitted 2026-06-18 · 💻 cs.CY · cs.AI

Measuring Biological Capabilities and Risks of AI Agents

Patricia Paskov , Jeffrey Lee , Kyle Brady , Alyssa Worland This is my paper

Pith reviewed 2026-06-26 15:36 UTC · model grok-4.3

classification 💻 cs.CY cs.AI

keywords AI agentsbiological risksevaluationsbiosecurityAI policyagentic systemsrisk assessmentscientific tasks

0 comments

The pith

Choices around defining, designing, running, scoring, and documenting biological agentic evaluations materially shape what their results imply about AI risks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper addresses the challenge of producing credible evidence on the biological capabilities of AI agents that can perform multi-step scientific tasks. It introduces biological agentic evaluations as a tool for risk assessment while stressing that design decisions in those evaluations determine the strength of any risk conclusions drawn. A reader would care because such systems are entering real research workflows, making it essential for decision-makers to understand what evaluation outputs can and cannot establish. The authors synthesize existing evidence and offer experience-based considerations to support cautious interpretation by policymakers, funders, and biosecurity practitioners.

Core claim

Biological agentic evaluations assess AI systems capable of autonomously or collaboratively performing multi-step scientific tasks, but choices in how these evaluations are defined, designed, run, scored, and documented materially shape what results do and do not imply about biological risk. Drawing from the authors' own evaluations, the paper supplies practical considerations intended to help interpret outputs with appropriate caution and to guide investments and assessments.

What carries the argument

Biological agentic evaluations together with the set of practical, experience-grounded considerations on how design choices affect risk implications.

If this is right

Policymakers should interpret biological evaluation outputs with appropriate caution.
Public and private funders should direct resources toward high-leverage investments in AI-biology evaluation research.
Biosecurity practitioners gain support when assessing emerging AI systems.
Researchers designing or conducting agentic evaluations receive guidance on documenting choices to clarify what results imply.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Standardized reporting templates could emerge as a practical response to the emphasis on documentation.
Meta-analyses across different labs may need to adjust for variation in evaluation design choices to remain reliable.
The same interpretive caution could apply to agentic evaluations in other domains such as chemical or cyber risks.

Load-bearing premise

The practical considerations drawn from the authors' own evaluations are generalizable enough to guide interpretation of results produced by other organizations and frontier systems.

What would settle it

A direct comparison of two evaluations of the same AI agent that differ in only one documented design choice, such as scoring criteria, yet produce identical conclusions about biological risk levels would challenge the central claim.

Figures

Figures reproduced from arXiv: 2606.19899 by Alyssa Worland, Jeffrey Lee, Kyle Brady, Patricia Paskov.

**Figure 2.** Figure 2: A biological weapon risk chain (Brady & Lee et al., [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: an example decomposition of “biological tool use” into discrete tasks and subtasks. [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

read the original abstract

This paper addresses a rapidly emerging policy challenge: how to generate and interpret credible evidence about the biological capabilities and risks of AI scientists, or agentic AI systems capable of autonomously or collaboratively performing multi-step scientific tasks. As these systems enter real research workflows, decision-makers increasingly face evaluation results whose meaning depends on underlying design choices that are often implicit or under-documented. We synthesize current evidence on AI-enabled biological risks and introduce biological agentic evaluations as a promising, but interpretation-sensitive, tool for assessing these systems. Our central contribution is a set of practical, experience-grounded considerations -- drawing from our own evaluations -- that show how choices around defining, designing, running, scoring, and documenting evaluations materially shape what results do and do not imply about risk. The analysis is intended to help policymakers interpret biological evaluation outputs with appropriate caution; guide public and private funders toward high-leverage investments in AI-biology evaluation research; and support biosecurity practitioners assessing emerging AI systems. A secondary audience includes researchers designing or conducting agentic evaluations within frontier AI labs, AI providers, scientific institutions, and third-party evaluation organizations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags real sensitivities in how bio agentic evals are designed and scored but assumes its own-run considerations transfer to other labs and frontier systems without showing it.

read the letter

The main takeaway is that choices in defining tasks, running agentic biological evaluations, scoring outputs, and documenting them can change what the numbers actually say about AI risks. The authors draw a set of practical considerations from evaluations they conducted themselves.

What works is the clear reminder that these design decisions are not neutral and that policymakers should not treat raw results as direct risk measures. The synthesis of existing evidence on AI-enabled bio risks is straightforward and the focus on interpretation issues is grounded in actual evaluation experience.

The soft spot is the lack of any comparative check. The considerations rest on the authors' setups, yet the paper offers no evidence that the same sensitivities appear in work from other organizations or on larger models. That leaves the advice for funders and external assessors resting on an untested transfer assumption.

This is for people in AI biosecurity policy and evaluation teams who need to read eval reports more carefully. Researchers running similar tests might pick up useful reminders on documentation and scoring.

It has enough practical content on a live policy issue to merit peer review, though referees should press on whether the considerations need more external validation to support the intended uses.

Referee Report

1 major / 1 minor

Summary. The paper synthesizes evidence on AI-enabled biological risks and positions biological agentic evaluations as an interpretation-sensitive tool for assessing AI scientists and agentic systems. Its central contribution is a set of practical considerations, drawn from the authors' own evaluations, on how choices in defining, designing, running, scoring, and documenting evaluations shape what results imply about risk; these are offered to help policymakers interpret outputs, guide funders, and support biosecurity practitioners.

Significance. If the considerations hold beyond the authors' specific setups, the work could usefully caution against over-interpreting evaluation results for frontier AI biological capabilities. The experience-grounded framing is a strength for highlighting under-documented design sensitivities, but the absence of comparative evidence across organizations or systems limits the strength of claims about guiding external interpretation and investment decisions.

major comments (1)

[Abstract/Introduction] Abstract and introduction: the intended uses (guiding policymakers and funders on outputs from other organizations, assessing emerging systems) require the considerations to transfer beyond the authors' evaluations, yet no comparative analysis, cross-lab validation, or evidence is provided that the identified sensitivities hold for different labs, models, or frontier-scale systems; this untested transferability assumption is load-bearing for the policy contribution.

minor comments (1)

The manuscript would benefit from explicit statements distinguishing claims directly supported by the authors' evaluation data from broader interpretive guidance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments. We address the major comment below and outline revisions to the manuscript.

read point-by-point responses

Referee: [Abstract/Introduction] Abstract and introduction: the intended uses (guiding policymakers and funders on outputs from other organizations, assessing emerging systems) require the considerations to transfer beyond the authors' evaluations, yet no comparative analysis, cross-lab validation, or evidence is provided that the identified sensitivities hold for different labs, models, or frontier-scale systems; this untested transferability assumption is load-bearing for the policy contribution.

Authors: We agree that the intended uses described in the abstract and introduction presuppose a degree of transferability of the considerations beyond the authors' specific evaluations, and that the manuscript provides no comparative analysis or cross-lab validation to support this. The considerations are explicitly drawn from our own evaluation experience, and the paper does not claim or demonstrate that the identified sensitivities are universal across labs, models, or frontier-scale systems. To address this, we will revise the abstract and introduction to qualify the scope more precisely: the considerations are presented as experience-grounded insights intended to illustrate how design choices can affect interpretation and to encourage caution, rather than as validated general principles ready for direct application to other organizations' outputs. We will also add explicit language noting the lack of comparative evidence as a limitation and a direction for future work. revision: yes

Circularity Check

0 steps flagged

No circularity: considerations are experience-based guidance without reduction to fitted inputs or self-citation chains

full rationale

The paper's central contribution consists of practical considerations for interpreting biological agentic evaluations, explicitly drawn from the authors' own work but presented as qualitative guidance rather than a derivation, prediction, or theorem. No equations, fitted parameters, or load-bearing self-citations appear in the provided text; the argument does not reduce any result to its inputs by construction. The manuscript is self-contained as a synthesis and advisory document whose claims rest on documented experience rather than a closed logical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on domain assumptions about the relevance of agentic evaluations to biological risk and the transferability of lessons from the authors' evaluations; no free parameters, mathematical axioms, or invented entities are introduced.

axioms (1)

domain assumption Agentic evaluations can provide credible evidence about biological capabilities and risks when properly designed and interpreted.
Invoked in the abstract as the basis for treating evaluations as a promising tool.

pith-pipeline@v0.9.1-grok · 5720 in / 1060 out tokens · 21044 ms · 2026-06-26T15:36:03.080569+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 2 canonical work pages

[1]

Introducing the Frontier Safety Framework,

As of February 9, 2026: https://www.rand.org/pubs/research_reports/RRA4591-1.html 20 Dev, Sunishchal, Charles Teague, Grant Ellison, Kyle Brady, Ying-Chiang Jeffrey Lee, Sarah L. Gebauer, Henry Alexander Bradley, Dawid Maciorowski, Bria Persaud, Jordan Despanie, Barbara Del Castello, Alyssa Worland, Michael Miller, Adrian Salas, Dave Nguyen, James Liu, Ja...

2026
[2]

Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark,

As of January 13, 2026: https://www.frontiermodelforum.org/uploads/2025/03/PDF-Version-of-Preliminary- Reporting-Tiers.pdf Götting, Jasper, Pedro Medeiros, Jon G Sanders, Nathaniel Li, Long Phan, Karam Elabd, Lennart Justen, Dan Hendrycks, and Seth Donoughe, “Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark,” arXiv, April 29, 2025. As...

work page doi:10.7591/j.ctt1287dk2 2026
[3]

The Reality of AI and Biorisk,

As of January 13, 2026: https://arxiv.org/abs/2502.10517 Paskov, Patricia, Michael J. Byun, Kevin Wei, and Toby Webster, Preliminary Suggestions for Rigorous GPAI Model Evaluations, RAND Corporation, May 1, 2025. As of January 13, 2026: https://www.rand.org/pubs/perspectives/PEA3971-1.html Peppin, Aidan, Anka Reuel, Stephen Casper, Elliot Jones, Andrew St...

Pith/arXiv arXiv 2026
[4]

Evaluating Frontier Models for Dangerous Capabilities,

As of January 13, 2026: https://arxiv.org/pdf/2412.01946 Persaud, Bria, Ying-Chiang Jeffrey Lee, Jordan Despanie, Helin Hernandez, Henry Alexander Bradley, Sarah L. Gebauer, and Greg McKelvey, Jr., Automated Grading for Efficiently Evaluating the Dual-Use Biological Capabilities of Large Language Models, RAND Corporation, 2025. As of January 13, 2026: htt...

work page doi:10.1038/s41587-025-02650-8 2026

[1] [1]

Introducing the Frontier Safety Framework,

As of February 9, 2026: https://www.rand.org/pubs/research_reports/RRA4591-1.html 20 Dev, Sunishchal, Charles Teague, Grant Ellison, Kyle Brady, Ying-Chiang Jeffrey Lee, Sarah L. Gebauer, Henry Alexander Bradley, Dawid Maciorowski, Bria Persaud, Jordan Despanie, Barbara Del Castello, Alyssa Worland, Michael Miller, Adrian Salas, Dave Nguyen, James Liu, Ja...

2026

[2] [2]

Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark,

As of January 13, 2026: https://www.frontiermodelforum.org/uploads/2025/03/PDF-Version-of-Preliminary- Reporting-Tiers.pdf Götting, Jasper, Pedro Medeiros, Jon G Sanders, Nathaniel Li, Long Phan, Karam Elabd, Lennart Justen, Dan Hendrycks, and Seth Donoughe, “Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark,” arXiv, April 29, 2025. As...

work page doi:10.7591/j.ctt1287dk2 2026

[3] [3]

The Reality of AI and Biorisk,

As of January 13, 2026: https://arxiv.org/abs/2502.10517 Paskov, Patricia, Michael J. Byun, Kevin Wei, and Toby Webster, Preliminary Suggestions for Rigorous GPAI Model Evaluations, RAND Corporation, May 1, 2025. As of January 13, 2026: https://www.rand.org/pubs/perspectives/PEA3971-1.html Peppin, Aidan, Anka Reuel, Stephen Casper, Elliot Jones, Andrew St...

Pith/arXiv arXiv 2026

[4] [4]

Evaluating Frontier Models for Dangerous Capabilities,

As of January 13, 2026: https://arxiv.org/pdf/2412.01946 Persaud, Bria, Ying-Chiang Jeffrey Lee, Jordan Despanie, Helin Hernandez, Henry Alexander Bradley, Sarah L. Gebauer, and Greg McKelvey, Jr., Automated Grading for Efficiently Evaluating the Dual-Use Biological Capabilities of Large Language Models, RAND Corporation, 2025. As of January 13, 2026: htt...

work page doi:10.1038/s41587-025-02650-8 2026