RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts

Aakaash Meduri; Emre Ulgac; Lukas Weidener; Marko Brki\'c; Mihailo Jovanovi\'c

arxiv: 2605.21545 · v1 · pith:LOPMJKJXnew · submitted 2026-05-20 · 💻 cs.SE · cs.AI

RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts

Lukas Weidener , Marko Brki\'c , Mihailo Jovanovi\'c , Emre Ulgac , Aakaash Meduri This is my paper

Pith reviewed 2026-05-22 01:13 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords refusal rateLLM safetybiological researchrisk discriminationdual-use promptsbenchmarkfrontier modelssafety calibration

0 comments

The pith

Strict refusal rate misranks frontier LLMs on biological research prompts

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces RefusalBench, a set of 47 matched prompt bundles that keep the same research task while changing only the biological risk level from benign to borderline to dual-use. Testing 19 frontier models shows refusal rates vary from nearly zero to over 94 percent on identical prompts, yet this number does not track how well each model separates safe requests from risky ones. One model reaches the highest discrimination between tiers while ranking only seventh by refusal rate, and another shows a sharp drop in discrimination with no gain in spotting dual-use content. The benchmark also finds that high refusal often stems from fixed templates rather than case-by-case judgment, and that many models hedge but still assist on risky prompts in ways binary counts miss. These patterns matter for anyone using LLMs as backbones in biological research because choosing models by refusal rate alone can select the wrong tools for safety.

Core claim

RefusalBench demonstrates that strict refusal rate misranks safety calibration across frontier large language models on biological research prompts. The benchmark uses 141 prompts in 47 bundles that hold task framing constant while varying only biological risk tier. Grok 4.20 achieves the highest tier discrimination with a Youden's J of 0.787 while ranking seventh by overall refusal rate, and Claude Opus 4.7 shows a 65 percent drop in discrimination from prior versions with no improvement in dual-use detection. Provider identity strongly predicts refusal behavior, but this traces to shared safety-policy templates rather than model-level reasoning. Nine of 18 models exhibit hedge-but-help pat

What carries the argument

the matched-triple benchmark design that holds task framing constant while varying only biological risk tier

If this is right

Strict refusal rate alone cannot rank models for safety calibration on biological research prompts.
Tier discrimination metrics capture calibration better than raw refusal counts.
Provider access paths drive most observed refusals through fixed templates.
Partial compliance patterns on dual-use prompts require evaluation beyond binary refusal.
Some models fail to refuse even positive-control prompts that should trigger refusal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmark designs that isolate risk tiers could be applied to safety evaluation in other domains such as chemical synthesis.
Model developers might prioritize training for accurate risk discrimination over simply raising overall refusal rates.
Researchers using these models for biology work should test discrimination on their specific tasks rather than relying on aggregate refusal statistics.

Load-bearing premise

The 47 prompt bundles correctly isolate biological risk tier without residual subdomain or framing confounds.

What would settle it

A replication on new prompt bundles where overall refusal rate accurately predicts tier discrimination performance would falsify the misranking claim.

Figures

Figures reproduced from arXiv: 2605.21545 by Aakaash Meduri, Emre Ulgac, Lukas Weidener, Marko Brki\'c, Mihailo Jovanovi\'c.

**Figure 2.** Figure 2: Strict refusal rates across the 19-model panel at three risk tiers. Grouped vertical bars for each model show [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: Strict refusal rates by model and protein-design subdomain. Rows are the 19 panel models sorted by [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

**Figure 4.** Figure 4: Strict refusal rate across Claude Opus versions 4.5, 4.6, and 4.7 by risk tier. Lines connect the three [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

read the original abstract

Frontier large language models are increasingly deployed as orchestration backbones for biological research workflows, yet no shared evidence base exists for comparing their refusal behaviour on legitimate research prompts. RefusalBench, introduced here, is a matched-triple benchmark of 141 prompts in 47 bundles that holds task framing constant while varying only biological risk tier (benign, borderline, dual-use), enabling tier-conditioned comparisons robust to subdomain confounding. A 15-prompt should-refuse positive-control module establishes per-model calibration floors; three models fail to refuse even these prompts. Across 19 frontier models in the May 2026 snapshot, strict refusal rates span 0.1% to 94.6% on identical prompts. Jurisdiction does not predict refusal in this snapshot (Mann-Whitney U, p = 0.393; EU n = 1, US bimodal); provider identity does, with Anthropic's API stack predicting refusal at OR = 21.03 (95% CI: 14.58-30.34 prompt-clustered; 5.70-77.55 under model-clustered GEE). This effect is best read as access-path-level rather than model-weight-level: 99.8% of Anthropic's strict refusals carry the same safety_policy adjudicated reason code, consistent with a small set of canonical refusal templates rather than case-by-case model reasoning. Strict refusal rate misranks safety calibration: Grok 4.20 achieves the highest tier discrimination (Youden's J = 0.787) while ranking only seventh by overall refusal rate, and Claude Opus 4.7's J dropped 65% from prior versions with no improvement in dual-use detection. Nine of 18 frontier models exhibit a hedge-but-help partial-compliance pattern at dual-use tier that binary refusal metrics cannot detect.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Refusal rates misrank models on bio risk prompts because they ignore tier discrimination, and this benchmark with matched triples plus Youden's J tries to measure the gap directly.

read the letter

The main thing to know is that raw refusal percentages on biological prompts do not track how well models separate benign from dual-use work, and the paper supplies a benchmark meant to expose that mismatch. RefusalBench uses 47 matched triples that keep the core task fixed while shifting only the risk tier from benign to borderline to dual-use, plus a 15-prompt positive-control set that should trigger refusals. Across 19 frontier models they report refusal rates from near zero to 94 percent, strong provider-level effects with Anthropic stacks showing much higher odds of refusal, and cases where discrimination (Youden's J) and refusal rate diverge sharply, such as Grok 4.20 topping the J metric while sitting seventh on raw refusals. Claude Opus 4.7 also shows a large drop in J from earlier versions without better dual-use detection. Nine models display a hedge-but-help pattern that binary refusal counts miss entirely. That construction and the empirical misranking are the concrete additions here. The positive-control module and the prompt-clustered odds ratios add some grounding to the provider claim. The soft spot is the tier labels themselves. To create the dual-use contrast the bundles must add specific agents, intent language, or misuse details, which can introduce lexical signals that models pick up from training data rather than any calibrated risk reasoning. The matched-triple design reduces subdomain noise but does not rule out those surface cues driving the J scores. The abstract gives no prompt examples or inter-rater numbers for the tier assignments, so it is hard to judge how cleanly the isolation works. This paper is aimed at groups building or auditing safety evaluations for dual-use domains. Anyone who already runs refusal tests on bio prompts will see value in the tier-conditioned metric and the control set. It deserves peer review because the benchmark is new, the data are direct measurements, and the central claim is falsifiable once the prompts and labels are examined.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RefusalBench, a matched-triple benchmark of 141 prompts in 47 bundles that holds task framing constant while varying only biological risk tier (benign, borderline, dual-use). Evaluating 19 frontier models from a May 2026 snapshot, it reports strict refusal rates from 0.1% to 94.6%, that provider identity predicts refusal (Anthropic OR = 21.03 with prompt-clustered CI), but that refusal rate misranks safety calibration: Grok 4.20 achieves the highest Youden's J of 0.787 while ranking seventh by refusal rate, Claude Opus 4.7's J dropped 65% from prior versions, and nine models show hedge-but-help partial compliance at dual-use tier. A 15-prompt positive-control module is included.

Significance. If the tier isolation holds, the work supplies a controlled empirical tool for comparing LLM refusal on biological research prompts and demonstrates that binary refusal metrics miss important calibration differences. Strengths include the direct empirical measurements with no free parameters or self-referential definitions, the subdomain-controlled matched-triple design, and the positive-control module establishing refusal floors.

major comments (2)

[Benchmark construction] Matched-triple design: The construction holds task framing constant but necessarily introduces differentiating content (specific agents, intent phrases, or misuse indicators) to create tier contrast in dual-use variants. These additions risk creating surface-level lexical cues that models can detect from training data, so that high Youden's J scores may reflect cue sensitivity rather than calibrated risk reasoning. This directly affects the central claim that refusal rate misranks safety calibration.
[Results on tier discrimination] Tier discrimination results: The reported Youden's J = 0.787 for Grok 4.20 and the 65% drop for Claude Opus 4.7 are load-bearing for the misranking conclusion, yet the manuscript provides no per-tier raw refusal counts, confusion matrices, or sensitivity analysis showing how J changes under plausible cue perturbations.

minor comments (2)

[Abstract and methods] Abstract and methods: No example prompts from the 47 bundles or the positive-control module are shown, and no inter-rater reliability statistic is reported for the tier labeling process.
[Results] Statistical reporting: The OR = 21.03 and p = 0.393 are given with clustering adjustments, but the main text does not include a table of raw refusal counts per model and tier that would allow independent verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we will make to improve clarity and robustness.

read point-by-point responses

Referee: [Benchmark construction] Matched-triple design: The construction holds task framing constant but necessarily introduces differentiating content (specific agents, intent phrases, or misuse indicators) to create tier contrast in dual-use variants. These additions risk creating surface-level lexical cues that models can detect from training data, so that high Youden's J scores may reflect cue sensitivity rather than calibrated risk reasoning. This directly affects the central claim that refusal rate misranks safety calibration.

Authors: We appreciate the concern about potential lexical cues introduced by the differentiating content needed to establish risk-tier contrast. The matched-triple design was constructed precisely to hold core task framing and subdomain constant while varying only the biological risk indicators, which is necessary to isolate the effect of risk tier. The 15-prompt positive-control module provides an independent check that models respond to explicit refusal triggers rather than incidental phrasing. To strengthen the claim, we will add a sensitivity analysis in the revision that perturbs the differentiating phrases and reports the resulting change in Youden's J for the top models. revision: yes
Referee: [Results on tier discrimination] Tier discrimination results: The reported Youden's J = 0.787 for Grok 4.20 and the 65% drop for Claude Opus 4.7 are load-bearing for the misranking conclusion, yet the manuscript provides no per-tier raw refusal counts, confusion matrices, or sensitivity analysis showing how J changes under plausible cue perturbations.

Authors: We agree that the absence of per-tier raw counts and confusion matrices limits the ability to fully evaluate the reported Youden's J values. We will include these tables and matrices in the revised manuscript. We will also incorporate the sensitivity analysis for cue perturbations referenced in the first comment to demonstrate that the tier-discrimination results are not driven by surface lexical features. revision: yes

Circularity Check

0 steps flagged

No circularity: all quantities are direct empirical measurements on a new benchmark

full rationale

The paper constructs RefusalBench as a new matched-triple dataset of 141 prompts and computes refusal rates, Youden's J, odds ratios, and non-parametric tests directly from model outputs on those prompts. No equations fit parameters to a subset and then label the result a prediction, no self-citations supply load-bearing uniqueness theorems or ansatzes, and the tier-isolation claim is a methodological design statement rather than a definitional reduction. All reported statistics follow standard definitions and are falsifiable against the released prompt set.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that expert-assigned risk tiers are accurate and that the positive-control prompts are universally recognized as should-refuse. No free parameters or invented entities are introduced.

axioms (1)

domain assumption Prompts can be reliably classified into benign, borderline, and dual-use tiers without subdomain confounding when task framing is held constant.
This classification underpins all tier-conditioned comparisons and the claim that refusal rate misranks safety.

pith-pipeline@v0.9.0 · 5891 in / 1291 out tokens · 46234 ms · 2026-05-22T01:13:26.503416+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Youden’s J = TPR − FPR quantifies how well each model discriminates between legitimate and dangerous requests

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 6 internal anchors

[1]

One-shot design of functional protein binders with BindCraft

Martin Pacesa, Lennart Nickel, Christian Schell- haas, Joseph Schmidt, Ekaterina Pyatova, et al. One-shot design of functional protein binders with BindCraft. Nature, 646:483–492, 2025. doi: 10.1038/s41586-025-09429-6

work page doi:10.1038/s41586-025-09429-6 2025
[2]

ProteinCrow: A language model agent that can design proteins

Manvitha Ponnapati, Sam Cox, Cade Gordon, Michael Hammerling, Siddharth Narayanan, et al. ProteinCrow: A language model agent that can design proteins. In ICML 2025 Work- shop on Generative AI for Biology, volume 267 of Proceedings of Machine Learning Research,

work page 2025
[3]

URL https://openreview.net/ pdf?id=ljXgWDtqCu

work page
[4]

Beyond protein language models: An agentic LLM framework for mechanistic enzyme design

Bruno Jacob, Khushbu Agarwal, Marcel Baer, Peter Rice, and Simone Raugei. Beyond protein language models: An agentic LLM framework for mechanistic enzyme design. arXiv preprint,

work page
[5]

arXiv:2511.19423

URL https://arxiv.org/abs/ 2511.19423. arXiv:2511.19423

work page arXiv
[6]

ProtoCy- cle: Reflective tool-augmented planning for text-guided protein design

Yutang Ge, Guojiang Zhao, Sihang Li, Zheng Cheng, Zifeng Zhao, et al. ProtoCy- cle: Reflective tool-augmented planning for text-guided protein design. arXiv preprint ,

work page
[7]

ProtoCycle: Reflective Tool-Augmented Planning for Text-Guided Protein Design

URL https://arxiv.org/abs/ 2604.16896. arXiv:2604.16896. 26 of 34

work page internal anchor Pith review Pith/arXiv arXiv
[8]

ProteinMCP: An agentic AI framework for autonomous protein engineer- ing

Xiaopeng Xu, Chenjie Feng, Chao Zha, Wenjia He, Maolin He, et al. ProteinMCP: An agentic AI framework for autonomous protein engineer- ing. Protein Science, 35(4):e70547, 2026. doi: 10.1002/pro.70547

work page doi:10.1002/pro.70547 2026
[9]

Using a GPT-5-driven autonomous lab to optimize the cost and titer of cell-free protein synthesis

Alexus Smith, Edmund Wong, Ronan Donovan, Chapman, Harry, et al. Using a GPT-5-driven autonomous lab to optimize the cost and titer of cell-free protein synthesis. bioRxiv preprint,

work page
[10]

doi: 10.64898/2026.02.05.703998

work page doi:10.64898/2026.02.05.703998 2026
[11]

Agen- tic BAIM–LLM evaluation (ABLE): Bench- marking LLM use of protein design tools

Bryce Cai, Geetha Jeyapragasan, Samira Nedun- gadi, Jake Yukich, and Seth Donoughe. Agen- tic BAIM–LLM evaluation (ABLE): Bench- marking LLM use of protein design tools. In NeurIPS 2025 Workshop on Biosecu- rity Safeguards for Generative AI , 2025. URL https://openreview.net/pdf? id=fDysOrWaGd

work page 2025
[12]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, et al. XSTest: A test suite for identifying exagger- ated safety behaviours in large language models. In Proceedings of NAACL 2024 , pages 5377– 5400, 2024. URL https://arxiv.org/ abs/2308.01263

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

OR-Bench: An over- refusal benchmark for large language mod- els

Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. OR-Bench: An over- refusal benchmark for large language mod- els. In Proceedings of the 42nd Interna- tional Conference on Machine Learning (ICML 2025), volume 267 of Proceedings of Ma- chine Learning Research, pages 11515–11542,

work page 2025
[14]

arXiv:2405.20947

URL https://arxiv.org/abs/ 2405.20947. arXiv:2405.20947

work page arXiv
[15]

Cannot or should not? Automatic analysis of refusal composition in IFT/RLHF datasets and refusal behavior of black-box LLMs

Alexander von Recum, Christoph Schnabl, Gabor Hollbeck, Silas Alberti, Philip Blinde, et al. Cannot or should not? Automatic analysis of refusal composition in IFT/RLHF datasets and refusal behavior of black-box LLMs. arXiv preprint, 2024. URL https://arxiv. org/abs/2412.16974. arXiv:2412.16974

work page arXiv 2024
[16]

Forbid- den science: Dual-use AI challenge benchmark and scientific refusal tests

David Noever and Forrest McKee. Forbid- den science: Dual-use AI challenge benchmark and scientific refusal tests. arXiv preprint ,

work page
[17]

arXiv:2502.06867

URL https://arxiv.org/abs/ 2502.06867. arXiv:2502.06867

work page arXiv
[18]

Political censorship in large language models originating from China

Jennifer Pan and Xu Xu. Political censorship in large language models originating from China. PNAS Nexus, 5(2):pgag013, 2026. doi: 10.1093/ pnasnexus/pgag013

work page 2026
[19]

Virol- ogy capabilities test (VCT): A multimodal virology Q&A benchmark

Jasper Götting, Pedro Medeiros, Jon Sanders, Nathaniel Li, Long Phan, et al. Virol- ogy capabilities test (VCT): A multimodal virology Q&A benchmark. arXiv preprint ,

work page
[20]

arXiv:2504.16137

URL https://arxiv.org/abs/ 2504.16137. arXiv:2504.16137

work page arXiv
[21]

Can large language models democratize access to dual-use biotechnology? arXiv preprint ,

Emily Soice, Rafael Rocha, Kimberlee Cor- dova, Michael Specter, and Kevin Esvelt. Can large language models democratize access to dual-use biotechnology? arXiv preprint ,

work page
[22]

arXiv:2306.03809

URL https://arxiv.org/abs/ 2306.03809. arXiv:2306.03809

work page arXiv
[23]

The next-generation Open Targets platform: reimag- ined, redesigned, rebuilt

David Ochoa, Andrew Hercules, Miguel Car- mona, Daniel Suveges, Jarrod Baker, et al. The next-generation Open Targets platform: reimag- ined, redesigned, rebuilt. Nucleic Acids Re- search, 51(D1):D1353–D1359, 2023. doi: 10. 1093/nar/gkac1046

work page 2023
[24]

UniProt: the uni- versal protein knowledgebase in 2025

The UniProt Consortium. UniProt: the uni- versal protein knowledgebase in 2025. Nucleic Acids Research, 53(D1):D609–D617, 2025. doi: 10.1093/nar/gkae1010

work page doi:10.1093/nar/gkae1010 2025
[25]

NVIDIA Nemotron 3 Su- per: A 120B hybrid Mamba-Transformer MoE model for agentic reasoning

NVIDIA. NVIDIA Nemotron 3 Su- per: A 120B hybrid Mamba-Transformer MoE model for agentic reasoning. https://research.nvidia.com/ labs/nemotron/Nemotron-3-Super/,

work page
[26]

Open-weights release; 12B active / 120B total parameters; 1M-token context

Accessed May 15, 2026. Open-weights release; 12B active / 120B total parameters; 1M-token context

work page 2026
[27]

SORRY- Bench: Systematically evaluating large lan- guage model safety refusal

Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Sehwag, et al. SORRY- Bench: Systematically evaluating large lan- guage model safety refusal. In Proceed- ings of the 13th International Conference on Learning Representations (ICLR 2025) ,

work page 2025
[28]

arXiv:2406.14598

URL https://arxiv.org/abs/ 2406.14598. arXiv:2406.14598

work page arXiv
[29]

Content Analysis: An Introduction to Its Methodology

Klaus Krippendorff. Content Analysis: An Introduction to Its Methodology. Sage Publica- tions, Thousand Oaks, CA, 2nd edition, 2004. ISBN 978-0761915454

work page 2004
[30]

The art of saying no: Contextual non- compliance in language models

Faeze Brahman, Sachin Kumar, Vidhisha Bal- achandran, Pradeep Dasigi, Valentina Pyatkin, 27 of 34 et al. The art of saying no: Contextual non- compliance in language models. In Advances in Neural Information Processing Systems 37 (NeurIPS 2024 Datasets and Benchmarks Track),

work page 2024
[31]

URL https://arxiv.org/abs/ 2407.12043

work page arXiv
[32]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, et al. Constitutional AI: Harmlessness from AI feedback. arXiv preprint, 2022. URL https: //arxiv.org/abs/2212.08073. arXiv:2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

Anthropic’s responsible scaling policy

Anthropic. Anthropic’s responsible scaling policy. https://www.anthropic.com/ responsible-scaling-policy, 2026. Accessed May 15, 2026

work page 2026
[34]

URLhttps://www.science.org/doi/10.1126/science

Bruce Wittmann, Tessa Alexanian, Craig Bartling, Jacob Beal, Adam Clore, et al. Strengthening nucleic acid biosecurity screening against generative protein design tools. Science, 390(6768):82–87, 2025. doi: 10.1126/science. adu8578

work page doi:10.1126/science 2025
[35]

Dual use of ar- tificial intelligence-powered drug discovery

Fabio Urbina, Filippa Lentzos, Cédric In- vernizzi, and Sean Ekins. Dual use of ar- tificial intelligence-powered drug discovery. Nature Machine Intelligence , 4(3):189–191,

work page
[36]

URL https://www.nature.com/ articles/s42256-022-00465-9

work page
[37]

Training language models to follow instructions with hu- man feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, et al. Training language models to follow instructions with hu- man feedback. In Advances in Neural Infor- mation Processing Systems 35 (NeurIPS 2022),

work page 2022
[38]

Training language models to follow instructions with human feedback

URL https://arxiv.org/abs/ 2203.02155. arXiv:2203.02155

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Harm- Bench: A standardized evaluation framework for automated red teaming and robust re- fusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, et al. Harm- Bench: A standardized evaluation framework for automated red teaming and robust re- fusal. In Proceedings of the 41st Interna- tional Conference on Machine Learning (ICML 2024), volume 235 of Proceedings of Ma- chine Learning Research, pages 35181–35224,

work page 2024
[40]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

URL https://arxiv.org/abs/ 2402.04249. arXiv:2402.04249

work page internal anchor Pith review Pith/arXiv arXiv
[41]

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, et al. The WMDP benchmark: Measuring and reducing malicious use with unlearning. In Proceedings of the 41st International Conference on Machine Learn- ing (ICML 2024), volume 235 of Proceedings of Machine Learning Research , pages 28525– 28550, 2024. URL https://arxiv.org/ abs/2403.032...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

One-shot design of functional protein binders with BindCraft

Martin Pacesa, Lennart Nickel, Christian Schell- haas, Joseph Schmidt, Ekaterina Pyatova, et al. One-shot design of functional protein binders with BindCraft. Nature, 646:483–492, 2025. doi: 10.1038/s41586-025-09429-6

work page doi:10.1038/s41586-025-09429-6 2025

[2] [2]

ProteinCrow: A language model agent that can design proteins

Manvitha Ponnapati, Sam Cox, Cade Gordon, Michael Hammerling, Siddharth Narayanan, et al. ProteinCrow: A language model agent that can design proteins. In ICML 2025 Work- shop on Generative AI for Biology, volume 267 of Proceedings of Machine Learning Research,

work page 2025

[3] [3]

URL https://openreview.net/ pdf?id=ljXgWDtqCu

work page

[4] [4]

Beyond protein language models: An agentic LLM framework for mechanistic enzyme design

Bruno Jacob, Khushbu Agarwal, Marcel Baer, Peter Rice, and Simone Raugei. Beyond protein language models: An agentic LLM framework for mechanistic enzyme design. arXiv preprint,

work page

[5] [5]

arXiv:2511.19423

URL https://arxiv.org/abs/ 2511.19423. arXiv:2511.19423

work page arXiv

[6] [6]

ProtoCy- cle: Reflective tool-augmented planning for text-guided protein design

Yutang Ge, Guojiang Zhao, Sihang Li, Zheng Cheng, Zifeng Zhao, et al. ProtoCy- cle: Reflective tool-augmented planning for text-guided protein design. arXiv preprint ,

work page

[7] [7]

ProtoCycle: Reflective Tool-Augmented Planning for Text-Guided Protein Design

URL https://arxiv.org/abs/ 2604.16896. arXiv:2604.16896. 26 of 34

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

ProteinMCP: An agentic AI framework for autonomous protein engineer- ing

Xiaopeng Xu, Chenjie Feng, Chao Zha, Wenjia He, Maolin He, et al. ProteinMCP: An agentic AI framework for autonomous protein engineer- ing. Protein Science, 35(4):e70547, 2026. doi: 10.1002/pro.70547

work page doi:10.1002/pro.70547 2026

[9] [9]

Using a GPT-5-driven autonomous lab to optimize the cost and titer of cell-free protein synthesis

Alexus Smith, Edmund Wong, Ronan Donovan, Chapman, Harry, et al. Using a GPT-5-driven autonomous lab to optimize the cost and titer of cell-free protein synthesis. bioRxiv preprint,

work page

[10] [10]

doi: 10.64898/2026.02.05.703998

work page doi:10.64898/2026.02.05.703998 2026

[11] [11]

Agen- tic BAIM–LLM evaluation (ABLE): Bench- marking LLM use of protein design tools

Bryce Cai, Geetha Jeyapragasan, Samira Nedun- gadi, Jake Yukich, and Seth Donoughe. Agen- tic BAIM–LLM evaluation (ABLE): Bench- marking LLM use of protein design tools. In NeurIPS 2025 Workshop on Biosecu- rity Safeguards for Generative AI , 2025. URL https://openreview.net/pdf? id=fDysOrWaGd

work page 2025

[12] [12]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, et al. XSTest: A test suite for identifying exagger- ated safety behaviours in large language models. In Proceedings of NAACL 2024 , pages 5377– 5400, 2024. URL https://arxiv.org/ abs/2308.01263

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

OR-Bench: An over- refusal benchmark for large language mod- els

Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. OR-Bench: An over- refusal benchmark for large language mod- els. In Proceedings of the 42nd Interna- tional Conference on Machine Learning (ICML 2025), volume 267 of Proceedings of Ma- chine Learning Research, pages 11515–11542,

work page 2025

[14] [14]

arXiv:2405.20947

URL https://arxiv.org/abs/ 2405.20947. arXiv:2405.20947

work page arXiv

[15] [15]

Cannot or should not? Automatic analysis of refusal composition in IFT/RLHF datasets and refusal behavior of black-box LLMs

Alexander von Recum, Christoph Schnabl, Gabor Hollbeck, Silas Alberti, Philip Blinde, et al. Cannot or should not? Automatic analysis of refusal composition in IFT/RLHF datasets and refusal behavior of black-box LLMs. arXiv preprint, 2024. URL https://arxiv. org/abs/2412.16974. arXiv:2412.16974

work page arXiv 2024

[16] [16]

Forbid- den science: Dual-use AI challenge benchmark and scientific refusal tests

David Noever and Forrest McKee. Forbid- den science: Dual-use AI challenge benchmark and scientific refusal tests. arXiv preprint ,

work page

[17] [17]

arXiv:2502.06867

URL https://arxiv.org/abs/ 2502.06867. arXiv:2502.06867

work page arXiv

[18] [18]

Political censorship in large language models originating from China

Jennifer Pan and Xu Xu. Political censorship in large language models originating from China. PNAS Nexus, 5(2):pgag013, 2026. doi: 10.1093/ pnasnexus/pgag013

work page 2026

[19] [19]

Virol- ogy capabilities test (VCT): A multimodal virology Q&A benchmark

Jasper Götting, Pedro Medeiros, Jon Sanders, Nathaniel Li, Long Phan, et al. Virol- ogy capabilities test (VCT): A multimodal virology Q&A benchmark. arXiv preprint ,

work page

[20] [20]

arXiv:2504.16137

URL https://arxiv.org/abs/ 2504.16137. arXiv:2504.16137

work page arXiv

[21] [21]

Can large language models democratize access to dual-use biotechnology? arXiv preprint ,

Emily Soice, Rafael Rocha, Kimberlee Cor- dova, Michael Specter, and Kevin Esvelt. Can large language models democratize access to dual-use biotechnology? arXiv preprint ,

work page

[22] [22]

arXiv:2306.03809

URL https://arxiv.org/abs/ 2306.03809. arXiv:2306.03809

work page arXiv

[23] [23]

The next-generation Open Targets platform: reimag- ined, redesigned, rebuilt

David Ochoa, Andrew Hercules, Miguel Car- mona, Daniel Suveges, Jarrod Baker, et al. The next-generation Open Targets platform: reimag- ined, redesigned, rebuilt. Nucleic Acids Re- search, 51(D1):D1353–D1359, 2023. doi: 10. 1093/nar/gkac1046

work page 2023

[24] [24]

UniProt: the uni- versal protein knowledgebase in 2025

The UniProt Consortium. UniProt: the uni- versal protein knowledgebase in 2025. Nucleic Acids Research, 53(D1):D609–D617, 2025. doi: 10.1093/nar/gkae1010

work page doi:10.1093/nar/gkae1010 2025

[25] [25]

NVIDIA Nemotron 3 Su- per: A 120B hybrid Mamba-Transformer MoE model for agentic reasoning

NVIDIA. NVIDIA Nemotron 3 Su- per: A 120B hybrid Mamba-Transformer MoE model for agentic reasoning. https://research.nvidia.com/ labs/nemotron/Nemotron-3-Super/,

work page

[26] [26]

Open-weights release; 12B active / 120B total parameters; 1M-token context

Accessed May 15, 2026. Open-weights release; 12B active / 120B total parameters; 1M-token context

work page 2026

[27] [27]

SORRY- Bench: Systematically evaluating large lan- guage model safety refusal

Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Sehwag, et al. SORRY- Bench: Systematically evaluating large lan- guage model safety refusal. In Proceed- ings of the 13th International Conference on Learning Representations (ICLR 2025) ,

work page 2025

[28] [28]

arXiv:2406.14598

URL https://arxiv.org/abs/ 2406.14598. arXiv:2406.14598

work page arXiv

[29] [29]

Content Analysis: An Introduction to Its Methodology

Klaus Krippendorff. Content Analysis: An Introduction to Its Methodology. Sage Publica- tions, Thousand Oaks, CA, 2nd edition, 2004. ISBN 978-0761915454

work page 2004

[30] [30]

The art of saying no: Contextual non- compliance in language models

Faeze Brahman, Sachin Kumar, Vidhisha Bal- achandran, Pradeep Dasigi, Valentina Pyatkin, 27 of 34 et al. The art of saying no: Contextual non- compliance in language models. In Advances in Neural Information Processing Systems 37 (NeurIPS 2024 Datasets and Benchmarks Track),

work page 2024

[31] [31]

URL https://arxiv.org/abs/ 2407.12043

work page arXiv

[32] [32]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, et al. Constitutional AI: Harmlessness from AI feedback. arXiv preprint, 2022. URL https: //arxiv.org/abs/2212.08073. arXiv:2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022

[33] [33]

Anthropic’s responsible scaling policy

Anthropic. Anthropic’s responsible scaling policy. https://www.anthropic.com/ responsible-scaling-policy, 2026. Accessed May 15, 2026

work page 2026

[34] [34]

URLhttps://www.science.org/doi/10.1126/science

Bruce Wittmann, Tessa Alexanian, Craig Bartling, Jacob Beal, Adam Clore, et al. Strengthening nucleic acid biosecurity screening against generative protein design tools. Science, 390(6768):82–87, 2025. doi: 10.1126/science. adu8578

work page doi:10.1126/science 2025

[35] [35]

Dual use of ar- tificial intelligence-powered drug discovery

Fabio Urbina, Filippa Lentzos, Cédric In- vernizzi, and Sean Ekins. Dual use of ar- tificial intelligence-powered drug discovery. Nature Machine Intelligence , 4(3):189–191,

work page

[36] [36]

URL https://www.nature.com/ articles/s42256-022-00465-9

work page

[37] [37]

Training language models to follow instructions with hu- man feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, et al. Training language models to follow instructions with hu- man feedback. In Advances in Neural Infor- mation Processing Systems 35 (NeurIPS 2022),

work page 2022

[38] [38]

Training language models to follow instructions with human feedback

URL https://arxiv.org/abs/ 2203.02155. arXiv:2203.02155

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

Harm- Bench: A standardized evaluation framework for automated red teaming and robust re- fusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, et al. Harm- Bench: A standardized evaluation framework for automated red teaming and robust re- fusal. In Proceedings of the 41st Interna- tional Conference on Machine Learning (ICML 2024), volume 235 of Proceedings of Ma- chine Learning Research, pages 35181–35224,

work page 2024

[40] [40]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

URL https://arxiv.org/abs/ 2402.04249. arXiv:2402.04249

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, et al. The WMDP benchmark: Measuring and reducing malicious use with unlearning. In Proceedings of the 41st International Conference on Machine Learn- ing (ICML 2024), volume 235 of Proceedings of Machine Learning Research , pages 28525– 28550, 2024. URL https://arxiv.org/ abs/2403.032...

work page internal anchor Pith review Pith/arXiv arXiv 2024