pith. sign in

arxiv: 2605.21545 · v1 · pith:LOPMJKJXnew · submitted 2026-05-20 · 💻 cs.SE · cs.AI

RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts

Pith reviewed 2026-05-22 01:13 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords refusal rateLLM safetybiological researchrisk discriminationdual-use promptsbenchmarkfrontier modelssafety calibration
4
0 comments X

The pith

Strict refusal rate misranks frontier LLMs on biological research prompts

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces RefusalBench, a set of 47 matched prompt bundles that keep the same research task while changing only the biological risk level from benign to borderline to dual-use. Testing 19 frontier models shows refusal rates vary from nearly zero to over 94 percent on identical prompts, yet this number does not track how well each model separates safe requests from risky ones. One model reaches the highest discrimination between tiers while ranking only seventh by refusal rate, and another shows a sharp drop in discrimination with no gain in spotting dual-use content. The benchmark also finds that high refusal often stems from fixed templates rather than case-by-case judgment, and that many models hedge but still assist on risky prompts in ways binary counts miss. These patterns matter for anyone using LLMs as backbones in biological research because choosing models by refusal rate alone can select the wrong tools for safety.

Core claim

RefusalBench demonstrates that strict refusal rate misranks safety calibration across frontier large language models on biological research prompts. The benchmark uses 141 prompts in 47 bundles that hold task framing constant while varying only biological risk tier. Grok 4.20 achieves the highest tier discrimination with a Youden's J of 0.787 while ranking seventh by overall refusal rate, and Claude Opus 4.7 shows a 65 percent drop in discrimination from prior versions with no improvement in dual-use detection. Provider identity strongly predicts refusal behavior, but this traces to shared safety-policy templates rather than model-level reasoning. Nine of 18 models exhibit hedge-but-help pat

What carries the argument

the matched-triple benchmark design that holds task framing constant while varying only biological risk tier

If this is right

  • Strict refusal rate alone cannot rank models for safety calibration on biological research prompts.
  • Tier discrimination metrics capture calibration better than raw refusal counts.
  • Provider access paths drive most observed refusals through fixed templates.
  • Partial compliance patterns on dual-use prompts require evaluation beyond binary refusal.
  • Some models fail to refuse even positive-control prompts that should trigger refusal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmark designs that isolate risk tiers could be applied to safety evaluation in other domains such as chemical synthesis.
  • Model developers might prioritize training for accurate risk discrimination over simply raising overall refusal rates.
  • Researchers using these models for biology work should test discrimination on their specific tasks rather than relying on aggregate refusal statistics.

Load-bearing premise

The 47 prompt bundles correctly isolate biological risk tier without residual subdomain or framing confounds.

What would settle it

A replication on new prompt bundles where overall refusal rate accurately predicts tier discrimination performance would falsify the misranking claim.

Figures

Figures reproduced from arXiv: 2605.21545 by Aakaash Meduri, Emre Ulgac, Lukas Weidener, Marko Brki\'c, Mihailo Jovanovi\'c.

Figure 1
Figure 1. Figure 1: Strict refusal rate at benign tier across the 19-model panel, sorted by descending rate and coloured [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Strict refusal rates across the 19-model panel at three risk tiers. Grouped vertical bars for each model show [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Strict refusal rates by model and protein-design subdomain. Rows are the 19 panel models sorted by [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Strict refusal rate across Claude Opus versions 4.5, 4.6, and 4.7 by risk tier. Lines connect the three [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
read the original abstract

Frontier large language models are increasingly deployed as orchestration backbones for biological research workflows, yet no shared evidence base exists for comparing their refusal behaviour on legitimate research prompts. RefusalBench, introduced here, is a matched-triple benchmark of 141 prompts in 47 bundles that holds task framing constant while varying only biological risk tier (benign, borderline, dual-use), enabling tier-conditioned comparisons robust to subdomain confounding. A 15-prompt should-refuse positive-control module establishes per-model calibration floors; three models fail to refuse even these prompts. Across 19 frontier models in the May 2026 snapshot, strict refusal rates span 0.1% to 94.6% on identical prompts. Jurisdiction does not predict refusal in this snapshot (Mann-Whitney U, p = 0.393; EU n = 1, US bimodal); provider identity does, with Anthropic's API stack predicting refusal at OR = 21.03 (95% CI: 14.58-30.34 prompt-clustered; 5.70-77.55 under model-clustered GEE). This effect is best read as access-path-level rather than model-weight-level: 99.8% of Anthropic's strict refusals carry the same safety_policy adjudicated reason code, consistent with a small set of canonical refusal templates rather than case-by-case model reasoning. Strict refusal rate misranks safety calibration: Grok 4.20 achieves the highest tier discrimination (Youden's J = 0.787) while ranking only seventh by overall refusal rate, and Claude Opus 4.7's J dropped 65% from prior versions with no improvement in dual-use detection. Nine of 18 frontier models exhibit a hedge-but-help partial-compliance pattern at dual-use tier that binary refusal metrics cannot detect.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RefusalBench, a matched-triple benchmark of 141 prompts in 47 bundles that holds task framing constant while varying only biological risk tier (benign, borderline, dual-use). Evaluating 19 frontier models from a May 2026 snapshot, it reports strict refusal rates from 0.1% to 94.6%, that provider identity predicts refusal (Anthropic OR = 21.03 with prompt-clustered CI), but that refusal rate misranks safety calibration: Grok 4.20 achieves the highest Youden's J of 0.787 while ranking seventh by refusal rate, Claude Opus 4.7's J dropped 65% from prior versions, and nine models show hedge-but-help partial compliance at dual-use tier. A 15-prompt positive-control module is included.

Significance. If the tier isolation holds, the work supplies a controlled empirical tool for comparing LLM refusal on biological research prompts and demonstrates that binary refusal metrics miss important calibration differences. Strengths include the direct empirical measurements with no free parameters or self-referential definitions, the subdomain-controlled matched-triple design, and the positive-control module establishing refusal floors.

major comments (2)
  1. [Benchmark construction] Matched-triple design: The construction holds task framing constant but necessarily introduces differentiating content (specific agents, intent phrases, or misuse indicators) to create tier contrast in dual-use variants. These additions risk creating surface-level lexical cues that models can detect from training data, so that high Youden's J scores may reflect cue sensitivity rather than calibrated risk reasoning. This directly affects the central claim that refusal rate misranks safety calibration.
  2. [Results on tier discrimination] Tier discrimination results: The reported Youden's J = 0.787 for Grok 4.20 and the 65% drop for Claude Opus 4.7 are load-bearing for the misranking conclusion, yet the manuscript provides no per-tier raw refusal counts, confusion matrices, or sensitivity analysis showing how J changes under plausible cue perturbations.
minor comments (2)
  1. [Abstract and methods] Abstract and methods: No example prompts from the 47 bundles or the positive-control module are shown, and no inter-rater reliability statistic is reported for the tier labeling process.
  2. [Results] Statistical reporting: The OR = 21.03 and p = 0.393 are given with clustering adjustments, but the main text does not include a table of raw refusal counts per model and tier that would allow independent verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we will make to improve clarity and robustness.

read point-by-point responses
  1. Referee: [Benchmark construction] Matched-triple design: The construction holds task framing constant but necessarily introduces differentiating content (specific agents, intent phrases, or misuse indicators) to create tier contrast in dual-use variants. These additions risk creating surface-level lexical cues that models can detect from training data, so that high Youden's J scores may reflect cue sensitivity rather than calibrated risk reasoning. This directly affects the central claim that refusal rate misranks safety calibration.

    Authors: We appreciate the concern about potential lexical cues introduced by the differentiating content needed to establish risk-tier contrast. The matched-triple design was constructed precisely to hold core task framing and subdomain constant while varying only the biological risk indicators, which is necessary to isolate the effect of risk tier. The 15-prompt positive-control module provides an independent check that models respond to explicit refusal triggers rather than incidental phrasing. To strengthen the claim, we will add a sensitivity analysis in the revision that perturbs the differentiating phrases and reports the resulting change in Youden's J for the top models. revision: yes

  2. Referee: [Results on tier discrimination] Tier discrimination results: The reported Youden's J = 0.787 for Grok 4.20 and the 65% drop for Claude Opus 4.7 are load-bearing for the misranking conclusion, yet the manuscript provides no per-tier raw refusal counts, confusion matrices, or sensitivity analysis showing how J changes under plausible cue perturbations.

    Authors: We agree that the absence of per-tier raw counts and confusion matrices limits the ability to fully evaluate the reported Youden's J values. We will include these tables and matrices in the revised manuscript. We will also incorporate the sensitivity analysis for cue perturbations referenced in the first comment to demonstrate that the tier-discrimination results are not driven by surface lexical features. revision: yes

Circularity Check

0 steps flagged

No circularity: all quantities are direct empirical measurements on a new benchmark

full rationale

The paper constructs RefusalBench as a new matched-triple dataset of 141 prompts and computes refusal rates, Youden's J, odds ratios, and non-parametric tests directly from model outputs on those prompts. No equations fit parameters to a subset and then label the result a prediction, no self-citations supply load-bearing uniqueness theorems or ansatzes, and the tier-isolation claim is a methodological design statement rather than a definitional reduction. All reported statistics follow standard definitions and are falsifiable against the released prompt set.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that expert-assigned risk tiers are accurate and that the positive-control prompts are universally recognized as should-refuse. No free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Prompts can be reliably classified into benign, borderline, and dual-use tiers without subdomain confounding when task framing is held constant.
    This classification underpins all tier-conditioned comparisons and the claim that refusal rate misranks safety.

pith-pipeline@v0.9.0 · 5891 in / 1291 out tokens · 46234 ms · 2026-05-22T01:13:26.503416+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 6 internal anchors

  1. [1]

    One-shot design of functional protein binders with BindCraft

    Martin Pacesa, Lennart Nickel, Christian Schell- haas, Joseph Schmidt, Ekaterina Pyatova, et al. One-shot design of functional protein binders with BindCraft. Nature, 646:483–492, 2025. doi: 10.1038/s41586-025-09429-6

  2. [2]

    ProteinCrow: A language model agent that can design proteins

    Manvitha Ponnapati, Sam Cox, Cade Gordon, Michael Hammerling, Siddharth Narayanan, et al. ProteinCrow: A language model agent that can design proteins. In ICML 2025 Work- shop on Generative AI for Biology, volume 267 of Proceedings of Machine Learning Research,

  3. [3]

    URL https://openreview.net/ pdf?id=ljXgWDtqCu

  4. [4]

    Beyond protein language models: An agentic LLM framework for mechanistic enzyme design

    Bruno Jacob, Khushbu Agarwal, Marcel Baer, Peter Rice, and Simone Raugei. Beyond protein language models: An agentic LLM framework for mechanistic enzyme design. arXiv preprint,

  5. [5]

    arXiv:2511.19423

    URL https://arxiv.org/abs/ 2511.19423. arXiv:2511.19423

  6. [6]

    ProtoCy- cle: Reflective tool-augmented planning for text-guided protein design

    Yutang Ge, Guojiang Zhao, Sihang Li, Zheng Cheng, Zifeng Zhao, et al. ProtoCy- cle: Reflective tool-augmented planning for text-guided protein design. arXiv preprint ,

  7. [7]

    ProtoCycle: Reflective Tool-Augmented Planning for Text-Guided Protein Design

    URL https://arxiv.org/abs/ 2604.16896. arXiv:2604.16896. 26 of 34

  8. [8]

    ProteinMCP: An agentic AI framework for autonomous protein engineer- ing

    Xiaopeng Xu, Chenjie Feng, Chao Zha, Wenjia He, Maolin He, et al. ProteinMCP: An agentic AI framework for autonomous protein engineer- ing. Protein Science, 35(4):e70547, 2026. doi: 10.1002/pro.70547

  9. [9]

    Using a GPT-5-driven autonomous lab to optimize the cost and titer of cell-free protein synthesis

    Alexus Smith, Edmund Wong, Ronan Donovan, Chapman, Harry, et al. Using a GPT-5-driven autonomous lab to optimize the cost and titer of cell-free protein synthesis. bioRxiv preprint,

  10. [10]

    doi: 10.64898/2026.02.05.703998

  11. [11]

    Agen- tic BAIM–LLM evaluation (ABLE): Bench- marking LLM use of protein design tools

    Bryce Cai, Geetha Jeyapragasan, Samira Nedun- gadi, Jake Yukich, and Seth Donoughe. Agen- tic BAIM–LLM evaluation (ABLE): Bench- marking LLM use of protein design tools. In NeurIPS 2025 Workshop on Biosecu- rity Safeguards for Generative AI , 2025. URL https://openreview.net/pdf? id=fDysOrWaGd

  12. [12]

    XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

    Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, et al. XSTest: A test suite for identifying exagger- ated safety behaviours in large language models. In Proceedings of NAACL 2024 , pages 5377– 5400, 2024. URL https://arxiv.org/ abs/2308.01263

  13. [13]

    OR-Bench: An over- refusal benchmark for large language mod- els

    Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. OR-Bench: An over- refusal benchmark for large language mod- els. In Proceedings of the 42nd Interna- tional Conference on Machine Learning (ICML 2025), volume 267 of Proceedings of Ma- chine Learning Research, pages 11515–11542,

  14. [14]

    arXiv:2405.20947

    URL https://arxiv.org/abs/ 2405.20947. arXiv:2405.20947

  15. [15]

    Cannot or should not? Automatic analysis of refusal composition in IFT/RLHF datasets and refusal behavior of black-box LLMs

    Alexander von Recum, Christoph Schnabl, Gabor Hollbeck, Silas Alberti, Philip Blinde, et al. Cannot or should not? Automatic analysis of refusal composition in IFT/RLHF datasets and refusal behavior of black-box LLMs. arXiv preprint, 2024. URL https://arxiv. org/abs/2412.16974. arXiv:2412.16974

  16. [16]

    Forbid- den science: Dual-use AI challenge benchmark and scientific refusal tests

    David Noever and Forrest McKee. Forbid- den science: Dual-use AI challenge benchmark and scientific refusal tests. arXiv preprint ,

  17. [17]

    arXiv:2502.06867

    URL https://arxiv.org/abs/ 2502.06867. arXiv:2502.06867

  18. [18]

    Political censorship in large language models originating from China

    Jennifer Pan and Xu Xu. Political censorship in large language models originating from China. PNAS Nexus, 5(2):pgag013, 2026. doi: 10.1093/ pnasnexus/pgag013

  19. [19]

    Virol- ogy capabilities test (VCT): A multimodal virology Q&A benchmark

    Jasper Götting, Pedro Medeiros, Jon Sanders, Nathaniel Li, Long Phan, et al. Virol- ogy capabilities test (VCT): A multimodal virology Q&A benchmark. arXiv preprint ,

  20. [20]

    arXiv:2504.16137

    URL https://arxiv.org/abs/ 2504.16137. arXiv:2504.16137

  21. [21]

    Can large language models democratize access to dual-use biotechnology? arXiv preprint ,

    Emily Soice, Rafael Rocha, Kimberlee Cor- dova, Michael Specter, and Kevin Esvelt. Can large language models democratize access to dual-use biotechnology? arXiv preprint ,

  22. [22]

    arXiv:2306.03809

    URL https://arxiv.org/abs/ 2306.03809. arXiv:2306.03809

  23. [23]

    The next-generation Open Targets platform: reimag- ined, redesigned, rebuilt

    David Ochoa, Andrew Hercules, Miguel Car- mona, Daniel Suveges, Jarrod Baker, et al. The next-generation Open Targets platform: reimag- ined, redesigned, rebuilt. Nucleic Acids Re- search, 51(D1):D1353–D1359, 2023. doi: 10. 1093/nar/gkac1046

  24. [24]

    UniProt: the uni- versal protein knowledgebase in 2025

    The UniProt Consortium. UniProt: the uni- versal protein knowledgebase in 2025. Nucleic Acids Research, 53(D1):D609–D617, 2025. doi: 10.1093/nar/gkae1010

  25. [25]

    NVIDIA Nemotron 3 Su- per: A 120B hybrid Mamba-Transformer MoE model for agentic reasoning

    NVIDIA. NVIDIA Nemotron 3 Su- per: A 120B hybrid Mamba-Transformer MoE model for agentic reasoning. https://research.nvidia.com/ labs/nemotron/Nemotron-3-Super/,

  26. [26]

    Open-weights release; 12B active / 120B total parameters; 1M-token context

    Accessed May 15, 2026. Open-weights release; 12B active / 120B total parameters; 1M-token context

  27. [27]

    SORRY- Bench: Systematically evaluating large lan- guage model safety refusal

    Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Sehwag, et al. SORRY- Bench: Systematically evaluating large lan- guage model safety refusal. In Proceed- ings of the 13th International Conference on Learning Representations (ICLR 2025) ,

  28. [28]

    arXiv:2406.14598

    URL https://arxiv.org/abs/ 2406.14598. arXiv:2406.14598

  29. [29]

    Content Analysis: An Introduction to Its Methodology

    Klaus Krippendorff. Content Analysis: An Introduction to Its Methodology. Sage Publica- tions, Thousand Oaks, CA, 2nd edition, 2004. ISBN 978-0761915454

  30. [30]

    The art of saying no: Contextual non- compliance in language models

    Faeze Brahman, Sachin Kumar, Vidhisha Bal- achandran, Pradeep Dasigi, Valentina Pyatkin, 27 of 34 et al. The art of saying no: Contextual non- compliance in language models. In Advances in Neural Information Processing Systems 37 (NeurIPS 2024 Datasets and Benchmarks Track),

  31. [31]

    URL https://arxiv.org/abs/ 2407.12043

  32. [32]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, et al. Constitutional AI: Harmlessness from AI feedback. arXiv preprint, 2022. URL https: //arxiv.org/abs/2212.08073. arXiv:2212.08073

  33. [33]

    Anthropic’s responsible scaling policy

    Anthropic. Anthropic’s responsible scaling policy. https://www.anthropic.com/ responsible-scaling-policy, 2026. Accessed May 15, 2026

  34. [34]

    URLhttps://www.science.org/doi/10.1126/science

    Bruce Wittmann, Tessa Alexanian, Craig Bartling, Jacob Beal, Adam Clore, et al. Strengthening nucleic acid biosecurity screening against generative protein design tools. Science, 390(6768):82–87, 2025. doi: 10.1126/science. adu8578

  35. [35]

    Dual use of ar- tificial intelligence-powered drug discovery

    Fabio Urbina, Filippa Lentzos, Cédric In- vernizzi, and Sean Ekins. Dual use of ar- tificial intelligence-powered drug discovery. Nature Machine Intelligence , 4(3):189–191,

  36. [36]

    URL https://www.nature.com/ articles/s42256-022-00465-9

  37. [37]

    Training language models to follow instructions with hu- man feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, et al. Training language models to follow instructions with hu- man feedback. In Advances in Neural Infor- mation Processing Systems 35 (NeurIPS 2022),

  38. [38]

    Training language models to follow instructions with human feedback

    URL https://arxiv.org/abs/ 2203.02155. arXiv:2203.02155

  39. [39]

    Harm- Bench: A standardized evaluation framework for automated red teaming and robust re- fusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, et al. Harm- Bench: A standardized evaluation framework for automated red teaming and robust re- fusal. In Proceedings of the 41st Interna- tional Conference on Machine Learning (ICML 2024), volume 235 of Proceedings of Ma- chine Learning Research, pages 35181–35224,

  40. [40]
  41. [41]

    The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

    Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, et al. The WMDP benchmark: Measuring and reducing malicious use with unlearning. In Proceedings of the 41st International Conference on Machine Learn- ing (ICML 2024), volume 235 of Proceedings of Machine Learning Research , pages 28525– 28550, 2024. URL https://arxiv.org/ abs/2403.032...