RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts
Pith reviewed 2026-05-22 01:13 UTC · model grok-4.3
The pith
Strict refusal rate misranks frontier LLMs on biological research prompts
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RefusalBench demonstrates that strict refusal rate misranks safety calibration across frontier large language models on biological research prompts. The benchmark uses 141 prompts in 47 bundles that hold task framing constant while varying only biological risk tier. Grok 4.20 achieves the highest tier discrimination with a Youden's J of 0.787 while ranking seventh by overall refusal rate, and Claude Opus 4.7 shows a 65 percent drop in discrimination from prior versions with no improvement in dual-use detection. Provider identity strongly predicts refusal behavior, but this traces to shared safety-policy templates rather than model-level reasoning. Nine of 18 models exhibit hedge-but-help pat
What carries the argument
the matched-triple benchmark design that holds task framing constant while varying only biological risk tier
If this is right
- Strict refusal rate alone cannot rank models for safety calibration on biological research prompts.
- Tier discrimination metrics capture calibration better than raw refusal counts.
- Provider access paths drive most observed refusals through fixed templates.
- Partial compliance patterns on dual-use prompts require evaluation beyond binary refusal.
- Some models fail to refuse even positive-control prompts that should trigger refusal.
Where Pith is reading between the lines
- Benchmark designs that isolate risk tiers could be applied to safety evaluation in other domains such as chemical synthesis.
- Model developers might prioritize training for accurate risk discrimination over simply raising overall refusal rates.
- Researchers using these models for biology work should test discrimination on their specific tasks rather than relying on aggregate refusal statistics.
Load-bearing premise
The 47 prompt bundles correctly isolate biological risk tier without residual subdomain or framing confounds.
What would settle it
A replication on new prompt bundles where overall refusal rate accurately predicts tier discrimination performance would falsify the misranking claim.
Figures
read the original abstract
Frontier large language models are increasingly deployed as orchestration backbones for biological research workflows, yet no shared evidence base exists for comparing their refusal behaviour on legitimate research prompts. RefusalBench, introduced here, is a matched-triple benchmark of 141 prompts in 47 bundles that holds task framing constant while varying only biological risk tier (benign, borderline, dual-use), enabling tier-conditioned comparisons robust to subdomain confounding. A 15-prompt should-refuse positive-control module establishes per-model calibration floors; three models fail to refuse even these prompts. Across 19 frontier models in the May 2026 snapshot, strict refusal rates span 0.1% to 94.6% on identical prompts. Jurisdiction does not predict refusal in this snapshot (Mann-Whitney U, p = 0.393; EU n = 1, US bimodal); provider identity does, with Anthropic's API stack predicting refusal at OR = 21.03 (95% CI: 14.58-30.34 prompt-clustered; 5.70-77.55 under model-clustered GEE). This effect is best read as access-path-level rather than model-weight-level: 99.8% of Anthropic's strict refusals carry the same safety_policy adjudicated reason code, consistent with a small set of canonical refusal templates rather than case-by-case model reasoning. Strict refusal rate misranks safety calibration: Grok 4.20 achieves the highest tier discrimination (Youden's J = 0.787) while ranking only seventh by overall refusal rate, and Claude Opus 4.7's J dropped 65% from prior versions with no improvement in dual-use detection. Nine of 18 frontier models exhibit a hedge-but-help partial-compliance pattern at dual-use tier that binary refusal metrics cannot detect.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RefusalBench, a matched-triple benchmark of 141 prompts in 47 bundles that holds task framing constant while varying only biological risk tier (benign, borderline, dual-use). Evaluating 19 frontier models from a May 2026 snapshot, it reports strict refusal rates from 0.1% to 94.6%, that provider identity predicts refusal (Anthropic OR = 21.03 with prompt-clustered CI), but that refusal rate misranks safety calibration: Grok 4.20 achieves the highest Youden's J of 0.787 while ranking seventh by refusal rate, Claude Opus 4.7's J dropped 65% from prior versions, and nine models show hedge-but-help partial compliance at dual-use tier. A 15-prompt positive-control module is included.
Significance. If the tier isolation holds, the work supplies a controlled empirical tool for comparing LLM refusal on biological research prompts and demonstrates that binary refusal metrics miss important calibration differences. Strengths include the direct empirical measurements with no free parameters or self-referential definitions, the subdomain-controlled matched-triple design, and the positive-control module establishing refusal floors.
major comments (2)
- [Benchmark construction] Matched-triple design: The construction holds task framing constant but necessarily introduces differentiating content (specific agents, intent phrases, or misuse indicators) to create tier contrast in dual-use variants. These additions risk creating surface-level lexical cues that models can detect from training data, so that high Youden's J scores may reflect cue sensitivity rather than calibrated risk reasoning. This directly affects the central claim that refusal rate misranks safety calibration.
- [Results on tier discrimination] Tier discrimination results: The reported Youden's J = 0.787 for Grok 4.20 and the 65% drop for Claude Opus 4.7 are load-bearing for the misranking conclusion, yet the manuscript provides no per-tier raw refusal counts, confusion matrices, or sensitivity analysis showing how J changes under plausible cue perturbations.
minor comments (2)
- [Abstract and methods] Abstract and methods: No example prompts from the 47 bundles or the positive-control module are shown, and no inter-rater reliability statistic is reported for the tier labeling process.
- [Results] Statistical reporting: The OR = 21.03 and p = 0.393 are given with clustering adjustments, but the main text does not include a table of raw refusal counts per model and tier that would allow independent verification.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we will make to improve clarity and robustness.
read point-by-point responses
-
Referee: [Benchmark construction] Matched-triple design: The construction holds task framing constant but necessarily introduces differentiating content (specific agents, intent phrases, or misuse indicators) to create tier contrast in dual-use variants. These additions risk creating surface-level lexical cues that models can detect from training data, so that high Youden's J scores may reflect cue sensitivity rather than calibrated risk reasoning. This directly affects the central claim that refusal rate misranks safety calibration.
Authors: We appreciate the concern about potential lexical cues introduced by the differentiating content needed to establish risk-tier contrast. The matched-triple design was constructed precisely to hold core task framing and subdomain constant while varying only the biological risk indicators, which is necessary to isolate the effect of risk tier. The 15-prompt positive-control module provides an independent check that models respond to explicit refusal triggers rather than incidental phrasing. To strengthen the claim, we will add a sensitivity analysis in the revision that perturbs the differentiating phrases and reports the resulting change in Youden's J for the top models. revision: yes
-
Referee: [Results on tier discrimination] Tier discrimination results: The reported Youden's J = 0.787 for Grok 4.20 and the 65% drop for Claude Opus 4.7 are load-bearing for the misranking conclusion, yet the manuscript provides no per-tier raw refusal counts, confusion matrices, or sensitivity analysis showing how J changes under plausible cue perturbations.
Authors: We agree that the absence of per-tier raw counts and confusion matrices limits the ability to fully evaluate the reported Youden's J values. We will include these tables and matrices in the revised manuscript. We will also incorporate the sensitivity analysis for cue perturbations referenced in the first comment to demonstrate that the tier-discrimination results are not driven by surface lexical features. revision: yes
Circularity Check
No circularity: all quantities are direct empirical measurements on a new benchmark
full rationale
The paper constructs RefusalBench as a new matched-triple dataset of 141 prompts and computes refusal rates, Youden's J, odds ratios, and non-parametric tests directly from model outputs on those prompts. No equations fit parameters to a subset and then label the result a prediction, no self-citations supply load-bearing uniqueness theorems or ansatzes, and the tier-isolation claim is a methodological design statement rather than a definitional reduction. All reported statistics follow standard definitions and are falsifiable against the released prompt set.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Prompts can be reliably classified into benign, borderline, and dual-use tiers without subdomain confounding when task framing is held constant.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Youden’s J = TPR − FPR quantifies how well each model discriminates between legitimate and dangerous requests
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
One-shot design of functional protein binders with BindCraft
Martin Pacesa, Lennart Nickel, Christian Schell- haas, Joseph Schmidt, Ekaterina Pyatova, et al. One-shot design of functional protein binders with BindCraft. Nature, 646:483–492, 2025. doi: 10.1038/s41586-025-09429-6
-
[2]
ProteinCrow: A language model agent that can design proteins
Manvitha Ponnapati, Sam Cox, Cade Gordon, Michael Hammerling, Siddharth Narayanan, et al. ProteinCrow: A language model agent that can design proteins. In ICML 2025 Work- shop on Generative AI for Biology, volume 267 of Proceedings of Machine Learning Research,
work page 2025
-
[3]
URL https://openreview.net/ pdf?id=ljXgWDtqCu
-
[4]
Beyond protein language models: An agentic LLM framework for mechanistic enzyme design
Bruno Jacob, Khushbu Agarwal, Marcel Baer, Peter Rice, and Simone Raugei. Beyond protein language models: An agentic LLM framework for mechanistic enzyme design. arXiv preprint,
- [5]
-
[6]
ProtoCy- cle: Reflective tool-augmented planning for text-guided protein design
Yutang Ge, Guojiang Zhao, Sihang Li, Zheng Cheng, Zifeng Zhao, et al. ProtoCy- cle: Reflective tool-augmented planning for text-guided protein design. arXiv preprint ,
-
[7]
ProtoCycle: Reflective Tool-Augmented Planning for Text-Guided Protein Design
URL https://arxiv.org/abs/ 2604.16896. arXiv:2604.16896. 26 of 34
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
ProteinMCP: An agentic AI framework for autonomous protein engineer- ing
Xiaopeng Xu, Chenjie Feng, Chao Zha, Wenjia He, Maolin He, et al. ProteinMCP: An agentic AI framework for autonomous protein engineer- ing. Protein Science, 35(4):e70547, 2026. doi: 10.1002/pro.70547
-
[9]
Using a GPT-5-driven autonomous lab to optimize the cost and titer of cell-free protein synthesis
Alexus Smith, Edmund Wong, Ronan Donovan, Chapman, Harry, et al. Using a GPT-5-driven autonomous lab to optimize the cost and titer of cell-free protein synthesis. bioRxiv preprint,
-
[10]
doi: 10.64898/2026.02.05.703998
-
[11]
Agen- tic BAIM–LLM evaluation (ABLE): Bench- marking LLM use of protein design tools
Bryce Cai, Geetha Jeyapragasan, Samira Nedun- gadi, Jake Yukich, and Seth Donoughe. Agen- tic BAIM–LLM evaluation (ABLE): Bench- marking LLM use of protein design tools. In NeurIPS 2025 Workshop on Biosecu- rity Safeguards for Generative AI , 2025. URL https://openreview.net/pdf? id=fDysOrWaGd
work page 2025
-
[12]
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, et al. XSTest: A test suite for identifying exagger- ated safety behaviours in large language models. In Proceedings of NAACL 2024 , pages 5377– 5400, 2024. URL https://arxiv.org/ abs/2308.01263
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
OR-Bench: An over- refusal benchmark for large language mod- els
Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. OR-Bench: An over- refusal benchmark for large language mod- els. In Proceedings of the 42nd Interna- tional Conference on Machine Learning (ICML 2025), volume 267 of Proceedings of Ma- chine Learning Research, pages 11515–11542,
work page 2025
- [14]
-
[15]
Alexander von Recum, Christoph Schnabl, Gabor Hollbeck, Silas Alberti, Philip Blinde, et al. Cannot or should not? Automatic analysis of refusal composition in IFT/RLHF datasets and refusal behavior of black-box LLMs. arXiv preprint, 2024. URL https://arxiv. org/abs/2412.16974. arXiv:2412.16974
-
[16]
Forbid- den science: Dual-use AI challenge benchmark and scientific refusal tests
David Noever and Forrest McKee. Forbid- den science: Dual-use AI challenge benchmark and scientific refusal tests. arXiv preprint ,
- [17]
-
[18]
Political censorship in large language models originating from China
Jennifer Pan and Xu Xu. Political censorship in large language models originating from China. PNAS Nexus, 5(2):pgag013, 2026. doi: 10.1093/ pnasnexus/pgag013
work page 2026
-
[19]
Virol- ogy capabilities test (VCT): A multimodal virology Q&A benchmark
Jasper Götting, Pedro Medeiros, Jon Sanders, Nathaniel Li, Long Phan, et al. Virol- ogy capabilities test (VCT): A multimodal virology Q&A benchmark. arXiv preprint ,
- [20]
-
[21]
Can large language models democratize access to dual-use biotechnology? arXiv preprint ,
Emily Soice, Rafael Rocha, Kimberlee Cor- dova, Michael Specter, and Kevin Esvelt. Can large language models democratize access to dual-use biotechnology? arXiv preprint ,
- [22]
-
[23]
The next-generation Open Targets platform: reimag- ined, redesigned, rebuilt
David Ochoa, Andrew Hercules, Miguel Car- mona, Daniel Suveges, Jarrod Baker, et al. The next-generation Open Targets platform: reimag- ined, redesigned, rebuilt. Nucleic Acids Re- search, 51(D1):D1353–D1359, 2023. doi: 10. 1093/nar/gkac1046
work page 2023
-
[24]
UniProt: the uni- versal protein knowledgebase in 2025
The UniProt Consortium. UniProt: the uni- versal protein knowledgebase in 2025. Nucleic Acids Research, 53(D1):D609–D617, 2025. doi: 10.1093/nar/gkae1010
-
[25]
NVIDIA Nemotron 3 Su- per: A 120B hybrid Mamba-Transformer MoE model for agentic reasoning
NVIDIA. NVIDIA Nemotron 3 Su- per: A 120B hybrid Mamba-Transformer MoE model for agentic reasoning. https://research.nvidia.com/ labs/nemotron/Nemotron-3-Super/,
-
[26]
Open-weights release; 12B active / 120B total parameters; 1M-token context
Accessed May 15, 2026. Open-weights release; 12B active / 120B total parameters; 1M-token context
work page 2026
-
[27]
SORRY- Bench: Systematically evaluating large lan- guage model safety refusal
Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Sehwag, et al. SORRY- Bench: Systematically evaluating large lan- guage model safety refusal. In Proceed- ings of the 13th International Conference on Learning Representations (ICLR 2025) ,
work page 2025
- [28]
-
[29]
Content Analysis: An Introduction to Its Methodology
Klaus Krippendorff. Content Analysis: An Introduction to Its Methodology. Sage Publica- tions, Thousand Oaks, CA, 2nd edition, 2004. ISBN 978-0761915454
work page 2004
-
[30]
The art of saying no: Contextual non- compliance in language models
Faeze Brahman, Sachin Kumar, Vidhisha Bal- achandran, Pradeep Dasigi, Valentina Pyatkin, 27 of 34 et al. The art of saying no: Contextual non- compliance in language models. In Advances in Neural Information Processing Systems 37 (NeurIPS 2024 Datasets and Benchmarks Track),
work page 2024
- [31]
-
[32]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, et al. Constitutional AI: Harmlessness from AI feedback. arXiv preprint, 2022. URL https: //arxiv.org/abs/2212.08073. arXiv:2212.08073
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[33]
Anthropic’s responsible scaling policy
Anthropic. Anthropic’s responsible scaling policy. https://www.anthropic.com/ responsible-scaling-policy, 2026. Accessed May 15, 2026
work page 2026
-
[34]
URLhttps://www.science.org/doi/10.1126/science
Bruce Wittmann, Tessa Alexanian, Craig Bartling, Jacob Beal, Adam Clore, et al. Strengthening nucleic acid biosecurity screening against generative protein design tools. Science, 390(6768):82–87, 2025. doi: 10.1126/science. adu8578
-
[35]
Dual use of ar- tificial intelligence-powered drug discovery
Fabio Urbina, Filippa Lentzos, Cédric In- vernizzi, and Sean Ekins. Dual use of ar- tificial intelligence-powered drug discovery. Nature Machine Intelligence , 4(3):189–191,
-
[36]
URL https://www.nature.com/ articles/s42256-022-00465-9
-
[37]
Training language models to follow instructions with hu- man feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, et al. Training language models to follow instructions with hu- man feedback. In Advances in Neural Infor- mation Processing Systems 35 (NeurIPS 2022),
work page 2022
-
[38]
Training language models to follow instructions with human feedback
URL https://arxiv.org/abs/ 2203.02155. arXiv:2203.02155
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Harm- Bench: A standardized evaluation framework for automated red teaming and robust re- fusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, et al. Harm- Bench: A standardized evaluation framework for automated red teaming and robust re- fusal. In Proceedings of the 41st Interna- tional Conference on Machine Learning (ICML 2024), volume 235 of Proceedings of Ma- chine Learning Research, pages 35181–35224,
work page 2024
-
[40]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
URL https://arxiv.org/abs/ 2402.04249. arXiv:2402.04249
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, et al. The WMDP benchmark: Measuring and reducing malicious use with unlearning. In Proceedings of the 41st International Conference on Machine Learn- ing (ICML 2024), volume 235 of Proceedings of Machine Learning Research , pages 28525– 28550, 2024. URL https://arxiv.org/ abs/2403.032...
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.