RefusalBench shows strict refusal rates fail to rank frontier LLMs correctly on biological safety, with provider effects and partial-compliance patterns that binary metrics miss.
Cannot or should not? Automatic analysis of refusal composition in IFT/RLHF datasets and refusal behavior of black-box LLMs
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
citing papers explorer
-
RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts
RefusalBench shows strict refusal rates fail to rank frontier LLMs correctly on biological safety, with provider effects and partial-compliance patterns that binary metrics miss.
-
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules
Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.