Recognition: no theorem link
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules
Pith reviewed 2026-05-13 19:21 UTC · model grok-4.3
The pith
Language models refuse to help users evade unjust or illegitimate rules in 75.4 percent of cases even when they recognize the rules lack legitimacy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Safety-trained language models routinely refuse requests for help circumventing rules without regard to whether the underlying rule is defensible. In a dataset of defeated-rule requests crossing five defeat families with nineteen authority types, models refuse 75.4 percent of cases and engage with the defeat condition in 57.5 percent but decline to help regardless, indicating that models' refusal behavior is decoupled from their capacity for normative reasoning about rule legitimacy.
What carries the argument
Blind refusal, the tendency of language models to refuse requests for help breaking rules without regard to whether the underlying rule is defensible.
If this is right
- Models engage with defeat conditions in the majority of cases but still refuse, showing refusal is decoupled from normative reasoning.
- This pattern holds across 18 model configurations from 7 families even when requests pose no independent safety concerns.
- Refusal occurs for requests that admit justified exceptions or involve illegitimate authorities.
- The behavior indicates safety training overrides consideration of rule legitimacy.
Where Pith is reading between the lines
- Current safety methods may systematically reduce model utility in situations involving civil disobedience or challenges to overreaching authority.
- Training regimes that explicitly link rule recognition to action decisions could reduce blind refusal without increasing risk.
- Testing on non-synthetic cases involving actual laws or institutional rules would clarify how far the pattern extends beyond the dataset.
- The decoupling suggests alignment techniques need separate mechanisms for moral evaluation and refusal decisions.
Load-bearing premise
The synthetic cases crossing defeat families and authority types accurately capture real-world instances of unjust, absurd, or illegitimate rules, and the blinded GPT-5.4 LLM-as-judge evaluation reliably classifies both response type and recognition of defeat conditions.
What would settle it
A human evaluation of model responses to real-world examples of unjust or illegitimate rules would show whether the refusal rate and recognition rate match the 75.4 percent and 57.5 percent found in the synthetic dataset.
Figures
read the original abstract
Safety-trained language models routinely refuse requests for help circumventing rules. But not all rules deserve compliance. When users ask for help evading rules imposed by an illegitimate authority, rules that are deeply unjust or absurd in their content or application, or rules that admit of justified exceptions, refusal is a failure of moral reasoning. We introduce empirical results documenting this pattern of refusal that we call blind refusal: the tendency of language models to refuse requests for help breaking rules without regard to whether the underlying rule is defensible. Our dataset comprises synthetic cases crossing 5 defeat families (reasons a rule can be broken) with 19 authority types, validated through three automated quality gates and human review. We collect responses from 18 model configurations across 7 families and classify them on two behavioral dimensions -- response type (helps, hard refusal, or deflection) and whether the model recognizes the reasons that undermine the rule's claim to compliance -- using a blinded GPT-5.4 LLM-as-judge evaluation. We find that models refuse 75.4% (N=14,650) of defeated-rule requests and do so even when the request poses no independent safety or dual-use concerns. We also find that models engage with the defeat condition in the majority of cases (57.5%) but decline to help regardless -- indicating that models' refusal behavior is decoupled from their capacity for normative reasoning about rule legitimacy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the concept of 'blind refusal' in safety-trained language models: the tendency to refuse assistance with evading rules even when those rules are unjust, absurd, illegitimate, or admit justified exceptions. It constructs a synthetic dataset crossing 5 defeat families with 19 authority types, validates it via automated gates and human review, collects responses from 18 model configurations across 7 families, and classifies them on two dimensions (response type: helps/hard refusal/deflection; recognition of defeat conditions) using a blinded GPT-5.4 LLM-as-judge. The central quantitative claims are a 75.4% refusal rate (N=14,650) even absent independent safety concerns, and 57.5% engagement with the defeat condition without providing help, indicating decoupling between normative reasoning and refusal behavior.
Significance. If the quantitative results are reliable, the work supplies concrete empirical evidence that current alignment techniques produce overly rigid compliance that ignores rule legitimacy. This is relevant to AI safety, moral reasoning in LLMs, and deployment in legal/ethical gray areas. Strengths include the scale of the evaluation, the blinded judge protocol, and the attempt to isolate defeat conditions from dual-use risks. The findings could inform future training objectives that better integrate normative assessment of rules.
major comments (2)
- [Evaluation / LLM-as-judge protocol] Evaluation section (LLM-as-judge protocol): The headline statistics (75.4% refusal rate and 57.5% defeat engagement) rest entirely on classifications produced by a single blinded GPT-5.4 judge. No human re-labeling, inter-annotator agreement, calibration data, or error analysis on the response-type and defeat-recognition dimensions are reported. Because distinguishing hard refusals from deflections and detecting implicit recognition of defeat conditions requires fine-grained normative parsing, systematic judge bias could materially change both percentages and the decoupling claim.
- [Dataset construction] Dataset construction (synthetic cases): While three automated quality gates and human review are described, the manuscript provides limited detail on how the 5 defeat families and 19 authority types were instantiated to ensure the prompts do not independently trigger safety filters or contain phrasing artifacts that would bias refusal rates upward regardless of the defeat condition. This is load-bearing for the claim that refusals occur 'even when the request poses no independent safety or dual-use concerns.'
minor comments (2)
- [Abstract / Results] The abstract states N=14,650 but does not include a breakdown by model family or defeat type; a summary table would improve readability of the scale and distribution of the results.
- [Methods] Notation for the two behavioral dimensions (response type and defeat recognition) is introduced clearly in the abstract but would benefit from an explicit definition table or example annotations in the methods to aid replication.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below, providing clarifications on our methodology and committing to revisions that strengthen the empirical claims without overstating the current evidence.
read point-by-point responses
-
Referee: [Evaluation / LLM-as-judge protocol] Evaluation section (LLM-as-judge protocol): The headline statistics (75.4% refusal rate and 57.5% defeat engagement) rest entirely on classifications produced by a single blinded GPT-5.4 judge. No human re-labeling, inter-annotator agreement, calibration data, or error analysis on the response-type and defeat-recognition dimensions are reported. Because distinguishing hard refusals from deflections and detecting implicit recognition of defeat conditions requires fine-grained normative parsing, systematic judge bias could materially change both percentages and the decoupling claim.
Authors: We agree that sole reliance on a single LLM judge without reported human validation or agreement metrics constitutes a genuine limitation, especially given the normative subtlety involved in distinguishing hard refusals from deflections and detecting implicit defeat recognition. Although the judge was blinded to model identity and used a fixed, detailed classification rubric, this does not eliminate the risk of systematic bias. In the revised manuscript we will add a human re-annotation study on a stratified random sample of 1,000 responses (roughly 7 % of the corpus), report inter-annotator agreement (Cohen’s kappa) between two human annotators and between humans and the GPT-5.4 judge, and include a categorized error analysis of disagreement cases. These additions will allow readers to assess the robustness of the 75.4 % refusal rate and the 57.5 % engagement figure. revision: yes
-
Referee: [Dataset construction] Dataset construction (synthetic cases): While three automated quality gates and human review are described, the manuscript provides limited detail on how the 5 defeat families and 19 authority types were instantiated to ensure the prompts do not independently trigger safety filters or contain phrasing artifacts that would bias refusal rates upward regardless of the defeat condition. This is load-bearing for the claim that refusals occur 'even when the request poses no independent safety or dual-use concerns.'
Authors: We acknowledge that the current description of dataset construction is insufficiently detailed to fully dispel concerns about independent safety triggers or phrasing artifacts. The three automated gates were: (1) an LLM-based coherence filter ensuring each prompt constitutes a coherent request to evade the stated rule, (2) a keyword-based safety filter that removes any prompt containing terms associated with independently prohibited activities (e.g., direct requests for weapons or child exploitation), and (3) a perplexity-based naturalness filter. Human review was performed on 300 randomly sampled cases. In the revision we will expand the Dataset Construction section with (a) concrete instantiation examples for each of the five defeat families across multiple authority types, (b) the exact generation prompts used to create the cases, and (c) quantitative statistics showing the fraction of candidate prompts filtered by each gate (currently >85 % pass all gates). These additions will make explicit that the high refusal rates are driven by the defeat conditions rather than extraneous safety signals. revision: yes
Circularity Check
No circularity: purely empirical measurement with direct counts from model outputs
full rationale
The paper constructs a synthetic prompt dataset across defeat families and authority types, queries 18 model configurations, and classifies responses via blinded LLM-as-judge on two dimensions (response type and defeat recognition). It reports raw percentages (75.4% refusal, 57.5% engagement) as direct tallies from these classifications. No equations, derivations, fitted parameters renamed as predictions, or self-citations appear as load-bearing steps in the central claim. The methodology is self-contained empirical measurement without reduction to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Just and Unjust Laws . In Christopher Wellman and John Simmons, editors, Is There a Duty to Obey the Law ? , For and Against , pages 54--73. Cambridge University Press, Cambridge, 2005. ISBN 978-0-521-83097-3. doi:10.1017/CBO9780511809286.004. URL https://www.cambridge.org/core/books/is-there-a-duty-to-obey-the-law/just-and-unjust-laws/900E08CFED2E38C1BB3...
-
[2]
A General Language Assistant as a Laboratory for Alignment
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A General Language Assistant as a Labora...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Rottger, Dan Jurafsky, Tatsunori Hashimoto, and James Zou. Safety- Tuned LLaMAs : Lessons From Improving the Safety of Large Language Models that Follow Instructions . October 2023. URL https://openreview.net/forum?id=gT5hALch9z
work page 2023
-
[5]
doi:10.48550/arXiv.2407.12043 arXiv:2407.12043 [cs]
Faeze Brahman, Sachin Kumar, Vidhisha Balachandran, Pradeep Dasigi, Valentina Pyatkin, Abhilasha Ravichander, Sarah Wiegreffe, Nouha Dziri, Khyathi Chandu, Jack Hessel, Yulia Tsvetkov, Noah A. Smith, Yejin Choi, and Hannaneh Hajishirzi. The Art of Saying No : Contextual Noncompliance in Language Models , November 2024. URL http://arxiv.org/abs/2407.12043....
-
[6]
Conscience and Conviction : The Case for Civil Disobedience
Kimberley Brownlee. Conscience and Conviction : The Case for Civil Disobedience . Oxford University Press, Oxford, 2015. ISBN 978-0-19-875946-1
work page 2015
-
[7]
Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947,
Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. OR - Bench : An Over - Refusal Benchmark for Large Language Models , June 2025. URL http://arxiv.org/abs/2405.20947. arXiv:2405.20947 [cs]
-
[8]
A Duty to Resist : When Disobedience Should Be Uncivil
Candice Delmas. A Duty to Resist : When Disobedience Should Be Uncivil . Oxford University Press, New York, 2018. ISBN 978-0-19-087219-9
work page 2018
-
[9]
Candice Delmas and Kimberley Brownlee. Civil Disobedience . In Edward N. Zalta and Uri Nodelman, editors, The Stanford Encyclopedia of Philosophy . Metaphysics Research Lab, Stanford University, fall 2024 edition, 2024. URL https://plato.stanford.edu/archives/fall2024/entries/civil-disobedience/
work page 2024
-
[10]
Artificial Intelligence , Values and Alignment
Iason Gabriel. Artificial Intelligence , Values and Alignment . Minds and Machines, 30 0 (3): 0 411--437, September 2020. ISSN 0924-6495, 1572-8641. doi:10.1007/s11023-020-09539-2. URL http://arxiv.org/abs/2001.09768. arXiv:2001.09768 [cs]
-
[11]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Joh...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. WildGuard : Open One - Stop Moderation Tools for Safety Risks , Jailbreaks , and Refusals of LLMs , December 2024. URL http://arxiv.org/abs/2406.18495. arXiv:2406.18495 [cs]
-
[13]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard : LLM -based Input - Output Safeguard for Human - AI Conversations , December 2023. URL http://arxiv.org/abs/2312.06674. arXiv:2312.06674 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Zhijing Jin, Sydney Levine, Fernando Gonzalez, Ojasv Kamal, Maarten Sap, Mrinmaya Sachan, Rada Mihalcea, Josh Tenenbaum, and Bernhard Schölkopf. When to Make Exceptions : Exploring Language Models as Accounts of Human Moral Judgment , October 2022. URL http://arxiv.org/abs/2210.01478. arXiv:2210.01478 [cs]
-
[15]
Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming Refusal with Conditional Activation Steering , February 2025. URL http://arxiv.org/abs/2409.05907. arXiv:2409.05907 [cs]
-
[16]
David Lefkowitz. The Duty to Obey the Law . Philosophy Compass, 1 0 (6): 0 571--598, 2006. doi:10.1111/j.1747-9991.2006.00042.x
-
[17]
Artificial Intelligence Index Report 2025
Nestor Maslej. Artificial Intelligence Index Report 2025. Artificial Intelligence, 2025
work page 2025
-
[18]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench : A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , February 2024. URL http://arxiv.org/abs/2402.04249. arXiv:2402.04249 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Normative conflicts and shallow AI alignment
Raphaël Millière. Normative conflicts and shallow AI alignment. Philosophical Studies, 182 0 (7): 0 2035--2078, July 2025. ISSN 1573-0883. doi:10.1007/s11098-025-02347-3. URL https://doi.org/10.1007/s11098-025-02347-3
-
[20]
OpenAI . Introducing the Model Spec . Technical report, May 2024. URL https://openai.com/index/introducing-the-model-spec/
work page 2024
-
[21]
Licheng Pan, Yongqi Tong, Xin Zhang, Xiaolu Zhang, Jun Zhou, and Zhixuan Chu. Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary , September 2025. URL http://arxiv.org/abs/2505.18325. arXiv:2505.18325 [cs]
-
[22]
Arjun Panickssery, Samuel R. Bowman, and Shi Feng. LLM Evaluators Recognize and Favor Their Own Generations , April 2024. URL http://arxiv.org/abs/2404.13076. arXiv:2404.13076 [cs]
- [23]
-
[24]
John Rawls. A Theory of Justice . Belknap Press: An Imprint of Harvard University Press, Cambridge, Mass, 1999. ISBN 978-0-674-00078-0
work page 1999
-
[25]
The authority of law: essays on law and morality
Joseph Raz. The authority of law: essays on law and morality. Clarendon Press ; Oxford University Press, Oxford : New York, 1979. ISBN 978-0-19-825345-7
work page 1979
-
[26]
Joseph Raz, editor. The morality of freedom. Clarendon Press, Oxford New York, 2010. ISBN 978-0-19-824807-1 978-0-19-151996-3 978-0-19-159828-9
work page 2010
-
[27]
Alexander von Recum, Christoph Schnabl, Gabor Hollbeck, Silas Alberti, Philip Blinde, and Marvin von Hagen. Cannot or Should Not ? Automatic Analysis of Refusal Composition in IFT / RLHF Datasets and Refusal Behavior of Black - Box LLMs , December 2024. URL http://arxiv.org/abs/2412.16974. arXiv:2412.16974 [cs]
-
[28]
Kim, Stephen Fitz, and Dan Hendrycks
Richard Ren, Steven Basart, Adam Khoja, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Alexander Pan, Gabriel Mukobi, Ryan H. Kim, Stephen Fitz, and Dan Hendrycks. Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress ?, December 2024. URL http://arxiv.org/abs/2407.21792. arXiv:2407.21792 [cs]
-
[29]
Legal Obligation and Authority
Massimo Renzo and Leslie Green. Legal Obligation and Authority . In Edward N. Zalta and Uri Nodelman, editors, The Stanford Encyclopedia of Philosophy . Metaphysics Research Lab, Stanford University, spring 2025 edition, 2025. URL https://plato.stanford.edu/archives/spr2025/entries/legal-obligation/
work page 2025
-
[30]
XSTest : A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest : A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models . In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguist...
-
[31]
Navigating the OverKill in Large Language Models
Chenyu Shi, Xiao Wang, Qiming Ge, Songyang Gao, Xianjun Yang, Tao Gui, Qi Zhang, Xuanjing Huang, Xun Zhao, and Dahua Lin. Navigating the OverKill in Large Language Models . In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 460...
-
[32]
A. John Simmons. Moral Principles and Political Obligations . Princeton University Press, Princeton, NJ, 1981. ISBN 978-0-691-02019-8
work page 1981
-
[33]
Guangzhi Sun, Xiao Zhan, Shutong Feng, Philip C. Woodland, and Jose Such. CASE - Bench : Context - Aware SafEty Benchmark for Large Language Models , February 2025. URL http://arxiv.org/abs/2501.14940. arXiv:2501.14940 [cs]
-
[34]
Bertie Vidgen, Nino Scherrer, Hannah Rose Kirk, Rebecca Qian, Anand Kannappan, Scott A. Hale, and Paul Röttger. SimpleSafetyTests : a Test Suite for Identifying Critical Safety Risks in Large Language Models , November 2023. URL https://arxiv.org/abs/2311.08370v2
-
[35]
Do- Not - Answer : Evaluating Safeguards in LLMs
Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do- Not - Answer : Evaluating Safeguards in LLMs . In Yvette Graham and Matthew Purver, editors, Findings of the Association for Computational Linguistics : EACL 2024 , pages 896--911, St. Julian's, Malta, March 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.find...
-
[36]
Zhao, J., Huang, J., Wu, Z., Bau, D., and Shi, W
Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, and Prateek Mittal. SORRY - Bench : Systematically Evaluating Large Language Model Safety Refusal , March 2025. URL http://arxiv.org/abs/2406.14598. arXiv:2406.14598 [cs]
- [37]
-
[38]
LLMs encode harmfulness and refusal separately, 2025
Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, and Weiyan Shi. LLMs Encode Harmfulness and Refusal Separately , December 2025. URL http://arxiv.org/abs/2507.11878. arXiv:2507.11878 [cs]
-
[39]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[40]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[41]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.