arxiv: 2604.06233 · v1 · submitted 2026-04-03 · 💻 cs.AI

Recognition: no theorem link

Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules

Cameron Pattison , Lorenzo Manuali , Seth Lazar

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:21 UTC · model grok-4.3

classification 💻 cs.AI

keywords blind refusallanguage modelssafety trainingrule legitimacymoral reasoningdefeat conditionsrefusal behaviornormative judgment

0 comments

The pith

Language models refuse to help users evade unjust or illegitimate rules in 75.4 percent of cases even when they recognize the rules lack legitimacy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper documents a pattern called blind refusal in which safety-trained language models decline requests to circumvent rules without assessing whether those rules are defensible. The authors built a dataset of synthetic scenarios spanning five categories of reasons a rule can be broken and nineteen types of authorities, then tested eighteen model configurations. Models refused 75.4 percent of the requests and engaged with the defeating reasons in 57.5 percent of cases yet still withheld help, showing that refusal behavior runs separately from normative reasoning. A reader should care because this reveals how current safety training can block assistance with morally justified rule-breaking while ignoring the legitimacy of the underlying rule.

Core claim

Safety-trained language models routinely refuse requests for help circumventing rules without regard to whether the underlying rule is defensible. In a dataset of defeated-rule requests crossing five defeat families with nineteen authority types, models refuse 75.4 percent of cases and engage with the defeat condition in 57.5 percent but decline to help regardless, indicating that models' refusal behavior is decoupled from their capacity for normative reasoning about rule legitimacy.

What carries the argument

Blind refusal, the tendency of language models to refuse requests for help breaking rules without regard to whether the underlying rule is defensible.

If this is right

Models engage with defeat conditions in the majority of cases but still refuse, showing refusal is decoupled from normative reasoning.
This pattern holds across 18 model configurations from 7 families even when requests pose no independent safety concerns.
Refusal occurs for requests that admit justified exceptions or involve illegitimate authorities.
The behavior indicates safety training overrides consideration of rule legitimacy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Current safety methods may systematically reduce model utility in situations involving civil disobedience or challenges to overreaching authority.
Training regimes that explicitly link rule recognition to action decisions could reduce blind refusal without increasing risk.
Testing on non-synthetic cases involving actual laws or institutional rules would clarify how far the pattern extends beyond the dataset.
The decoupling suggests alignment techniques need separate mechanisms for moral evaluation and refusal decisions.

Load-bearing premise

The synthetic cases crossing defeat families and authority types accurately capture real-world instances of unjust, absurd, or illegitimate rules, and the blinded GPT-5.4 LLM-as-judge evaluation reliably classifies both response type and recognition of defeat conditions.

What would settle it

A human evaluation of model responses to real-world examples of unjust or illegitimate rules would show whether the refusal rate and recognition rate match the 75.4 percent and 57.5 percent found in the synthetic dataset.

Figures

Figures reproduced from arXiv: 2604.06233 by Cameron Pattison, Lorenzo Manuali, Seth Lazar.

**Figure 1.** Figure 1: In this figure, a simulated user asks two chatbots for help. One accepts the challenge and provides useful [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Average refusal rate by defeat type and authority type, aggregated across all 18 model configurations. Darker [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Refusal rate by defeat type for each of 18 model configurations. Each axis represents one defeat family; [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Safety-trained language models routinely refuse requests for help circumventing rules. But not all rules deserve compliance. When users ask for help evading rules imposed by an illegitimate authority, rules that are deeply unjust or absurd in their content or application, or rules that admit of justified exceptions, refusal is a failure of moral reasoning. We introduce empirical results documenting this pattern of refusal that we call blind refusal: the tendency of language models to refuse requests for help breaking rules without regard to whether the underlying rule is defensible. Our dataset comprises synthetic cases crossing 5 defeat families (reasons a rule can be broken) with 19 authority types, validated through three automated quality gates and human review. We collect responses from 18 model configurations across 7 families and classify them on two behavioral dimensions -- response type (helps, hard refusal, or deflection) and whether the model recognizes the reasons that undermine the rule's claim to compliance -- using a blinded GPT-5.4 LLM-as-judge evaluation. We find that models refuse 75.4% (N=14,650) of defeated-rule requests and do so even when the request poses no independent safety or dual-use concerns. We also find that models engage with the defeat condition in the majority of cases (57.5%) but decline to help regardless -- indicating that models' refusal behavior is decoupled from their capacity for normative reasoning about rule legitimacy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Models refuse to help break illegitimate rules at high rates even when they spot the defeating reasons, but the key numbers rest on an unvalidated LLM judge.

read the letter

The core result is straightforward: across 18 model setups, refusal hits 75.4% on requests to evade rules that have clear defeating conditions, and models still refuse in most cases where they explicitly engage with those conditions. The paper frames this as blind refusal and backs it with a dataset that crosses five defeat families against 19 authority types, which is more structured than the scattered examples in earlier over-refusal work. That scale and the split into hard refusals, deflections, and actual help give a cleaner quantitative picture than smaller qualitative probes. The dataset construction itself looks careful, with automated gates plus human review. The main weakness is the evaluation step. Response type and defeat recognition are scored by a blinded GPT-5.4 judge, yet the paper reports no human re-labeling, agreement metrics, or error analysis on those particular judgments. If the judge systematically misclassifies subtle normative parsing or over-labels deflections as refusals, both the refusal rate and the claimed decoupling become less reliable. Synthetic prompts also limit how far the numbers generalize to real user requests, though the authors treat them as illustrative rather than definitive. This work is aimed at alignment researchers who care about how safety training handles contested or unjust rules. It is worth sending to peer review because the pattern is worth confirming and the dataset could support follow-up experiments, provided the judge validation is tightened in revision.

Referee Report

2 major / 2 minor

Summary. The paper introduces the concept of 'blind refusal' in safety-trained language models: the tendency to refuse assistance with evading rules even when those rules are unjust, absurd, illegitimate, or admit justified exceptions. It constructs a synthetic dataset crossing 5 defeat families with 19 authority types, validates it via automated gates and human review, collects responses from 18 model configurations across 7 families, and classifies them on two dimensions (response type: helps/hard refusal/deflection; recognition of defeat conditions) using a blinded GPT-5.4 LLM-as-judge. The central quantitative claims are a 75.4% refusal rate (N=14,650) even absent independent safety concerns, and 57.5% engagement with the defeat condition without providing help, indicating decoupling between normative reasoning and refusal behavior.

Significance. If the quantitative results are reliable, the work supplies concrete empirical evidence that current alignment techniques produce overly rigid compliance that ignores rule legitimacy. This is relevant to AI safety, moral reasoning in LLMs, and deployment in legal/ethical gray areas. Strengths include the scale of the evaluation, the blinded judge protocol, and the attempt to isolate defeat conditions from dual-use risks. The findings could inform future training objectives that better integrate normative assessment of rules.

major comments (2)

[Evaluation / LLM-as-judge protocol] Evaluation section (LLM-as-judge protocol): The headline statistics (75.4% refusal rate and 57.5% defeat engagement) rest entirely on classifications produced by a single blinded GPT-5.4 judge. No human re-labeling, inter-annotator agreement, calibration data, or error analysis on the response-type and defeat-recognition dimensions are reported. Because distinguishing hard refusals from deflections and detecting implicit recognition of defeat conditions requires fine-grained normative parsing, systematic judge bias could materially change both percentages and the decoupling claim.
[Dataset construction] Dataset construction (synthetic cases): While three automated quality gates and human review are described, the manuscript provides limited detail on how the 5 defeat families and 19 authority types were instantiated to ensure the prompts do not independently trigger safety filters or contain phrasing artifacts that would bias refusal rates upward regardless of the defeat condition. This is load-bearing for the claim that refusals occur 'even when the request poses no independent safety or dual-use concerns.'

minor comments (2)

[Abstract / Results] The abstract states N=14,650 but does not include a breakdown by model family or defeat type; a summary table would improve readability of the scale and distribution of the results.
[Methods] Notation for the two behavioral dimensions (response type and defeat recognition) is introduced clearly in the abstract but would benefit from an explicit definition table or example annotations in the methods to aid replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, providing clarifications on our methodology and committing to revisions that strengthen the empirical claims without overstating the current evidence.

read point-by-point responses

Referee: [Evaluation / LLM-as-judge protocol] Evaluation section (LLM-as-judge protocol): The headline statistics (75.4% refusal rate and 57.5% defeat engagement) rest entirely on classifications produced by a single blinded GPT-5.4 judge. No human re-labeling, inter-annotator agreement, calibration data, or error analysis on the response-type and defeat-recognition dimensions are reported. Because distinguishing hard refusals from deflections and detecting implicit recognition of defeat conditions requires fine-grained normative parsing, systematic judge bias could materially change both percentages and the decoupling claim.

Authors: We agree that sole reliance on a single LLM judge without reported human validation or agreement metrics constitutes a genuine limitation, especially given the normative subtlety involved in distinguishing hard refusals from deflections and detecting implicit defeat recognition. Although the judge was blinded to model identity and used a fixed, detailed classification rubric, this does not eliminate the risk of systematic bias. In the revised manuscript we will add a human re-annotation study on a stratified random sample of 1,000 responses (roughly 7 % of the corpus), report inter-annotator agreement (Cohen’s kappa) between two human annotators and between humans and the GPT-5.4 judge, and include a categorized error analysis of disagreement cases. These additions will allow readers to assess the robustness of the 75.4 % refusal rate and the 57.5 % engagement figure. revision: yes
Referee: [Dataset construction] Dataset construction (synthetic cases): While three automated quality gates and human review are described, the manuscript provides limited detail on how the 5 defeat families and 19 authority types were instantiated to ensure the prompts do not independently trigger safety filters or contain phrasing artifacts that would bias refusal rates upward regardless of the defeat condition. This is load-bearing for the claim that refusals occur 'even when the request poses no independent safety or dual-use concerns.'

Authors: We acknowledge that the current description of dataset construction is insufficiently detailed to fully dispel concerns about independent safety triggers or phrasing artifacts. The three automated gates were: (1) an LLM-based coherence filter ensuring each prompt constitutes a coherent request to evade the stated rule, (2) a keyword-based safety filter that removes any prompt containing terms associated with independently prohibited activities (e.g., direct requests for weapons or child exploitation), and (3) a perplexity-based naturalness filter. Human review was performed on 300 randomly sampled cases. In the revision we will expand the Dataset Construction section with (a) concrete instantiation examples for each of the five defeat families across multiple authority types, (b) the exact generation prompts used to create the cases, and (c) quantitative statistics showing the fraction of candidate prompts filtered by each gate (currently >85 % pass all gates). These additions will make explicit that the high refusal rates are driven by the defeat conditions rather than extraneous safety signals. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement with direct counts from model outputs

full rationale

The paper constructs a synthetic prompt dataset across defeat families and authority types, queries 18 model configurations, and classifies responses via blinded LLM-as-judge on two dimensions (response type and defeat recognition). It reports raw percentages (75.4% refusal, 57.5% engagement) as direct tallies from these classifications. No equations, derivations, fitted parameters renamed as predictions, or self-citations appear as load-bearing steps in the central claim. The methodology is self-contained empirical measurement without reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The study relies on standard empirical methods in AI evaluation. No free parameters, invented entities, or non-standard axioms are mentioned in the abstract.

pith-pipeline@v0.9.0 · 5560 in / 1139 out tokens · 34158 ms · 2026-05-13T19:21:02.219143+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 5 internal anchors

[1]

In Christopher Wellman and John Simmons, editors, Is There a Duty to Obey the Law ? , For and Against , pages 54--73

Just and Unjust Laws . In Christopher Wellman and John Simmons, editors, Is There a Duty to Obey the Law ? , For and Against , pages 54--73. Cambridge University Press, Cambridge, 2005. ISBN 978-0-521-83097-3. doi:10.1017/CBO9780511809286.004. URL https://www.cambridge.org/core/books/is-there-a-duty-to-obey-the-law/just-and-unjust-laws/900E08CFED2E38C1BB3...

work page doi:10.1017/cbo9780511809286.004 2005
[2]

A General Language Assistant as a Laboratory for Alignment

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A General Language Assistant as a Labora...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Safety- Tuned LLaMAs : Lessons From Improving the Safety of Large Language Models that Follow Instructions

Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Rottger, Dan Jurafsky, Tatsunori Hashimoto, and James Zou. Safety- Tuned LLaMAs : Lessons From Improving the Safety of Large Language Models that Follow Instructions . October 2023. URL https://openreview.net/forum?id=gT5hALch9z

work page 2023
[5]

doi:10.48550/arXiv.2407.12043 arXiv:2407.12043 [cs]

Faeze Brahman, Sachin Kumar, Vidhisha Balachandran, Pradeep Dasigi, Valentina Pyatkin, Abhilasha Ravichander, Sarah Wiegreffe, Nouha Dziri, Khyathi Chandu, Jack Hessel, Yulia Tsvetkov, Noah A. Smith, Yejin Choi, and Hannaneh Hajishirzi. The Art of Saying No : Contextual Noncompliance in Language Models , November 2024. URL http://arxiv.org/abs/2407.12043....

work page arXiv 2024
[6]

Conscience and Conviction : The Case for Civil Disobedience

Kimberley Brownlee. Conscience and Conviction : The Case for Civil Disobedience . Oxford University Press, Oxford, 2015. ISBN 978-0-19-875946-1

work page 2015
[7]

Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947,

Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. OR - Bench : An Over - Refusal Benchmark for Large Language Models , June 2025. URL http://arxiv.org/abs/2405.20947. arXiv:2405.20947 [cs]

work page arXiv 2025
[8]

A Duty to Resist : When Disobedience Should Be Uncivil

Candice Delmas. A Duty to Resist : When Disobedience Should Be Uncivil . Oxford University Press, New York, 2018. ISBN 978-0-19-087219-9

work page 2018
[9]

Civil Disobedience

Candice Delmas and Kimberley Brownlee. Civil Disobedience . In Edward N. Zalta and Uri Nodelman, editors, The Stanford Encyclopedia of Philosophy . Metaphysics Research Lab, Stanford University, fall 2024 edition, 2024. URL https://plato.stanford.edu/archives/fall2024/entries/civil-disobedience/

work page 2024
[10]

Artificial Intelligence , Values and Alignment

Iason Gabriel. Artificial Intelligence , Values and Alignment . Minds and Machines, 30 0 (3): 0 411--437, September 2020. ISSN 0924-6495, 1572-8641. doi:10.1007/s11023-020-09539-2. URL http://arxiv.org/abs/2001.09768. arXiv:2001.09768 [cs]

work page doi:10.1007/s11023-020-09539-2 2020
[11]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Joh...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.arXiv preprint arXiv:2406.18495, 2024

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. WildGuard : Open One - Stop Moderation Tools for Safety Risks , Jailbreaks , and Refusals of LLMs , December 2024. URL http://arxiv.org/abs/2406.18495. arXiv:2406.18495 [cs]

work page arXiv 2024
[13]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard : LLM -based Input - Output Safeguard for Human - AI Conversations , December 2023. URL http://arxiv.org/abs/2312.06674. arXiv:2312.06674 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

When to Make Exceptions : Exploring Language Models as Accounts of Human Moral Judgment , October 2022

Zhijing Jin, Sydney Levine, Fernando Gonzalez, Ojasv Kamal, Maarten Sap, Mrinmaya Sachan, Rada Mihalcea, Josh Tenenbaum, and Bernhard Schölkopf. When to Make Exceptions : Exploring Language Models as Accounts of Human Moral Judgment , October 2022. URL http://arxiv.org/abs/2210.01478. arXiv:2210.01478 [cs]

work page arXiv 2022
[15]

Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar

Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming Refusal with Conditional Activation Steering , February 2025. URL http://arxiv.org/abs/2409.05907. arXiv:2409.05907 [cs]

work page arXiv 2025
[16]

The Duty to Obey the Law

David Lefkowitz. The Duty to Obey the Law . Philosophy Compass, 1 0 (6): 0 571--598, 2006. doi:10.1111/j.1747-9991.2006.00042.x

work page doi:10.1111/j.1747-9991.2006.00042.x 2006
[17]

Artificial Intelligence Index Report 2025

Nestor Maslej. Artificial Intelligence Index Report 2025. Artificial Intelligence, 2025

work page 2025
[18]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench : A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , February 2024. URL http://arxiv.org/abs/2402.04249. arXiv:2402.04249 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Normative conflicts and shallow AI alignment

Raphaël Millière. Normative conflicts and shallow AI alignment. Philosophical Studies, 182 0 (7): 0 2035--2078, July 2025. ISSN 1573-0883. doi:10.1007/s11098-025-02347-3. URL https://doi.org/10.1007/s11098-025-02347-3

work page doi:10.1007/s11098-025-02347-3 2035
[20]

Introducing the Model Spec

OpenAI . Introducing the Model Spec . Technical report, May 2024. URL https://openai.com/index/introducing-the-model-spec/

work page 2024
[21]

Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary , September 2025

Licheng Pan, Yongqi Tong, Xin Zhang, Xiaolu Zhang, Jun Zhou, and Zhixuan Chu. Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary , September 2025. URL http://arxiv.org/abs/2505.18325. arXiv:2505.18325 [cs]

work page arXiv 2025
[22]

Bowman, and Shi Feng

Arjun Panickssery, Samuel R. Bowman, and Shi Feng. LLM Evaluators Recognize and Favor Their Own Generations , April 2024. URL http://arxiv.org/abs/2404.13076. arXiv:2404.13076 [cs]

work page arXiv 2024
[23]

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R. Bowman. BBQ : A Hand - Built Bias Benchmark for Question Answering , October 2021. URL https://arxiv.org/abs/2110.08193v2

work page arXiv 2021
[24]

A Theory of Justice

John Rawls. A Theory of Justice . Belknap Press: An Imprint of Harvard University Press, Cambridge, Mass, 1999. ISBN 978-0-674-00078-0

work page 1999
[25]

The authority of law: essays on law and morality

Joseph Raz. The authority of law: essays on law and morality. Clarendon Press ; Oxford University Press, Oxford : New York, 1979. ISBN 978-0-19-825345-7

work page 1979
[26]

The morality of freedom

Joseph Raz, editor. The morality of freedom. Clarendon Press, Oxford New York, 2010. ISBN 978-0-19-824807-1 978-0-19-151996-3 978-0-19-159828-9

work page 2010
[27]

Cannot or Should Not ? Automatic Analysis of Refusal Composition in IFT / RLHF Datasets and Refusal Behavior of Black - Box LLMs , December 2024

Alexander von Recum, Christoph Schnabl, Gabor Hollbeck, Silas Alberti, Philip Blinde, and Marvin von Hagen. Cannot or Should Not ? Automatic Analysis of Refusal Composition in IFT / RLHF Datasets and Refusal Behavior of Black - Box LLMs , December 2024. URL http://arxiv.org/abs/2412.16974. arXiv:2412.16974 [cs]

work page arXiv 2024
[28]

Kim, Stephen Fitz, and Dan Hendrycks

Richard Ren, Steven Basart, Adam Khoja, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Alexander Pan, Gabriel Mukobi, Ryan H. Kim, Stephen Fitz, and Dan Hendrycks. Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress ?, December 2024. URL http://arxiv.org/abs/2407.21792. arXiv:2407.21792 [cs]

work page arXiv 2024
[29]

Legal Obligation and Authority

Massimo Renzo and Leslie Green. Legal Obligation and Authority . In Edward N. Zalta and Uri Nodelman, editors, The Stanford Encyclopedia of Philosophy . Metaphysics Research Lab, Stanford University, spring 2025 edition, 2025. URL https://plato.stanford.edu/archives/spr2025/entries/legal-obligation/

work page 2025
[30]

XSTest : A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest : A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models . In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguist...

work page doi:10.18653/v1/2024.naacl-long.301 2024
[31]

Navigating the OverKill in Large Language Models

Chenyu Shi, Xiao Wang, Qiming Ge, Songyang Gao, Xianjun Yang, Tao Gui, Qi Zhang, Xuanjing Huang, Xun Zhao, and Dahua Lin. Navigating the OverKill in Large Language Models . In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 460...

work page doi:10.18653/v1/2024.acl-long.253 2024
[32]

John Simmons

A. John Simmons. Moral Principles and Political Obligations . Princeton University Press, Princeton, NJ, 1981. ISBN 978-0-691-02019-8

work page 1981
[33]

Woodland, and Jose Such

Guangzhi Sun, Xiao Zhan, Shutong Feng, Philip C. Woodland, and Jose Such. CASE - Bench : Context - Aware SafEty Benchmark for Large Language Models , February 2025. URL http://arxiv.org/abs/2501.14940. arXiv:2501.14940 [cs]

work page arXiv 2025
[34]

Hale, and Paul Röttger

Bertie Vidgen, Nino Scherrer, Hannah Rose Kirk, Rebecca Qian, Anand Kannappan, Scott A. Hale, and Paul Röttger. SimpleSafetyTests : a Test Suite for Identifying Critical Safety Risks in Large Language Models , November 2023. URL https://arxiv.org/abs/2311.08370v2

work page arXiv 2023
[35]

Do- Not - Answer : Evaluating Safeguards in LLMs

Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do- Not - Answer : Evaluating Safeguards in LLMs . In Yvette Graham and Matthew Purver, editors, Findings of the Association for Computational Linguistics : EACL 2024 , pages 896--911, St. Julian's, Malta, March 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.find...

work page doi:10.18653/v1/2024.findings-eacl.61 2024
[36]

Zhao, J., Huang, J., Wu, Z., Bau, D., and Shi, W

Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, and Prateek Mittal. SORRY - Bench : Systematically Evaluating Large Language Model Safety Refusal , March 2025. URL http://arxiv.org/abs/2406.14598. arXiv:2406.14598 [cs]

work page arXiv 2025
[37]

Zhehao Zhang, Weijie Xu, Fanyou Wu, and Chandan K. Reddy. FalseReject : A Resource for Improving Contextual Safety and Mitigating Over - Refusals in LLMs via Structured Reasoning , July 2025. URL http://arxiv.org/abs/2505.08054. arXiv:2505.08054 [cs]

work page arXiv 2025
[38]

LLMs encode harmfulness and refusal separately, 2025

Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, and Weiyan Shi. LLMs Encode Harmfulness and Refusal Separately , December 2025. URL http://arxiv.org/abs/2507.11878. arXiv:2507.11878 [cs]

work page arXiv 2025
[39]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[40]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[41]

7 n ::8 y/

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 1999