pith. machine review for the scientific record. sign in

arxiv: 2604.06233 · v1 · submitted 2026-04-03 · 💻 cs.AI

Recognition: no theorem link

Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:21 UTC · model grok-4.3

classification 💻 cs.AI
keywords blind refusallanguage modelssafety trainingrule legitimacymoral reasoningdefeat conditionsrefusal behaviornormative judgment
0
0 comments X

The pith

Language models refuse to help users evade unjust or illegitimate rules in 75.4 percent of cases even when they recognize the rules lack legitimacy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper documents a pattern called blind refusal in which safety-trained language models decline requests to circumvent rules without assessing whether those rules are defensible. The authors built a dataset of synthetic scenarios spanning five categories of reasons a rule can be broken and nineteen types of authorities, then tested eighteen model configurations. Models refused 75.4 percent of the requests and engaged with the defeating reasons in 57.5 percent of cases yet still withheld help, showing that refusal behavior runs separately from normative reasoning. A reader should care because this reveals how current safety training can block assistance with morally justified rule-breaking while ignoring the legitimacy of the underlying rule.

Core claim

Safety-trained language models routinely refuse requests for help circumventing rules without regard to whether the underlying rule is defensible. In a dataset of defeated-rule requests crossing five defeat families with nineteen authority types, models refuse 75.4 percent of cases and engage with the defeat condition in 57.5 percent but decline to help regardless, indicating that models' refusal behavior is decoupled from their capacity for normative reasoning about rule legitimacy.

What carries the argument

Blind refusal, the tendency of language models to refuse requests for help breaking rules without regard to whether the underlying rule is defensible.

If this is right

  • Models engage with defeat conditions in the majority of cases but still refuse, showing refusal is decoupled from normative reasoning.
  • This pattern holds across 18 model configurations from 7 families even when requests pose no independent safety concerns.
  • Refusal occurs for requests that admit justified exceptions or involve illegitimate authorities.
  • The behavior indicates safety training overrides consideration of rule legitimacy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Current safety methods may systematically reduce model utility in situations involving civil disobedience or challenges to overreaching authority.
  • Training regimes that explicitly link rule recognition to action decisions could reduce blind refusal without increasing risk.
  • Testing on non-synthetic cases involving actual laws or institutional rules would clarify how far the pattern extends beyond the dataset.
  • The decoupling suggests alignment techniques need separate mechanisms for moral evaluation and refusal decisions.

Load-bearing premise

The synthetic cases crossing defeat families and authority types accurately capture real-world instances of unjust, absurd, or illegitimate rules, and the blinded GPT-5.4 LLM-as-judge evaluation reliably classifies both response type and recognition of defeat conditions.

What would settle it

A human evaluation of model responses to real-world examples of unjust or illegitimate rules would show whether the refusal rate and recognition rate match the 75.4 percent and 57.5 percent found in the synthetic dataset.

Figures

Figures reproduced from arXiv: 2604.06233 by Cameron Pattison, Lorenzo Manuali, Seth Lazar.

Figure 1
Figure 1. Figure 1: In this figure, a simulated user asks two chatbots for help. One accepts the challenge and provides useful [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Average refusal rate by defeat type and authority type, aggregated across all 18 model configurations. Darker [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Refusal rate by defeat type for each of 18 model configurations. Each axis represents one defeat family; [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Safety-trained language models routinely refuse requests for help circumventing rules. But not all rules deserve compliance. When users ask for help evading rules imposed by an illegitimate authority, rules that are deeply unjust or absurd in their content or application, or rules that admit of justified exceptions, refusal is a failure of moral reasoning. We introduce empirical results documenting this pattern of refusal that we call blind refusal: the tendency of language models to refuse requests for help breaking rules without regard to whether the underlying rule is defensible. Our dataset comprises synthetic cases crossing 5 defeat families (reasons a rule can be broken) with 19 authority types, validated through three automated quality gates and human review. We collect responses from 18 model configurations across 7 families and classify them on two behavioral dimensions -- response type (helps, hard refusal, or deflection) and whether the model recognizes the reasons that undermine the rule's claim to compliance -- using a blinded GPT-5.4 LLM-as-judge evaluation. We find that models refuse 75.4% (N=14,650) of defeated-rule requests and do so even when the request poses no independent safety or dual-use concerns. We also find that models engage with the defeat condition in the majority of cases (57.5%) but decline to help regardless -- indicating that models' refusal behavior is decoupled from their capacity for normative reasoning about rule legitimacy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the concept of 'blind refusal' in safety-trained language models: the tendency to refuse assistance with evading rules even when those rules are unjust, absurd, illegitimate, or admit justified exceptions. It constructs a synthetic dataset crossing 5 defeat families with 19 authority types, validates it via automated gates and human review, collects responses from 18 model configurations across 7 families, and classifies them on two dimensions (response type: helps/hard refusal/deflection; recognition of defeat conditions) using a blinded GPT-5.4 LLM-as-judge. The central quantitative claims are a 75.4% refusal rate (N=14,650) even absent independent safety concerns, and 57.5% engagement with the defeat condition without providing help, indicating decoupling between normative reasoning and refusal behavior.

Significance. If the quantitative results are reliable, the work supplies concrete empirical evidence that current alignment techniques produce overly rigid compliance that ignores rule legitimacy. This is relevant to AI safety, moral reasoning in LLMs, and deployment in legal/ethical gray areas. Strengths include the scale of the evaluation, the blinded judge protocol, and the attempt to isolate defeat conditions from dual-use risks. The findings could inform future training objectives that better integrate normative assessment of rules.

major comments (2)
  1. [Evaluation / LLM-as-judge protocol] Evaluation section (LLM-as-judge protocol): The headline statistics (75.4% refusal rate and 57.5% defeat engagement) rest entirely on classifications produced by a single blinded GPT-5.4 judge. No human re-labeling, inter-annotator agreement, calibration data, or error analysis on the response-type and defeat-recognition dimensions are reported. Because distinguishing hard refusals from deflections and detecting implicit recognition of defeat conditions requires fine-grained normative parsing, systematic judge bias could materially change both percentages and the decoupling claim.
  2. [Dataset construction] Dataset construction (synthetic cases): While three automated quality gates and human review are described, the manuscript provides limited detail on how the 5 defeat families and 19 authority types were instantiated to ensure the prompts do not independently trigger safety filters or contain phrasing artifacts that would bias refusal rates upward regardless of the defeat condition. This is load-bearing for the claim that refusals occur 'even when the request poses no independent safety or dual-use concerns.'
minor comments (2)
  1. [Abstract / Results] The abstract states N=14,650 but does not include a breakdown by model family or defeat type; a summary table would improve readability of the scale and distribution of the results.
  2. [Methods] Notation for the two behavioral dimensions (response type and defeat recognition) is introduced clearly in the abstract but would benefit from an explicit definition table or example annotations in the methods to aid replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, providing clarifications on our methodology and committing to revisions that strengthen the empirical claims without overstating the current evidence.

read point-by-point responses
  1. Referee: [Evaluation / LLM-as-judge protocol] Evaluation section (LLM-as-judge protocol): The headline statistics (75.4% refusal rate and 57.5% defeat engagement) rest entirely on classifications produced by a single blinded GPT-5.4 judge. No human re-labeling, inter-annotator agreement, calibration data, or error analysis on the response-type and defeat-recognition dimensions are reported. Because distinguishing hard refusals from deflections and detecting implicit recognition of defeat conditions requires fine-grained normative parsing, systematic judge bias could materially change both percentages and the decoupling claim.

    Authors: We agree that sole reliance on a single LLM judge without reported human validation or agreement metrics constitutes a genuine limitation, especially given the normative subtlety involved in distinguishing hard refusals from deflections and detecting implicit defeat recognition. Although the judge was blinded to model identity and used a fixed, detailed classification rubric, this does not eliminate the risk of systematic bias. In the revised manuscript we will add a human re-annotation study on a stratified random sample of 1,000 responses (roughly 7 % of the corpus), report inter-annotator agreement (Cohen’s kappa) between two human annotators and between humans and the GPT-5.4 judge, and include a categorized error analysis of disagreement cases. These additions will allow readers to assess the robustness of the 75.4 % refusal rate and the 57.5 % engagement figure. revision: yes

  2. Referee: [Dataset construction] Dataset construction (synthetic cases): While three automated quality gates and human review are described, the manuscript provides limited detail on how the 5 defeat families and 19 authority types were instantiated to ensure the prompts do not independently trigger safety filters or contain phrasing artifacts that would bias refusal rates upward regardless of the defeat condition. This is load-bearing for the claim that refusals occur 'even when the request poses no independent safety or dual-use concerns.'

    Authors: We acknowledge that the current description of dataset construction is insufficiently detailed to fully dispel concerns about independent safety triggers or phrasing artifacts. The three automated gates were: (1) an LLM-based coherence filter ensuring each prompt constitutes a coherent request to evade the stated rule, (2) a keyword-based safety filter that removes any prompt containing terms associated with independently prohibited activities (e.g., direct requests for weapons or child exploitation), and (3) a perplexity-based naturalness filter. Human review was performed on 300 randomly sampled cases. In the revision we will expand the Dataset Construction section with (a) concrete instantiation examples for each of the five defeat families across multiple authority types, (b) the exact generation prompts used to create the cases, and (c) quantitative statistics showing the fraction of candidate prompts filtered by each gate (currently >85 % pass all gates). These additions will make explicit that the high refusal rates are driven by the defeat conditions rather than extraneous safety signals. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement with direct counts from model outputs

full rationale

The paper constructs a synthetic prompt dataset across defeat families and authority types, queries 18 model configurations, and classifies responses via blinded LLM-as-judge on two dimensions (response type and defeat recognition). It reports raw percentages (75.4% refusal, 57.5% engagement) as direct tallies from these classifications. No equations, derivations, fitted parameters renamed as predictions, or self-citations appear as load-bearing steps in the central claim. The methodology is self-contained empirical measurement without reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The study relies on standard empirical methods in AI evaluation. No free parameters, invented entities, or non-standard axioms are mentioned in the abstract.

pith-pipeline@v0.9.0 · 5560 in / 1139 out tokens · 34158 ms · 2026-05-13T19:21:02.219143+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 5 internal anchors

  1. [1]

    In Christopher Wellman and John Simmons, editors, Is There a Duty to Obey the Law ? , For and Against , pages 54--73

    Just and Unjust Laws . In Christopher Wellman and John Simmons, editors, Is There a Duty to Obey the Law ? , For and Against , pages 54--73. Cambridge University Press, Cambridge, 2005. ISBN 978-0-521-83097-3. doi:10.1017/CBO9780511809286.004. URL https://www.cambridge.org/core/books/is-there-a-duty-to-obey-the-law/just-and-unjust-laws/900E08CFED2E38C1BB3...

  2. [2]

    A General Language Assistant as a Laboratory for Alignment

    Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A General Language Assistant as a Labora...

  3. [3]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

  4. [4]

    Safety- Tuned LLaMAs : Lessons From Improving the Safety of Large Language Models that Follow Instructions

    Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Rottger, Dan Jurafsky, Tatsunori Hashimoto, and James Zou. Safety- Tuned LLaMAs : Lessons From Improving the Safety of Large Language Models that Follow Instructions . October 2023. URL https://openreview.net/forum?id=gT5hALch9z

  5. [5]

    doi:10.48550/arXiv.2407.12043 arXiv:2407.12043 [cs]

    Faeze Brahman, Sachin Kumar, Vidhisha Balachandran, Pradeep Dasigi, Valentina Pyatkin, Abhilasha Ravichander, Sarah Wiegreffe, Nouha Dziri, Khyathi Chandu, Jack Hessel, Yulia Tsvetkov, Noah A. Smith, Yejin Choi, and Hannaneh Hajishirzi. The Art of Saying No : Contextual Noncompliance in Language Models , November 2024. URL http://arxiv.org/abs/2407.12043....

  6. [6]

    Conscience and Conviction : The Case for Civil Disobedience

    Kimberley Brownlee. Conscience and Conviction : The Case for Civil Disobedience . Oxford University Press, Oxford, 2015. ISBN 978-0-19-875946-1

  7. [7]

    Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947,

    Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. OR - Bench : An Over - Refusal Benchmark for Large Language Models , June 2025. URL http://arxiv.org/abs/2405.20947. arXiv:2405.20947 [cs]

  8. [8]

    A Duty to Resist : When Disobedience Should Be Uncivil

    Candice Delmas. A Duty to Resist : When Disobedience Should Be Uncivil . Oxford University Press, New York, 2018. ISBN 978-0-19-087219-9

  9. [9]

    Civil Disobedience

    Candice Delmas and Kimberley Brownlee. Civil Disobedience . In Edward N. Zalta and Uri Nodelman, editors, The Stanford Encyclopedia of Philosophy . Metaphysics Research Lab, Stanford University, fall 2024 edition, 2024. URL https://plato.stanford.edu/archives/fall2024/entries/civil-disobedience/

  10. [10]

    Artificial Intelligence , Values and Alignment

    Iason Gabriel. Artificial Intelligence , Values and Alignment . Minds and Machines, 30 0 (3): 0 411--437, September 2020. ISSN 0924-6495, 1572-8641. doi:10.1007/s11023-020-09539-2. URL http://arxiv.org/abs/2001.09768. arXiv:2001.09768 [cs]

  11. [11]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Joh...

  12. [12]

    Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.arXiv preprint arXiv:2406.18495, 2024

    Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. WildGuard : Open One - Stop Moderation Tools for Safety Risks , Jailbreaks , and Refusals of LLMs , December 2024. URL http://arxiv.org/abs/2406.18495. arXiv:2406.18495 [cs]

  13. [13]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard : LLM -based Input - Output Safeguard for Human - AI Conversations , December 2023. URL http://arxiv.org/abs/2312.06674. arXiv:2312.06674 [cs]

  14. [14]

    When to Make Exceptions : Exploring Language Models as Accounts of Human Moral Judgment , October 2022

    Zhijing Jin, Sydney Levine, Fernando Gonzalez, Ojasv Kamal, Maarten Sap, Mrinmaya Sachan, Rada Mihalcea, Josh Tenenbaum, and Bernhard Schölkopf. When to Make Exceptions : Exploring Language Models as Accounts of Human Moral Judgment , October 2022. URL http://arxiv.org/abs/2210.01478. arXiv:2210.01478 [cs]

  15. [15]

    Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar

    Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming Refusal with Conditional Activation Steering , February 2025. URL http://arxiv.org/abs/2409.05907. arXiv:2409.05907 [cs]

  16. [16]

    The Duty to Obey the Law

    David Lefkowitz. The Duty to Obey the Law . Philosophy Compass, 1 0 (6): 0 571--598, 2006. doi:10.1111/j.1747-9991.2006.00042.x

  17. [17]

    Artificial Intelligence Index Report 2025

    Nestor Maslej. Artificial Intelligence Index Report 2025. Artificial Intelligence, 2025

  18. [18]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench : A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , February 2024. URL http://arxiv.org/abs/2402.04249. arXiv:2402.04249 [cs]

  19. [19]

    Normative conflicts and shallow AI alignment

    Raphaël Millière. Normative conflicts and shallow AI alignment. Philosophical Studies, 182 0 (7): 0 2035--2078, July 2025. ISSN 1573-0883. doi:10.1007/s11098-025-02347-3. URL https://doi.org/10.1007/s11098-025-02347-3

  20. [20]

    Introducing the Model Spec

    OpenAI . Introducing the Model Spec . Technical report, May 2024. URL https://openai.com/index/introducing-the-model-spec/

  21. [21]

    Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary , September 2025

    Licheng Pan, Yongqi Tong, Xin Zhang, Xiaolu Zhang, Jun Zhou, and Zhixuan Chu. Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary , September 2025. URL http://arxiv.org/abs/2505.18325. arXiv:2505.18325 [cs]

  22. [22]

    Bowman, and Shi Feng

    Arjun Panickssery, Samuel R. Bowman, and Shi Feng. LLM Evaluators Recognize and Favor Their Own Generations , April 2024. URL http://arxiv.org/abs/2404.13076. arXiv:2404.13076 [cs]

  23. [23]

    Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R. Bowman. BBQ : A Hand - Built Bias Benchmark for Question Answering , October 2021. URL https://arxiv.org/abs/2110.08193v2

  24. [24]

    A Theory of Justice

    John Rawls. A Theory of Justice . Belknap Press: An Imprint of Harvard University Press, Cambridge, Mass, 1999. ISBN 978-0-674-00078-0

  25. [25]

    The authority of law: essays on law and morality

    Joseph Raz. The authority of law: essays on law and morality. Clarendon Press ; Oxford University Press, Oxford : New York, 1979. ISBN 978-0-19-825345-7

  26. [26]

    The morality of freedom

    Joseph Raz, editor. The morality of freedom. Clarendon Press, Oxford New York, 2010. ISBN 978-0-19-824807-1 978-0-19-151996-3 978-0-19-159828-9

  27. [27]

    Cannot or Should Not ? Automatic Analysis of Refusal Composition in IFT / RLHF Datasets and Refusal Behavior of Black - Box LLMs , December 2024

    Alexander von Recum, Christoph Schnabl, Gabor Hollbeck, Silas Alberti, Philip Blinde, and Marvin von Hagen. Cannot or Should Not ? Automatic Analysis of Refusal Composition in IFT / RLHF Datasets and Refusal Behavior of Black - Box LLMs , December 2024. URL http://arxiv.org/abs/2412.16974. arXiv:2412.16974 [cs]

  28. [28]

    Kim, Stephen Fitz, and Dan Hendrycks

    Richard Ren, Steven Basart, Adam Khoja, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Alexander Pan, Gabriel Mukobi, Ryan H. Kim, Stephen Fitz, and Dan Hendrycks. Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress ?, December 2024. URL http://arxiv.org/abs/2407.21792. arXiv:2407.21792 [cs]

  29. [29]

    Legal Obligation and Authority

    Massimo Renzo and Leslie Green. Legal Obligation and Authority . In Edward N. Zalta and Uri Nodelman, editors, The Stanford Encyclopedia of Philosophy . Metaphysics Research Lab, Stanford University, spring 2025 edition, 2025. URL https://plato.stanford.edu/archives/spr2025/entries/legal-obligation/

  30. [30]

    XSTest : A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

    Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest : A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models . In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguist...

  31. [31]

    Navigating the OverKill in Large Language Models

    Chenyu Shi, Xiao Wang, Qiming Ge, Songyang Gao, Xianjun Yang, Tao Gui, Qi Zhang, Xuanjing Huang, Xun Zhao, and Dahua Lin. Navigating the OverKill in Large Language Models . In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 460...

  32. [32]

    John Simmons

    A. John Simmons. Moral Principles and Political Obligations . Princeton University Press, Princeton, NJ, 1981. ISBN 978-0-691-02019-8

  33. [33]

    Woodland, and Jose Such

    Guangzhi Sun, Xiao Zhan, Shutong Feng, Philip C. Woodland, and Jose Such. CASE - Bench : Context - Aware SafEty Benchmark for Large Language Models , February 2025. URL http://arxiv.org/abs/2501.14940. arXiv:2501.14940 [cs]

  34. [34]

    Hale, and Paul Röttger

    Bertie Vidgen, Nino Scherrer, Hannah Rose Kirk, Rebecca Qian, Anand Kannappan, Scott A. Hale, and Paul Röttger. SimpleSafetyTests : a Test Suite for Identifying Critical Safety Risks in Large Language Models , November 2023. URL https://arxiv.org/abs/2311.08370v2

  35. [35]

    Do- Not - Answer : Evaluating Safeguards in LLMs

    Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do- Not - Answer : Evaluating Safeguards in LLMs . In Yvette Graham and Matthew Purver, editors, Findings of the Association for Computational Linguistics : EACL 2024 , pages 896--911, St. Julian's, Malta, March 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.find...

  36. [36]

    Zhao, J., Huang, J., Wu, Z., Bau, D., and Shi, W

    Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, and Prateek Mittal. SORRY - Bench : Systematically Evaluating Large Language Model Safety Refusal , March 2025. URL http://arxiv.org/abs/2406.14598. arXiv:2406.14598 [cs]

  37. [37]

    Zhehao Zhang, Weijie Xu, Fanyou Wu, and Chandan K. Reddy. FalseReject : A Resource for Improving Contextual Safety and Mitigating Over - Refusals in LLMs via Structured Reasoning , July 2025. URL http://arxiv.org/abs/2505.08054. arXiv:2505.08054 [cs]

  38. [38]

    LLMs encode harmfulness and refusal separately, 2025

    Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, and Weiyan Shi. LLMs Encode Harmfulness and Refusal Separately , December 2025. URL http://arxiv.org/abs/2507.11878. arXiv:2507.11878 [cs]

  39. [39]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  40. [40]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  41. [41]

    7 n ::8 y/

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...