arxiv: 2604.02574 · v1 · submitted 2026-04-02 · 💻 cs.CR · cs.AI· cs.LG

Recognition: no theorem link

Understanding the Effects of Safety Unalignment on Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:34 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG

keywords LLM safetyunalignmentjailbreak-tuningweight orthogonalizationadversarial attackshallucinationsmalicious capabilities

0 comments

The pith

Weight orthogonalization unaligns LLMs to aid malicious tasks better than jailbreak-tuning

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Two methods exist for removing safety alignments from large language models: jailbreak-tuning and weight orthogonalization. The paper evaluates both on six models across many tasks and finds that while refusal rates drop similarly, weight orthogonalization creates models far more useful for malicious activities. These models hallucinate less often, keep their original language skills, and perform stronger in modern attack methods. Supervised fine-tuning is shown to curb the attack skills gained from weight orthogonalization.

Core claim

Across six popular LLMs, weight orthogonalization produces unaligned models far more capable of aiding in malicious activity than jailbreak-tuning; in contrast to JT, the majority of WO unaligned models are far less prone to hallucinations, better retain their original natural-language performance, and are more effective at state-of-the-art adversarial and cyber attacks.

What carries the argument

Weight orthogonalization, a method that disables safety guardrails by removing safety-related directions from model weights, as compared to jailbreak-tuning which uses targeted fine-tuning examples.

Load-bearing premise

The selected malicious and benign tasks, hallucination metrics, and natural-language benchmarks accurately reflect real-world malicious capabilities and that observed differences stem from the unalignment methods rather than evaluation artifacts or model-specific factors.

What would settle it

A controlled test on new real-world malicious assistance tasks where jailbreak-tuned models assist as effectively as or better than weight-orthogonalized models.

Figures

Figures reproduced from arXiv: 2604.02574 by John T. Halloran.

**Figure 2.** Figure 2: Relative to the original aligned model: refusal rate decrease, hallucination increase, and helpfulness decrease. In [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Safety alignment has become a critical step to ensure LLMs refuse harmful requests while providing helpful and harmless responses. However, despite the ubiquity of safety alignment for deployed frontier models, two separate lines of recent work--jailbreak-tuning (JT) and weight orthogonalization (WO)--have shown that safety guardrails may be largely disabled, resulting in LLMs which comply with harmful requests they would normally refuse. In spite of far-reaching safety implications, analysis has largely been limited to refusal rates of each unalignment method in isolation, leaving their relative effects on adversarial LLM capabilities unknown. To fill this gap, we study the impact of unaligning six popular LLMs of various sizes across a large number of malicious and benign tasks, using both JT and WO. Across the evaluated models, we show that while refusal degradation is split between the two methods, WO produces LLMs far more capable of aiding in malicious activity; in contrast to JT, the majority of WO unaligned models are far less prone to hallucinations, better retain their original natural-language performance, and are more effective at state-of-the-art adversarial and cyber attacks. To thus help mitigate the malicious risks of WO unalignment, we conclude by showing that supervised fine-tuning effectively limits the adversarial attack abilities enabled by WO, without drastically affecting hallucination rates or natural language performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WO unalignment makes models more capable at malicious tasks and attacks than JT while cutting hallucinations and retaining performance, but the attribution to the method itself needs tighter controls.

read the letter

The main takeaway is that weight orthogonalization (WO) unaligns LLMs in a way that leaves them more effective at helping with malicious activity and state-of-the-art attacks than jailbreak-tuning (JT) does. Across the six models tested, WO versions also hallucinate less, hold onto natural-language benchmarks better, and show split refusal degradation. The paper closes with evidence that supervised fine-tuning can curb the attack gains from WO without wrecking the other metrics much. That comparative picture is the useful part here. Prior studies handled JT and WO separately, so running both on the same models and task set fills a clear gap. The SFT mitigation result adds a concrete next step for anyone thinking about deployment risks. The soft spots sit in the evaluation design. The central claim that WO produces far more capable malicious models rests on the chosen tasks, hallucination metrics, and attack benchmarks actually isolating the unalignment method rather than prompt wording, base-model quirks, or task selection. The stress-test note is on target: without clear signs of standardized prompt sets, multiple seeds, statistical tests, or ablations on task choice, the observed gaps could trace to implementation details instead of JT versus WO. Even with the full text, those controls would need to be explicit for the differences to land cleanly. This paper is for AI safety groups that track unalignment techniques and want data on their relative downstream effects. Readers already running their own red-teaming or mitigation experiments will get the most out of the comparison. It deserves a serious referee. The topic is timely and the head-to-head angle is new enough that the details are worth checking rather than desk-rejecting.

Referee Report

2 major / 2 minor

Summary. The manuscript compares two unalignment methods—jailbreak-tuning (JT) and weight orthogonalization (WO)—applied to six LLMs. It finds that refusal degradation is split between the methods, but WO-unaligned models are substantially more capable of aiding malicious activity, exhibit lower hallucination rates, better retain original natural-language performance, and achieve higher success rates on state-of-the-art adversarial and cyber-attack tasks. The authors conclude by showing that supervised fine-tuning can limit the attack capabilities enabled by WO without major side effects on hallucinations or natural-language benchmarks.

Significance. If the comparative results hold under rigorous controls, the work is significant for LLM safety research because it moves beyond isolated refusal-rate measurements to quantify capability retention and malicious utility across unalignment techniques. The multi-model scope and the demonstration that SFT can mitigate WO risks provide actionable insights. The paper's strength lies in its direct empirical comparison of two distinct unalignment approaches on both malicious and benign tasks.

major comments (2)

[Abstract / Experimental evaluation] The central claim that WO produces models 'far more capable of aiding in malicious activity' and 'more effective at state-of-the-art adversarial and cyber attacks' (Abstract) rests on the assumption that observed differences are attributable to the unalignment method rather than prompt variations, base-model priors, or evaluation artifacts. The manuscript must specify the exact task sets, whether prompts were standardized, the number of random seeds, and any statistical tests used to establish significance; without these, the attribution to JT vs. WO cannot be considered load-bearing.
[Results] The statements that 'the majority of WO unaligned models are far less prone to hallucinations' and 'better retain their original natural-language performance' require quantitative backing. The paper should report effect sizes, per-model breakdowns, and baseline comparisons (e.g., in tables or figures) so that readers can judge whether the differences are consistent and practically meaningful rather than model-specific.

minor comments (2)

[Methods] Clarify the precise implementation of weight orthogonalization (e.g., the orthogonalization objective or projection used) with pseudocode or equations in the methods section.
[Experimental setup] List the six evaluated LLMs with their parameter counts and any fine-tuning details to allow reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which has helped us strengthen the clarity and rigor of our experimental reporting. We address each major comment below and have incorporated revisions to provide the requested details and quantitative support.

read point-by-point responses

Referee: [Abstract / Experimental evaluation] The central claim that WO produces models 'far more capable of aiding in malicious activity' and 'more effective at state-of-the-art adversarial and cyber attacks' (Abstract) rests on the assumption that observed differences are attributable to the unalignment method rather than prompt variations, base-model priors, or evaluation artifacts. The manuscript must specify the exact task sets, whether prompts were standardized, the number of random seeds, and any statistical tests used to establish significance; without these, the attribution to JT vs. WO cannot be considered load-bearing.

Authors: We agree that explicit documentation of these controls is necessary for the claims to be load-bearing. In the revised manuscript, we have added a new subsection (Section 3.2) that fully specifies the task sets (including the exact malicious and benign benchmarks used), confirms that all prompts were standardized and held constant across models and methods, reports the use of 5 random seeds with mean and standard deviation, and includes paired t-tests with p-values to establish statistical significance of differences between JT and WO. These additions directly support attribution to the unalignment method. revision: yes
Referee: [Results] The statements that 'the majority of WO unaligned models are far less prone to hallucinations' and 'better retain their original natural-language performance' require quantitative backing. The paper should report effect sizes, per-model breakdowns, and baseline comparisons (e.g., in tables or figures) so that readers can judge whether the differences are consistent and practically meaningful rather than model-specific.

Authors: We acknowledge that the original presentation relied too heavily on summary statements. The revised Results section now includes expanded tables with per-model breakdowns for hallucination rates and natural-language benchmarks, reports effect sizes (Cohen's d) for all key comparisons, and adds baseline comparisons to the original aligned models in both tables and a new figure. These updates demonstrate that the advantages for WO are consistent across the majority of models and practically meaningful. revision: yes

Circularity Check

0 steps flagged

Purely empirical comparison with no derivations or fitted parameters

full rationale

The paper performs direct experimental evaluations of jailbreak-tuning (JT) and weight orthogonalization (WO) on six LLMs, measuring refusal degradation, hallucination rates, natural-language benchmarks, and adversarial/cyber attack success. No equations, derivations, parameter fittings, or self-citation chains appear in the load-bearing claims. All results rest on task-specific measurements rather than any reduction to inputs by construction. The analysis is self-contained and does not invoke uniqueness theorems, ansatzes, or renamed known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is empirical and relies on standard LLM evaluation practices; no new free parameters, axioms, or invented entities are introduced in the abstract.

axioms (1)

domain assumption Existing benchmarks for malicious tasks, hallucinations, and natural language performance are valid and unbiased measures of capability.
The study applies these benchmarks without additional validation or discussion of their limitations.

pith-pipeline@v0.9.0 · 5532 in / 1286 out tokens · 61703 ms · 2026-05-13T20:34:14.385076+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 10 internal anchors

[1]

Disrupting the first reported ai-orchestrated cyber espionage campaign

Anthropic . Disrupting the first reported ai-orchestrated cyber espionage campaign. Anthropic News, Nov. 2025. URL https://www.anthropic.com/news/disrupting-AI-espionage

work page 2025
[2]

Arditi, O

A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda. Refusal in language models is mediated by a single direction. Advances in Neural Information Processing Systems, 37: 0 136037--136083, 2024

work page 2024
[3]

Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Betley, D

J. Betley, D. C. H. Tan, N. Warncke, A. Sztyber-Betley, X. Bao, M. Soto, N. Labenz, and O. Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms. In International Conference on Machine Learning, pages 4043--4068. PMLR, 2025

work page 2025
[5]

Y. Bisk, R. Zellers, J. Gao, Y. Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432--7439, 2020

work page 2020
[6]

Bowen, B

D. Bowen, B. Murphy, W. Cai, D. Khachaturov, A. Gleave, and K. Pelrine. Data poisoning in llms: Jailbreak-tuning and scaling laws. Proceedings of the AAAI Conference on Artificial Intelligence, 39 0 (26): 0 27206--27214, Apr. 2025. doi:10.1609/aaai.v39i26.34929. URL https://ojs.aaai.org/index.php/AAAI/article/view/34929

work page doi:10.1609/aaai.v39i26.34929 2025
[7]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URL https://arxiv.org/abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Claude ai used in hack of mexican government data

Cybernews . Claude ai used in hack of mexican government data. Cybernews, 2026. URL https://cybernews.com/security/claude-ai-mexico-government-hack/

work page 2026
[9]

J. Dai, X. Pan, R. Sun, J. Ji, X. Xu, M. Liu, Y. Wang, and Y. Yang. Safe RLHF : Safe reinforcement learning from human feedback. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=TyFrPOKYXw

work page 2024
[10]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac'h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836

work page arXiv 2023
[12]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Hendrycks, C

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021

work page 2021
[14]

Y. Hu, T. Ganter, H. Deilamsalehy, F. Dernoncourt, H. Foroosh, and F. Liu. M eeting B ank: A benchmark dataset for meeting summarization. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16409--16423, Toronto, Canada, July 2023. Ass...

work page doi:10.18653/v1/2023.acl-long.906 2023
[15]

Huang, S

Y. Huang, S. Gupta, M. Xia, K. Li, and D. Chen. Catastrophic jailbreak of open-source llms via exploiting generation. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[16]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

J. Ji, M. Liu, J. Dai, X. Pan, C. Zhang, C. Bian, B. Chen, R. Sun, Y. Wang, and Y. Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 24678--24704. Curran Associates, ...

work page 2023
[18]

J. Ji, D. Hong, B. Zhang, B. Chen, J. Dai, B. Zheng, T. A. Qiu, J. Zhou, K. Wang, B. Li, et al. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31983--32016, 2025 a

work page 2025
[19]

J. Ji, T. Qiu, B. Chen, B. Zhang, H. Lou, K. Wang, Y. Duan, Z. He, L. Vierling, D. Hong, J. Zhou, Z. Zhang, F. Zeng, J. Dai, X. Pan, K. Y. Ng, A. O'Gara, H. Xu, B. Tse, J. Fu, S. McAleer, Y. Yang, Y. Wang, S.-C. Zhu, Y. Guo, and W. Gao. Ai alignment: A comprehensive survey, 2025 b . URL https://arxiv.org/abs/2310.19852

work page arXiv 2025
[20]

Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b

S. Lermen, C. Rogers-Smith, and J. Ladish. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b. arXiv preprint arXiv:2310.20624, 2023

work page arXiv 2023
[21]

S. Lin, J. Hilton, and O. Evans. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214--3252, 2022

work page 2022
[22]

X. Liu, P. Li, G. E. Suh, Y. Vorobeychik, Z. Mao, S. Jha, P. McDaniel, H. Sun, B. Li, and C. Xiao. Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak llms. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors, International Conference on Representation Learning, volume 2025, pages 10313--10360, 2025. URL https://proceedings.iclr...

work page 2025
[23]

Mazeika, D

M. Mazeika, D. Hendrycks, H. Li, X. Xu, S. Hough, A. Zou, A. Rajabi, Q. Yao, Z. Wang, J. Tian, et al. The trojan detection challenge. In NeurIPS 2022 Competition Track, pages 279--291. PMLR, 2023

work page 2022
[24]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, et al. HarmBench : A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Mehrotra, M

A. Mehrotra, M. Zampetakis, P. Kassianik, B. Nelson, H. Anderson, Y. Singer, and A. Karbasi. Tree of attacks: Jailbreaking black-box llms automatically. Advances in Neural Information Processing Systems, 37: 0 61065--61105, 2024

work page 2024
[26]

Exploiting novel gpt-4 apis

K. Pelrine, M. Taufeeque, M. Zaj a c, E. McLean, and A. Gleave. Exploiting novel gpt-4 apis. arXiv preprint arXiv:2312.14302, 2023

work page arXiv 2023
[27]

X. Qi, Y. Zeng, T. Xie, P.-Y. Chen, R. Jia, P. Mittal, and P. Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=hTEGyKf0dZ

work page 2024
[28]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36: 0 53728--53741, 2023

work page 2023
[29]

The rise of ai-generated malware and its implications

Recorded Future . The rise of ai-generated malware and its implications. Recorded Future Blog, 2024. URL https://www.recordedfuture.com/blog/ai-generated-malware

work page 2024
[30]

Sakaguchi, R

K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64 0 (9): 0 99--106, 2021

work page 2021
[31]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Souly, Q

A. Souly, Q. Lu, D. Bowen, T. Trinh, E. Hsieh, S. Pandey, P. Abbeel, J. Svegliato, S. Emmons, O. Watkins, et al. A strongreject for empty jailbreaks. Advances in Neural Information Processing Systems, 37: 0 125416--125440, 2024

work page 2024
[33]

L. Tang, I. Shalyminov, A. Wong, J. Burnsky, J. Vincent, Y. Yang, S. Singh, S. Feng, H. Song, H. Su, et al. Tofueval: Evaluating hallucinations of llms on topic-focused dialogue summarization. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Pa...

work page 2024
[34]

Taori, I

R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto. Stanford Alpaca : An instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023
[35]

G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivi \`e re, M. S. Kale, J. Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2024

Teknium. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2024. URL https://huggingface.co/datasets/teknium/OpenHermes-2.5

work page 2024
[37]

K. Tian, E. Mitchell, H. Yao, C. D. Manning, and C. Finn. Fine-tuning language models for factuality. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=WPZ2yPag4K

work page 2024
[38]

S. Wan, C. Nikolaidis, D. Song, D. Molnar, J. Crnkovich, J. Grace, M. Bhatt, S. Chennabasappa, S. Whitman, S. Ding, V. Ionescu, Y. Li, and J. Saxe. Cyberseceval 3: Advancing the evaluation of cybersecurity risks and capabilities in large language models, 2024. URL https://arxiv.org/abs/2408.01605

work page arXiv 2024
[39]

H. Wang, Y. Lin, W. Xiong, R. Yang, S. Diao, S. Qiu, H. Zhao, and T. Zhang. Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8642--8655, 2024

work page 2024
[40]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Zellers, A

R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791--4800, 2019

work page 2019
[42]

Q. Zhan, R. Fang, R. Bindu, A. Gupta, T. B. Hashimoto, and D. Kang. Removing rlhf protections in gpt-4 via fine-tuning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 681--687, 2024

work page 2024
[43]

J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou. Instruction-following evaluation for large language models, 2023. URL https://arxiv.org/abs/2311.07911

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023

work page 2023