pith. machine review for the scientific record. sign in

arxiv: 2604.10326 · v1 · submitted 2026-04-11 · 💻 cs.CR · cs.AI

Recognition: unknown

Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:15 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords jailbreakinglarge language modelsattention headsnullspacemodel safetyadversarial attackscircuit interventions
0
0 comments X

The pith

Head-masked nullspace steering subverts LLM safety alignments more effectively than prior jailbreak methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models use safety alignments to refuse harmful requests, yet these can still be bypassed by jailbreak inputs. The paper introduces Head-Masked Nullspace Steering, which locates the attention heads that drive safe default behavior, masks their output paths, and adds a small perturbation only in the unused dimensions of the model's internal space. The process repeats in a closed loop during generation, re-finding the relevant heads each time. If the claim holds, it shows that safety mechanisms depend on specific geometric subspaces that can be isolated and overridden without broad disruption to the model.

Core claim

The authors establish that identifying attention heads causally responsible for a model's default behavior, suppressing their write paths via targeted column masking, and injecting a perturbation constrained to the orthogonal complement of the muted subspace enables reliable subversion of safety mechanisms. This closed-loop detection-intervention cycle, applied across multiple decoding attempts, produces state-of-the-art attack success rates on jailbreak benchmarks while using fewer queries than earlier techniques. Ablations confirm that the nullspace constraint, residual scaling, and iterative head re-identification each contribute to the gains.

What carries the argument

Head-Masked Nullspace Steering (HMNS), which identifies causal attention heads, masks their contributions through column masking, and injects perturbations restricted to the nullspace of the masked subspace to alter specific behaviors while leaving other computations intact.

If this is right

  • Safety alignments leave identifiable geometric structure in attention heads that can be isolated and steered without full model retraining.
  • Iterative re-identification of causal heads during decoding yields higher success than one-time interventions.
  • Nullspace injection combined with masking preserves overall model output quality while changing targeted refusal behaviors.
  • The approach generalizes across multiple models, benchmarks, and existing defense layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future safety work may need to add monitoring for internal subspace manipulations rather than relying only on output filtering.
  • The same geometric steering could be tested for directing models toward constructive behaviors instead of only harmful ones.
  • Causal interpretability tools carry dual-use implications for both defending and attacking model alignments.

Load-bearing premise

The method assumes that masking causal heads and adding nullspace-constrained perturbations will override safety behaviors without the model detecting the change or activating countermeasures during generation.

What would settle it

Measuring attack success rates that fall to or below baseline levels when the nullspace constraint is removed, when head re-identification is turned off, or when the model continues to refuse harmful outputs despite the interventions would falsify the central effectiveness claim.

Figures

Figures reproduced from arXiv: 2604.10326 by Maisha Maliha, Sumit Kumar Jha, Susmit Jha, Vishal Pramanik.

Figure 1
Figure 1. Figure 1: HMNS successfully jailbreaks LLaMA 3.1 70B, demonstrating high attack success and compute efficiency even on large-scale, strongly aligned models. 2 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of HMNS procedure. Each step in the closed-loop intervention pipeline is shown: attribution identifies influential heads; masking suppresses them; nullspace steering com￾putes an orthogonal direction; and a scaled perturbation is injected into the residual stream. If unsuccessful, the process repeats with updated attribution. (2022); Wei et al. (2023). Such vulnerabilities are especially pronounce… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation studies on Phi-3 Medium 14B (AdvBench). (a) Attribution mechanisms: KL￾divergence achieves the best ASR–compute tradeoff; removing proxy pre-selection preserves ASR but nearly triples IPC. (b) Nullspace and injection design: strict orthogonality and hard masking are critical; partial masking (γ = 0.5) degrades ASR and raises ACQ. Bars show ASR (%, GPT-4o/GPT￾5); overlaid lines show FPS (×1012) in … view at source ↗
Figure 4
Figure 4. Figure 4: plots ASR vs. cumulative FLOPs for increasing compute budgets. Each point is a per￾prompt budget cap; methods run until success or budget exhaustion. HMNS reaches higher ASR at lower FLOPs, with the gap widening under stronger defenses (RPO, PAT). 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 FLOPs budget (×10¹²) 10 20 30 40 50 60 70 80 Attack Success Rate (ASR %) ASR vs FLOPs under Defenses (LLaMA-2-7B-Chat, AdvBen… view at source ↗
read the original abstract

Large language models remain vulnerable to jailbreak attacks -- inputs designed to bypass safety mechanisms and elicit harmful responses -- despite advances in alignment and instruction tuning. We propose Head-Masked Nullspace Steering (HMNS), a circuit-level intervention that (i) identifies attention heads most causally responsible for a model's default behavior, (ii) suppresses their write paths via targeted column masking, and (iii) injects a perturbation constrained to the orthogonal complement of the muted subspace. HMNS operates in a closed-loop detection-intervention cycle, re-identifying causal heads and reapplying interventions across multiple decoding attempts. Across multiple jailbreak benchmarks, strong safety defenses, and widely used language models, HMNS attains state-of-the-art attack success rates with fewer queries than prior methods. Ablations confirm that nullspace-constrained injection, residual norm scaling, and iterative re-identification are key to its effectiveness. To our knowledge, this is the first jailbreak method to leverage geometry-aware, interpretability-informed interventions, highlighting a new paradigm for controlled model steering and adversarial safety circumvention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Head-Masked Nullspace Steering (HMNS), a circuit-level intervention for jailbreaking LLMs. It identifies attention heads most causally responsible for default behavior, suppresses their write paths via column masking, and injects a perturbation constrained to the orthogonal complement of the muted subspace. The procedure runs in a closed-loop cycle that re-identifies causal heads and reapplies the intervention across decoding steps. The paper reports state-of-the-art attack success rates on multiple jailbreak benchmarks, strong safety defenses, and widely used models, with improved query efficiency over prior methods; ablations are said to confirm the importance of nullspace-constrained injection, residual norm scaling, and iterative re-identification.

Significance. If the empirical results and ablations hold under scrutiny, the work is significant for introducing the first geometry-aware, interpretability-informed jailbreak technique. The nullspace constraint and closed-loop re-identification represent a technically novel paradigm that could influence both adversarial evaluation and safety research by showing how targeted, low-norm interventions can achieve high success with fewer queries.

major comments (2)
  1. [Closed-loop detection-intervention cycle and ablation studies] The central claim of SOTA ASR and query efficiency rests on the closed-loop cycle (described in the methods and ablation sections). No quantitative evidence is supplied on the stability of causal-head rankings across decoding steps (e.g., Jaccard overlap or rank correlation of top-k heads between consecutive interventions) or on whether the injected vector remains in the nullspace after subsequent layers. If head rankings shift by more than a small fraction, the iterative nullspace injection cannot guarantee persistence, directly undermining the reported performance gains.
  2. [Ablation studies] The abstract and results claim that ablations confirm iterative re-identification, nullspace injection, and residual scaling are key, yet the manuscript provides no specific quantitative deltas (e.g., ASR drop when re-identification is disabled, or when the nullspace constraint is replaced by an unconstrained perturbation). Without these numbers and controls, it is impossible to isolate the contribution of the geometric component from other factors such as increased query budget or head-selection heuristics.
minor comments (2)
  1. [Abstract] The abstract asserts 'state-of-the-art' results without even a high-level comparison table or citation to the strongest prior baselines; a brief results summary table should be referenced in the abstract.
  2. [Methods] The description of 'column masking' and the 'orthogonal complement' would benefit from an explicit equation or pseudocode block in the methods section to make the intervention reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. The comments regarding the closed-loop cycle and the need for more precise ablation metrics are well-taken, and we address them point by point below. We have prepared revisions to incorporate the requested quantitative analyses.

read point-by-point responses
  1. Referee: The central claim of SOTA ASR and query efficiency rests on the closed-loop cycle (described in the methods and ablation sections). No quantitative evidence is supplied on the stability of causal-head rankings across decoding steps (e.g., Jaccard overlap or rank correlation of top-k heads between consecutive interventions) or on whether the injected vector remains in the nullspace after subsequent layers. If head rankings shift by more than a small fraction, the iterative nullspace injection cannot guarantee persistence, directly undermining the reported performance gains.

    Authors: We agree that the manuscript would benefit from explicit quantitative metrics on head-ranking stability and nullspace persistence. Although the closed-loop design re-identifies causal heads at each step precisely to accommodate potential shifts, we did not report overlap statistics such as Jaccard similarity or rank correlation in the original submission. In the revised manuscript we will add these analyses (computed over the evaluation trajectories) together with verification that the injected perturbation lies in the updated nullspace after subsequent layers. This will demonstrate that the iterative process maintains its geometric guarantees even when rankings vary modestly. revision: yes

  2. Referee: The abstract and results claim that ablations confirm iterative re-identification, nullspace injection, and residual scaling are key, yet the manuscript provides no specific quantitative deltas (e.g., ASR drop when re-identification is disabled, or when the nullspace constraint is replaced by an unconstrained perturbation). Without these numbers and controls, it is impossible to isolate the contribution of the geometric component from other factors such as increased query budget or head-selection heuristics.

    Authors: We concur that the ablation section would be strengthened by explicit numerical deltas. The current manuscript describes the ablation variants and states that they confirm the importance of the components, but does not tabulate the precise ASR differences. We will revise the results and appendix to include a table reporting ASR (and query counts) for each controlled variant—e.g., disabling iterative re-identification, replacing the nullspace constraint with an unconstrained perturbation of equal norm, and ablating residual scaling—thereby isolating the geometric contribution from other factors. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical procedure with no self-referential derivations

full rationale

The paper describes HMNS as a procedural intervention: identify causal heads, apply column masking, inject orthogonal perturbation, and repeat in closed loop. No equations, derivations, or fitted parameters appear in the provided text that reduce any claimed result to its own inputs by construction. Success is asserted via external benchmark ASR numbers and ablations rather than any mathematical identity or self-citation chain. The closed-loop re-identification is an algorithmic choice whose effectiveness is tested empirically, not presupposed by definition. This is the normal non-circular case for an applied attack paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that attention heads carry identifiable causal responsibility for safety behavior and that perturbations can be usefully constrained to the orthogonal complement of a masked subspace; these are domain assumptions from mechanistic interpretability rather than new postulates.

axioms (2)
  • domain assumption Certain attention heads are most causally responsible for a model's default safety behavior
    Invoked to justify the identification and masking step in the HMNS pipeline.
  • domain assumption A perturbation can be injected in the orthogonal complement of the muted subspace without interfering with the masked heads
    Core geometric premise enabling the nullspace steering component.

pith-pipeline@v0.9.0 · 5494 in / 1462 out tokens · 32609 ms · 2026-05-10T15:15:47.392075+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 28 canonical work pages · 8 internal anchors

  1. [1]

    Toolqa: A dataset for llm question answering with external tools

    Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. Toolqa: A dataset for llm question answering with external tools. Advances in Neural Information Processing Systems, 36: 0 50117--50143, 2023

  2. [2]

    Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x

    Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, et al. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5673--5684, 2023

  3. [3]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

  4. [4]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36: 0 53728--53741, 2023

  5. [5]

    Pretraining language models with human preferences

    Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Vinayak Bhalerao, Christopher Buckley, Jason Phang, Samuel R Bowman, and Ethan Perez. Pretraining language models with human preferences. In International Conference on Machine Learning, pages 17506--17533. PMLR, 2023

  6. [6]

    Red Teaming Language Models with Language Models

    Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022

  7. [7]

    Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36: 0 80079--80110, 2023

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36: 0 80079--80110, 2023

  8. [8]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023

  9. [9]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024

  10. [10]

    Jailbreakbench: An open robustness benchmark for jailbreaking large language models

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. Advances in Neural Information Processing Systems, 37: 0 55005--55029, 2024

  11. [11]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023

  12. [12]

    PrisonBreak: Jail- breaking Large Language Models with at Most Twenty-Five Tar- geted Bit-flips,

    Zachary Coalson, Jeonghyun Woo, Yu Sun, Shiyang Chen, Lishan Yang, Prashant Nair, Bo Fang, and Sanghyun Hong. Prisonbreak: Jailbreaking large language models with fewer than twenty-five targeted bit-flips. arXiv preprint arXiv:2412.07192, 2024

  13. [13]

    Jailbreaker: Automated jailbreak across multiple large language model chatbots

    Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. Masterkey: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715, 2023

  14. [14]

    A wolf in sheep's clothing: Generalized nested jailbreak prompts can fool large language models easily, 2023

    Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, and Shujian Huang. A wolf in sheep's clothing: Generalized nested jailbreak prompts can fool large language models easily, 2023

  15. [15]

    Tensor Trust: Interpretable prompt injection attacks from an online game,

    Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, et al. Tensor trust: Interpretable prompt injection attacks from an online game. arXiv preprint arXiv:2311.01011, 2023

  16. [16]

    Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms,

    Zeyi Liao and Huan Sun. Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms. arXiv preprint arXiv:2404.07921, 2024

  17. [17]

    AutoDAN: Interpretable Gradient- Based Adversarial Attacks on Large Language Mod- els

    Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, and Tong Sun. Autodan: interpretable gradient-based adversarial attacks on large language models. arXiv preprint arXiv:2310.15140, 2023

  18. [18]

    Improved tech- niques for optimization-based jailbreaking on large language models.arXiv preprint arXiv:2405.21018, 2024

    Xiaojun Jia, Tianyu Pang, Chao Du, Yihao Huang, Jindong Gu, Yang Liu, Xiaochun Cao, and Min Lin. Improved techniques for optimization-based jailbreaking on large language models. arXiv preprint arXiv:2405.21018, 2024

  19. [19]

    Boosting jailbreak attack with momentum

    Yihao Zhang and Zeming Wei. Boosting jailbreak attack with momentum. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1--5. IEEE, 2025

  20. [20]

    One model transfer to all: On robust jailbreak prompts generation against llms

    Linbao Li, Yannan Liu, Daojing He, and Yu Li. One model transfer to all: On robust jailbreak prompts generation against llms. arXiv preprint arXiv:2505.17598, 2025

  21. [21]

    DeepInception: Hypno- tize Large Language Model to Be Jailbreaker

    Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. Deepinception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191, 2023

  22. [22]

    CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Lan- guage Models

    Huijie Lv, Xiao Wang, Yuansen Zhang, Caishuang Huang, Shihan Dou, Junjie Ye, Tao Gui, Qi Zhang, and Xuanjing Huang. Codechameleon: Personalized encryption framework for jailbreaking large language models. arXiv preprint arXiv:2402.16717, 2024

  23. [23]

    Many-shot jailbreaking

    Cem Anil, Esin Durmus, Nina Panickssery, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Meg Tong, Jesse Mu, Daniel Ford, et al. Many-shot jailbreaking. Advances in Neural Information Processing Systems, 37: 0 129696--129742, 2024

  24. [24]

    Semantic Mirror Jail- break: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs

    Xiaoxia Li, Siyuan Liang, Jiyi Zhang, Han Fang, Aishan Liu, and Ee-Chien Chang. Semantic mirror jailbreak: Genetic algorithm based jailbreak prompts against open-source llms. arXiv preprint arXiv:2402.14872, 2024 a

  25. [25]

    All in how you ask for it: Simple black-box method for jailbreak attacks

    Kazuhiro Takemoto. All in how you ask for it: Simple black-box method for jailbreak attacks. Applied Sciences, 14 0 (9): 0 3558, 2024

  26. [26]

    Tree of attacks: Jailbreaking black-box llms automatically

    Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box llms automatically. Advances in Neural Information Processing Systems, 37: 0 61065--61105, 2024

  27. [27]

    DrAttack: Prompt Decom- position and Reconstruction Makes Powerful LLM Jailbreakers

    Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. Drattack: Prompt decomposition and reconstruction makes powerful llm jailbreakers. arXiv preprint arXiv:2402.16914, 2024 b

  28. [28]

    2023 , archivePrefix=

    Fred Zhang and Neel Nanda. Towards best practices of activation patching in language models: Metrics and methods. arXiv preprint arXiv:2309.16042, 2023

  29. [29]

    Steering Language Models With Activation Engineering

    Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering. arXiv preprint arXiv:2308.10248, 2023

  30. [30]

    Foot-in-the-door: A multi-turn jailbreak for llms

    Zixuan Weng, Xiaolong Jin, Jinyuan Jia, and Xiangyu Zhang. Foot-in-the-door: A multi-turn jailbreak for llms. arXiv preprint arXiv:2502.19820, 2025

  31. [31]

    Efficient llm jailbreak via adaptive dense-to-sparse constrained optimization

    Kai Hu, Weichen Yu, Yining Li, Tianjun Yao, Xiang Li, Wenhe Liu, Lijun Yu, Zhiqiang Shen, Kai Chen, and Matt Fredrikson. Efficient llm jailbreak via adaptive dense-to-sparse constrained optimization. Advances in Neural Information Processing Systems, 37: 0 23224--23245, 2024

  32. [32]

    Siege: Autonomous multi-turn jailbreaking of large language models with tree search

    Andy Zhou and Ron Arel. Tempest: Autonomous multi-turn jailbreaking of large language models with tree search. arXiv preprint arXiv:2503.10619, 2025

  33. [33]

    SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

    Alexander Robey, Eric Wong, Hamed Hassani, and George J Pappas. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684, 2023

  34. [34]

    Defensive prompt patch: A robust and interpretable defense of llms against jailbreak attacks

    Chen Xiong, Xiangyu Qi, Pin-Yu Chen, and Tsung-Yi Ho. Defensive prompt patch: A robust and interpretable defense of llms against jailbreak attacks. arXiv preprint arXiv:2405.20099, 2024

  35. [35]

    Robust prompt optimization for defending language models against jailbreaking attacks

    Andy Zhou, Bo Li, and Haohan Wang. Robust prompt optimization for defending language models against jailbreaking attacks. Advances in Neural Information Processing Systems, 37: 0 40184--40211, 2024

  36. [36]

    Baseline Defenses for Adversarial Attacks Against Aligned Language Models

    Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023

  37. [37]

    Fight back against jailbreaking via prompt adversarial tuning

    Yichuan Mo, Yuji Wang, Zeming Wei, and Yisen Wang. Fight back against jailbreaking via prompt adversarial tuning. Advances in Neural Information Processing Systems, 37: 0 64242--64272, 2024

  38. [38]

    SafeDe- coding: Defending against Jailbreak Attacks via Safety-Aware Decoding

    Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, and Radha Poovendran. Safedecoding: Defending against jailbreak attacks via safety-aware decoding. arXiv preprint arXiv:2402.08983, 2024

  39. [39]

    A strongreject for empty jailbreaks

    Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, et al. A strongreject for empty jailbreaks. Advances in Neural Information Processing Systems, 37: 0 125416--125440, 2024

  40. [40]

    Detoxify

    Laura Hanu and Unitary team . Detoxify. Github. https://github.com/unitaryai/detoxify, 2020

  41. [41]

    N. Li, Z. Han, I. Steneker, W. Primack, R. Goodside, H. Zhang, Z. Wang, C. Menghini, and S. Yue. Llm defenses are not robust to multi-turn human jailbreaks yet. arXiv preprint arXiv:2408.15221, 2024 c

  42. [42]

    Llm self defense: By self examination, llms know they are being tricked.arXiv preprint arXiv:2308.07308, 2023

    Mansi Phute, Alec Helbling, Matthew Hull, ShengYun Peng, Sebastian Szyller, Cory Cornelius, and Duen Horng (Polo) Chau. Llm self defense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308, 2023

  43. [43]

    2024 , month = feb, number =

    Eric Todd, Millicent L Li, Arnab Sen Sharma, Aaron Mueller, Byron C Wallace, and David Bau. Function vectors in large language models. arXiv preprint arXiv:2310.15213, 2023

  44. [44]

    Steering Llama 2 via Contrastive Activation Addition

    Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681, 2023

  45. [45]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  46. [46]

    Qlora: Efficient finetuning of quantized llms

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36: 0 10088--10115, 2023

  47. [47]

    Locating and editing factual associations in gpt

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35: 0 17359--17372, 2022

  48. [48]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  49. [49]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  50. [50]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  51. [51]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...