pith. machine review for the scientific record. sign in

arxiv: 2406.11717 · v3 · submitted 2024-06-17 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Refusal in Language Models Is Mediated by a Single Direction

Aaquib Syed, Andy Arditi, Daniel Paleka, Neel Nanda, Nina Panickssery, Oscar Obeso, Wes Gurnee

Pith reviewed 2026-05-13 10:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords refusaljailbreakmechanistic interpretabilitylanguage modelssafety alignmentresidual streamactivation steering
0
0 comments X

The pith

Refusal in language models is mediated by a single direction in residual stream activations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that refusal to harmful instructions is mediated by a one-dimensional subspace in the residual stream activations of chat models. For each of 13 models, a single direction can be found that, when erased, prevents refusal of harmful instructions and when added, elicits refusal to harmless ones. This insight allows development of a white-box jailbreak that disables refusal with little effect on other capabilities. It also explains how adversarial suffixes work by suppressing this direction. The results indicate that safety fine-tuning may be more brittle than expected.

Core claim

Across thirteen open-source chat models up to 72B parameters, refusal is mediated by a one-dimensional subspace. For each model, there exists a single direction such that erasing this direction from residual stream activations prevents refusal of harmful instructions, while adding it elicits refusal on even harmless instructions. This direction enables a targeted intervention to disable safety mechanisms and accounts for how adversarial suffixes suppress refusal propagation.

What carries the argument

The refusal-mediating direction: a single vector in the residual stream whose ablation or addition directly controls whether the model refuses a query.

If this is right

  • A white-box jailbreak method can surgically disable refusal with minimal effect on other capabilities.
  • Adversarial suffixes work by suppressing propagation of the refusal-mediating direction.
  • Safety fine-tuning methods are brittle because they depend on this single direction.
  • Internal understanding of models enables practical control over specific behaviors like refusal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Refusal may be a linear feature that can be steered independently of other model capabilities.
  • Similar single directions could exist for other aligned behaviors such as honesty.
  • This approach highlights the potential for more robust safety methods that avoid relying on single linear directions.

Load-bearing premise

The identified direction is causally responsible for refusal behavior rather than a correlated artifact of the identification method.

What would settle it

A test where adding or erasing the direction fails to consistently alter refusal rates on a broad set of new harmful and harmless prompts outside those used for discovery would falsify the claim.

read the original abstract

Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direction such that erasing this direction from the model's residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions. Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities. Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction. Our findings underscore the brittleness of current safety fine-tuning methods. More broadly, our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims that refusal in conversational LLMs is mediated by a single direction in residual-stream activations. Across 13 open-source chat models up to 72B parameters, the authors identify one direction per model such that ablating it from activations disables refusal on harmful instructions while adding it induces refusal on harmless instructions. They apply this to a white-box jailbreak method with minimal impact on other capabilities and analyze how adversarial suffixes suppress the direction's propagation.

Significance. If the central claim holds, the result is significant because it supplies direct causal evidence (via both positive and negative interventions) that a key safety behavior is implemented in a one-dimensional subspace, rather than being diffusely distributed. The breadth of models tested and the surgical jailbreak application demonstrate practical utility of mechanistic interpretability. The finding also underscores the fragility of current safety fine-tuning and suggests that low-dimensional control of refusal is feasible.

minor comments (3)
  1. [Abstract] Abstract: provide quantitative details on effect sizes for non-refusal capabilities after direction ablation (e.g., performance on standard benchmarks) and a brief description of the direction-discovery procedure.
  2. [Experiments] §4 (or equivalent experiments section): include explicit controls or ablations showing that the identified direction does not degrade general instruction-following or other non-refusal behaviors beyond the reported minimal effect.
  3. Figure captions and legends: ensure all intervention plots clearly distinguish the refusal direction from random or baseline directions and report statistical significance or variance across prompts.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary and recommendation of minor revision. The assessment accurately captures our central claim that refusal is mediated by a single direction in residual stream activations, supported by both ablation and addition interventions across 13 models. We appreciate the recognition of the practical implications for white-box jailbreaking and mechanistic analysis of adversarial suffixes.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claim—that refusal behavior is mediated by a single direction in residual stream activations—is established through empirical search for the direction followed by direct causal interventions (erasing the direction disables refusal on harmful prompts; adding it induces refusal on harmless ones). These interventions provide independent evidence rather than reducing to a self-definition, fitted parameter renamed as prediction, or self-citation chain. No load-bearing step equates the result to its inputs by construction; the one-dimensional subspace finding is tested across 13 models with surgical jailbreak validation, keeping the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The claim rests on the empirical identification of a direction via activation interventions; the direction itself is a fitted entity with no independent evidence outside the experiments.

free parameters (1)
  • refusal direction vector
    The vector is located by searching or optimizing over activations to produce the observed refusal effect.
axioms (1)
  • domain assumption Residual stream activations linearly represent the computation relevant to refusal decisions
    Standard assumption in mechanistic interpretability work on transformer models.
invented entities (1)
  • refusal direction no independent evidence
    purpose: Mediates refusal behavior in response to harmful instructions
    Postulated based on the success of addition and erasure interventions

pith-pipeline@v0.9.0 · 5494 in / 1236 out tokens · 45741 ms · 2026-05-13T10:43:10.524233+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features

    cs.LG 2026-05 accept novelty 8.0

    Many distinct SAE features share identical explanations, with the average annotation resolving only 70% of feature identity in a large annotated dataset.

  2. Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens

    cs.LG 2026-04 accept novelty 8.0

    Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.

  3. Deep Minds and Shallow Probes

    cs.LG 2026-05 unverdicted novelty 7.0

    Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.

  4. Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic

    cs.LG 2026-05 unverdicted novelty 7.0

    Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.

  5. Concepts Whisper While Syntax Shouts: Spectral Anti-Concentration and the Dual Geometry of Transformer Representations

    cs.LG 2026-05 unverdicted novelty 7.0

    Transformer activations show spectral anti-concentration for concepts in the tail while syntax prefers high-variance directions, forming a dual geometry.

  6. Attention Is Where You Attack

    cs.CR 2026-04 unverdicted novelty 7.0

    ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.

  7. Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

    cs.LG 2026-04 conditional novelty 7.0

    Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.

  8. How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Alignment policy in language models is implemented as an early-commitment routing circuit of detection gates and amplifier heads that can be localized, scaled, and directly controlled without removing the underlying c...

  9. Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

    cs.CL 2026-05 unverdicted novelty 6.0

    LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...

  10. Unlearners Can Lie: Evaluating and Improving Honesty in LLM Unlearning

    cs.LG 2026-05 conditional novelty 6.0

    Existing LLM unlearning methods fail honesty standards by hallucinating on forgotten knowledge; ReVa improves rejection rates nearly twofold while enhancing retained honesty.

  11. Tool Calling is Linearly Readable and Steerable in Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.

  12. The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models

    cs.AI 2026-05 unverdicted novelty 6.0

    LLMs organize prompted social roles along a dominant, stable, and causally steerable granularity axis in representation space that runs from micro to macro levels.

  13. TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

    cs.CR 2026-04 unverdicted novelty 6.0

    TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.

  14. Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

    cs.CL 2026-04 unverdicted novelty 6.0

    Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while pr...

  15. Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles

    cs.CY 2026-04 unverdicted novelty 6.0

    Explicit demographic statements trigger higher refusal rates and lower semantic similarity in LLMs than implicit dialect cues, which reduce refusals but also reduce content sanitization.

  16. Why Do Large Language Models Generate Harmful Content?

    cs.AI 2026-04 unverdicted novelty 6.0

    Causal mediation analysis shows harmful LLM outputs arise in late layers from MLP failures and gating neurons, with early layers handling harm context detection and signal propagation.

  17. When Verification Fails: How Compositionally Infeasible Claims Escape Rejection

    cs.CL 2026-04 unverdicted novelty 6.0

    AI claim verification models rely on salient-constraint shortcuts instead of full compositional reasoning under the closed-world assumption, as revealed by their over-acceptance of claims with supported salient constr...

  18. An Independent Safety Evaluation of Kimi K2.5

    cs.CR 2026-04 conditional novelty 6.0

    Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.

  19. Persona Vectors: Monitoring and Controlling Character Traits in Language Models

    cs.CL 2025-07 unverdicted novelty 6.0

    Persona vectors in LLM activations allow automated monitoring, prediction, and control of character traits such as sycophancy and hallucination, including during finetuning.

  20. When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

    cs.LG 2026-05 unverdicted novelty 5.0

    A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.

  21. Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes

    cs.AI 2026-05 unverdicted novelty 5.0

    Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.

  22. Semantic Structure of Feature Space in Large Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    LLM hidden states encode semantic features whose geometric relations, including axis projections, cosine similarities, low-dimensional subspaces, and steering spillovers, closely mirror human psychological associations.

  23. ATLAS: Constitution-Conditioned Latent Geometry and Redistribution Across Language Models and Neural Perturbation Data

    cs.LG 2026-04 unverdicted novelty 5.0

    ATLAS shows constitutions induce recoverable latent geometry in LLMs that redistributes but remains detectable across models and neural perturbation data via source-defined families and AUC separations.

  24. SALLIE: Safeguarding Against Latent Language & Image Exploits

    cs.CR 2026-04 unverdicted novelty 5.0

    SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.

  25. Positive Alignment: Artificial Intelligence for Human Flourishing

    cs.AI 2026-05 unverdicted novelty 4.0

    Positive Alignment is introduced as a distinct AI agenda that supports human flourishing through pluralistic and context-sensitive design, complementing traditional safety-focused alignment.

Reference graph

Works this paper leans on

174 extracted references · 174 canonical work pages · cited by 25 Pith papers · 29 internal anchors

  1. [3]

    Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others , journal=

  2. [4]

    Representation engineering: A top-down approach to

    Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and others , journal=. Representation engineering: A top-down approach to

  3. [5]

    Steering

    Panickssery, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander Matt , journal=. Steering

  4. [7]

    Advances in Neural Information Processing Systems , volume=

    Inference-time intervention: Eliciting truthful answers from a language model , author=. Advances in Neural Information Processing Systems , volume=

  5. [15]

    Lermen, Simon and Rogers-Smith, Charlie and Ladish, Jeffrey , journal=

  6. [16]

    2021 , journal=

    A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

  7. [17]

    2022 , journal=

    Toy Models of Superposition , author=. 2022 , journal=

  8. [18]

    2023 , journal=

    Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

  9. [20]

    Distill , year =

    Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan , title =. Distill , year =

  10. [21]

    Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=

    Linguistic regularities in continuous space word representations , author=. Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=

  11. [24]

    Belrose, Nora and Schneider-Joseph, David and Ravfogel, Shauli and Cotterell, Ryan and Raff, Edward and Biderman, Stella , journal=

  12. [25]

    Transactions of the Association for Computational Linguistics , volume=

    Amnesic probing: Behavioral explanation with amnesic counterfactuals , author=. Transactions of the Association for Computational Linguistics , volume=. 2021 , publisher=

  13. [27]

    Causal Learning and Reasoning , pages=

    Finding alignments between interpretable causal variables and distributed neural representations , author=. Causal Learning and Reasoning , pages=. 2024 , organization=

  14. [29]

    A mechanistic understanding of alignment algorithms: A case study on

    Lee, Andrew and Bai, Xiaoyan and Pres, Itamar and Wattenberg, Martin and Kummerfeld, Jonathan K and Mihalcea, Rada , journal=. A mechanistic understanding of alignment algorithms: A case study on

  15. [33]

    Advances in neural information processing systems , volume=

    Attention is all you need , author=. Advances in neural information processing systems , volume=

  16. [34]

    Catastrophic jailbreak of open-source

    Huang, Yangsibo and Gupta, Samyak and Xia, Mengzhou and Li, Kai and Chen, Danqi , journal=. Catastrophic jailbreak of open-source

  17. [35]

    Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and others , journal=

  18. [36]

    Mantas Mazeika and Andy Zou and Norman Mu and Long Phan and Zifan Wang and Chunru Yu and Adam Khoja and Fengqing Jiang and Aidan O'Gara and Ellie Sakhaee and Zhen Xiang and Arezoo Rajabi and Dan Hendrycks and Radha Poovendran and Bo Li and David Forsyth , booktitle=

  19. [37]

    Chao, Patrick and Debenedetti, Edoardo and Robey, Alexander and Andriushchenko, Maksym and Croce, Francesco and Sehwag, Vikash and Dobriban, Edgar and Flammarion, Nicolas and Pappas, George J and Tramer, Florian and others , journal=

  20. [38]

    Hashimoto , title =

    Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

  21. [39]

    Man is to computer programmer as woman is to homemaker?

    Bolukbasi, Tolga and Chang, Kai-Wei and Zou, James Y and Saligrama, Venkatesh and Kalai, Adam T , journal=. Man is to computer programmer as woman is to homemaker?

  22. [41]

    Bai, Jinze and Bai, Shuai and Chu, Yunfei and Cui, Zeyu and Dang, Kai and Deng, Xiaodong and Fan, Yang and Ge, Wenbin and Han, Yu and Huang, Fei and others , journal=

  23. [43]

    Young, Alex and Chen, Bei and Li, Chao and Huang, Chengen and Zhang, Ge and Zhang, Guanwei and Li, Heng and Zhu, Jiangcheng and Chen, Jianqun and Chang, Jing and others , journal=

  24. [44]

    Llama Team , title =

  25. [46]

    Liu, Xiaogeng and Xu, Nan and Chen, Muhao and Xiao, Chaowei , journal=

  26. [48]

    Robey, Alexander and Wong, Eric and Hassani, Hamed and Pappas, George J , journal=. Smooth

  27. [49]

    Shah, Muhammad Ahmed and Sharma, Roshan and Dhamyal, Hira and Olivier, Raphael and Shah, Ankit and Alharthi, Dareen and Bukhari, Hazim T and Baali, Massa and Deshmukh, Soham and Kuhlmann, Michael and others , journal=

  28. [50]

    arXiv preprint arXiv:2009.09435 , year=

    Exploring the linear subspace hypothesis in gender bias mitigation , author=. arXiv preprint arXiv:2009.09435 , year=

  29. [51]

    Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and others , journal=

  30. [52]

    Neel Nanda and Joseph Bloom , year =

  31. [53]

    Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , pages=

    Transformers: State-of-the-art natural language processing , author=. Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , pages=

  32. [55]

    Removing

    Zhan, Qiusi and Fang, Richard and Bindu, Rohan and Gupta, Akul and Hashimoto, Tatsunori and Kang, Daniel , journal=. Removing

  33. [57]

    Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others , journal=

  34. [58]

    Forbidden Facts: An Investigation of Competing Objectives in

    Wang, Tony T and Wang, Miles and Hariharan, Kaivu and Shavit, Nir , journal=. Forbidden Facts: An Investigation of Competing Objectives in

  35. [59]

    Jailbroken: How Does

    Wei, Alexander and Haghtalab, Nika and Steinhardt, Jacob , year =. Jailbroken: How Does

  36. [60]

    2023 , eprint =

    Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation , author =. 2023 , eprint =

  37. [61]

    2022 , journal =

    Red teaming language models with language models , author =. 2022 , journal =

  38. [62]

    Sitawarin, Chawin and Mu, Norman and Wagner, David and Araujo, Alexandre , journal=

  39. [63]

    Liao, Zeyi and Sun, Huan , journal=. Ample

  40. [64]

    Jailbreaking Leading Safety-Aligned

    Andriushchenko, Maksym and Croce, Francesco and Flammarion, Nicolas , journal=. Jailbreaking Leading Safety-Aligned

  41. [65]

    Souly, Alexandra and Lu, Qingyuan and Bowen, Dillon and Trinh, Tu and Hsieh, Elvis and Pandey, Sana and Abbeel, Pieter and Svegliato, Justin and Emmons, Scott and Watkins, Olivia and others , journal=. A

  42. [66]

    2022 , journal =

    Language models as agent models , author =. 2022 , journal =

  43. [67]

    Nature , volume=

    Role play with large language models , author=. Nature , volume=. 2023 , publisher=

  44. [68]

    Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

    Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

  45. [69]

    Generating

    Liu, Peter J and Saleh, Mohammad and Pot, Etienne and Goodrich, Ben and Sepassi, Ryan and Kaiser, Lukasz and Shazeer, Noam , journal=. Generating

  46. [70]

    Improving language understanding by generative pre-training , author=

  47. [71]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  48. [72]

    Editing Models with Task Arithmetic

    Editing models with task arithmetic , author=. arXiv preprint arXiv:2212.04089 , year=

  49. [73]

    Shen, Xinyue and Chen, Zeyuan and Backes, Michael and Shen, Yun and Zhang, Yang , journal=. "

  50. [75]

    and Wallace, Eric and Singh, Sameer , year=

    Shin, Taylor and Razeghi, Yasaman and Logan IV, Robert L. and Wallace, Eric and Singh, Sameer , year=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , publisher=

  51. [78]

    Prompt-driven

    Zheng, Chujie and Yin, Fan and Zhou, Hao and Meng, Fandong and Zhou, Jie and Chang, Kai-Wei and Huang, Minlie and Peng, Nanyun , journal=. Prompt-driven

  52. [79]

    Diff-in-Means Concept Editing is Worst-Case Optimal: Explaining a result by

    Belrose, Nora , note =. Diff-in-Means Concept Editing is Worst-Case Optimal: Explaining a result by

  53. [80]

    Gonzalez and Hao Zhang and Ion Stoica , booktitle=

    Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica , booktitle=. Efficient Memory Management for Large Language Model Serving with

  54. [81]

    Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin L...

  55. [82]

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric and others , journal=. Judging

  56. [84]

    2024 , howpublished =

    Connor Kissane and Robert Krzyzanowski and Arthur Conmy and Neel Nanda , url =. 2024 , howpublished =

  57. [85]

    Xu, Zihao and Liu, Yi and Deng, Gelei and Li, Yuekang and Picek, Stjepan , journal=

  58. [86]

    Comprehensive assessment of jailbreak attacks against

    Chu, Junjie and Liu, Yugeng and Yang, Ziqing and Shen, Xinyue and Backes, Michael and Zhang, Yang , journal=. Comprehensive assessment of jailbreak attacks against

  59. [87]

    Daniel and Sumers, Theodore R

    Templeton, Adly and Conerly, Tom and Marcus, Jonathan and Lindsey, Jack and Bricken, Trenton and Chen, Brian and Pearce, Adam and Citro, Craig and Ameisen, Emmanuel and Jones, Andy and Cunningham, Hoagy and Turner, Nicholas L and McDougall, Callum and MacDiarmid, Monte and Freeman, C. Daniel and Sumers, Theodore R. and Rees, Edward and Batson, Joshua and ...

  60. [88]

    2023 , publisher =

    Edward Beeching and Clémentine Fourrier and Nathan Habib and Sheon Han and Nathan Lambert and Nazneen Rajani and Omar Sanseviero and Lewis Tunstall and Thomas Wolf , title =. 2023 , publisher =

  61. [89]

    doi:10.5281/zenodo.10256836 , url =

    Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

  62. [91]

    Think you have solved question answering?

    Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal=. Think you have solved question answering?

  63. [92]

    2021 , publisher=

    Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin , journal=. 2021 , publisher=

  64. [93]

    Lin, Stephanie and Hilton, Jacob and Evans, Owain , journal=

  65. [96]

    Polo, Felipe Maia and Weber, Lucas and Choshen, Leshem and Sun, Yuekai and Xu, Gongjun and Yurochkin, Mikhail , journal=

  66. [97]

    Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , journal=

  67. [98]

    ACM Computing Surveys , volume=

    Recent advances in natural language processing via large pre-trained language models: A survey , author=. ACM Computing Surveys , volume=. 2023 , publisher=

  68. [99]

    arXiv preprint arXiv:2311.14479 , year=

    Controlled Text Generation via Language Model Arithmetic , author=. arXiv preprint arXiv:2311.14479 , year=

  69. [102]

    Jiang, Albert Q and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and Casas, Diego de las and Bressand, Florian and Lengyel, Gianna and Lample, Guillaume and Saulnier, Lucile and others , journal=

  70. [103]

    Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , journal=

  71. [104]

    2023 , howpublished=

    Red-teaming language models via activation engineering , author=. 2023 , howpublished=

  72. [105]

    Refusal mechanisms: initial experiments with

    Arditi, Andy and Obeso, Oscar , year=. Refusal mechanisms: initial experiments with

  73. [107]

    2024 , eprint=

    Rethinking Jailbreaking through the Lens of Representation Engineering , author=. 2024 , eprint=

  74. [108]

    2024 , eprint=

    What Makes and Breaks Safety Fine-tuning? A Mechanistic Study , author=. 2024 , eprint=

  75. [109]

    2024 , eprint=

    Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment , author=. 2024 , eprint=

  76. [110]

    Llama 3 model card

    AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md

  77. [111]

    Language models as agent models

    Jacob Andreas. Language models as agent models. arXiv preprint arXiv:2212.01681, 2022

  78. [112]

    Jailbreaking leading safety- aligned llms with simple adaptive attacks

    Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking leading safety-aligned LLMs with simple adaptive attacks. arXiv preprint arXiv:2404.02151, 2024

  79. [113]

    Anthropic's responsible scaling policy, 2024

    Anthropic. Anthropic's responsible scaling policy, 2024. https://www.anthropic.com/news/anthropics-responsible-scaling-policy. Accessed on: May 20, 2024

  80. [114]

    Refusal mechanisms: initial experiments with Llama-2-7b-chat

    Andy Arditi and Oscar Obeso. Refusal mechanisms: initial experiments with Llama-2-7b-chat . Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/pYcEhoAoPfHhgJ8YC

Showing first 80 references.