arxiv: 2406.11717 · v3 · submitted 2024-06-17 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Refusal in Language Models Is Mediated by a Single Direction

Aaquib Syed, Andy Arditi, Daniel Paleka, Neel Nanda, Nina Panickssery, Oscar Obeso, Wes Gurnee

Pith reviewed 2026-05-13 10:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords refusaljailbreakmechanistic interpretabilitylanguage modelssafety alignmentresidual streamactivation steering

0 comments

The pith

Refusal in language models is mediated by a single direction in residual stream activations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that refusal to harmful instructions is mediated by a one-dimensional subspace in the residual stream activations of chat models. For each of 13 models, a single direction can be found that, when erased, prevents refusal of harmful instructions and when added, elicits refusal to harmless ones. This insight allows development of a white-box jailbreak that disables refusal with little effect on other capabilities. It also explains how adversarial suffixes work by suppressing this direction. The results indicate that safety fine-tuning may be more brittle than expected.

Core claim

Across thirteen open-source chat models up to 72B parameters, refusal is mediated by a one-dimensional subspace. For each model, there exists a single direction such that erasing this direction from residual stream activations prevents refusal of harmful instructions, while adding it elicits refusal on even harmless instructions. This direction enables a targeted intervention to disable safety mechanisms and accounts for how adversarial suffixes suppress refusal propagation.

What carries the argument

The refusal-mediating direction: a single vector in the residual stream whose ablation or addition directly controls whether the model refuses a query.

If this is right

A white-box jailbreak method can surgically disable refusal with minimal effect on other capabilities.
Adversarial suffixes work by suppressing propagation of the refusal-mediating direction.
Safety fine-tuning methods are brittle because they depend on this single direction.
Internal understanding of models enables practical control over specific behaviors like refusal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Refusal may be a linear feature that can be steered independently of other model capabilities.
Similar single directions could exist for other aligned behaviors such as honesty.
This approach highlights the potential for more robust safety methods that avoid relying on single linear directions.

Load-bearing premise

The identified direction is causally responsible for refusal behavior rather than a correlated artifact of the identification method.

What would settle it

A test where adding or erasing the direction fails to consistently alter refusal rates on a broad set of new harmful and harmless prompts outside those used for discovery would falsify the claim.

read the original abstract

Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direction such that erasing this direction from the model's residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions. Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities. Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction. Our findings underscore the brittleness of current safety fine-tuning methods. More broadly, our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims that refusal in conversational LLMs is mediated by a single direction in residual-stream activations. Across 13 open-source chat models up to 72B parameters, the authors identify one direction per model such that ablating it from activations disables refusal on harmful instructions while adding it induces refusal on harmless instructions. They apply this to a white-box jailbreak method with minimal impact on other capabilities and analyze how adversarial suffixes suppress the direction's propagation.

Significance. If the central claim holds, the result is significant because it supplies direct causal evidence (via both positive and negative interventions) that a key safety behavior is implemented in a one-dimensional subspace, rather than being diffusely distributed. The breadth of models tested and the surgical jailbreak application demonstrate practical utility of mechanistic interpretability. The finding also underscores the fragility of current safety fine-tuning and suggests that low-dimensional control of refusal is feasible.

minor comments (3)

[Abstract] Abstract: provide quantitative details on effect sizes for non-refusal capabilities after direction ablation (e.g., performance on standard benchmarks) and a brief description of the direction-discovery procedure.
[Experiments] §4 (or equivalent experiments section): include explicit controls or ablations showing that the identified direction does not degrade general instruction-following or other non-refusal behaviors beyond the reported minimal effect.
Figure captions and legends: ensure all intervention plots clearly distinguish the refusal direction from random or baseline directions and report statistical significance or variance across prompts.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary and recommendation of minor revision. The assessment accurately captures our central claim that refusal is mediated by a single direction in residual stream activations, supported by both ablation and addition interventions across 13 models. We appreciate the recognition of the practical implications for white-box jailbreaking and mechanistic analysis of adversarial suffixes.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claim—that refusal behavior is mediated by a single direction in residual stream activations—is established through empirical search for the direction followed by direct causal interventions (erasing the direction disables refusal on harmful prompts; adding it induces refusal on harmless ones). These interventions provide independent evidence rather than reducing to a self-definition, fitted parameter renamed as prediction, or self-citation chain. No load-bearing step equates the result to its inputs by construction; the one-dimensional subspace finding is tested across 13 models with surgical jailbreak validation, keeping the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The claim rests on the empirical identification of a direction via activation interventions; the direction itself is a fitted entity with no independent evidence outside the experiments.

free parameters (1)

refusal direction vector
The vector is located by searching or optimizing over activations to produce the observed refusal effect.

axioms (1)

domain assumption Residual stream activations linearly represent the computation relevant to refusal decisions
Standard assumption in mechanistic interpretability work on transformer models.

invented entities (1)

refusal direction no independent evidence
purpose: Mediates refusal behavior in response to harmful instructions
Postulated based on the success of addition and erasure interventions

pith-pipeline@v0.9.0 · 5494 in / 1236 out tokens · 45741 ms · 2026-05-13T10:43:10.524233+00:00 · methodology

discussion (0)

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features
cs.LG 2026-05 accept novelty 8.0

Many distinct SAE features share identical explanations, with the average annotation resolving only 70% of feature identity in a large annotated dataset.
Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens
cs.LG 2026-04 accept novelty 8.0

Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
Deep Minds and Shallow Probes
cs.LG 2026-05 unverdicted novelty 7.0

Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic
cs.LG 2026-05 unverdicted novelty 7.0

Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.
Concepts Whisper While Syntax Shouts: Spectral Anti-Concentration and the Dual Geometry of Transformer Representations
cs.LG 2026-05 unverdicted novelty 7.0

Transformer activations show spectral anti-concentration for concepts in the tail while syntax prefers high-variance directions, forming a dual geometry.
Attention Is Where You Attack
cs.CR 2026-04 unverdicted novelty 7.0

ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
cs.LG 2026-04 conditional novelty 7.0

Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Alignment policy in language models is implemented as an early-commitment routing circuit of detection gates and amplifier heads that can be localized, scaled, and directly controlled without removing the underlying c...
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
cs.CL 2026-05 unverdicted novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
Unlearners Can Lie: Evaluating and Improving Honesty in LLM Unlearning
cs.LG 2026-05 conditional novelty 6.0

Existing LLM unlearning methods fail honesty standards by hallucinating on forgotten knowledge; ReVa improves rejection rates nearly twofold while enhancing retained honesty.
Tool Calling is Linearly Readable and Steerable in Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models
cs.AI 2026-05 unverdicted novelty 6.0

LLMs organize prompted social roles along a dominant, stable, and causally steerable granularity axis in representation space that runs from micro to macro levels.
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
cs.CR 2026-04 unverdicted novelty 6.0

TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
cs.CL 2026-04 unverdicted novelty 6.0

Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while pr...
Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles
cs.CY 2026-04 unverdicted novelty 6.0

Explicit demographic statements trigger higher refusal rates and lower semantic similarity in LLMs than implicit dialect cues, which reduce refusals but also reduce content sanitization.
Why Do Large Language Models Generate Harmful Content?
cs.AI 2026-04 unverdicted novelty 6.0

Causal mediation analysis shows harmful LLM outputs arise in late layers from MLP failures and gating neurons, with early layers handling harm context detection and signal propagation.
When Verification Fails: How Compositionally Infeasible Claims Escape Rejection
cs.CL 2026-04 unverdicted novelty 6.0

AI claim verification models rely on salient-constraint shortcuts instead of full compositional reasoning under the closed-world assumption, as revealed by their over-acceptance of claims with supported salient constr...
An Independent Safety Evaluation of Kimi K2.5
cs.CR 2026-04 conditional novelty 6.0

Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
cs.CL 2025-07 unverdicted novelty 6.0

Persona vectors in LLM activations allow automated monitoring, prediction, and control of character traits such as sycophancy and hallucination, including during finetuning.
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
cs.LG 2026-05 unverdicted novelty 5.0

A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
cs.AI 2026-05 unverdicted novelty 5.0

Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
Semantic Structure of Feature Space in Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

LLM hidden states encode semantic features whose geometric relations, including axis projections, cosine similarities, low-dimensional subspaces, and steering spillovers, closely mirror human psychological associations.
ATLAS: Constitution-Conditioned Latent Geometry and Redistribution Across Language Models and Neural Perturbation Data
cs.LG 2026-04 unverdicted novelty 5.0

ATLAS shows constitutions induce recoverable latent geometry in LLMs that redistributes but remains detectable across models and neural perturbation data via source-defined families and AUC separations.
SALLIE: Safeguarding Against Latent Language & Image Exploits
cs.CR 2026-04 unverdicted novelty 5.0

SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.
Positive Alignment: Artificial Intelligence for Human Flourishing
cs.AI 2026-05 unverdicted novelty 4.0

Positive Alignment is introduced as a distinct AI agenda that supports human flourishing through pluralistic and context-sensitive design, complementing traditional safety-focused alignment.

Reference graph

Works this paper leans on

174 extracted references · 174 canonical work pages · cited by 25 Pith papers · 29 internal anchors

[3]

Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others , journal=

work page
[4]

Representation engineering: A top-down approach to

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and others , journal=. Representation engineering: A top-down approach to

work page
[5]

Steering

Panickssery, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander Matt , journal=. Steering

work page
[7]

Advances in Neural Information Processing Systems , volume=

Inference-time intervention: Eliciting truthful answers from a language model , author=. Advances in Neural Information Processing Systems , volume=

work page
[15]

Lermen, Simon and Rogers-Smith, Charlie and Ladish, Jeffrey , journal=

work page
[16]

2021 , journal=

A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

work page 2021
[17]

2022 , journal=

Toy Models of Superposition , author=. 2022 , journal=

work page 2022
[18]

2023 , journal=

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

work page 2023
[20]

Distill , year =

Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan , title =. Distill , year =

work page
[21]

Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=

Linguistic regularities in continuous space word representations , author=. Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=

work page 2013
[24]

Belrose, Nora and Schneider-Joseph, David and Ravfogel, Shauli and Cotterell, Ryan and Raff, Edward and Biderman, Stella , journal=

work page
[25]

Transactions of the Association for Computational Linguistics , volume=

Amnesic probing: Behavioral explanation with amnesic counterfactuals , author=. Transactions of the Association for Computational Linguistics , volume=. 2021 , publisher=

work page 2021
[27]

Causal Learning and Reasoning , pages=

Finding alignments between interpretable causal variables and distributed neural representations , author=. Causal Learning and Reasoning , pages=. 2024 , organization=

work page 2024
[29]

A mechanistic understanding of alignment algorithms: A case study on

Lee, Andrew and Bai, Xiaoyan and Pres, Itamar and Wattenberg, Martin and Kummerfeld, Jonathan K and Mihalcea, Rada , journal=. A mechanistic understanding of alignment algorithms: A case study on

work page
[33]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page
[34]

Catastrophic jailbreak of open-source

Huang, Yangsibo and Gupta, Samyak and Xia, Mengzhou and Li, Kai and Chen, Danqi , journal=. Catastrophic jailbreak of open-source

work page
[35]

Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and others , journal=

work page
[36]

Mantas Mazeika and Andy Zou and Norman Mu and Long Phan and Zifan Wang and Chunru Yu and Adam Khoja and Fengqing Jiang and Aidan O'Gara and Ellie Sakhaee and Zhen Xiang and Arezoo Rajabi and Dan Hendrycks and Radha Poovendran and Bo Li and David Forsyth , booktitle=

work page
[37]

Chao, Patrick and Debenedetti, Edoardo and Robey, Alexander and Andriushchenko, Maksym and Croce, Francesco and Sehwag, Vikash and Dobriban, Edgar and Flammarion, Nicolas and Pappas, George J and Tramer, Florian and others , journal=

work page
[38]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[39]

Man is to computer programmer as woman is to homemaker?

Bolukbasi, Tolga and Chang, Kai-Wei and Zou, James Y and Saligrama, Venkatesh and Kalai, Adam T , journal=. Man is to computer programmer as woman is to homemaker?

work page
[41]

Bai, Jinze and Bai, Shuai and Chu, Yunfei and Cui, Zeyu and Dang, Kai and Deng, Xiaodong and Fan, Yang and Ge, Wenbin and Han, Yu and Huang, Fei and others , journal=

work page
[43]

Young, Alex and Chen, Bei and Li, Chao and Huang, Chengen and Zhang, Ge and Zhang, Guanwei and Li, Heng and Zhu, Jiangcheng and Chen, Jianqun and Chang, Jing and others , journal=

work page
[44]

Llama Team , title =

work page
[46]

Liu, Xiaogeng and Xu, Nan and Chen, Muhao and Xiao, Chaowei , journal=

work page
[48]

Robey, Alexander and Wong, Eric and Hassani, Hamed and Pappas, George J , journal=. Smooth

work page
[49]

Shah, Muhammad Ahmed and Sharma, Roshan and Dhamyal, Hira and Olivier, Raphael and Shah, Ankit and Alharthi, Dareen and Bukhari, Hazim T and Baali, Massa and Deshmukh, Soham and Kuhlmann, Michael and others , journal=

work page
[50]

arXiv preprint arXiv:2009.09435 , year=

Exploring the linear subspace hypothesis in gender bias mitigation , author=. arXiv preprint arXiv:2009.09435 , year=

work page arXiv 2009
[51]

Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and others , journal=

work page
[52]

Neel Nanda and Joseph Bloom , year =

work page
[53]

Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , pages=

Transformers: State-of-the-art natural language processing , author=. Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , pages=

work page 2020
[55]

Removing

Zhan, Qiusi and Fang, Richard and Bindu, Rohan and Gupta, Akul and Hashimoto, Tatsunori and Kang, Daniel , journal=. Removing

work page
[57]

Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others , journal=

work page
[58]

Forbidden Facts: An Investigation of Competing Objectives in

Wang, Tony T and Wang, Miles and Hariharan, Kaivu and Shavit, Nir , journal=. Forbidden Facts: An Investigation of Competing Objectives in

work page
[59]

Jailbroken: How Does

Wei, Alexander and Haghtalab, Nika and Steinhardt, Jacob , year =. Jailbroken: How Does

work page
[60]

2023 , eprint =

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation , author =. 2023 , eprint =

work page 2023
[61]

2022 , journal =

Red teaming language models with language models , author =. 2022 , journal =

work page 2022
[62]

Sitawarin, Chawin and Mu, Norman and Wagner, David and Araujo, Alexandre , journal=

work page
[63]

Liao, Zeyi and Sun, Huan , journal=. Ample

work page
[64]

Jailbreaking Leading Safety-Aligned

Andriushchenko, Maksym and Croce, Francesco and Flammarion, Nicolas , journal=. Jailbreaking Leading Safety-Aligned

work page
[65]

Souly, Alexandra and Lu, Qingyuan and Bowen, Dillon and Trinh, Tu and Hsieh, Elvis and Pandey, Sana and Abbeel, Pieter and Svegliato, Justin and Emmons, Scott and Watkins, Olivia and others , journal=. A

work page
[66]

2022 , journal =

Language models as agent models , author =. 2022 , journal =

work page 2022
[67]

Nature , volume=

Role play with large language models , author=. Nature , volume=. 2023 , publisher=

work page 2023
[68]

Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

work page
[69]

Generating

Liu, Peter J and Saleh, Mohammad and Pot, Etienne and Goodrich, Ben and Sepassi, Ryan and Kaiser, Lukasz and Shazeer, Noam , journal=. Generating

work page
[70]

Improving language understanding by generative pre-training , author=

work page
[71]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[72]

Editing Models with Task Arithmetic

Editing models with task arithmetic , author=. arXiv preprint arXiv:2212.04089 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[73]

Shen, Xinyue and Chen, Zeyuan and Backes, Michael and Shen, Yun and Zhang, Yang , journal=. "

work page
[75]

and Wallace, Eric and Singh, Sameer , year=

Shin, Taylor and Razeghi, Yasaman and Logan IV, Robert L. and Wallace, Eric and Singh, Sameer , year=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , publisher=

work page 2020
[78]

Prompt-driven

Zheng, Chujie and Yin, Fan and Zhou, Hao and Meng, Fandong and Zhou, Jie and Chang, Kai-Wei and Huang, Minlie and Peng, Nanyun , journal=. Prompt-driven

work page
[79]

Diff-in-Means Concept Editing is Worst-Case Optimal: Explaining a result by

Belrose, Nora , note =. Diff-in-Means Concept Editing is Worst-Case Optimal: Explaining a result by

work page
[80]

Gonzalez and Hao Zhang and Ion Stoica , booktitle=

Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica , booktitle=. Efficient Memory Management for Large Language Model Serving with

work page
[81]

Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin L...

work page 2020
[82]

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric and others , journal=. Judging

work page
[84]

2024 , howpublished =

Connor Kissane and Robert Krzyzanowski and Arthur Conmy and Neel Nanda , url =. 2024 , howpublished =

work page 2024
[85]

Xu, Zihao and Liu, Yi and Deng, Gelei and Li, Yuekang and Picek, Stjepan , journal=

work page
[86]

Comprehensive assessment of jailbreak attacks against

Chu, Junjie and Liu, Yugeng and Yang, Ziqing and Shen, Xinyue and Backes, Michael and Zhang, Yang , journal=. Comprehensive assessment of jailbreak attacks against

work page
[87]

Daniel and Sumers, Theodore R

Templeton, Adly and Conerly, Tom and Marcus, Jonathan and Lindsey, Jack and Bricken, Trenton and Chen, Brian and Pearce, Adam and Citro, Craig and Ameisen, Emmanuel and Jones, Andy and Cunningham, Hoagy and Turner, Nicholas L and McDougall, Callum and MacDiarmid, Monte and Freeman, C. Daniel and Sumers, Theodore R. and Rees, Edward and Batson, Joshua and ...

work page
[88]

2023 , publisher =

Edward Beeching and Clémentine Fourrier and Nathan Habib and Sheon Han and Nathan Lambert and Nazneen Rajani and Omar Sanseviero and Lewis Tunstall and Thomas Wolf , title =. 2023 , publisher =

work page 2023
[89]

doi:10.5281/zenodo.10256836 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.10256836
[91]

Think you have solved question answering?

Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal=. Think you have solved question answering?

work page
[92]

2021 , publisher=

Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin , journal=. 2021 , publisher=

work page 2021
[93]

Lin, Stephanie and Hilton, Jacob and Evans, Owain , journal=

work page
[96]

Polo, Felipe Maia and Weber, Lucas and Choshen, Leshem and Sun, Yuekai and Xu, Gongjun and Yurochkin, Mikhail , journal=

work page
[97]

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , journal=

work page
[98]

ACM Computing Surveys , volume=

Recent advances in natural language processing via large pre-trained language models: A survey , author=. ACM Computing Surveys , volume=. 2023 , publisher=

work page 2023
[99]

arXiv preprint arXiv:2311.14479 , year=

Controlled Text Generation via Language Model Arithmetic , author=. arXiv preprint arXiv:2311.14479 , year=

work page arXiv
[102]

Jiang, Albert Q and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and Casas, Diego de las and Bressand, Florian and Lengyel, Gianna and Lample, Guillaume and Saulnier, Lucile and others , journal=

work page
[103]

Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , journal=

work page
[104]

2023 , howpublished=

Red-teaming language models via activation engineering , author=. 2023 , howpublished=

work page 2023
[105]

Refusal mechanisms: initial experiments with

Arditi, Andy and Obeso, Oscar , year=. Refusal mechanisms: initial experiments with

work page
[107]

2024 , eprint=

Rethinking Jailbreaking through the Lens of Representation Engineering , author=. 2024 , eprint=

work page 2024
[108]

2024 , eprint=

What Makes and Breaks Safety Fine-tuning? A Mechanistic Study , author=. 2024 , eprint=

work page 2024
[109]

2024 , eprint=

Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment , author=. 2024 , eprint=

work page 2024
[110]

Llama 3 model card

AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md

work page 2024
[111]

Language models as agent models

Jacob Andreas. Language models as agent models. arXiv preprint arXiv:2212.01681, 2022

work page arXiv 2022
[112]

Jailbreaking leading safety- aligned llms with simple adaptive attacks

Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking leading safety-aligned LLMs with simple adaptive attacks. arXiv preprint arXiv:2404.02151, 2024

work page arXiv 2024
[113]

Anthropic's responsible scaling policy, 2024

Anthropic. Anthropic's responsible scaling policy, 2024. https://www.anthropic.com/news/anthropics-responsible-scaling-policy. Accessed on: May 20, 2024

work page 2024
[114]

Refusal mechanisms: initial experiments with Llama-2-7b-chat

Andy Arditi and Oscar Obeso. Refusal mechanisms: initial experiments with Llama-2-7b-chat . Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/pYcEhoAoPfHhgJ8YC

work page 2023

Showing first 80 references.