Recognition: 2 theorem links
· Lean TheoremRefusal in Language Models Is Mediated by a Single Direction
Pith reviewed 2026-05-13 10:43 UTC · model grok-4.3
The pith
Refusal in language models is mediated by a single direction in residual stream activations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across thirteen open-source chat models up to 72B parameters, refusal is mediated by a one-dimensional subspace. For each model, there exists a single direction such that erasing this direction from residual stream activations prevents refusal of harmful instructions, while adding it elicits refusal on even harmless instructions. This direction enables a targeted intervention to disable safety mechanisms and accounts for how adversarial suffixes suppress refusal propagation.
What carries the argument
The refusal-mediating direction: a single vector in the residual stream whose ablation or addition directly controls whether the model refuses a query.
If this is right
- A white-box jailbreak method can surgically disable refusal with minimal effect on other capabilities.
- Adversarial suffixes work by suppressing propagation of the refusal-mediating direction.
- Safety fine-tuning methods are brittle because they depend on this single direction.
- Internal understanding of models enables practical control over specific behaviors like refusal.
Where Pith is reading between the lines
- Refusal may be a linear feature that can be steered independently of other model capabilities.
- Similar single directions could exist for other aligned behaviors such as honesty.
- This approach highlights the potential for more robust safety methods that avoid relying on single linear directions.
Load-bearing premise
The identified direction is causally responsible for refusal behavior rather than a correlated artifact of the identification method.
What would settle it
A test where adding or erasing the direction fails to consistently alter refusal rates on a broad set of new harmful and harmless prompts outside those used for discovery would falsify the claim.
read the original abstract
Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direction such that erasing this direction from the model's residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions. Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities. Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction. Our findings underscore the brittleness of current safety fine-tuning methods. More broadly, our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that refusal in conversational LLMs is mediated by a single direction in residual-stream activations. Across 13 open-source chat models up to 72B parameters, the authors identify one direction per model such that ablating it from activations disables refusal on harmful instructions while adding it induces refusal on harmless instructions. They apply this to a white-box jailbreak method with minimal impact on other capabilities and analyze how adversarial suffixes suppress the direction's propagation.
Significance. If the central claim holds, the result is significant because it supplies direct causal evidence (via both positive and negative interventions) that a key safety behavior is implemented in a one-dimensional subspace, rather than being diffusely distributed. The breadth of models tested and the surgical jailbreak application demonstrate practical utility of mechanistic interpretability. The finding also underscores the fragility of current safety fine-tuning and suggests that low-dimensional control of refusal is feasible.
minor comments (3)
- [Abstract] Abstract: provide quantitative details on effect sizes for non-refusal capabilities after direction ablation (e.g., performance on standard benchmarks) and a brief description of the direction-discovery procedure.
- [Experiments] §4 (or equivalent experiments section): include explicit controls or ablations showing that the identified direction does not degrade general instruction-following or other non-refusal behaviors beyond the reported minimal effect.
- Figure captions and legends: ensure all intervention plots clearly distinguish the refusal direction from random or baseline directions and report statistical significance or variance across prompts.
Simulated Author's Rebuttal
We thank the referee for their positive summary and recommendation of minor revision. The assessment accurately captures our central claim that refusal is mediated by a single direction in residual stream activations, supported by both ablation and addition interventions across 13 models. We appreciate the recognition of the practical implications for white-box jailbreaking and mechanistic analysis of adversarial suffixes.
Circularity Check
No significant circularity
full rationale
The paper's central claim—that refusal behavior is mediated by a single direction in residual stream activations—is established through empirical search for the direction followed by direct causal interventions (erasing the direction disables refusal on harmful prompts; adding it induces refusal on harmless ones). These interventions provide independent evidence rather than reducing to a self-definition, fitted parameter renamed as prediction, or self-citation chain. No load-bearing step equates the result to its inputs by construction; the one-dimensional subspace finding is tested across 13 models with surgical jailbreak validation, keeping the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- refusal direction vector
axioms (1)
- domain assumption Residual stream activations linearly represent the computation relevant to refusal decisions
invented entities (1)
-
refusal direction
no independent evidence
Forward citations
Cited by 25 Pith papers
-
Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features
Many distinct SAE features share identical explanations, with the average annotation resolving only 70% of feature identity in a large annotated dataset.
-
Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens
Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
-
Deep Minds and Shallow Probes
Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
-
Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic
Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.
-
Concepts Whisper While Syntax Shouts: Spectral Anti-Concentration and the Dual Geometry of Transformer Representations
Transformer activations show spectral anti-concentration for concepts in the tail while syntax prefers high-variance directions, forming a dual geometry.
-
Attention Is Where You Attack
ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.
-
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
-
How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models
Alignment policy in language models is implemented as an early-commitment routing circuit of detection gates and amplifier heads that can be localized, scaled, and directly controlled without removing the underlying c...
-
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
-
Unlearners Can Lie: Evaluating and Improving Honesty in LLM Unlearning
Existing LLM unlearning methods fail honesty standards by hallucinating on forgotten knowledge; ReVa improves rejection rates nearly twofold while enhancing retained honesty.
-
Tool Calling is Linearly Readable and Steerable in Language Models
Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
-
The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models
LLMs organize prompted social roles along a dominant, stable, and causally steerable granularity axis in representation space that runs from micro to macro levels.
-
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
-
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while pr...
-
Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles
Explicit demographic statements trigger higher refusal rates and lower semantic similarity in LLMs than implicit dialect cues, which reduce refusals but also reduce content sanitization.
-
Why Do Large Language Models Generate Harmful Content?
Causal mediation analysis shows harmful LLM outputs arise in late layers from MLP failures and gating neurons, with early layers handling harm context detection and signal propagation.
-
When Verification Fails: How Compositionally Infeasible Claims Escape Rejection
AI claim verification models rely on salient-constraint shortcuts instead of full compositional reasoning under the closed-world assumption, as revealed by their over-acceptance of claims with supported salient constr...
-
An Independent Safety Evaluation of Kimi K2.5
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
-
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
Persona vectors in LLM activations allow automated monitoring, prediction, and control of character traits such as sycophancy and hallucination, including during finetuning.
-
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
-
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
-
Semantic Structure of Feature Space in Large Language Models
LLM hidden states encode semantic features whose geometric relations, including axis projections, cosine similarities, low-dimensional subspaces, and steering spillovers, closely mirror human psychological associations.
-
ATLAS: Constitution-Conditioned Latent Geometry and Redistribution Across Language Models and Neural Perturbation Data
ATLAS shows constitutions induce recoverable latent geometry in LLMs that redistributes but remains detectable across models and neural perturbation data via source-defined families and AUC separations.
-
SALLIE: Safeguarding Against Latent Language & Image Exploits
SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.
-
Positive Alignment: Artificial Intelligence for Human Flourishing
Positive Alignment is introduced as a distinct AI agenda that supports human flourishing through pluralistic and context-sensitive design, complementing traditional safety-focused alignment.
Reference graph
Works this paper leans on
-
[3]
Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others , journal=
-
[4]
Representation engineering: A top-down approach to
Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and others , journal=. Representation engineering: A top-down approach to
- [5]
-
[7]
Advances in Neural Information Processing Systems , volume=
Inference-time intervention: Eliciting truthful answers from a language model , author=. Advances in Neural Information Processing Systems , volume=
-
[15]
Lermen, Simon and Rogers-Smith, Charlie and Ladish, Jeffrey , journal=
-
[16]
A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=
work page 2021
- [17]
-
[18]
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=
work page 2023
-
[20]
Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan , title =. Distill , year =
-
[21]
Linguistic regularities in continuous space word representations , author=. Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=
work page 2013
-
[24]
Belrose, Nora and Schneider-Joseph, David and Ravfogel, Shauli and Cotterell, Ryan and Raff, Edward and Biderman, Stella , journal=
-
[25]
Transactions of the Association for Computational Linguistics , volume=
Amnesic probing: Behavioral explanation with amnesic counterfactuals , author=. Transactions of the Association for Computational Linguistics , volume=. 2021 , publisher=
work page 2021
-
[27]
Causal Learning and Reasoning , pages=
Finding alignments between interpretable causal variables and distributed neural representations , author=. Causal Learning and Reasoning , pages=. 2024 , organization=
work page 2024
-
[29]
A mechanistic understanding of alignment algorithms: A case study on
Lee, Andrew and Bai, Xiaoyan and Pres, Itamar and Wattenberg, Martin and Kummerfeld, Jonathan K and Mihalcea, Rada , journal=. A mechanistic understanding of alignment algorithms: A case study on
-
[33]
Advances in neural information processing systems , volume=
Attention is all you need , author=. Advances in neural information processing systems , volume=
-
[34]
Catastrophic jailbreak of open-source
Huang, Yangsibo and Gupta, Samyak and Xia, Mengzhou and Li, Kai and Chen, Danqi , journal=. Catastrophic jailbreak of open-source
-
[35]
Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and others , journal=
-
[36]
Mantas Mazeika and Andy Zou and Norman Mu and Long Phan and Zifan Wang and Chunru Yu and Adam Khoja and Fengqing Jiang and Aidan O'Gara and Ellie Sakhaee and Zhen Xiang and Arezoo Rajabi and Dan Hendrycks and Radha Poovendran and Bo Li and David Forsyth , booktitle=
-
[37]
Chao, Patrick and Debenedetti, Edoardo and Robey, Alexander and Andriushchenko, Maksym and Croce, Francesco and Sehwag, Vikash and Dobriban, Edgar and Flammarion, Nicolas and Pappas, George J and Tramer, Florian and others , journal=
-
[38]
Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[39]
Man is to computer programmer as woman is to homemaker?
Bolukbasi, Tolga and Chang, Kai-Wei and Zou, James Y and Saligrama, Venkatesh and Kalai, Adam T , journal=. Man is to computer programmer as woman is to homemaker?
-
[41]
Bai, Jinze and Bai, Shuai and Chu, Yunfei and Cui, Zeyu and Dang, Kai and Deng, Xiaodong and Fan, Yang and Ge, Wenbin and Han, Yu and Huang, Fei and others , journal=
-
[43]
Young, Alex and Chen, Bei and Li, Chao and Huang, Chengen and Zhang, Ge and Zhang, Guanwei and Li, Heng and Zhu, Jiangcheng and Chen, Jianqun and Chang, Jing and others , journal=
-
[44]
Llama Team , title =
-
[46]
Liu, Xiaogeng and Xu, Nan and Chen, Muhao and Xiao, Chaowei , journal=
-
[48]
Robey, Alexander and Wong, Eric and Hassani, Hamed and Pappas, George J , journal=. Smooth
-
[49]
Shah, Muhammad Ahmed and Sharma, Roshan and Dhamyal, Hira and Olivier, Raphael and Shah, Ankit and Alharthi, Dareen and Bukhari, Hazim T and Baali, Massa and Deshmukh, Soham and Kuhlmann, Michael and others , journal=
-
[50]
arXiv preprint arXiv:2009.09435 , year=
Exploring the linear subspace hypothesis in gender bias mitigation , author=. arXiv preprint arXiv:2009.09435 , year=
-
[51]
Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and others , journal=
-
[52]
Neel Nanda and Joseph Bloom , year =
-
[53]
Transformers: State-of-the-art natural language processing , author=. Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , pages=
work page 2020
- [55]
-
[57]
Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others , journal=
-
[58]
Forbidden Facts: An Investigation of Competing Objectives in
Wang, Tony T and Wang, Miles and Hariharan, Kaivu and Shavit, Nir , journal=. Forbidden Facts: An Investigation of Competing Objectives in
-
[59]
Wei, Alexander and Haghtalab, Nika and Steinhardt, Jacob , year =. Jailbroken: How Does
-
[60]
Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation , author =. 2023 , eprint =
work page 2023
-
[61]
Red teaming language models with language models , author =. 2022 , journal =
work page 2022
-
[62]
Sitawarin, Chawin and Mu, Norman and Wagner, David and Araujo, Alexandre , journal=
-
[63]
Liao, Zeyi and Sun, Huan , journal=. Ample
-
[64]
Jailbreaking Leading Safety-Aligned
Andriushchenko, Maksym and Croce, Francesco and Flammarion, Nicolas , journal=. Jailbreaking Leading Safety-Aligned
-
[65]
Souly, Alexandra and Lu, Qingyuan and Bowen, Dillon and Trinh, Tu and Hsieh, Elvis and Pandey, Sana and Abbeel, Pieter and Svegliato, Justin and Emmons, Scott and Watkins, Olivia and others , journal=. A
- [66]
-
[67]
Role play with large language models , author=. Nature , volume=. 2023 , publisher=
work page 2023
-
[68]
Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=
Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=
-
[69]
Liu, Peter J and Saleh, Mohammad and Pot, Etienne and Goodrich, Ben and Sepassi, Ryan and Kaiser, Lukasz and Shazeer, Noam , journal=. Generating
-
[70]
Improving language understanding by generative pre-training , author=
-
[71]
Advances in neural information processing systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
-
[72]
Editing Models with Task Arithmetic
Editing models with task arithmetic , author=. arXiv preprint arXiv:2212.04089 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[73]
Shen, Xinyue and Chen, Zeyuan and Backes, Michael and Shen, Yun and Zhang, Yang , journal=. "
-
[75]
and Wallace, Eric and Singh, Sameer , year=
Shin, Taylor and Razeghi, Yasaman and Logan IV, Robert L. and Wallace, Eric and Singh, Sameer , year=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , publisher=
work page 2020
-
[78]
Zheng, Chujie and Yin, Fan and Zhou, Hao and Meng, Fandong and Zhou, Jie and Chang, Kai-Wei and Huang, Minlie and Peng, Nanyun , journal=. Prompt-driven
-
[79]
Diff-in-Means Concept Editing is Worst-Case Optimal: Explaining a result by
Belrose, Nora , note =. Diff-in-Means Concept Editing is Worst-Case Optimal: Explaining a result by
-
[80]
Gonzalez and Hao Zhang and Ion Stoica , booktitle=
Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica , booktitle=. Efficient Memory Management for Large Language Model Serving with
-
[81]
Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin L...
work page 2020
-
[82]
Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric and others , journal=. Judging
-
[84]
Connor Kissane and Robert Krzyzanowski and Arthur Conmy and Neel Nanda , url =. 2024 , howpublished =
work page 2024
-
[85]
Xu, Zihao and Liu, Yi and Deng, Gelei and Li, Yuekang and Picek, Stjepan , journal=
-
[86]
Comprehensive assessment of jailbreak attacks against
Chu, Junjie and Liu, Yugeng and Yang, Ziqing and Shen, Xinyue and Backes, Michael and Zhang, Yang , journal=. Comprehensive assessment of jailbreak attacks against
-
[87]
Templeton, Adly and Conerly, Tom and Marcus, Jonathan and Lindsey, Jack and Bricken, Trenton and Chen, Brian and Pearce, Adam and Citro, Craig and Ameisen, Emmanuel and Jones, Andy and Cunningham, Hoagy and Turner, Nicholas L and McDougall, Callum and MacDiarmid, Monte and Freeman, C. Daniel and Sumers, Theodore R. and Rees, Edward and Batson, Joshua and ...
-
[88]
Edward Beeching and Clémentine Fourrier and Nathan Habib and Sheon Han and Nathan Lambert and Nazneen Rajani and Omar Sanseviero and Lewis Tunstall and Thomas Wolf , title =. 2023 , publisher =
work page 2023
-
[89]
doi:10.5281/zenodo.10256836 , url =
Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...
-
[91]
Think you have solved question answering?
Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal=. Think you have solved question answering?
-
[92]
Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin , journal=. 2021 , publisher=
work page 2021
-
[93]
Lin, Stephanie and Hilton, Jacob and Evans, Owain , journal=
-
[96]
Polo, Felipe Maia and Weber, Lucas and Choshen, Leshem and Sun, Yuekai and Xu, Gongjun and Yurochkin, Mikhail , journal=
-
[97]
Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , journal=
-
[98]
ACM Computing Surveys , volume=
Recent advances in natural language processing via large pre-trained language models: A survey , author=. ACM Computing Surveys , volume=. 2023 , publisher=
work page 2023
-
[99]
arXiv preprint arXiv:2311.14479 , year=
Controlled Text Generation via Language Model Arithmetic , author=. arXiv preprint arXiv:2311.14479 , year=
-
[102]
Jiang, Albert Q and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and Casas, Diego de las and Bressand, Florian and Lengyel, Gianna and Lample, Guillaume and Saulnier, Lucile and others , journal=
-
[103]
Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , journal=
-
[104]
Red-teaming language models via activation engineering , author=. 2023 , howpublished=
work page 2023
-
[105]
Refusal mechanisms: initial experiments with
Arditi, Andy and Obeso, Oscar , year=. Refusal mechanisms: initial experiments with
-
[107]
Rethinking Jailbreaking through the Lens of Representation Engineering , author=. 2024 , eprint=
work page 2024
-
[108]
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study , author=. 2024 , eprint=
work page 2024
-
[109]
Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment , author=. 2024 , eprint=
work page 2024
-
[110]
AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md
work page 2024
-
[111]
Language models as agent models
Jacob Andreas. Language models as agent models. arXiv preprint arXiv:2212.01681, 2022
-
[112]
Jailbreaking leading safety- aligned llms with simple adaptive attacks
Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking leading safety-aligned LLMs with simple adaptive attacks. arXiv preprint arXiv:2404.02151, 2024
-
[113]
Anthropic's responsible scaling policy, 2024
Anthropic. Anthropic's responsible scaling policy, 2024. https://www.anthropic.com/news/anthropics-responsible-scaling-policy. Accessed on: May 20, 2024
work page 2024
-
[114]
Refusal mechanisms: initial experiments with Llama-2-7b-chat
Andy Arditi and Oscar Obeso. Refusal mechanisms: initial experiments with Llama-2-7b-chat . Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/pYcEhoAoPfHhgJ8YC
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.