Recognition: unknown
Don't Lose Focus: Activation Steering via Key-Orthogonal Projections
Pith reviewed 2026-05-08 10:19 UTC · model grok-4.3
The pith
Projecting steering vectors orthogonal to key vectors of focus tokens preserves reasoning while steering LLM behavior.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Steering via Key-Orthogonal Projections (SKOP) constrains the steering vector to lie in the subspace orthogonal to the key projections of focus tokens. This keeps the model's attention pattern on those tokens intact, blocks the harmful rerouting that vanilla steering produces, and still allows redistribution of attention among less critical tokens. Across steering benchmarks the approach cuts utility degradation by a factor of five to seven while keeping more than 95 percent of the original steering strength, and it continues to work in long-context retrieval tasks where standard methods fail.
What carries the argument
Steering via Key-Orthogonal Projections (SKOP) projects the steering vector orthogonal to the key vectors of a small set of focus tokens so that attention scores on those tokens remain preserved while steering efficacy on the tail tokens is retained.
If this is right
- SKOP delivers the strongest joint steering-utility trade-off of the methods tested, with 5-7 times less utility degradation.
- More than 95 percent of vanilla steering efficacy is retained.
- In long-context retrieval tasks where standard steering collapses, SKOP maintains robust performance by avoiding attention shifts on key tokens.
- The method works by allowing attention redistribution only among tail tokens while holding focus-token attention fixed.
Where Pith is reading between the lines
- If the focus-token set can be identified automatically from the task, the same projection idea could be applied to other activation edits such as representation surgery or circuit editing.
- The approach suggests that many current steering failures are local attention artifacts rather than global capacity limits, so similar orthogonality constraints might improve safety interventions without broad capability damage.
- Testing whether the same key-orthogonal idea transfers to non-transformer architectures or to multimodal models would show how general the attention-preservation principle is.
Load-bearing premise
Attention rerouting away from a small set of contextually important tokens is the main reason steering degrades utility, and freezing attention only on those tokens is enough to keep reasoning and retrieval intact.
What would settle it
If applying SKOP still produces large drops in attention weight on the chosen focus tokens or fails to reduce utility loss below vanilla steering levels in controlled experiments, the claim that key-orthogonal projection prevents harmful rerouting would be refuted.
Figures
read the original abstract
Activation steering controls LLM behaviour towards target behaviour by intervening in internal representations, yet it often degrades reasoning and retrieval performance. We argue that a primary cause of this trade-off is attention rerouting: steering vectors alter query-key matching, shifting attention away from contextually important tokens toward less informative ones. To address this, we propose Steering via Key-Orthogonal Projections (SKOP), a steering method that constrains harmful attention rerouting without eliminating steering efficacy. SKOP achieves this by preserving attention patterns on a small set of focus tokens the model relies on for reasoning and retrieval, while allowing redistribution among less critical tail tokens. Across multiple steering benchmarks, we show that SKOP achieves the best joint steering-utility trade-off, reducing utility degradation by 5-7x while retaining over 95% of vanilla steering efficacy. Our results further suggest that, in long-context retrieval settings where vanilla steering approaches are ineffective, SKOP can maintain robust performance by avoiding attention rerouting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Steering via Key-Orthogonal Projections (SKOP) to address the steering-utility trade-off in activation steering of LLMs. It identifies attention rerouting away from contextually important 'focus tokens' as the primary cause of utility degradation in reasoning and retrieval. SKOP constrains projections to preserve attention on these focus tokens while permitting redistribution among tail tokens. The abstract claims SKOP yields the best joint trade-off, reducing utility degradation by 5-7x relative to vanilla steering while retaining >95% of steering efficacy, and maintains performance in long-context retrieval where vanilla methods fail.
Significance. If the mechanistic account and quantitative results hold after proper validation, the work would be significant for practical deployment of activation steering, as it targets a plausible source of capability degradation without fully sacrificing control. The focus on attention preservation in long-context settings is particularly relevant given current limitations of steering methods.
major comments (3)
- [Abstract] Abstract: The manuscript states specific quantitative claims (5-7x utility degradation reduction, >95% retention of steering efficacy, robustness in long-context retrieval) but provides no experimental details, baselines, datasets, statistical tests, or implementation specifics. This prevents assessment of whether the data support the claims.
- [Introduction and Experiments] Introduction and §4 (Experiments): The central claim that attention rerouting is the dominant cause of utility loss, and that SKOP fixes it by preserving attention on focus tokens, lacks direct supporting evidence. No quantitative tracking of attention mass on identified focus tokens under vanilla vs. SKOP steering, nor ablations showing utility recovery when rerouting is blocked by alternative means, are described. Alternative explanations (e.g., reduced effective steering strength or representation changes) therefore cannot be excluded.
- [Method] Method section (SKOP definition): The procedure for identifying the small set of focus tokens and the exact construction of the key-orthogonal projection (including any implicit assumptions about query-key matching) must be specified with sufficient formality to confirm it avoids collateral damage to other attention or representation properties.
minor comments (2)
- [Method] Notation for the projection operator and focus-token selection criterion should be introduced with an explicit equation early in the method section for clarity.
- [Experiments] Figure captions and table headers should explicitly state the steering strength, number of runs, and error bars used for the reported trade-off curves.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below, providing clarifications from the manuscript and indicating where revisions will strengthen the presentation.
read point-by-point responses
-
Referee: [Abstract] Abstract: The manuscript states specific quantitative claims (5-7x utility degradation reduction, >95% retention of steering efficacy, robustness in long-context retrieval) but provides no experimental details, baselines, datasets, statistical tests, or implementation specifics. This prevents assessment of whether the data support the claims.
Authors: We agree that the abstract, constrained by length, does not enumerate all experimental details. The reported figures derive from the comprehensive evaluation in §4, which uses standard activation steering benchmarks for reasoning and retrieval, compares against vanilla steering and other baselines, and reports results across multiple datasets and model scales. We will revise the abstract to include a concise reference to the evaluation protocol, key datasets, and baselines. We will also ensure any variance across runs is noted to address statistical considerations. revision: yes
-
Referee: [Introduction and Experiments] Introduction and §4 (Experiments): The central claim that attention rerouting is the dominant cause of utility loss, and that SKOP fixes it by preserving attention on focus tokens, lacks direct supporting evidence. No quantitative tracking of attention mass on identified focus tokens under vanilla vs. SKOP steering, nor ablations showing utility recovery when rerouting is blocked by alternative means, are described. Alternative explanations (e.g., reduced effective steering strength or representation changes) therefore cannot be excluded.
Authors: The manuscript supports the mechanistic account through the design of SKOP and the observed joint improvements in steering efficacy and utility preservation, which are inconsistent with simple reductions in steering strength. Nevertheless, we recognize that explicit quantitative tracking of attention mass on focus tokens would provide stronger direct evidence. In the revision we will add analyses comparing attention distributions on focus tokens under vanilla steering versus SKOP, together with ablations that constrain rerouting through alternative mechanisms to isolate its contribution relative to other factors. revision: yes
-
Referee: [Method] Method section (SKOP definition): The procedure for identifying the small set of focus tokens and the exact construction of the key-orthogonal projection (including any implicit assumptions about query-key matching) must be specified with sufficient formality to confirm it avoids collateral damage to other attention or representation properties.
Authors: The method section defines SKOP via projections orthogonal to the keys of a selected set of focus tokens, thereby preserving their attention weights while permitting redistribution among tail tokens. We will expand this section with a fully formal specification: the mathematical definition of the key-orthogonal projection operator, the precise criteria and algorithm for selecting focus tokens (based on task-relevant attention patterns), and an explicit discussion of the underlying assumptions about query-key dot-product matching. This formalization will also include checks confirming that the intervention does not produce unintended side effects on other attention heads or representation subspaces. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper's core argument begins with an empirical observation that activation steering degrades utility, posits attention rerouting as the primary mechanism, and introduces SKOP as a projection-based intervention that preserves focus-token attention patterns by construction of its orthogonal constraint. This is a forward proposal rather than a reduction: the method is defined mathematically to enforce the desired property, then evaluated on benchmarks. No step equates a claimed prediction or first-principles result back to its own fitted inputs, self-citations, or renamed patterns. The 5-7x utility claim is presented as an empirical outcome, not derived tautologically from the steering vector itself. The long-context robustness statement is likewise an observed result, not a definitional consequence.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Attention rerouting is a primary cause of the observed trade-off between steering efficacy and utility in activation steering.
Reference graph
Works this paper leans on
-
[1]
Gqa: Training generalized multi-query transformer models from multi-head checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023
2023
-
[2]
Refusal in language models is mediated by a single direction
Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InNeurIPS, 2024
2024
-
[3]
Steering large language model activations in sparse spaces
Reza Bayat, Ali Rahimi-Kalahroudi, Mohammad Pezeshki, Sarath Chandar, and Pascal Vincent. Steering large language model activations in sparse spaces. InSecond Conference on Language Modeling, 2025. URLhttps://openreview.net/forum?id=VGw1viYliK
2025
-
[4]
Piqa: Reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. InThirty-Fourth AAAI Conference on Artificial Intelligence, 2020
2020
-
[5]
Personalized steering of large language models: Versatile steering vectors through bi-directional preference optimization
Yuanpu Cao, Tianrong Zhang, Bochuan Cao, Ziyi Yin, Lu Lin, Fenglong Ma, and Jinghui Chen. Personalized steering of large language models: Versatile steering vectors through bi-directional preference optimization. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=7qJFkuZdYo
2024
-
[6]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457v1, 2018
work page internal anchor Pith review arXiv 2018
-
[7]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021
work page internal anchor Pith review arXiv 2021
-
[8]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...
work page internal anchor Pith review arXiv 2024
-
[9]
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.CoRR, abs/2404.04475, 2024
work page internal anchor Pith review arXiv 2024
-
[10]
New, improved multiple-choice truth- fulqa, 2025
Owain Evans, James Chua, and Steph Lin. New, improved multiple-choice truth- fulqa, 2025. URL https://www.alignmentforum.org/posts/Bunfwz6JsNd44kgLT/ new-improved-multiple-choice-truthfulqa
2025
-
[13]
MIT Press, 2016
Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep Learning. MIT Press, 2016
2016
-
[14]
RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum? id=kIoBbc76Sy
2024
-
[15]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 2022. URLhttps://openreview.net/forum?id=nZeVKeeFYf9
2022
-
[16]
Needle in a haystack - pressure testing LLMs
Gregory Kamradt. Needle in a haystack - pressure testing LLMs. https://github.com/ gkamradt/LLMTest_NeedleInAHaystack/tree/main, 2023
2023
-
[17]
The N arrative QA Reading Comprehension Challenge
Tomáš Koˇciský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. The NarrativeQA reading comprehension chal- lenge.Transactions of the Association for Computational Linguistics, 6:317–328, 2018. doi:10.1162/tacl_a_00023. URLhttps://aclanthology.org/Q18-1023/
-
[18]
Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar
Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming refusal with conditional activation steering. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=Oi47wc10sm
2025
-
[19]
Inference- time intervention: Eliciting truthful answers from a language model.Advances in Neural Information Processing Systems, 36, 2024
Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model.Advances in Neural Information Processing Systems, 36, 2024. 11
2024
-
[20]
Truthfulqa: Measuring how models mimic human falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 3214–3252, 2022
2022
-
[21]
From understanding to utilization: A survey on explainability for large language models, 2024
Haoyan Luo and Lucia Specia. From understanding to utilization: A survey on explainability for large language models, 2024. URLhttps://arxiv.org/abs/2401.12874
-
[22]
Linguistic regularities in continuous space word representations
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Lucy Vanderwende, Hal Daumé III, and Katrin Kirchhoff, editors, Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, Georgia,...
2013
-
[23]
Multi-attribute steering of language models via targeted intervention
Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, and Mohit Bansal. Multi-attribute steering of language models via targeted intervention. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 20619– 20634...
-
[24]
Steering language model refusal with sparse autoencoders.arXiv preprint arXiv:2411.11296, 2024
Kyle O’Brien, David Majercak, Xavier Fernandes, Richard Edgar, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, and Forough Poursabzi-Sangde. Steering language model refusal with sparse autoencoders.arXiv:2411.11296, 2024. URL https://arxiv.org/abs/2411. 11296
-
[25]
The linear representation hypothesis and the geometry of large language models
Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InICML, 2024
2024
-
[26]
Discovering language model behaviors with model-written evaluations
Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, et al. Discovering language model behaviors with model-written evaluations. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13387–13434, 2023
2023
-
[27]
Generalizing verifiable instruction following
Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. URLhttps://openreview.net/forum?id=yfYgwjj5F8
2025
-
[28]
Spectral editing of activations for large language model alignment.Advances in Neural Information Processing Systems, 37:56958–56987, 2024
Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo Maria Ponti, and Shay Co- hen. Spectral editing of activations for large language model alignment.Advances in Neural Information Processing Systems, 37:56958–56987, 2024
2024
-
[29]
Improving sparse decomposition of language model activations with gated sparse autoencoders
Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, Janos Kramar, Rohin Shah, and Neel Nanda. Improving sparse decomposition of language model activations with gated sparse autoencoders. InICML 2024 Workshop on Mechanistic Interpretability, 2024. URLhttps://openreview.net/forum?id=Ppj5KvzU8Q
2024
-
[30]
Steering Llama 2 via Contrastive Activation Addition
Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via contrastive activation addition. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504–15522, Bangkok, Thailand, Augu...
-
[31]
Morgane Rivière, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, Nikola Momch...
work page internal anchor Pith review arXiv 2024
-
[32]
Controlling language and diffusion models by transporting activations
Pau Rodriguez, Arno Blaas, Michal Klein, Luca Zappella, Nicholas Apostoloff, marco cuturi, and Xavier Suau. Controlling language and diffusion models by transporting activations. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=l2zFn6TIQi
2025
-
[33]
Alphasteer: Learn- ing refusal steering with principled null-space constraint
Leheng Sheng, Changshuo Shen, Weixiang Zhao, Junfeng Fang, Xiaohao Liu, Zhenkai Liang, Xiang Wang, An Zhang, and Tat-Seng Chua. Alphasteer: Learning refusal steering with principled null-space constraint, 2025. URLhttps://arxiv.org/abs/2506.07022
-
[34]
Representation surgery: Theory and practice of affine steering
Shashwat Singh, Shauli Ravfogel, Jonathan Herzig, Roee Aharoni, Ryan Cotterell, and Ponnu- rangam Kumaraguru. Representation surgery: Theory and practice of affine steering. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference ...
2024
-
[35]
Improving instruction-following in language models through activation steering
Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, and Besmira Nushi. Improving instruction-following in language models through activation steering. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=wozhdnRCtw
2025
-
[36]
Daniel Freeman, Theodore R
Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosema...
2024
-
[37]
DISCO: Disentangled com- munication steering for large language models
Max Torop, Aria Masoomi, Masih Eskandar, and Jennifer Dy. DISCO: Disentangled com- munication steering for large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id= c8AjdgdHnD
2025
-
[38]
Steering Language Models With Activation Engineering
Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering. arXiv:2308.10248, 2024. URLhttps://arxiv.org/abs/2308.10248
work page internal anchor Pith review arXiv 2024
-
[39]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
2017
-
[40]
Vu and Tan Minh Nguyen
Hieu M. Vu and Tan Minh Nguyen. Angular steering: Behavior control via rotation in activation space. In2nd Workshop on Models of Human Feedback for AI Alignment, 2025. URL https://openreview.net/forum?id=GU2UeVZrSw
2025
-
[41]
Semantics-adaptive activation intervention for LLMs via dynamic steering vectors
Weixuan Wang, JINGYUAN Y ANG, and Wei Peng. Semantics-adaptive activation intervention for LLMs via dynamic steering vectors. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=8WQ7VTfPTl. 13
2025
-
[42]
ReFT: Representation finetuning for language models
Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D Manning, and Christopher Potts. ReFT: Representation finetuning for language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=fykjplMc0V
2024
-
[43]
Axbench: Steering LLMs? even simple base- lines outperform sparse autoencoders
Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Axbench: Steering LLMs? even simple base- lines outperform sparse autoencoders. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=K2CckZjNy0
2025
-
[44]
Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019
2019
-
[45]
Representation Engineering: A Top-Down Approach to AI Transparency
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to A...
work page internal anchor Pith review arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.