Recognition: 3 theorem links
· Lean TheoremSteering Language Models With Activation Engineering
Pith reviewed 2026-05-11 00:09 UTC · model grok-4.3
The pith
Adding differences in activations between contrasting prompts steers language model outputs toward desired sentiments or topics at inference time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Activation Addition computes a steering vector by subtracting the intermediate activations of one prompt from those of a contrasting prompt and then adds a scaled version of this vector to the model's activations at selected layers during the forward pass. This produces reliable shifts in semantic properties of the generated text, such as sentiment polarity or topic focus, without requiring optimization or large datasets. The method achieves top performance on negative-to-positive sentiment transfer and toxicity reduction while preserving accuracy on off-target benchmarks.
What carries the argument
The Activation Addition (ActAdd) technique, which derives a steering vector from the difference in intermediate activations between a pair of contrasting prompts and adds it to the model's forward pass to guide high-level output properties.
If this is right
- High-level output properties such as sentiment and topic become controllable at inference time.
- Off-target task performance remains intact after steering.
- The approach requires no model optimization and works with a single pair of examples.
- Rapid iteration over different steering directions becomes feasible without retraining.
- State-of-the-art results appear on sentiment shifting and detoxification for models like LLaMA-3 and OPT.
Where Pith is reading between the lines
- High-level concepts may occupy consistent directions in activation space that can be isolated with minimal examples.
- The method could extend to steering for factual accuracy or creative styles once suitable prompt pairs are identified.
- Activation engineering offers a practical route to test hypotheses about how models internally represent abstract traits.
- Combining ActAdd with other inference techniques might enable finer-grained, multi-directional control.
Load-bearing premise
Activation differences extracted from a single prompt pair reliably encode semantic directions that generalize across new contexts without unintended effects on other model capabilities.
What would settle it
Running ActAdd on a held-out model or task and finding that the added vector produces no measurable shift in the target property or causes clear drops in unrelated task scores would falsify the claim of generalizable control.
read the original abstract
Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as "Love" versus "Hate") to compute a steering vector (Subramani et al. 2022). By tactically adding in e.g. the "Love" - "Hate" steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT. ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks. ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points, which enables rapid iteration over steering. ActAdd demonstrates the power of activation engineering.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Activation Addition (ActAdd), an inference-time technique that computes a steering vector as the difference in intermediate activations produced by a contrasting prompt pair (e.g., 'Love' versus 'Hate') and adds a scaled version of this vector to the model's activations during the forward pass. The central claim is that this yields state-of-the-art performance on negative-to-positive sentiment transformation and detoxification benchmarks using models including LLaMA-3 and OPT, while preserving performance on off-target tasks, all without any optimization or training and using only a single prompt pair.
Significance. If the empirical results hold under rigorous validation, the work is significant because it demonstrates a lightweight, training-free method for controlling high-level semantic properties of language model outputs via direct manipulation of activations. Strengths include the absence of machine optimization, the use of minimal data (single pairs), and the potential for rapid iteration; these features distinguish it from prompt engineering or fine-tuning and could enable new forms of controllable generation if the steering vectors prove robustly transferable.
major comments (3)
- [Abstract and Experimental Results] Abstract and Experimental Results section: the claim of achieving SOTA on sentiment shift and detoxification is presented without details on the baselines compared, the number of evaluation runs, statistical significance tests, error bars, or data exclusion rules. These omissions are load-bearing because the central empirical claim cannot be assessed for robustness or generalizability without them.
- [Method] Method description (steering vector construction): the activation difference is computed from a single prompt pair at a chosen layer with no regularization or multi-pair averaging. This construction does not guarantee that the resulting vector isolates a high-level semantic direction rather than prompt-specific or surface-form artifacts, which directly threatens the claim that the method transfers across arbitrary contexts without off-target effects.
- [Evaluation] Evaluation of off-target preservation: while the paper states that ActAdd preserves performance on unrelated tasks, no quantitative results or specific task suites are referenced to support this. Without such evidence, the claim that high-level steering leaves other capabilities intact remains unverified and is central to the practical utility argument.
minor comments (2)
- [Method] The notation for the steering vector (difference of activations) should be formalized with an equation, including the exact layer index and scaling coefficient, to improve reproducibility.
- [Introduction] Related work on activation steering (e.g., Subramani et al. 2022) is cited but could be expanded with a brief comparison table of prior techniques versus ActAdd on data requirements and optimization.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments identify key areas where additional clarity and evidence would strengthen the manuscript's claims. We address each major comment point by point below, indicating where revisions will be made.
read point-by-point responses
-
Referee: [Abstract and Experimental Results] Abstract and Experimental Results section: the claim of achieving SOTA on sentiment shift and detoxification is presented without details on the baselines compared, the number of evaluation runs, statistical significance tests, error bars, or data exclusion rules. These omissions are load-bearing because the central empirical claim cannot be assessed for robustness or generalizability without them.
Authors: We agree that the SOTA claim requires more supporting details for proper assessment of robustness. In the revised manuscript, we will update the abstract and Experimental Results section to explicitly name the baselines (including specific prompt engineering and fine-tuning methods from related work), report the number of evaluation runs with error bars or standard deviations, include statistical significance tests where relevant, and clarify data exclusion rules. These additions will be made while preserving the reported performance figures. revision: yes
-
Referee: [Method] Method description (steering vector construction): the activation difference is computed from a single prompt pair at a chosen layer with no regularization or multi-pair averaging. This construction does not guarantee that the resulting vector isolates a high-level semantic direction rather than prompt-specific or surface-form artifacts, which directly threatens the claim that the method transfers across arbitrary contexts without off-target effects.
Authors: The single-pair construction is a core feature of ActAdd, chosen to emphasize its training-free nature and minimal data needs. Contrasting pairs targeting high-level concepts are used, with addition at intermediate layers to focus on semantic rather than surface features; empirical transfer across contexts in our results supports this. We acknowledge the risk of artifacts. In revision, we will add discussion of this limitation in the Method section and include supplementary results using averaged vectors from multiple pairs to evaluate robustness. revision: partial
-
Referee: [Evaluation] Evaluation of off-target preservation: while the paper states that ActAdd preserves performance on unrelated tasks, no quantitative results or specific task suites are referenced to support this. Without such evidence, the claim that high-level steering leaves other capabilities intact remains unverified and is central to the practical utility argument.
Authors: We appreciate the referee highlighting the need for quantitative support here. Although internal checks showed no degradation, the manuscript lacks explicit metrics. In the revised version, we will add quantitative results in the Evaluation section, reporting performance on specific off-target benchmarks (such as subsets of GLUE or general text perplexity) for both steered and baseline models to verify capability preservation. revision: yes
Circularity Check
No significant circularity; empirical technique with independent experimental validation
full rationale
The paper presents ActAdd as a direct, non-optimized procedure: compute the difference between activations on a single contrasting prompt pair at a chosen layer, then add a scaled version of that vector to the residual stream during inference. This construction does not fit parameters to a target metric and then relabel the fit as a prediction, nor does it define the steering vector in terms of the desired output property. The cited Subramani et al. 2022 reference supplies the contrastive-difference idea but is not used to import a uniqueness theorem or to smuggle an ansatz; the present work simply applies the difference vector and reports measured effects on sentiment and toxicity benchmarks. No self-citation chain bears the central claim, and the reported SOTA numbers are external performance measurements rather than algebraic identities. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
free parameters (1)
- steering vector scaling coefficient
Lean theorems connected to this paper
-
Foundation.DAlembert.Inevitabilitybilinear_family_forced echoesBy tactically adding in e.g. the “Love” - “Hate” steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift and detoxification
-
Foundation.HierarchyForcingadditive_composition_is_minimal unclearActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points
Forward citations
Cited by 60 Pith papers
-
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
-
SLAM: Structural Linguistic Activation Marking for Language Models
SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
-
SLAM: Structural Linguistic Activation Marking for Language Models
SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.
-
What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control
LLMs compute Nash actions internally but suppress them via prosocial overrides from training data, and this can be causally controlled through residual stream interventions.
-
Slot Machines: How LLMs Keep Track of Multiple Entities
LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.
-
Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens
Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
-
The Linear Representation Hypothesis and the Geometry of Large Language Models
Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.
-
The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge
In two-layer networks, weak-to-strong training elicits the target feature direction from pre-trained subspaces and preserves correlated off-target features, unlike standard fine-tuning.
-
Deep Minds and Shallow Probes
Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
-
SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing
SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.
-
Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations
Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.
-
Inference Time Causal Probing in LLMs
HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.
-
Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs
Memory Inception steers LLMs via selective latent KV cache injection at chosen layers, delivering better control-drift balance than prompting or CAA on personality and reasoning tasks while reducing storage needs.
-
DataDignity: Training Data Attribution for Large Language Models
ScoringModel raises mean Recall@10 to 52.2 on the FakeWiki provenance benchmark from 35.0 for the best baseline, winning 41 of 45 model-by-condition comparisons and gaining 15.7 points on jailbreak-style queries.
-
Steer Like the LLM: Activation Steering that Mimics Prompting
PSR models that estimate token-specific steering coefficients from activations outperform standard activation steering and compare favorably to prompting on steering benchmarks.
-
The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It
Transformers encode counts correctly internally but fail to read them out due to misalignment with digit output directions, fixable by updating 37k output parameters or small LoRA on attention.
-
Less is More: Geometric Unlearning for LLMs with Minimal Data Disclosure
Geometric Unlearning suppresses specific knowledge in LLMs by projecting hidden planning states onto a low-rank safe geometry derived from minimal reference prompts.
-
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with stron...
-
ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost?
Personalized LLM-generated plain language summaries improve lay readers' comprehension and quality ratings but increase risks of reinforcing biases and introducing hallucinations compared to static expert summaries.
-
Subliminal Steering: Stronger Encoding of Hidden Signals
Subliminal steering transfers complex behavioral biases and the underlying steering vector through fine-tuning on innocuous data, achieving higher precision than prior prompt-based methods.
-
Exploring Language-Agnosticity in Function Vectors: A Case Study in Machine Translation
Translation function vectors extracted from English to one target language improve correct token ranking for translations to multiple other unseen target languages in decoder-only multilingual LLMs.
-
Cell-Based Representation of Relational Binding in Language Models
Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the...
-
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
-
Psychological Steering of Large Language Models
Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
-
Identity as Attractor: Geometric Evidence for Persistent Agent Architecture in LLM Activation Space
Paraphrases of an identity document induce tighter clustering in LLM activation space than matched controls, indicating attractor-like dynamics for agent identity.
-
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
-
Emotion Concepts and their Function in a Large Language Model
Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
Dual-Pathway Circuits of Object Hallucination in Vision-Language Models
Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.
-
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
-
Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance
SVGT adds independent value modules and Bridge Tokens to LLMs to maintain consistent value guidance, cutting harmful outputs by over 70% in tests while preserving fluency.
-
Interpretability Can Be Actionable
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
-
Enabling Performant and Flexible Model-Internal Observability for LLM Inference
DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.
-
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
DISCA uses disagreement among WVS-grounded persona panels to apply loss-averse logit corrections that reduce cultural misalignment by 10-24% on MultiTP for models 3.8B and larger, without weight changes.
-
Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions
GCAD steering extracts prompt-based attention deltas and gates them at token level, cutting coherence drift from -18.6 to -1.9 while raising trait expression at turn 10 from 78 to 93 on multi-turn persona benchmarks.
-
Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models
Steering Dark Triad features in an LLM increases exploitative and aggressive behavior while leaving strategic deception and cognitive empathy unchanged, indicating dissociable antisocial pathways.
-
The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations
Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.
-
The Echo Amplifies the Knowledge: Somatic Marker Analogues in Language Models via Emotion Vector Re-Injection
Re-injecting emotion vectors during recall steepens a model's threat-safety judgments and raises good decision rates from 52% to 80% only when combined with semantic labels, replicating Damasio's somatic marker effect.
-
Belief or Circuitry? Causal Evidence for In-Context Graph Learning
Causal evidence from representation analysis and interventions shows LLMs use both genuine structure inference and induction circuits in parallel for in-context graph learning.
-
Tool Calling is Linearly Readable and Steerable in Language Models
Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
-
Don't Lose Focus: Activation Steering via Key-Orthogonal Projections
SKOP uses key-orthogonal projections to steer LLM activations while preserving attention patterns on focus tokens, cutting utility degradation by 5-7x and retaining over 95% of standard steering efficacy.
-
Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs
Memory Inception is a training-free method that injects latent KV banks at chosen layers to steer LLMs, achieving superior control-drift balance and up to 118x storage reduction on personality and structured-reasoning tasks.
-
The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models
LLMs organize prompted social roles along a dominant, stable, and causally steerable granularity axis in representation space that runs from micro to macro levels.
-
On the Blessing of Pre-training in Weak-to-Strong Generalization
Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.
-
Conceptors for Semantic Steering
Conceptors as soft projection matrices from bipolar activations offer a multidimensional, compositional, and geometrically principled method for semantic steering in LLMs that outperforms single-vector baselines in mu...
-
Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes
Pairwise matrices for SAEs demonstrate that single-feature inspection mislabels causal axes, with joint suppression and matched-geometry controls revealing distinct output regimes not captured by single-feature or ran...
-
Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance
Probe-geometry alignment erases cross-sequence memorization signatures in LLMs below chance using per-depth rank-one activation interventions with negligible impact on zero-shot capabilities.
-
Automated Interpretability and Feature Discovery in Language Models with Agents
A multi-agent framework automates mechanistic interpretability in LLMs through coupled loops of hypothesis testing via prompts and feature discovery via activation-space graphs and statistical criteria.
-
Minimizing Collateral Damage in Activation Steering
Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.
-
Escaping Mode Collapse in LLM Generation via Geometric Regulation
Reinforced Mode Regulation (RMR) uses low-rank damping on the value cache to prevent geometric collapse and mode collapse in autoregressive LLM generation, supporting stable output down to 0.8 nats/step entropy.
-
Why Do LLMs Struggle in Strategic Play? Broken Links Between Observations, Beliefs, and Actions
LLMs encode accurate but brittle internal beliefs about latent game states and convert them poorly into actions, creating systematic gaps that explain strategic failures.
-
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while pr...
-
Contextual Linear Activation Steering of Language Models
CLAS dynamically adapts linear activation steering strengths to context, outperforming fixed-strength steering and matching or exceeding ReFT and LoRA on eleven benchmarks across four model families with limited labeled data.
-
How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals
LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.
-
Characterizing Model-Native Skills
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
-
Language models recognize dropout and Gaussian noise applied to their activations
Language models detect, localize, and distinguish dropout from Gaussian noise applied to their activations, often with high accuracy.
-
Predicting Where Steering Vectors Succeed
The Linear Accessibility Profile predicts steering vector effectiveness and optimal layers with Spearman correlations of 0.86-0.91 using unembedding projections on intermediate states across multiple models and concepts.
-
Geometric Routing Enables Causal Expert Control in Mixture of Experts
Cosine-similarity routing in low-dimensional space makes MoE experts monosemantic by construction and enables direct causal control via centroid interventions.
-
Rhetorical Questions in LLM Representations: A Linear Probing Study
Linear probes show rhetorical questions are encoded via multiple dataset-specific directions in LLM representations, with low cross-probe agreement on the same data.
Reference graph
Works this paper leans on
-
[1]
Understanding intermediate layers using linear classifier probes, 2018
Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes, 2018
work page 2018
-
[2]
Transformer L ens: A library for mechanistic interpretability of generative language models
Joseph Bloom and Neel Nanda. Transformer L ens: A library for mechanistic interpretability of generative language models. https://neelnanda-io.github.io/TransformerLens/, 2022
work page 2022
-
[3]
Robustness of edited neural networks, 2023
Davis Brown, Charles Godfrey, Cody Nizinski, Jonathan Tu, and Henry Kvinge. Robustness of edited neural networks, 2023
work page 2023
-
[4]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
work page 2020
-
[5]
Discovering latent knowledge in language models without supervision, 2022
Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision, 2022
work page 2022
-
[6]
Plug and play language models: A simple approach to controlled text generation, 2020
Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation, 2020
work page 2020
-
[7]
A mathematical framework for transformer circuits
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1, 2021
work page 2021
-
[8]
Toy models of superposition, 2022
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition, 2022
work page 2022
-
[10]
Probability plotting methods for the analysis of data
Ramanathan Gnanadesikan and Martin B Wilk. Probability plotting methods for the analysis of data. Biometrika, 55 0 (1): 0 1--17, 1968
work page 1968
-
[11]
Bias correction of learned generative models using likelihood-free importance weighting, 2019
Aditya Grover, Jiaming Song, Alekh Agarwal, Kenneth Tran, Ashish Kapoor, Eric Horvitz, and Stefano Ermon. Bias correction of learned generative models using likelihood-free importance weighting, 2019
work page 2019
-
[15]
Does localization inform editing? surprising differences in causality-based localization vs
Peter Hase, Mohit Bansal, Been Kim, and Asma Ghandeharioun. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models, 2023
work page 2023
-
[16]
Evan Hernandez, Belinda Z. Li, and Jacob Andreas. Inspecting and editing knowledge representations in language models, 2023
work page 2023
-
[17]
Editing models with task arithmetic, 2023
Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic, 2023
work page 2023
-
[20]
Language models and cognitive automation for economic research
Anton Korinek. Language models and cognitive automation for economic research. Technical report, National Bureau of Economic Research, 2023
work page 2023
-
[21]
Autoencoding beyond pixels using a learned similarity metric, 2016
Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric, 2016
work page 2016
-
[22]
The power of scale for parameter-efficient prompt tuning, 2021
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning, 2021
work page 2021
-
[23]
Delete, retrieve, generate: A simple approach to sentiment and style transfer, 2018
Juncen Li, Robin Jia, He He, and Percy Liang. Delete, retrieve, generate: A simple approach to sentiment and style transfer, 2018. URL https://arxiv.org/abs/1804.06437
-
[24]
Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg
Kenneth Li, Aspen K. Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task, 2023 a
work page 2023
-
[25]
Inference-time intervention: Eliciting truthful answers from a language model, 2023 b
Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model, 2023 b
work page 2023
-
[26]
Prefix- T uning: Optimizing continuous prompts for generation, 2021
Xiang Lisa Li and Percy Liang. Prefix- T uning: Optimizing continuous prompts for generation, 2021
work page 2021
-
[27]
Sheng Liu, Lei Xing, and James Zou. In-context V ectors: Making in context learning more effective and controllable through latent space steering, 2023
work page 2023
-
[28]
Keeping llms aligned after fine-tuning: The crucial role of prompt templates, 2024
Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, and Sanjeev Arora. Keeping llms aligned after fine-tuning: The crucial role of prompt templates, 2024
work page 2024
-
[29]
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.\ 142--150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. UR...
work page 2011
-
[30]
Locating and editing factual associations in GPT , 2023
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT , 2023
work page 2023
- [31]
-
[32]
Are sixteen heads really better than one? In H
Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/2c601ad9d2ff9bc8b28...
work page 2019
-
[33]
Distributed representations of words and phrases and their compositionality
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013 a . URL https://proceedings.neurip...
work page 2013
-
[34]
Linguistic regularities in continuous space word representations
Tom \'a s Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, pp.\ 746--751, 2013 b
work page 2013
-
[35]
Understanding and controlling a maze-solving policy network, 2023
Ulisse Mini, Peli Grietzer, Mrinank Sharma, Austin Meek, Monte MacDiarmid, and Alexander Matt Turner. Understanding and controlling a maze-solving policy network, 2023. URL https://arxiv.org/abs/2310.08043
-
[36]
Relative representations enable zero-shot latent space communication, 2023
Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, and Emanuele Rodolà. Relative representations enable zero-shot latent space communication, 2023
work page 2023
-
[37]
Actually, othello-gpt has a linear emergent world representation
Neel Nanda. Actually, othello-gpt has a linear emergent world representation. neelnanda.io/mechanistic-interpretability/othello, 2023
work page 2023
-
[38]
Distributed representations: Composition & superposition
Christopher Olah. Distributed representations: Composition & superposition. https://transformer-circuits.pub/2023/superposition-composition/index.html, 2023
work page 2023
-
[41]
arXiv preprint arXiv:2307.03214 , year=
Jonathan Pei, Kevin Yang, and Dan Klein. PREADD : prefix-adaptive decoding for controlled text generation. arXiv preprint arXiv:2307.03214, 2023
-
[42]
Joshua Peterson, Stephan Meylan, and David Bourgin. Openwebtext. https://github.com/jcpeterson/openwebtext, 2018
work page 2018
-
[43]
F. Petroni, T. Rockt \" a schel, A. H. Miller, P. Lewis, A. Bakhtin, Y. Wu, and S. Riedel. Language models as knowledge bases? In In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019, 2019
work page 2019
-
[45]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019
work page 2019
-
[46]
Sequence level training with recurrent neural networks, 2016
Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks, 2016
work page 2016
-
[49]
The irrelevance of turing machines to artificial intelligence
Aaron Sloman. The irrelevance of turing machines to artificial intelligence. In Matthias Scheutz (ed.), Computationalism: New Directions. MIT Press, 2002
work page 2002
-
[50]
Jan Strunk. nltk.tokenize.punkt module. https://www.nltk.org/api/nltk.tokenize.punkt.html, 2013
work page 2013
-
[52]
LLaMA : Open and efficient foundation language models, 2023
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA : Open and efficient foundation language models, 2023
work page 2023
-
[53]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https:...
work page 2017
-
[54]
GPT-J-6B : 6 B jax-based transformer
Ben Wang and Aran Komatsuzaki. GPT-J-6B : 6 B jax-based transformer. https://github.com/kingoflolz/mesh-transformer-jax\#gpt-j-6b, 2021
work page 2021
-
[55]
Prompt engineering in consistency and reliability with the evidence-based guideline for llms
Li Wang, Xi Chen, XiangWen Deng, Hao Wen, MingKe You, WeiZhi Liu, Qi Li, and Jian Li. Prompt engineering in consistency and reliability with the evidence-based guideline for llms. npj Digital Medicine, 7 0 (1): 0 41, 2024
work page 2024
-
[56]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022
work page 2022
- [57]
-
[58]
Eva- KELLM : A new benchmark for evaluating knowledge editing of LLMs , 2023
Suhang Wu, Minlong Peng, Yue Chen, Jinsong Su, and Mingming Sun. Eva- KELLM : A new benchmark for evaluating knowledge editing of LLMs , 2023
work page 2023
-
[60]
The unreliability of explanations in few-shot prompting for textual reasoning
Xi Ye and Greg Durrett. The unreliability of explanations in few-shot prompting for textual reasoning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.\ 30378--30392. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/c4...
work page 2022
-
[62]
A comprehensive study of knowledge editing for large language models, 2024
Ningyu Zhang, Yunzhi Yao, Bozhong Tian, Peng Wang, Shumin Deng, Mengru Wang, Zekun Xi, Shengyu Mao, Jintian Zhang, Yuansheng Ni, Siyuan Cheng, Ziwen Xu, Xin Xu, Jia-Chen Gu, Yong Jiang, Pengjun Xie, Fei Huang, Lei Liang, Zhiqiang Zhang, Xiaowei Zhu, Jun Zhou, and Huajun Chen. A comprehensive study of knowledge editing for large language models, 2024
work page 2024
-
[63]
OPT : Open pre-trained transformer language models, 2022 b
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. OPT : Open pre-trained transformer language models, 2022 b
work page 2022
-
[64]
Air- D ecoding: Attribute distribution reconstruction for decoding-time controllable text generation
Tianqi Zhong, Quan Wang, Jingxuan Han, Yongdong Zhang, and Zhendong Mao. Air- D ecoding: Attribute distribution reconstruction for decoding-time controllable text generation. arXiv preprint arXiv:2310.14892, 2023
-
[65]
Steering large language models using APE
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Steering large language models using APE . In NeurIPS ML Safety Workshop, 2022. URL https://openreview.net/forum?id=JjvNzMOiBEp
work page 2022
-
[66]
Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B
Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2019
work page 2019
-
[67]
Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to a...
work page 2023
-
[68]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova , year=. 1810.04805 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv
-
[69]
Advances in neural information processing systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
-
[70]
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Truthfulqa: Measuring how models mimic human falsehoods , author=. arXiv preprint arXiv:2109.07958 , year=
work page internal anchor Pith review arXiv
-
[71]
Understanding and Controlling a Maze-Solving Policy Network , author=. 2023 , eprint=
work page 2023
-
[72]
Fine-Tuning Language Models from Human Preferences , author=. 2019 , eprint=
work page 2019
-
[73]
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Fine-tuning aligned language models compromises safety, even when users do not intend to! , author=. arXiv preprint arXiv:2310.03693 , year=
work page internal anchor Pith review arXiv
-
[74]
Fudge: Controlled text generation with future discriminators
Yang, Kevin and Klein, Dan. FUDGE : Controlled Text Generation With Future Discriminators. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v1/2021.naacl-main.276
-
[75]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron and Thibaut Lavril and Gautier Izacard and Xavier Martinet and Marie-Anne Lachaux and Timothée Lacroix and Baptiste Rozière and Naman Goyal and Eric Hambro and Faisal Azhar and Aurelien Rodriguez and Armand Joulin and Edouard Grave and Guillaume Lample , year=. 2302.13971 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv
-
[76]
Peters, Matthew E. and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke. Deep Contextualized Word Representations. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10...
- [77]
-
[78]
Generating Wikipedia by summarizing long sequences
Generating wikipedia by summarizing long sequences , author=. arXiv preprint arXiv:1801.10198 , year=
-
[79]
Transformer Circuits Thread , volume=
A mathematical framework for transformer circuits , author=. Transformer Circuits Thread , volume=
-
[80]
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
-
[81]
Controllable text generation via probability density estimation in the latent space
Controllable text generation via probability density estimation in the latent space , author=. arXiv preprint arXiv:2212.08307 , year=
-
[82]
Zhong, Tianqi and Wang, Quan and Han, Jingxuan and Zhang, Yongdong and Mao, Zhendong , journal=. Air-
-
[83]
Pei, Jonathan and Yang, Kevin and Klein, Dan , journal=
-
[84]
Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher , title =. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , month =. 2011 , address =
work page 2011
-
[85]
Sheng Liu and Lei Xing and James Zou , year=. In-context. 2311.06668 , archivePrefix=
-
[86]
Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2023 , eprint=
work page 2023
-
[87]
More than a Feeling: Accuracy and Application of Sentiment Analysis , journal =
Jochen Hartmann and Mark Heitmann and Christian Siebert and Christina Schamp , keywords =. More than a Feeling: Accuracy and Application of Sentiment Analysis , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.ijresmar.2022.05.005 , url =
- [88]
-
[89]
Inspecting and Editing Knowledge Representations in Language Models , author=. 2023 , eprint=
work page 2023
-
[90]
Locating and editing factual associations in gpt
Kevin Meng and David Bau and Alex Andonian and Yonatan Belinkov , year=. Locating and Editing Factual Associations in. 2202.05262 , archivePrefix=
- [91]
-
[92]
Language mod- els implement simple word2vec-style vector arithmetic
Jack Merullo and Carsten Eickhoff and Ellie Pavlick , year=. Language Models Implement Simple. 2305.16130 , archivePrefix=
-
[93]
Linguistic regularities in continuous space word representations , author=. Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=
work page 2013
-
[94]
H afez: an Interactive Poetry Generation System
Ghazvininejad, Marjan and Shi, Xing and Priyadarshi, Jay and Knight, Kevin. H afez: an Interactive Poetry Generation System. Proceedings of ACL 2017, System Demonstrations. 2017
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.