Steering Language Models With Activation Engineering
Pith reviewed 2026-05-11 00:09 UTC · model grok-4.3
The pith
Adding differences in activations between contrasting prompts steers language model outputs toward desired sentiments or topics at inference time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Activation Addition computes a steering vector by subtracting the intermediate activations of one prompt from those of a contrasting prompt and then adds a scaled version of this vector to the model's activations at selected layers during the forward pass. This produces reliable shifts in semantic properties of the generated text, such as sentiment polarity or topic focus, without requiring optimization or large datasets. The method achieves top performance on negative-to-positive sentiment transfer and toxicity reduction while preserving accuracy on off-target benchmarks.
What carries the argument
The Activation Addition (ActAdd) technique, which derives a steering vector from the difference in intermediate activations between a pair of contrasting prompts and adds it to the model's forward pass to guide high-level output properties.
If this is right
- High-level output properties such as sentiment and topic become controllable at inference time.
- Off-target task performance remains intact after steering.
- The approach requires no model optimization and works with a single pair of examples.
- Rapid iteration over different steering directions becomes feasible without retraining.
- State-of-the-art results appear on sentiment shifting and detoxification for models like LLaMA-3 and OPT.
Where Pith is reading between the lines
- High-level concepts may occupy consistent directions in activation space that can be isolated with minimal examples.
- The method could extend to steering for factual accuracy or creative styles once suitable prompt pairs are identified.
- Activation engineering offers a practical route to test hypotheses about how models internally represent abstract traits.
- Combining ActAdd with other inference techniques might enable finer-grained, multi-directional control.
Load-bearing premise
Activation differences extracted from a single prompt pair reliably encode semantic directions that generalize across new contexts without unintended effects on other model capabilities.
What would settle it
Running ActAdd on a held-out model or task and finding that the added vector produces no measurable shift in the target property or causes clear drops in unrelated task scores would falsify the claim of generalizable control.
read the original abstract
Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as "Love" versus "Hate") to compute a steering vector (Subramani et al. 2022). By tactically adding in e.g. the "Love" - "Hate" steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT. ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks. ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points, which enables rapid iteration over steering. ActAdd demonstrates the power of activation engineering.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Activation Addition (ActAdd), an inference-time technique that computes a steering vector as the difference in intermediate activations produced by a contrasting prompt pair (e.g., 'Love' versus 'Hate') and adds a scaled version of this vector to the model's activations during the forward pass. The central claim is that this yields state-of-the-art performance on negative-to-positive sentiment transformation and detoxification benchmarks using models including LLaMA-3 and OPT, while preserving performance on off-target tasks, all without any optimization or training and using only a single prompt pair.
Significance. If the empirical results hold under rigorous validation, the work is significant because it demonstrates a lightweight, training-free method for controlling high-level semantic properties of language model outputs via direct manipulation of activations. Strengths include the absence of machine optimization, the use of minimal data (single pairs), and the potential for rapid iteration; these features distinguish it from prompt engineering or fine-tuning and could enable new forms of controllable generation if the steering vectors prove robustly transferable.
major comments (3)
- [Abstract and Experimental Results] Abstract and Experimental Results section: the claim of achieving SOTA on sentiment shift and detoxification is presented without details on the baselines compared, the number of evaluation runs, statistical significance tests, error bars, or data exclusion rules. These omissions are load-bearing because the central empirical claim cannot be assessed for robustness or generalizability without them.
- [Method] Method description (steering vector construction): the activation difference is computed from a single prompt pair at a chosen layer with no regularization or multi-pair averaging. This construction does not guarantee that the resulting vector isolates a high-level semantic direction rather than prompt-specific or surface-form artifacts, which directly threatens the claim that the method transfers across arbitrary contexts without off-target effects.
- [Evaluation] Evaluation of off-target preservation: while the paper states that ActAdd preserves performance on unrelated tasks, no quantitative results or specific task suites are referenced to support this. Without such evidence, the claim that high-level steering leaves other capabilities intact remains unverified and is central to the practical utility argument.
minor comments (2)
- [Method] The notation for the steering vector (difference of activations) should be formalized with an equation, including the exact layer index and scaling coefficient, to improve reproducibility.
- [Introduction] Related work on activation steering (e.g., Subramani et al. 2022) is cited but could be expanded with a brief comparison table of prior techniques versus ActAdd on data requirements and optimization.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments identify key areas where additional clarity and evidence would strengthen the manuscript's claims. We address each major comment point by point below, indicating where revisions will be made.
read point-by-point responses
-
Referee: [Abstract and Experimental Results] Abstract and Experimental Results section: the claim of achieving SOTA on sentiment shift and detoxification is presented without details on the baselines compared, the number of evaluation runs, statistical significance tests, error bars, or data exclusion rules. These omissions are load-bearing because the central empirical claim cannot be assessed for robustness or generalizability without them.
Authors: We agree that the SOTA claim requires more supporting details for proper assessment of robustness. In the revised manuscript, we will update the abstract and Experimental Results section to explicitly name the baselines (including specific prompt engineering and fine-tuning methods from related work), report the number of evaluation runs with error bars or standard deviations, include statistical significance tests where relevant, and clarify data exclusion rules. These additions will be made while preserving the reported performance figures. revision: yes
-
Referee: [Method] Method description (steering vector construction): the activation difference is computed from a single prompt pair at a chosen layer with no regularization or multi-pair averaging. This construction does not guarantee that the resulting vector isolates a high-level semantic direction rather than prompt-specific or surface-form artifacts, which directly threatens the claim that the method transfers across arbitrary contexts without off-target effects.
Authors: The single-pair construction is a core feature of ActAdd, chosen to emphasize its training-free nature and minimal data needs. Contrasting pairs targeting high-level concepts are used, with addition at intermediate layers to focus on semantic rather than surface features; empirical transfer across contexts in our results supports this. We acknowledge the risk of artifacts. In revision, we will add discussion of this limitation in the Method section and include supplementary results using averaged vectors from multiple pairs to evaluate robustness. revision: partial
-
Referee: [Evaluation] Evaluation of off-target preservation: while the paper states that ActAdd preserves performance on unrelated tasks, no quantitative results or specific task suites are referenced to support this. Without such evidence, the claim that high-level steering leaves other capabilities intact remains unverified and is central to the practical utility argument.
Authors: We appreciate the referee highlighting the need for quantitative support here. Although internal checks showed no degradation, the manuscript lacks explicit metrics. In the revised version, we will add quantitative results in the Evaluation section, reporting performance on specific off-target benchmarks (such as subsets of GLUE or general text perplexity) for both steered and baseline models to verify capability preservation. revision: yes
Circularity Check
No significant circularity; empirical technique with independent experimental validation
full rationale
The paper presents ActAdd as a direct, non-optimized procedure: compute the difference between activations on a single contrasting prompt pair at a chosen layer, then add a scaled version of that vector to the residual stream during inference. This construction does not fit parameters to a target metric and then relabel the fit as a prediction, nor does it define the steering vector in terms of the desired output property. The cited Subramani et al. 2022 reference supplies the contrastive-difference idea but is not used to import a uniqueness theorem or to smuggle an ansatz; the present work simply applies the difference vector and reports measured effects on sentiment and toxicity benchmarks. No self-citation chain bears the central claim, and the reported SOTA numbers are external performance measurements rather than algebraic identities. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
free parameters (1)
- steering vector scaling coefficient
Lean theorems connected to this paper
-
Foundation.DAlembert.Inevitabilitybilinear_family_forced echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
By tactically adding in e.g. the “Love” - “Hate” steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift and detoxification
-
Foundation.HierarchyForcingadditive_composition_is_minimal unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
-
SLAM: Structural Linguistic Activation Marking for Language Models
SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
-
SLAM: Structural Linguistic Activation Marking for Language Models
SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.
-
What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control
LLMs compute Nash actions internally but suppress them via prosocial overrides from training data, and this can be causally controlled through residual stream interventions.
-
Slot Machines: How LLMs Keep Track of Multiple Entities
LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.
-
Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens
Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
-
Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection
Knowledge Packs deliver knowledge via pre-computed KV caches with exact equivalence under causal masking, achieving zero divergences on tested questions and enabling value-based steering without training.
-
The Linear Representation Hypothesis and the Geometry of Large Language Models
Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.
-
As X, Do Y: How Persona and Task Combine in Instruction-Tuned LLMs
Persona and task in role prompts decompose additively into orthogonal directions at the prompt-to-answer transition in LLM residual streams, but this local structure does not allow compressing the prompt into a single...
-
Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy
Off-the-shelf persona vectors for doubt and scrutiny reduce sycophancy comparably to CAA while maintaining accuracy on correct inputs and showing directional independence.
-
The Hidden Signal of Verifier Strictness: Controlling and Improving Step-Wise Verification via Selective Latent Steering
VerifySteer selectively steers hidden states at paragraph boundaries using latent correctness signals to control verifier strictness and outperform baselines on ProcessBench and Hard2Verify with lower compute.
-
Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing
Residual Paving decomposes selective refusal editing into an early-layer router for intervention decisions and later-layer residual experts for edits, with oracle routing showing that learned route selectivity is the ...
-
FishBack: Pullback Fisher Geometry for Optimal Activation Steering in Transformers
FishBack derives a closed-form minimum-distortion steering direction from the pullback Fisher metric of the softmax layer, outperforming Euclidean baselines on GPT-2 verb-morphology tasks with lower off-target KL divergence.
-
Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space
Pretraining and alignment induce asymmetric geometric traces in transformer weights because alignment updates concentrate in read pathways due to activation covariance while write pathways inherit less structure from ...
-
The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge
In two-layer networks, weak-to-strong training elicits the target feature direction from pre-trained subspaces and preserves correlated off-target features, unlike standard fine-tuning.
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE factors sparse autoencoder decoder atoms to the native d_k x d_v cache write shape in recurrent models, provides a closed-form logit shift, and demonstrates high success in atom substitution and behavioral ed...
-
Deep Minds and Shallow Probes
Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
-
SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing
SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.
-
Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions
GCAD reduces coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1 in persona-steering tasks by using gated attention-delta interventions from system prompts.
-
Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations
Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.
-
When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search
Rank-1 activation steering is often cheap when prompt-boundary alignment guides budgeted search and concept granularity diagnoses directional stability, with the GRACE framework reducing trials to 95% utility by 39.8%...
-
When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search
Prompt-boundary directional alignment enables geometry-guided search that cuts trials to 95% best utility by 39.8% on average, while concept granularity predicts remaining difficulty via directional heterogeneity.
-
Inference Time Causal Probing in LLMs
HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.
-
Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs
Memory Inception steers LLMs via selective latent KV cache injection at chosen layers, delivering better control-drift balance than prompting or CAA on personality and reasoning tasks while reducing storage needs.
-
DataDignity: Training Data Attribution for Large Language Models
ScoringModel raises mean Recall@10 to 52.2 on the FakeWiki provenance benchmark from 35.0 for the best baseline, winning 41 of 45 model-by-condition comparisons and gaining 15.7 points on jailbreak-style queries.
-
Steer Like the LLM: Activation Steering that Mimics Prompting
PSR models that estimate token-specific steering coefficients from activations outperform standard activation steering and compare favorably to prompting on steering benchmarks.
-
The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It
Transformers encode counts correctly internally but fail to read them out due to misalignment with digit output directions, fixable by updating 37k output parameters or small LoRA on attention.
-
The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It
Transformers store count information internally but cannot read it out as digits due to near-orthogonal alignment with output-head rows; updating digit rows or applying LoRA to attention layers improves constrained an...
-
Less is More: Geometric Unlearning for LLMs with Minimal Data Disclosure
Geometric Unlearning suppresses specific knowledge in LLMs by projecting hidden planning states onto a low-rank safe geometry derived from minimal reference prompts.
-
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with stron...
-
ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost?
Personalized LLM-generated plain language summaries improve lay readers' comprehension and quality ratings but increase risks of reinforcing biases and introducing hallucinations compared to static expert summaries.
-
Subliminal Steering: Stronger Encoding of Hidden Signals
Subliminal steering transfers complex behavioral biases and the underlying steering vector through fine-tuning on innocuous data, achieving higher precision than prior prompt-based methods.
-
Exploring Language-Agnosticity in Function Vectors: A Case Study in Machine Translation
Translation function vectors extracted from English to one target language improve correct token ranking for translations to multiple other unseen target languages in decoder-only multilingual LLMs.
-
Cell-Based Representation of Relational Binding in Language Models
Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the...
-
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
-
Psychological Steering of Large Language Models
Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
-
Identity as Attractor: Geometric Evidence for Persistent Agent Architecture in LLM Activation Space
Paraphrases of an identity document induce tighter clustering in LLM activation space than matched controls, indicating attractor-like dynamics for agent identity.
-
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
-
Emotion Concepts and their Function in a Large Language Model
Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
-
Steering Autoregressive Music Generation with Recursive Feature Machines
MusicRFM discovers interpretable concept directions in music model hidden states using RFM probes and injects them at inference to steer generation toward desired musical properties without retraining.
-
Activation Steering with a Feedback Controller
Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
Relational Linear Properties in Language Models: An Empirical Investigation
A KL-divergence probing method shows relational linearity in language models varies across models and layers while being sensitive to relation phrasing, extending prior linear embedding work.
-
Manifold-Guided Attention Steering
MAGS learns low-dimensional subspaces from correct versus incorrect reasoning traces and applies targeted projection corrections to attention heads when they deviate from the correctness manifold during inference.
-
Latent-space Attacks for Refusal Evasion in Language Models
Introduces Controlled Latent-space Evasion attack that projects model activations past a linear probe's decision boundary to suppress refusal, outperforming ablation baselines on 15 models.
-
Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics
Probe trajectories across token positions in LRMs, combined with signal-processing features, improve prediction of future model outputs over static probes on safety and math tasks.
-
ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models
ASRU combines activation redirection and reward-optimized fine-tuning to unlearn cross-modal sensitive knowledge in MLLMs, reporting +24.6% better unlearning effectiveness and 5.8x higher generation quality on Qwen3-V...
-
VSPO: Vector-Steered Policy Optimization for Behavioral Control
VSPO samples rollouts at varying steering intensities to improve behavioral control in LLMs while preserving task accuracy.
-
TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale
TFGN is an architectural overlay for transformers enabling task-free, replay-free continual pre-training across heterogeneous domains at LLM scale with near-zero backward transfer and high gradient orthogonality.
-
Fusion-fission forecasts when AI will shift to undesirable behavior
A vector generalization of fusion-fission group dynamics from physics forecasts when AI behavior shifts to undesirable states, validated at 90 percent across seven models and prior to real-world data.
-
Dual-Pathway Circuits of Object Hallucination in Vision-Language Models
Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.
-
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
-
Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance
SVGT adds independent value modules and Bridge Tokens to LLMs to maintain consistent value guidance, cutting harmful outputs by over 70% in tests while preserving fluency.
-
Interpretability Can Be Actionable
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
-
Enabling Performant and Flexible Model-Internal Observability for LLM Inference
DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.
-
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
DISCA uses disagreement among WVS-grounded persona panels to apply loss-averse logit corrections that reduce cultural misalignment by 10-24% on MultiTP for models 3.8B and larger, without weight changes.
-
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, wit...
Reference graph
Works this paper leans on
-
[1]
Understanding intermediate layers using linear classifier probes, 2018
Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes, 2018
work page 2018
-
[2]
Transformer L ens: A library for mechanistic interpretability of generative language models
Joseph Bloom and Neel Nanda. Transformer L ens: A library for mechanistic interpretability of generative language models. https://neelnanda-io.github.io/TransformerLens/, 2022
work page 2022
-
[3]
Robustness of edited neural networks, 2023
Davis Brown, Charles Godfrey, Cody Nizinski, Jonathan Tu, and Henry Kvinge. Robustness of edited neural networks, 2023
work page 2023
-
[4]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
work page 2020
-
[5]
Discovering latent knowledge in language models without supervision, 2022
Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision, 2022
work page 2022
-
[6]
Plug and play language models: A simple approach to controlled text generation, 2020
Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation, 2020
work page 2020
-
[7]
A mathematical framework for transformer circuits
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1, 2021
work page 2021
-
[8]
Toy models of superposition, 2022
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition, 2022
work page 2022
-
[10]
Probability plotting methods for the analysis of data
Ramanathan Gnanadesikan and Martin B Wilk. Probability plotting methods for the analysis of data. Biometrika, 55 0 (1): 0 1--17, 1968
work page 1968
-
[11]
Bias correction of learned generative models using likelihood-free importance weighting, 2019
Aditya Grover, Jiaming Song, Alekh Agarwal, Kenneth Tran, Ashish Kapoor, Eric Horvitz, and Stefano Ermon. Bias correction of learned generative models using likelihood-free importance weighting, 2019
work page 2019
-
[15]
Does localization inform editing? surprising differences in causality-based localization vs
Peter Hase, Mohit Bansal, Been Kim, and Asma Ghandeharioun. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models, 2023
work page 2023
-
[16]
Evan Hernandez, Belinda Z. Li, and Jacob Andreas. Inspecting and editing knowledge representations in language models, 2023
work page 2023
-
[17]
Editing models with task arithmetic, 2023
Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic, 2023
work page 2023
-
[20]
Language models and cognitive automation for economic research
Anton Korinek. Language models and cognitive automation for economic research. Technical report, National Bureau of Economic Research, 2023
work page 2023
-
[21]
Autoencoding beyond pixels using a learned similarity metric, 2016
Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric, 2016
work page 2016
-
[22]
The power of scale for parameter-efficient prompt tuning, 2021
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning, 2021
work page 2021
-
[23]
Delete, retrieve, generate: A simple approach to sentiment and style transfer, 2018
Juncen Li, Robin Jia, He He, and Percy Liang. Delete, retrieve, generate: A simple approach to sentiment and style transfer, 2018. URL https://arxiv.org/abs/1804.06437
-
[24]
Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg
Kenneth Li, Aspen K. Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task, 2023 a
work page 2023
-
[25]
Inference-time intervention: Eliciting truthful answers from a language model, 2023 b
Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model, 2023 b
work page 2023
-
[26]
Prefix- T uning: Optimizing continuous prompts for generation, 2021
Xiang Lisa Li and Percy Liang. Prefix- T uning: Optimizing continuous prompts for generation, 2021
work page 2021
-
[27]
Sheng Liu, Lei Xing, and James Zou. In-context V ectors: Making in context learning more effective and controllable through latent space steering, 2023
work page 2023
-
[28]
Keeping llms aligned after fine-tuning: The crucial role of prompt templates, 2024
Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, and Sanjeev Arora. Keeping llms aligned after fine-tuning: The crucial role of prompt templates, 2024
work page 2024
-
[29]
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.\ 142--150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. UR...
work page 2011
-
[30]
Locating and editing factual associations in GPT , 2023
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT , 2023
work page 2023
- [31]
-
[32]
Are sixteen heads really better than one? In H
Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/2c601ad9d2ff9bc8b28...
work page 2019
-
[33]
Distributed representations of words and phrases and their compositionality
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013 a . URL https://proceedings.neurip...
work page 2013
-
[34]
Linguistic regularities in continuous space word representations
Tom \'a s Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, pp.\ 746--751, 2013 b
work page 2013
-
[35]
Understanding and Controlling a Maze - Solving Policy Network , October 2023
Ulisse Mini, Peli Grietzer, Mrinank Sharma, Austin Meek, Monte MacDiarmid, and Alexander Matt Turner. Understanding and controlling a maze-solving policy network, 2023. URL https://arxiv.org/abs/2310.08043
-
[36]
Relative representations enable zero-shot latent space communication, 2023
Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, and Emanuele Rodolà. Relative representations enable zero-shot latent space communication, 2023
work page 2023
-
[37]
Actually, othello-gpt has a linear emergent world representation
Neel Nanda. Actually, othello-gpt has a linear emergent world representation. neelnanda.io/mechanistic-interpretability/othello, 2023
work page 2023
-
[38]
Distributed representations: Composition & superposition
Christopher Olah. Distributed representations: Composition & superposition. https://transformer-circuits.pub/2023/superposition-composition/index.html, 2023
work page 2023
-
[41]
arXiv preprint arXiv:2307.03214 , year=
Jonathan Pei, Kevin Yang, and Dan Klein. PREADD : prefix-adaptive decoding for controlled text generation. arXiv preprint arXiv:2307.03214, 2023
-
[42]
Joshua Peterson, Stephan Meylan, and David Bourgin. Openwebtext. https://github.com/jcpeterson/openwebtext, 2018
work page 2018
-
[43]
F. Petroni, T. Rockt \" a schel, A. H. Miller, P. Lewis, A. Bakhtin, Y. Wu, and S. Riedel. Language models as knowledge bases? In In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019, 2019
work page 2019
-
[45]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019
work page 2019
-
[46]
Sequence level training with recurrent neural networks, 2016
Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks, 2016
work page 2016
-
[49]
The irrelevance of turing machines to artificial intelligence
Aaron Sloman. The irrelevance of turing machines to artificial intelligence. In Matthias Scheutz (ed.), Computationalism: New Directions. MIT Press, 2002
work page 2002
-
[50]
Jan Strunk. nltk.tokenize.punkt module. https://www.nltk.org/api/nltk.tokenize.punkt.html, 2013
work page 2013
-
[52]
LLaMA : Open and efficient foundation language models, 2023
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA : Open and efficient foundation language models, 2023
work page 2023
-
[53]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https:...
work page 2017
-
[54]
GPT-J-6B : 6 B jax-based transformer
Ben Wang and Aran Komatsuzaki. GPT-J-6B : 6 B jax-based transformer. https://github.com/kingoflolz/mesh-transformer-jax\#gpt-j-6b, 2021
work page 2021
-
[55]
Prompt engineering in consistency and reliability with the evidence-based guideline for llms
Li Wang, Xi Chen, XiangWen Deng, Hao Wen, MingKe You, WeiZhi Liu, Qi Li, and Jian Li. Prompt engineering in consistency and reliability with the evidence-based guideline for llms. npj Digital Medicine, 7 0 (1): 0 41, 2024
work page 2024
-
[56]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022
work page 2022
- [57]
-
[58]
Eva- KELLM : A new benchmark for evaluating knowledge editing of LLMs , 2023
Suhang Wu, Minlong Peng, Yue Chen, Jinsong Su, and Mingming Sun. Eva- KELLM : A new benchmark for evaluating knowledge editing of LLMs , 2023
work page 2023
-
[60]
The unreliability of explanations in few-shot prompting for textual reasoning
Xi Ye and Greg Durrett. The unreliability of explanations in few-shot prompting for textual reasoning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.\ 30378--30392. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/c4...
work page 2022
-
[62]
A comprehensive study of knowledge editing for large language models, 2024
Ningyu Zhang, Yunzhi Yao, Bozhong Tian, Peng Wang, Shumin Deng, Mengru Wang, Zekun Xi, Shengyu Mao, Jintian Zhang, Yuansheng Ni, Siyuan Cheng, Ziwen Xu, Xin Xu, Jia-Chen Gu, Yong Jiang, Pengjun Xie, Fei Huang, Lei Liang, Zhiqiang Zhang, Xiaowei Zhu, Jun Zhou, and Huajun Chen. A comprehensive study of knowledge editing for large language models, 2024
work page 2024
-
[63]
OPT : Open pre-trained transformer language models, 2022 b
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. OPT : Open pre-trained transformer language models, 2022 b
work page 2022
-
[64]
Air- D ecoding: Attribute distribution reconstruction for decoding-time controllable text generation
Tianqi Zhong, Quan Wang, Jingxuan Han, Yongdong Zhang, and Zhendong Mao. Air- D ecoding: Attribute distribution reconstruction for decoding-time controllable text generation. arXiv preprint arXiv:2310.14892, 2023
-
[65]
Steering large language models using APE
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Steering large language models using APE . In NeurIPS ML Safety Workshop, 2022. URL https://openreview.net/forum?id=JjvNzMOiBEp
work page 2022
-
[66]
Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B
Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2019
work page 2019
-
[67]
Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to a...
work page 2023
-
[68]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova , year=. 1810.04805 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv
-
[69]
Advances in neural information processing systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
-
[70]
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Truthfulqa: Measuring how models mimic human falsehoods , author=. arXiv preprint arXiv:2109.07958 , year=
work page internal anchor Pith review arXiv
-
[71]
Understanding and Controlling a Maze-Solving Policy Network , author=. 2023 , eprint=
work page 2023
-
[72]
Fine-Tuning Language Models from Human Preferences , author=. 2019 , eprint=
work page 2019
-
[73]
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Fine-tuning aligned language models compromises safety, even when users do not intend to! , author=. arXiv preprint arXiv:2310.03693 , year=
work page internal anchor Pith review arXiv
-
[74]
Fudge: Controlled text generation with future discriminators
Yang, Kevin and Klein, Dan. FUDGE : Controlled Text Generation With Future Discriminators. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v1/2021.naacl-main.276
-
[75]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron and Thibaut Lavril and Gautier Izacard and Xavier Martinet and Marie-Anne Lachaux and Timothée Lacroix and Baptiste Rozière and Naman Goyal and Eric Hambro and Faisal Azhar and Aurelien Rodriguez and Armand Joulin and Edouard Grave and Guillaume Lample , year=. 2302.13971 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv
-
[76]
Peters, Matthew E. and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke. Deep Contextualized Word Representations. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10...
- [77]
-
[78]
Generating Wikipedia by Summarizing Long Sequences
Generating wikipedia by summarizing long sequences , author=. arXiv preprint arXiv:1801.10198 , year=
-
[79]
Transformer Circuits Thread , volume=
A mathematical framework for transformer circuits , author=. Transformer Circuits Thread , volume=
-
[80]
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
-
[81]
Controllable text generation via probability density estimation in the latent space
Controllable text generation via probability density estimation in the latent space , author=. arXiv preprint arXiv:2212.08307 , year=
-
[82]
Zhong, Tianqi and Wang, Quan and Han, Jingxuan and Zhang, Yongdong and Mao, Zhendong , journal=. Air-
-
[83]
Pei, Jonathan and Yang, Kevin and Klein, Dan , journal=
-
[84]
Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher , title =. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , month =. 2011 , address =
work page 2011
-
[85]
Sheng Liu and Lei Xing and James Zou , year=. In-context. 2311.06668 , archivePrefix=
-
[86]
Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2023 , eprint=
work page 2023
-
[87]
More than a Feeling: Accuracy and Application of Sentiment Analysis , journal =
Jochen Hartmann and Mark Heitmann and Christian Siebert and Christina Schamp , keywords =. More than a Feeling: Accuracy and Application of Sentiment Analysis , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.ijresmar.2022.05.005 , url =
- [88]
-
[89]
Inspecting and Editing Knowledge Representations in Language Models , author=. 2023 , eprint=
work page 2023
-
[90]
Locating and Editing Factual Associations in GPT
Kevin Meng and David Bau and Alex Andonian and Yonatan Belinkov , year=. Locating and Editing Factual Associations in. 2202.05262 , archivePrefix=
work page internal anchor Pith review arXiv
- [91]
-
[92]
Language models implement simple word2vec-style vector arithmetic, 2024
Jack Merullo and Carsten Eickhoff and Ellie Pavlick , year=. Language Models Implement Simple. 2305.16130 , archivePrefix=
-
[93]
Linguistic regularities in continuous space word representations , author=. Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=
work page 2013
-
[94]
H afez: an Interactive Poetry Generation System
Ghazvininejad, Marjan and Shi, Xing and Priyadarshi, Jay and Knight, Kevin. H afez: an Interactive Poetry Generation System. Proceedings of ACL 2017, System Demonstrations. 2017
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.