arxiv: 2308.10248 · v5 · submitted 2023-08-20 · 💻 cs.CL · cs.LG

Recognition: 3 theorem links

· Lean Theorem

Steering Language Models With Activation Engineering

Alexander Matt Turner, David Udell, Gavin Leech, Juan J. Vazquez, Lisa Thiergart, Monte MacDiarmid, Ulisse Mini

Pith reviewed 2026-05-11 00:09 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords activation engineeringsteering vectorslanguage modelsinference-time controlsentiment shiftdetoxificationactivation additionLLM steering

0 comments

The pith

Adding differences in activations between contrasting prompts steers language model outputs toward desired sentiments or topics at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces activation engineering as a method to control language model behavior by modifying internal activations during generation rather than changing the model weights. Researchers compute a steering vector from the difference in activations produced by a pair of prompts, such as one evoking love and one evoking hate. Adding this vector to the model's forward pass shifts high-level output traits like sentiment while leaving performance on unrelated tasks largely unchanged. The technique reaches state-of-the-art results on turning negative text positive and on detoxification across models including LLaMA-3 and OPT. It requires only one prompt pair, no training, and supports quick experimentation with new directions.

Core claim

Activation Addition computes a steering vector by subtracting the intermediate activations of one prompt from those of a contrasting prompt and then adds a scaled version of this vector to the model's activations at selected layers during the forward pass. This produces reliable shifts in semantic properties of the generated text, such as sentiment polarity or topic focus, without requiring optimization or large datasets. The method achieves top performance on negative-to-positive sentiment transfer and toxicity reduction while preserving accuracy on off-target benchmarks.

What carries the argument

The Activation Addition (ActAdd) technique, which derives a steering vector from the difference in intermediate activations between a pair of contrasting prompts and adds it to the model's forward pass to guide high-level output properties.

If this is right

High-level output properties such as sentiment and topic become controllable at inference time.
Off-target task performance remains intact after steering.
The approach requires no model optimization and works with a single pair of examples.
Rapid iteration over different steering directions becomes feasible without retraining.
State-of-the-art results appear on sentiment shifting and detoxification for models like LLaMA-3 and OPT.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

High-level concepts may occupy consistent directions in activation space that can be isolated with minimal examples.
The method could extend to steering for factual accuracy or creative styles once suitable prompt pairs are identified.
Activation engineering offers a practical route to test hypotheses about how models internally represent abstract traits.
Combining ActAdd with other inference techniques might enable finer-grained, multi-directional control.

Load-bearing premise

Activation differences extracted from a single prompt pair reliably encode semantic directions that generalize across new contexts without unintended effects on other model capabilities.

What would settle it

Running ActAdd on a held-out model or task and finding that the added vector produces no measurable shift in the target property or causes clear drops in unrelated task scores would falsify the claim of generalizable control.

read the original abstract

Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as "Love" versus "Hate") to compute a steering vector (Subramani et al. 2022). By tactically adding in e.g. the "Love" - "Hate" steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT. ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks. ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points, which enables rapid iteration over steering. ActAdd demonstrates the power of activation engineering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ActAdd shows a simple way to steer LLMs at inference time with activation differences from one prompt pair, but the generalizability claim rests on thin evidence.

read the letter

The main takeaway is that this paper demonstrates a lightweight method called ActAdd for controlling LLM outputs by computing the activation difference between two short contrasting prompts and adding a scaled version of that vector during the forward pass. They report it shifts sentiment from negative to positive and reduces toxicity on LLaMA-3 and OPT models, all without any training or optimization, while leaving unrelated capabilities mostly intact.

Referee Report

3 major / 2 minor

Summary. The paper introduces Activation Addition (ActAdd), an inference-time technique that computes a steering vector as the difference in intermediate activations produced by a contrasting prompt pair (e.g., 'Love' versus 'Hate') and adds a scaled version of this vector to the model's activations during the forward pass. The central claim is that this yields state-of-the-art performance on negative-to-positive sentiment transformation and detoxification benchmarks using models including LLaMA-3 and OPT, while preserving performance on off-target tasks, all without any optimization or training and using only a single prompt pair.

Significance. If the empirical results hold under rigorous validation, the work is significant because it demonstrates a lightweight, training-free method for controlling high-level semantic properties of language model outputs via direct manipulation of activations. Strengths include the absence of machine optimization, the use of minimal data (single pairs), and the potential for rapid iteration; these features distinguish it from prompt engineering or fine-tuning and could enable new forms of controllable generation if the steering vectors prove robustly transferable.

major comments (3)

[Abstract and Experimental Results] Abstract and Experimental Results section: the claim of achieving SOTA on sentiment shift and detoxification is presented without details on the baselines compared, the number of evaluation runs, statistical significance tests, error bars, or data exclusion rules. These omissions are load-bearing because the central empirical claim cannot be assessed for robustness or generalizability without them.
[Method] Method description (steering vector construction): the activation difference is computed from a single prompt pair at a chosen layer with no regularization or multi-pair averaging. This construction does not guarantee that the resulting vector isolates a high-level semantic direction rather than prompt-specific or surface-form artifacts, which directly threatens the claim that the method transfers across arbitrary contexts without off-target effects.
[Evaluation] Evaluation of off-target preservation: while the paper states that ActAdd preserves performance on unrelated tasks, no quantitative results or specific task suites are referenced to support this. Without such evidence, the claim that high-level steering leaves other capabilities intact remains unverified and is central to the practical utility argument.

minor comments (2)

[Method] The notation for the steering vector (difference of activations) should be formalized with an equation, including the exact layer index and scaling coefficient, to improve reproducibility.
[Introduction] Related work on activation steering (e.g., Subramani et al. 2022) is cited but could be expanded with a brief comparison table of prior techniques versus ActAdd on data requirements and optimization.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments identify key areas where additional clarity and evidence would strengthen the manuscript's claims. We address each major comment point by point below, indicating where revisions will be made.

read point-by-point responses

Referee: [Abstract and Experimental Results] Abstract and Experimental Results section: the claim of achieving SOTA on sentiment shift and detoxification is presented without details on the baselines compared, the number of evaluation runs, statistical significance tests, error bars, or data exclusion rules. These omissions are load-bearing because the central empirical claim cannot be assessed for robustness or generalizability without them.

Authors: We agree that the SOTA claim requires more supporting details for proper assessment of robustness. In the revised manuscript, we will update the abstract and Experimental Results section to explicitly name the baselines (including specific prompt engineering and fine-tuning methods from related work), report the number of evaluation runs with error bars or standard deviations, include statistical significance tests where relevant, and clarify data exclusion rules. These additions will be made while preserving the reported performance figures. revision: yes
Referee: [Method] Method description (steering vector construction): the activation difference is computed from a single prompt pair at a chosen layer with no regularization or multi-pair averaging. This construction does not guarantee that the resulting vector isolates a high-level semantic direction rather than prompt-specific or surface-form artifacts, which directly threatens the claim that the method transfers across arbitrary contexts without off-target effects.

Authors: The single-pair construction is a core feature of ActAdd, chosen to emphasize its training-free nature and minimal data needs. Contrasting pairs targeting high-level concepts are used, with addition at intermediate layers to focus on semantic rather than surface features; empirical transfer across contexts in our results supports this. We acknowledge the risk of artifacts. In revision, we will add discussion of this limitation in the Method section and include supplementary results using averaged vectors from multiple pairs to evaluate robustness. revision: partial
Referee: [Evaluation] Evaluation of off-target preservation: while the paper states that ActAdd preserves performance on unrelated tasks, no quantitative results or specific task suites are referenced to support this. Without such evidence, the claim that high-level steering leaves other capabilities intact remains unverified and is central to the practical utility argument.

Authors: We appreciate the referee highlighting the need for quantitative support here. Although internal checks showed no degradation, the manuscript lacks explicit metrics. In the revised version, we will add quantitative results in the Evaluation section, reporting performance on specific off-target benchmarks (such as subsets of GLUE or general text perplexity) for both steered and baseline models to verify capability preservation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical technique with independent experimental validation

full rationale

The paper presents ActAdd as a direct, non-optimized procedure: compute the difference between activations on a single contrasting prompt pair at a chosen layer, then add a scaled version of that vector to the residual stream during inference. This construction does not fit parameters to a target metric and then relabel the fit as a prediction, nor does it define the steering vector in terms of the desired output property. The cited Subramani et al. 2022 reference supplies the contrastive-difference idea but is not used to import a uniqueness theorem or to smuggle an ansatz; the present work simply applies the difference vector and reports measured effects on sentiment and toxicity benchmarks. No self-citation chain bears the central claim, and the reported SOTA numbers are external performance measurements rather than algebraic identities. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical observation that activation differences capture controllable semantic directions; no explicit free parameters or invented entities are stated in the abstract, though scaling of the steering vector is implicitly required.

free parameters (1)

steering vector scaling coefficient
Magnitude of vector addition must be chosen to achieve desired effect without over-steering.

pith-pipeline@v0.9.0 · 5505 in / 1055 out tokens · 29754 ms · 2026-05-11T00:09:39.439037+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DAlembert.Inevitability bilinear_family_forced echoes
By tactically adding in e.g. the “Love” - “Hate” steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift and detoxification
Foundation.HierarchyForcing additive_composition_is_minimal unclear
ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
cs.CL 2026-05 unverdicted novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
WriteSAE: Sparse Autoencoders for Recurrent State
cs.LG 2026-05 unverdicted novelty 8.0

WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
SLAM: Structural Linguistic Activation Marking for Language Models
cs.CL 2026-05 unverdicted novelty 8.0

SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
SLAM: Structural Linguistic Activation Marking for Language Models
cs.CL 2026-05 unverdicted novelty 8.0

SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.
What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control
cs.GT 2026-04 accept novelty 8.0

LLMs compute Nash actions internally but suppress them via prosocial overrides from training data, and this can be causally controlled through residual stream interventions.
Slot Machines: How LLMs Keep Track of Multiple Entities
cs.CL 2026-04 unverdicted novelty 8.0

LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.
Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens
cs.LG 2026-04 accept novelty 8.0

Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
The Linear Representation Hypothesis and the Geometry of Large Language Models
cs.CL 2023-11 conditional novelty 8.0

Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.
The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge
stat.ML 2026-05 unverdicted novelty 7.0

In two-layer networks, weak-to-strong training elicits the target feature direction from pre-trained subspaces and preserves correlated off-target features, unlike standard fine-tuning.
Deep Minds and Shallow Probes
cs.LG 2026-05 unverdicted novelty 7.0

Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing
cs.LG 2026-05 unverdicted novelty 7.0

SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.
Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations
cs.AI 2026-05 unverdicted novelty 7.0

Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.
Inference Time Causal Probing in LLMs
cs.AI 2026-05 unverdicted novelty 7.0

HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.
Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs
cs.LG 2026-05 unverdicted novelty 7.0

Memory Inception steers LLMs via selective latent KV cache injection at chosen layers, delivering better control-drift balance than prompting or CAA on personality and reasoning tasks while reducing storage needs.
DataDignity: Training Data Attribution for Large Language Models
cs.AI 2026-05 unverdicted novelty 7.0

ScoringModel raises mean Recall@10 to 52.2 on the FakeWiki provenance benchmark from 35.0 for the best baseline, winning 41 of 45 model-by-condition comparisons and gaining 15.7 points on jailbreak-style queries.
Steer Like the LLM: Activation Steering that Mimics Prompting
cs.CL 2026-05 unverdicted novelty 7.0

PSR models that estimate token-specific steering coefficients from activations outperform standard activation steering and compare favorably to prompting on steering benchmarks.
The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It
cs.LG 2026-05 unverdicted novelty 7.0

Transformers encode counts correctly internally but fail to read them out due to misalignment with digit output directions, fixable by updating 37k output parameters or small LoRA on attention.
Less is More: Geometric Unlearning for LLMs with Minimal Data Disclosure
cs.CL 2026-05 unverdicted novelty 7.0

Geometric Unlearning suppresses specific knowledge in LLMs by projecting hidden planning states onto a low-rank safe geometry derived from minimal reference prompts.
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
cs.LG 2026-05 unverdicted novelty 7.0

RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with stron...
ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost?
cs.CL 2026-05 unverdicted novelty 7.0

Personalized LLM-generated plain language summaries improve lay readers' comprehension and quality ratings but increase risks of reinforcing biases and introducing hallucinations compared to static expert summaries.
Subliminal Steering: Stronger Encoding of Hidden Signals
cs.CL 2026-04 unverdicted novelty 7.0

Subliminal steering transfers complex behavioral biases and the underlying steering vector through fine-tuning on innocuous data, achieving higher precision than prior prompt-based methods.
Exploring Language-Agnosticity in Function Vectors: A Case Study in Machine Translation
cs.CL 2026-04 unverdicted novelty 7.0

Translation function vectors extracted from English to one target language improve correct token ranking for translations to multiple other unseen target languages in decoder-only multilingual LLMs.
Cell-Based Representation of Relational Binding in Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the...
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
cs.LG 2026-04 conditional novelty 7.0

Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
Psychological Steering of Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
Identity as Attractor: Geometric Evidence for Persistent Agent Architecture in LLM Activation Space
cs.AI 2026-04 conditional novelty 7.0

Paraphrases of an identity document induce tighter clustering in LLM activation space than matched controls, indicating attractor-like dynamics for agent identity.
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion
cs.CR 2026-04 unverdicted novelty 7.0

HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
Emotion Concepts and their Function in a Large Language Model
cs.AI 2026-04 unverdicted novelty 7.0

Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
Refusal in Language Models Is Mediated by a Single Direction
cs.LG 2024-06 accept novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Dual-Pathway Circuits of Object Hallucination in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
cs.CL 2026-05 unverdicted novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance
cs.AI 2026-05 unverdicted novelty 6.0

SVGT adds independent value modules and Bridge Tokens to LLMs to maintain consistent value guidance, cutting harmful outputs by over 70% in tests while preserving fluency.
Interpretability Can Be Actionable
cs.LG 2026-05 conditional novelty 6.0

Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
Enabling Performant and Flexible Model-Internal Observability for LLM Inference
cs.LG 2026-05 unverdicted novelty 6.0

DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
cs.CL 2026-05 unverdicted novelty 6.0

DISCA uses disagreement among WVS-grounded persona panels to apply loss-averse logit corrections that reduce cultural misalignment by 10-24% on MultiTP for models 3.8B and larger, without weight changes.
Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions
cs.CL 2026-05 unverdicted novelty 6.0

GCAD steering extracts prompt-based attention deltas and gates them at token level, cutting coherence drift from -18.6 to -1.9 while raising trait expression at turn 10 from 78 to 93 on multi-turn persona benchmarks.
Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Steering Dark Triad features in an LLM increases exploitative and aggressive behavior while leaving strategic deception and cognitive empathy unchanged, indicating dissociable antisocial pathways.
The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations
cs.AI 2026-05 unverdicted novelty 6.0

Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.
The Echo Amplifies the Knowledge: Somatic Marker Analogues in Language Models via Emotion Vector Re-Injection
cs.AI 2026-05 conditional novelty 6.0

Re-injecting emotion vectors during recall steepens a model's threat-safety judgments and raises good decision rates from 52% to 80% only when combined with semantic labels, replicating Damasio's somatic marker effect.
Belief or Circuitry? Causal Evidence for In-Context Graph Learning
cs.AI 2026-05 conditional novelty 6.0

Causal evidence from representation analysis and interventions shows LLMs use both genuine structure inference and induction circuits in parallel for in-context graph learning.
Tool Calling is Linearly Readable and Steerable in Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
Don't Lose Focus: Activation Steering via Key-Orthogonal Projections
cs.CL 2026-05 unverdicted novelty 6.0

SKOP uses key-orthogonal projections to steer LLM activations while preserving attention patterns on focus tokens, cutting utility degradation by 5-7x and retaining over 95% of standard steering efficacy.
Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs
cs.LG 2026-05 unverdicted novelty 6.0

Memory Inception is a training-free method that injects latent KV banks at chosen layers to steer LLMs, achieving superior control-drift balance and up to 118x storage reduction on personality and structured-reasoning tasks.
The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models
cs.AI 2026-05 unverdicted novelty 6.0

LLMs organize prompted social roles along a dominant, stable, and causally steerable granularity axis in representation space that runs from micro to macro levels.
On the Blessing of Pre-training in Weak-to-Strong Generalization
cs.LG 2026-05 unverdicted novelty 6.0

Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.
Conceptors for Semantic Steering
cs.LG 2026-05 unverdicted novelty 6.0

Conceptors as soft projection matrices from bipolar activations offer a multidimensional, compositional, and geometrically principled method for semantic steering in LLMs that outperforms single-vector baselines in mu...
Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes
cs.LG 2026-05 unverdicted novelty 6.0

Pairwise matrices for SAEs demonstrate that single-feature inspection mislabels causal axes, with joint suppression and matched-geometry controls revealing distinct output regimes not captured by single-feature or ran...
Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance
cs.LG 2026-05 unverdicted novelty 6.0

Probe-geometry alignment erases cross-sequence memorization signatures in LLMs below chance using per-depth rank-one activation interventions with negligible impact on zero-shot capabilities.
Automated Interpretability and Feature Discovery in Language Models with Agents
cs.CL 2026-05 unverdicted novelty 6.0

A multi-agent framework automates mechanistic interpretability in LLMs through coupled loops of hypothesis testing via prompts and feature discovery via activation-space graphs and statistical criteria.
Minimizing Collateral Damage in Activation Steering
cs.LG 2026-05 unverdicted novelty 6.0

Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.
Escaping Mode Collapse in LLM Generation via Geometric Regulation
cs.CL 2026-05 unverdicted novelty 6.0

Reinforced Mode Regulation (RMR) uses low-rank damping on the value cache to prevent geometric collapse and mode collapse in autoregressive LLM generation, supporting stable output down to 0.8 nats/step entropy.
Why Do LLMs Struggle in Strategic Play? Broken Links Between Observations, Beliefs, and Actions
cs.CL 2026-04 unverdicted novelty 6.0

LLMs encode accurate but brittle internal beliefs about latent game states and convert them poorly into actions, creating systematic gaps that explain strategic failures.
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
cs.CL 2026-04 unverdicted novelty 6.0

Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while pr...
Contextual Linear Activation Steering of Language Models
cs.CL 2026-04 unverdicted novelty 6.0

CLAS dynamically adapts linear activation steering strengths to context, outperforming fixed-strength steering and matching or exceeding ReFT and LoRA on eleven benchmarks across four model families with limited labeled data.
How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals
cs.LG 2026-04 unverdicted novelty 6.0

LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.
Characterizing Model-Native Skills
cs.AI 2026-04 conditional novelty 6.0

Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
Language models recognize dropout and Gaussian noise applied to their activations
cs.AI 2026-04 unverdicted novelty 6.0

Language models detect, localize, and distinguish dropout from Gaussian noise applied to their activations, often with high accuracy.
Predicting Where Steering Vectors Succeed
cs.LG 2026-04 unverdicted novelty 6.0

The Linear Accessibility Profile predicts steering vector effectiveness and optimal layers with Spearman correlations of 0.86-0.91 using unembedding projections on intermediate states across multiple models and concepts.
Geometric Routing Enables Causal Expert Control in Mixture of Experts
cs.AI 2026-04 unverdicted novelty 6.0

Cosine-similarity routing in low-dimensional space makes MoE experts monosemantic by construction and enables direct causal control via centroid interventions.
Rhetorical Questions in LLM Representations: A Linear Probing Study
cs.CL 2026-04 unverdicted novelty 6.0

Linear probes show rhetorical questions are encoded via multiple dataset-specific directions in LLM representations, with low cross-probe agreement on the same data.

Reference graph

Works this paper leans on

135 extracted references · 135 canonical work pages · cited by 77 Pith papers · 9 internal anchors

[1]

Understanding intermediate layers using linear classifier probes, 2018

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes, 2018

work page 2018
[2]

Transformer L ens: A library for mechanistic interpretability of generative language models

Joseph Bloom and Neel Nanda. Transformer L ens: A library for mechanistic interpretability of generative language models. https://neelnanda-io.github.io/TransformerLens/, 2022

work page 2022
[3]

Robustness of edited neural networks, 2023

Davis Brown, Charles Godfrey, Cody Nizinski, Jonathan Tu, and Henry Kvinge. Robustness of edited neural networks, 2023

work page 2023
[4]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page 2020
[5]

Discovering latent knowledge in language models without supervision, 2022

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision, 2022

work page 2022
[6]

Plug and play language models: A simple approach to controlled text generation, 2020

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation, 2020

work page 2020
[7]

A mathematical framework for transformer circuits

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1, 2021

work page 2021
[8]

Toy models of superposition, 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition, 2022

work page 2022
[10]

Probability plotting methods for the analysis of data

Ramanathan Gnanadesikan and Martin B Wilk. Probability plotting methods for the analysis of data. Biometrika, 55 0 (1): 0 1--17, 1968

work page 1968
[11]

Bias correction of learned generative models using likelihood-free importance weighting, 2019

Aditya Grover, Jiaming Song, Alekh Agarwal, Kenneth Tran, Ashish Kapoor, Eric Horvitz, and Stefano Ermon. Bias correction of learned generative models using likelihood-free importance weighting, 2019

work page 2019
[15]

Does localization inform editing? surprising differences in causality-based localization vs

Peter Hase, Mohit Bansal, Been Kim, and Asma Ghandeharioun. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models, 2023

work page 2023
[16]

Li, and Jacob Andreas

Evan Hernandez, Belinda Z. Li, and Jacob Andreas. Inspecting and editing knowledge representations in language models, 2023

work page 2023
[17]

Editing models with task arithmetic, 2023

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic, 2023

work page 2023
[20]

Language models and cognitive automation for economic research

Anton Korinek. Language models and cognitive automation for economic research. Technical report, National Bureau of Economic Research, 2023

work page 2023
[21]

Autoencoding beyond pixels using a learned similarity metric, 2016

Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric, 2016

work page 2016
[22]

The power of scale for parameter-efficient prompt tuning, 2021

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning, 2021

work page 2021
[23]

Delete, retrieve, generate: A simple approach to sentiment and style transfer, 2018

Juncen Li, Robin Jia, He He, and Percy Liang. Delete, retrieve, generate: A simple approach to sentiment and style transfer, 2018. URL https://arxiv.org/abs/1804.06437

work page arXiv 2018
[24]

Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg

Kenneth Li, Aspen K. Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task, 2023 a

work page 2023
[25]

Inference-time intervention: Eliciting truthful answers from a language model, 2023 b

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model, 2023 b

work page 2023
[26]

Prefix- T uning: Optimizing continuous prompts for generation, 2021

Xiang Lisa Li and Percy Liang. Prefix- T uning: Optimizing continuous prompts for generation, 2021

work page 2021
[27]

In-context V ectors: Making in context learning more effective and controllable through latent space steering, 2023

Sheng Liu, Lei Xing, and James Zou. In-context V ectors: Making in context learning more effective and controllable through latent space steering, 2023

work page 2023
[28]

Keeping llms aligned after fine-tuning: The crucial role of prompt templates, 2024

Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, and Sanjeev Arora. Keeping llms aligned after fine-tuning: The crucial role of prompt templates, 2024

work page 2024
[29]

Maas, Raymond E

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.\ 142--150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. UR...

work page 2011
[30]

Locating and editing factual associations in GPT , 2023

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT , 2023

work page 2023
[31]

Meta L lama 3

Meta. Meta L lama 3. https://llama.meta.com/llama3, 2024

work page 2024
[32]

Are sixteen heads really better than one? In H

Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/2c601ad9d2ff9bc8b28...

work page 2019
[33]

Distributed representations of words and phrases and their compositionality

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013 a . URL https://proceedings.neurip...

work page 2013
[34]

Linguistic regularities in continuous space word representations

Tom \'a s Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, pp.\ 746--751, 2013 b

work page 2013
[35]

Understanding and controlling a maze-solving policy network, 2023

Ulisse Mini, Peli Grietzer, Mrinank Sharma, Austin Meek, Monte MacDiarmid, and Alexander Matt Turner. Understanding and controlling a maze-solving policy network, 2023. URL https://arxiv.org/abs/2310.08043

work page arXiv 2023
[36]

Relative representations enable zero-shot latent space communication, 2023

Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, and Emanuele Rodolà. Relative representations enable zero-shot latent space communication, 2023

work page 2023
[37]

Actually, othello-gpt has a linear emergent world representation

Neel Nanda. Actually, othello-gpt has a linear emergent world representation. neelnanda.io/mechanistic-interpretability/othello, 2023

work page 2023
[38]

Distributed representations: Composition & superposition

Christopher Olah. Distributed representations: Composition & superposition. https://transformer-circuits.pub/2023/superposition-composition/index.html, 2023

work page 2023
[41]

arXiv preprint arXiv:2307.03214 , year=

Jonathan Pei, Kevin Yang, and Dan Klein. PREADD : prefix-adaptive decoding for controlled text generation. arXiv preprint arXiv:2307.03214, 2023

work page arXiv 2023
[42]

Openwebtext

Joshua Peterson, Stephan Meylan, and David Bourgin. Openwebtext. https://github.com/jcpeterson/openwebtext, 2018

work page 2018
[43]

Petroni, T

F. Petroni, T. Rockt \" a schel, A. H. Miller, P. Lewis, A. Bakhtin, Y. Wu, and S. Riedel. Language models as knowledge bases? In In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019, 2019

work page 2019
[45]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

work page 2019
[46]

Sequence level training with recurrent neural networks, 2016

Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks, 2016

work page 2016
[49]

The irrelevance of turing machines to artificial intelligence

Aaron Sloman. The irrelevance of turing machines to artificial intelligence. In Matthias Scheutz (ed.), Computationalism: New Directions. MIT Press, 2002

work page 2002
[50]

nltk.tokenize.punkt module

Jan Strunk. nltk.tokenize.punkt module. https://www.nltk.org/api/nltk.tokenize.punkt.html, 2013

work page 2013
[52]

LLaMA : Open and efficient foundation language models, 2023

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA : Open and efficient foundation language models, 2023

work page 2023
[53]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https:...

work page 2017
[54]

GPT-J-6B : 6 B jax-based transformer

Ben Wang and Aran Komatsuzaki. GPT-J-6B : 6 B jax-based transformer. https://github.com/kingoflolz/mesh-transformer-jax\#gpt-j-6b, 2021

work page 2021
[55]

Prompt engineering in consistency and reliability with the evidence-based guideline for llms

Li Wang, Xi Chen, XiangWen Deng, Hao Wen, MingKe You, WeiZhi Liu, Qi Li, and Jian Li. Prompt engineering in consistency and reliability with the evidence-based guideline for llms. npj Digital Medicine, 7 0 (1): 0 41, 2024

work page 2024
[56]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

work page 2022
[57]

Sampling generative networks, 2016

Tom White. Sampling generative networks, 2016

work page 2016
[58]

Eva- KELLM : A new benchmark for evaluating knowledge editing of LLMs , 2023

Suhang Wu, Minlong Peng, Yue Chen, Jinsong Su, and Mingming Sun. Eva- KELLM : A new benchmark for evaluating knowledge editing of LLMs , 2023

work page 2023
[60]

The unreliability of explanations in few-shot prompting for textual reasoning

Xi Ye and Greg Durrett. The unreliability of explanations in few-shot prompting for textual reasoning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.\ 30378--30392. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/c4...

work page 2022
[62]

A comprehensive study of knowledge editing for large language models, 2024

Ningyu Zhang, Yunzhi Yao, Bozhong Tian, Peng Wang, Shumin Deng, Mengru Wang, Zekun Xi, Shengyu Mao, Jintian Zhang, Yuansheng Ni, Siyuan Cheng, Ziwen Xu, Xin Xu, Jia-Chen Gu, Yong Jiang, Pengjun Xie, Fei Huang, Lei Liang, Zhiqiang Zhang, Xiaowei Zhu, Jun Zhou, and Huajun Chen. A comprehensive study of knowledge editing for large language models, 2024

work page 2024
[63]

OPT : Open pre-trained transformer language models, 2022 b

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. OPT : Open pre-trained transformer language models, 2022 b

work page 2022
[64]

Air- D ecoding: Attribute distribution reconstruction for decoding-time controllable text generation

Tianqi Zhong, Quan Wang, Jingxuan Han, Yongdong Zhang, and Zhendong Mao. Air- D ecoding: Attribute distribution reconstruction for decoding-time controllable text generation. arXiv preprint arXiv:2310.14892, 2023

work page arXiv 2023
[65]

Steering large language models using APE

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Steering large language models using APE . In NeurIPS ML Safety Workshop, 2022. URL https://openreview.net/forum?id=JjvNzMOiBEp

work page 2022
[66]

Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2019

work page 2019
[67]

Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to a...

work page 2023
[68]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova , year=. 1810.04805 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[69]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page
[70]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Truthfulqa: Measuring how models mimic human falsehoods , author=. arXiv preprint arXiv:2109.07958 , year=

work page internal anchor Pith review arXiv
[71]

2023 , eprint=

Understanding and Controlling a Maze-Solving Policy Network , author=. 2023 , eprint=

work page 2023
[72]

2019 , eprint=

Fine-Tuning Language Models from Human Preferences , author=. 2019 , eprint=

work page 2019
[73]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Fine-tuning aligned language models compromises safety, even when users do not intend to! , author=. arXiv preprint arXiv:2310.03693 , year=

work page internal anchor Pith review arXiv
[74]

Fudge: Controlled text generation with future discriminators

Yang, Kevin and Klein, Dan. FUDGE : Controlled Text Generation With Future Discriminators. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v1/2021.naacl-main.276

work page doi:10.18653/v1/2021.naacl-main.276 2021
[75]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron and Thibaut Lavril and Gautier Izacard and Xavier Martinet and Marie-Anne Lachaux and Timothée Lacroix and Baptiste Rozière and Naman Goyal and Eric Hambro and Faisal Azhar and Aurelien Rodriguez and Armand Joulin and Edouard Grave and Guillaume Lample , year=. 2302.13971 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[76]

and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke

Peters, Matthew E. and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke. Deep Contextualized Word Representations. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10...

work page doi:10.18653/v1/n18-1202 2018
[77]

2020 , eprint=

Language Models are Few-Shot Learners , author=. 2020 , eprint=

work page 2020
[78]

Generating Wikipedia by summarizing long sequences

Generating wikipedia by summarizing long sequences , author=. arXiv preprint arXiv:1801.10198 , year=

work page arXiv
[79]

Transformer Circuits Thread , volume=

A mathematical framework for transformer circuits , author=. Transformer Circuits Thread , volume=

work page
[80]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page
[81]

Controllable text generation via probability density estimation in the latent space

Controllable text generation via probability density estimation in the latent space , author=. arXiv preprint arXiv:2212.08307 , year=

work page arXiv
[82]

Zhong, Tianqi and Wang, Quan and Han, Jingxuan and Zhang, Yongdong and Mao, Zhendong , journal=. Air-

work page
[83]

Pei, Jonathan and Yang, Kevin and Klein, Dan , journal=

work page
[84]

and Daly, Raymond E

Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher , title =. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , month =. 2011 , address =

work page 2011
[85]

In-context vectors: Making in context learning more effective and controllable through latent space steering.arXiv preprint arXiv:2311.06668, 2023

Sheng Liu and Lei Xing and James Zou , year=. In-context. 2311.06668 , archivePrefix=

work page arXiv
[86]

2023 , eprint=

Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2023 , eprint=

work page 2023
[87]

More than a Feeling: Accuracy and Application of Sentiment Analysis , journal =

Jochen Hartmann and Mark Heitmann and Christian Siebert and Christina Schamp , keywords =. More than a Feeling: Accuracy and Application of Sentiment Analysis , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.ijresmar.2022.05.005 , url =

work page doi:10.1016/j.ijresmar.2022.05.005 2023
[88]

Suhang Wu and Minlong Peng and Yue Chen and Jinsong Su and Mingming Sun , year=. Eva-. 2308.09954 , archivePrefix=

work page arXiv
[89]

2023 , eprint=

Inspecting and Editing Knowledge Representations in Language Models , author=. 2023 , eprint=

work page 2023
[90]

Locating and editing factual associations in gpt

Kevin Meng and David Bau and Alex Andonian and Yonatan Belinkov , year=. Locating and Editing Factual Associations in. 2202.05262 , archivePrefix=

work page arXiv
[91]

2016 , eprint=

Sampling Generative Networks , author=. 2016 , eprint=

work page 2016
[92]

Language mod- els implement simple word2vec-style vector arithmetic

Jack Merullo and Carsten Eickhoff and Ellie Pavlick , year=. Language Models Implement Simple. 2305.16130 , archivePrefix=

work page arXiv
[93]

Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=

Linguistic regularities in continuous space word representations , author=. Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=

work page 2013
[94]

H afez: an Interactive Poetry Generation System

Ghazvininejad, Marjan and Shi, Xing and Priyadarshi, Jay and Knight, Kevin. H afez: an Interactive Poetry Generation System. Proceedings of ACL 2017, System Demonstrations. 2017

work page 2017

Showing first 80 references.