super hub Mixed citations

Steering Language Models With Activation Engineering

Alexander Matt Turner, David Udell, Gavin Leech, Juan J. Vazquez, Lisa Thiergart, Ulisse Mini · 2023 · cs.CL · arXiv 2308.10248

Mixed citation behavior. Most common role is background (62%).

209 Pith papers citing it

Background 62% of classified citations

open full Pith review browse 209 citing papers more from Alexander Matt Turner arXiv PDF

abstract

Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as "Love" versus "Hate") to compute a steering vector (Subramani et al. 2022). By tactically adding in e.g. the "Love" - "Hate" steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT. ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks. ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points, which enables rapid iteration over steering. ActAdd demonstrates the power of activation engineering.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 16 baseline 2 method 2 dataset 1

citation-polarity summary

background 13 unclear 3 baseline 2 use method 2 use dataset 1

claims ledger

abstract Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as "Love" versus "Hate") to compute a steering vector (Subramani et al. 2022). By tactically adding in e.g. the "Love" - "H

authors

Alexander Matt Turner David Udell Gavin Leech Juan J. Vazquez Lisa Thiergart Ulisse Mini

co-cited works

representative citing papers

Mechanistic Interpretability and Causal Feature Steering of Neural Quantum States via Sparse Autoencoders

quant-ph · 2026-07-01 · unverdicted · novelty 8.0

Sparse autoencoders applied to Neural Quantum States extract unsupervised features correlating with and causally steering physical observables such as order parameters while preserving variational energy.

Reported Confidence in LLMs Tracks Commitment More Than Correctness

cs.LG · 2026-06-28 · unverdicted · novelty 8.0

Verbal confidence in LLMs tracks future commit/abstain decisions more than answer correctness, while log-probabilities track correctness.

Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation

cs.CV · 2026-06-03 · unverdicted · novelty 8.0

A safety direction estimated in a source LLM is transported to a target generator through lightweight alignment on benign data alone, matching native safety performance without any target-side unsafe data.

WriteSAE: Sparse Autoencoders for Recurrent State

cs.LG · 2026-05-12 · unverdicted · novelty 8.0 · 3 refs

WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

SLAM: Structural Linguistic Activation Marking for Language Models

cs.CL · 2026-05-06 · unverdicted · novelty 8.0 · 2 refs

SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.

What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control

cs.GT · 2026-04-29 · accept · novelty 8.0

LLMs compute Nash actions internally but suppress them via prosocial overrides from training data, and this can be causally controlled through residual stream interventions.

Slot Machines: How LLMs Keep Track of Multiple Entities

cs.CL · 2026-04-22 · unverdicted · novelty 8.0

LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.

Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens

cs.LG · 2026-04-03 · accept · novelty 8.0

Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.

Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection

cs.CL · 2026-03-22 · unverdicted · novelty 8.0

Knowledge Packs deliver knowledge via pre-computed KV caches with exact equivalence under causal masking, achieving zero divergences on tested questions and enabling value-based steering without training.

The Linear Representation Hypothesis and the Geometry of Large Language Models

cs.CL · 2023-11-07 · conditional · novelty 8.0 · 2 refs

Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.

Closing the Activation-Cone Blind Spot: Response-Time Probing and Unified Defense

cs.CR · 2026-06-28 · unverdicted · novelty 7.0

Response-time linear probing on first generated tokens detects prefilling attacks missed by prompt-time activation defenses, achieving 0/40 attack success and 0% false positives across seven models while composing orthogonally with AlphaSteer.

Plans Don't Persist: Why Context Management Is Load Bearing for LLM Agents

cs.AI · 2026-06-22 · unverdicted · novelty 7.0

Replay pairing shows LLM agents do not persist plans in hidden states but rely on plans remaining in context, with rapid signal decay and task performance drops when plans are evicted.

When Agents Commit Too Soon: Diagnosing Premature Commitment in LLM Agents

cs.AI · 2026-06-22 · unverdicted · novelty 7.0

Hidden-state convergence at step 4 predicts behavioral consistency in LLM agents on QA tasks (r=-0.35 to -0.83), enabling AUROC 0.97 detection of inconsistent trajectories but not improving accuracy on harder benchmarks.

Channel Location Constrains the Auditability of Subliminal Learning

cs.LG · 2026-06-20 · unverdicted · novelty 7.0

Auditability of subliminal learning is constrained by channel location, with initialization-dependent body channels allowing pre-training screens while vocabulary geometry and conditional body channels evade them.

Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

cs.CL · 2026-06-18 · unverdicted · novelty 7.0

Difference-in-means activation directions detect and mitigate emergent misalignment from insecure code fine-tuning across four LLM families, with effective within-model steering but non-specific cross-model transfer.

Comparing Linear Probes with Mahalanobis Cosine Similarity

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

For balanced Gaussian class projections, OOD AUROC is a linear function of MCS to the reference probe because both are sigmoid-shaped functions of the probe SNR on test data.

Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models

cs.SD · 2026-06-09 · unverdicted · novelty 7.0

Instruction-based vector steering redirects temporal attention in LALMs to acoustically relevant regions, recovering queried sound event locations with 60.87-68.72% overlap accuracy without training.

Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

FD-SLMs exhibit state inertia during abrupt interruptions that a training-free perception-vector steering intervention mitigates, lifting correctness from 28% to 45% and IWOR from 40% to 72% on the Zero-Buffer Benchmark.

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

cs.LG · 2026-06-08 · unverdicted · novelty 7.0

VFUSE applies sparse autoencoders to diffusion-transformer activations in RoseTTAFold3 and RFDiffusion3 to find monosemantic features that detect hazardous protein designs with AUROC up to 0.84.

Adversarial Robustness of Activation Steering in Large Language Models

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

First systematic test shows activation steering robustness drops sharply (up to 64%) under adversarial input perturbations across multiple extraction methods, models, and personas.

Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

LA-LQR applies latent-space linear-quadratic regulator control to steer text-to-video model activations toward desired features while penalizing excessive changes.

Rotate2Think: Geometric Priming via Orthogonal Rotation to Improve Language Model Reasoning

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

Rotate2Think estimates an orthogonal rotation from input to thinking embeddings via Procrustes analysis on a few examples and injects the resulting vector to prime reasoning traces, raising accuracy in 30 of 32 model-benchmark settings.

Decomposing how prompting steers behavior

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

A geometric decomposition framework shows that affine transformations best recover prompt-induced task geometry and behavior in language and vision models across multiple datasets.

Subliminal Learning Is Steering Vector Distillation

cs.AI · 2026-05-31 · unverdicted · novelty 7.0

Subliminal learning is steering vector distillation: a student fine-tuned on a steered teacher's outputs learns to imitate the steering vector.

citing papers explorer

Showing 1 of 1 citing paper after filters.

The Tutoring Effectiveness Index: Predicting LLM Math Tutor Quality from Four Conversation Signals cs.CY · 2026-05-28 · unverdicted · none · ref 16 · internal anchor
The Tutoring Effectiveness Index (TEI) uses four signals from LLM conversations to select math tutoring responses, raising student improvement rates from 59.0% to 81.9% at N=8 on a frozen DeepSeek-R1-8B model without training or judges.

Steering Language Models With Activation Engineering

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer