hub

Steering llama 2 via contrastive activation addition

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Turner · 2024 · Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) · DOI 10.18653/v1/2024.acl-long.828

19 Pith papers cite this work, alongside 23 external citations. Polarity classification is still indexing.

19 Pith papers citing it

23 external citations · Crossref

open at publisher browse 19 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

representative citing papers

Inference-Time Machine Unlearning via Gated Activation Redirection

cs.LG · 2026-05-12 · conditional · novelty 8.0

GUARD-IT performs machine unlearning in LLMs via inference-time gated activation redirection, matching or exceeding gradient-based baselines on TOFU and MUSE while preserving utility and working under quantization.

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.

SLAM: Structural Linguistic Activation Marking for Language Models

cs.CL · 2026-05-06 · unverdicted · novelty 8.0

SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.

Deep Minds and Shallow Probes

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.

Interpreting Reinforcement Learning Agents with Susceptibilities

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.

Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.

Steer Like the LLM: Activation Steering that Mimics Prompting

cs.CL · 2026-05-05 · unverdicted · novelty 7.0

PSR models that estimate token-specific steering coefficients from activations outperform standard activation steering and compare favorably to prompting on steering benchmarks.

Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

cs.LG · 2026-04-21 · conditional · novelty 7.0

Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.

Psychological Steering of Large Language Models

cs.CL · 2026-04-15 · unverdicted · novelty 7.0

Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.

Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.

Pre-trained Tabular Foundation Models as Versatile Summary Networks for Neural Posterior Estimation

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Pre-trained TabPFN acts as an effective training-free summary network for neural posterior estimation, matching or outperforming standard methods while preserving useful marginal and location information in the posterior.

Don't Lose Focus: Activation Steering via Key-Orthogonal Projections

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

SKOP uses key-orthogonal projections to steer LLM activations while preserving attention patterns on focus tokens, cutting utility degradation by 5-7x and retaining over 95% of standard steering efficacy.

Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs

cs.LG · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

Memory Inception is a training-free method that injects latent KV banks at chosen layers to steer LLMs, achieving superior control-drift balance and up to 118x storage reduction on personality and structured-reasoning tasks.

Conceptors for Semantic Steering

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

Conceptors as soft projection matrices from bipolar activations offer a multidimensional, compositional, and geometrically principled method for semantic steering in LLMs that outperforms single-vector baselines in multi-dimensional subspaces.

Do Large Language Models Plan Answer Positions? Position Bias in Multiple-Choice Question Generation

cs.CL · 2026-05-03 · unverdicted · novelty 6.0

LLMs implicitly plan answer positions during MCQ generation, as shown by predictive signals in hidden representations and controllable shifts via activation steering.

Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

cs.CL · 2026-04-29 · unverdicted · novelty 6.0

LLMs favor task-appropriate reasoning over conflicting instructions, yet reasoning types are linearly encoded in middle-to-late layers and can be steered to boost instruction compliance by up to 29%.

What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

cs.LG · 2026-04-09 · unverdicted · novelty 6.0

Steering vectors for refusal primarily modify the OV circuit in attention, ignore most of the QK circuit, and can be sparsified to 1-10% of dimensions while retaining performance.

OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning

cs.LG · 2026-05-12 · unverdicted · novelty 5.0

OGLS-SD improves LLM reasoning by using verifiable outcome rewards to guide logit steering that calibrates teacher distributions in on-policy self-distillation, addressing reflection-induced mismatches.

Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models

cs.CL · 2026-05-12 · unverdicted · novelty 4.0

Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.

citing papers explorer

Showing 19 of 19 citing papers.

Inference-Time Machine Unlearning via Gated Activation Redirection cs.LG · 2026-05-12 · conditional · none · ref 7
GUARD-IT performs machine unlearning in LLMs via inference-time gated activation redirection, matching or exceeding gradient-based baselines on TOFU and MUSE while preserving utility and working under quantization.
Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 16
Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.
SLAM: Structural Linguistic Activation Marking for Language Models cs.CL · 2026-05-06 · unverdicted · none · ref 40
SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
Deep Minds and Shallow Probes cs.LG · 2026-05-12 · unverdicted · none · ref 36
Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
Interpreting Reinforcement Learning Agents with Susceptibilities cs.LG · 2026-05-08 · unverdicted · none · ref 112
Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior cs.LG · 2026-05-06 · unverdicted · none · ref 195
Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
Steer Like the LLM: Activation Steering that Mimics Prompting cs.CL · 2026-05-05 · unverdicted · none · ref 25
PSR models that estimate token-specific steering coefficients from activations outperform standard activation steering and compare favorably to prompting on steering benchmarks.
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control cs.LG · 2026-04-21 · conditional · none · ref 28
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
Psychological Steering of Large Language Models cs.CL · 2026-04-15 · unverdicted · none · ref 54
Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space cs.CL · 2026-05-12 · unverdicted · none · ref 187
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
Pre-trained Tabular Foundation Models as Versatile Summary Networks for Neural Posterior Estimation cs.LG · 2026-05-08 · unverdicted · none · ref 31
Pre-trained TabPFN acts as an effective training-free summary network for neural posterior estimation, matching or outperforming standard methods while preserving useful marginal and location information in the posterior.
Don't Lose Focus: Activation Steering via Key-Orthogonal Projections cs.CL · 2026-05-07 · unverdicted · none · ref 30
SKOP uses key-orthogonal projections to steer LLM activations while preserving attention patterns on focus tokens, cutting utility degradation by 5-7x and retaining over 95% of standard steering efficacy.
Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs cs.LG · 2026-05-07 · unverdicted · none · ref 20 · 2 links
Memory Inception is a training-free method that injects latent KV banks at chosen layers to steer LLMs, achieving superior control-drift balance and up to 118x storage reduction on personality and structured-reasoning tasks.
Conceptors for Semantic Steering cs.LG · 2026-05-06 · unverdicted · none · ref 23
Conceptors as soft projection matrices from bipolar activations offer a multidimensional, compositional, and geometrically principled method for semantic steering in LLMs that outperforms single-vector baselines in multi-dimensional subspaces.
Do Large Language Models Plan Answer Positions? Position Bias in Multiple-Choice Question Generation cs.CL · 2026-05-03 · unverdicted · none · ref 7
LLMs implicitly plan answer positions during MCQ generation, as shown by predictive signals in hidden representations and controllable shifts via activation steering.
Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models cs.CL · 2026-04-29 · unverdicted · none · ref 30
LLMs favor task-appropriate reasoning over conflicting instructions, yet reasoning types are linearly encoded in middle-to-late layers and can be steered to boost instruction compliance by up to 29%.
What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal cs.LG · 2026-04-09 · unverdicted · none · ref 28
Steering vectors for refusal primarily modify the OV circuit in attention, ignore most of the QK circuit, and can be sparsified to 1-10% of dimensions while retaining performance.
OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning cs.LG · 2026-05-12 · unverdicted · none · ref 33
OGLS-SD improves LLM reasoning by using verifiable outcome rewards to guide logit steering that calibrates teacher distributions in on-policy self-distillation, addressing reflection-induced mismatches.
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models cs.CL · 2026-05-12 · unverdicted · none · ref 52
Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.

Steering llama 2 via contrastive activation addition

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer