super hub Mixed citations

Steering Language Models With Activation Engineering

Alexander Matt Turner, David Udell, Gavin Leech, Juan J. Vazquez, Lisa Thiergart, Ulisse Mini · 2023 · cs.CL · arXiv 2308.10248

Mixed citation behavior. Most common role is background (62%).

209 Pith papers citing it

Background 62% of classified citations

open full Pith review browse 209 citing papers more from Alexander Matt Turner arXiv PDF

abstract

Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as "Love" versus "Hate") to compute a steering vector (Subramani et al. 2022). By tactically adding in e.g. the "Love" - "Hate" steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT. ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks. ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points, which enables rapid iteration over steering. ActAdd demonstrates the power of activation engineering.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 16 baseline 2 method 2 dataset 1

citation-polarity summary

background 13 unclear 3 baseline 2 use method 2 use dataset 1

claims ledger

abstract Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as "Love" versus "Hate") to compute a steering vector (Subramani et al. 2022). By tactically adding in e.g. the "Love" - "H

authors

Alexander Matt Turner David Udell Gavin Leech Juan J. Vazquez Lisa Thiergart Ulisse Mini

co-cited works

representative citing papers

Mechanistic Interpretability and Causal Feature Steering of Neural Quantum States via Sparse Autoencoders

quant-ph · 2026-07-01 · unverdicted · novelty 8.0

Sparse autoencoders applied to Neural Quantum States extract unsupervised features correlating with and causally steering physical observables such as order parameters while preserving variational energy.

Reported Confidence in LLMs Tracks Commitment More Than Correctness

cs.LG · 2026-06-28 · unverdicted · novelty 8.0

Verbal confidence in LLMs tracks future commit/abstain decisions more than answer correctness, while log-probabilities track correctness.

Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation

cs.CV · 2026-06-03 · unverdicted · novelty 8.0

A safety direction estimated in a source LLM is transported to a target generator through lightweight alignment on benign data alone, matching native safety performance without any target-side unsafe data.

WriteSAE: Sparse Autoencoders for Recurrent State

cs.LG · 2026-05-12 · unverdicted · novelty 8.0 · 3 refs

WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

SLAM: Structural Linguistic Activation Marking for Language Models

cs.CL · 2026-05-06 · unverdicted · novelty 8.0 · 2 refs

SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.

What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control

cs.GT · 2026-04-29 · accept · novelty 8.0

LLMs compute Nash actions internally but suppress them via prosocial overrides from training data, and this can be causally controlled through residual stream interventions.

Slot Machines: How LLMs Keep Track of Multiple Entities

cs.CL · 2026-04-22 · unverdicted · novelty 8.0

LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.

Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens

cs.LG · 2026-04-03 · accept · novelty 8.0

Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.

Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection

cs.CL · 2026-03-22 · unverdicted · novelty 8.0

Knowledge Packs deliver knowledge via pre-computed KV caches with exact equivalence under causal masking, achieving zero divergences on tested questions and enabling value-based steering without training.

The Linear Representation Hypothesis and the Geometry of Large Language Models

cs.CL · 2023-11-07 · conditional · novelty 8.0 · 2 refs

Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.

Closing the Activation-Cone Blind Spot: Response-Time Probing and Unified Defense

cs.CR · 2026-06-28 · unverdicted · novelty 7.0

Response-time linear probing on first generated tokens detects prefilling attacks missed by prompt-time activation defenses, achieving 0/40 attack success and 0% false positives across seven models while composing orthogonally with AlphaSteer.

Plans Don't Persist: Why Context Management Is Load Bearing for LLM Agents

cs.AI · 2026-06-22 · unverdicted · novelty 7.0

Replay pairing shows LLM agents do not persist plans in hidden states but rely on plans remaining in context, with rapid signal decay and task performance drops when plans are evicted.

When Agents Commit Too Soon: Diagnosing Premature Commitment in LLM Agents

cs.AI · 2026-06-22 · unverdicted · novelty 7.0

Hidden-state convergence at step 4 predicts behavioral consistency in LLM agents on QA tasks (r=-0.35 to -0.83), enabling AUROC 0.97 detection of inconsistent trajectories but not improving accuracy on harder benchmarks.

Channel Location Constrains the Auditability of Subliminal Learning

cs.LG · 2026-06-20 · unverdicted · novelty 7.0

Auditability of subliminal learning is constrained by channel location, with initialization-dependent body channels allowing pre-training screens while vocabulary geometry and conditional body channels evade them.

Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

cs.CL · 2026-06-18 · unverdicted · novelty 7.0

Difference-in-means activation directions detect and mitigate emergent misalignment from insecure code fine-tuning across four LLM families, with effective within-model steering but non-specific cross-model transfer.

Comparing Linear Probes with Mahalanobis Cosine Similarity

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

For balanced Gaussian class projections, OOD AUROC is a linear function of MCS to the reference probe because both are sigmoid-shaped functions of the probe SNR on test data.

Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models

cs.SD · 2026-06-09 · unverdicted · novelty 7.0

Instruction-based vector steering redirects temporal attention in LALMs to acoustically relevant regions, recovering queried sound event locations with 60.87-68.72% overlap accuracy without training.

Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

FD-SLMs exhibit state inertia during abrupt interruptions that a training-free perception-vector steering intervention mitigates, lifting correctness from 28% to 45% and IWOR from 40% to 72% on the Zero-Buffer Benchmark.

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

cs.LG · 2026-06-08 · unverdicted · novelty 7.0

VFUSE applies sparse autoencoders to diffusion-transformer activations in RoseTTAFold3 and RFDiffusion3 to find monosemantic features that detect hazardous protein designs with AUROC up to 0.84.

Adversarial Robustness of Activation Steering in Large Language Models

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

First systematic test shows activation steering robustness drops sharply (up to 64%) under adversarial input perturbations across multiple extraction methods, models, and personas.

Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

LA-LQR applies latent-space linear-quadratic regulator control to steer text-to-video model activations toward desired features while penalizing excessive changes.

Rotate2Think: Geometric Priming via Orthogonal Rotation to Improve Language Model Reasoning

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

Rotate2Think estimates an orthogonal rotation from input to thinking embeddings via Procrustes analysis on a few examples and injects the resulting vector to prime reasoning traces, raising accuracy in 30 of 32 model-benchmark settings.

Decomposing how prompting steers behavior

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

A geometric decomposition framework shows that affine transformations best recover prompt-induced task geometry and behavior in language and vision models across multiple datasets.

Subliminal Learning Is Steering Vector Distillation

cs.AI · 2026-05-31 · unverdicted · novelty 7.0

Subliminal learning is steering vector distillation: a student fine-tuned on a steered teacher's outputs learns to imitate the steering vector.

citing papers explorer

Showing 50 of 84 citing papers after filters.

Reported Confidence in LLMs Tracks Commitment More Than Correctness cs.LG · 2026-06-28 · unverdicted · none · ref 28 · internal anchor
Verbal confidence in LLMs tracks future commit/abstain decisions more than answer correctness, while log-probabilities track correctness.
WriteSAE: Sparse Autoencoders for Recurrent State cs.LG · 2026-05-12 · unverdicted · none · ref 70 · 3 links · internal anchor
WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens cs.LG · 2026-04-03 · accept · none · ref 27 · internal anchor
Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
Channel Location Constrains the Auditability of Subliminal Learning cs.LG · 2026-06-20 · unverdicted · none · ref 5 · internal anchor
Auditability of subliminal learning is constrained by channel location, with initialization-dependent body channels allowing pre-training screens while vocabulary geometry and conditional body channels evade them.
Comparing Linear Probes with Mahalanobis Cosine Similarity cs.LG · 2026-06-17 · unverdicted · none · ref 24 · internal anchor
For balanced Gaussian class projections, OOD AUROC is a linear function of MCS to the reference probe because both are sigmoid-shaped functions of the probe SNR on test data.
VFUSE: Virulent Feature Understanding with Sparse autoEncoders cs.LG · 2026-06-08 · unverdicted · none · ref 42 · internal anchor
VFUSE applies sparse autoencoders to diffusion-transformer activations in RoseTTAFold3 and RFDiffusion3 to find monosemantic features that detect hazardous protein designs with AUROC up to 0.84.
Adversarial Robustness of Activation Steering in Large Language Models cs.LG · 2026-06-05 · unverdicted · none · ref 5 · internal anchor
First systematic test shows activation steering robustness drops sharply (up to 64%) under adversarial input perturbations across multiple extraction methods, models, and personas.
Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control cs.LG · 2026-06-03 · unverdicted · none · ref 28 · internal anchor
LA-LQR applies latent-space linear-quadratic regulator control to steer text-to-video model activations toward desired features while penalizing excessive changes.
Rotate2Think: Geometric Priming via Orthogonal Rotation to Improve Language Model Reasoning cs.LG · 2026-06-02 · unverdicted · none · ref 28 · internal anchor
Rotate2Think estimates an orthogonal rotation from input to thinking embeddings via Procrustes analysis on a few examples and injects the resulting vector to prime reasoning traces, raising accuracy in 30 of 32 model-benchmark settings.
Transformer Field Theory: A Response-Theoretic Approach to Mechanistic Interpretability cs.LG · 2026-05-24 · unverdicted · none · ref 22 · internal anchor
Transformer Field Theory frames the residual stream as a field, models patching as source insertion, and uses first-order sensitivities plus Green functions to predict and describe responses, with empirical tests on GPT-2 autoregressive models.
Riemannian-Manifold Steering: Geometry-Aware Generative Autoencoders for Label-Free Steering cs.LG · 2026-05-24 · unverdicted · none · ref 13 · internal anchor
A Riemannian geodesic framework for label-free manifold steering in language models via a schema-supervised encoder approximating output Hellinger distance on activations.
Measuring Alignment-Induced Activation Shifts Correctly: A Template-Controlled Difference-in-Differences Protocol cs.LG · 2026-05-23 · unverdicted · none · ref 13 · internal anchor
Introduces a template-controlled difference-in-differences protocol that corrects chat-template confounding when measuring alignment-induced activation shifts in LLMs and recovers the refusal direction with higher fidelity.
Polymorphism Is Rotation: Operational Mechanistic Interpretability from a Two-Layer Transformer to Pythia-70m cs.LG · 2026-05-23 · unverdicted · none · ref 13 · internal anchor
Transformers trained from different random seeds exhibit residual-stream polymorphism that is exactly a uniform random rotation, which a Procrustes alignment removes to transfer SAEs and steering vectors.
The Hidden Signal of Verifier Strictness: Controlling and Improving Step-Wise Verification via Selective Latent Steering cs.LG · 2026-05-20 · conditional · none · ref 20 · internal anchor
VerifySteer selectively steers hidden states at paragraph boundaries using latent correctness signals to control verifier strictness and outperform baselines on ProcessBench and Hard2Verify with lower compute.
Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing cs.LG · 2026-05-18 · unverdicted · none · ref 4 · internal anchor
Residual Paving decomposes selective refusal editing into an early-layer router for intervention decisions and later-layer residual experts for edits, with oracle routing showing that learned route selectivity is the primary bottleneck across six backbones.
FishBack: Pullback Fisher Geometry for Optimal Activation Steering in Transformers cs.LG · 2026-05-17 · unverdicted · none · ref 45 · internal anchor
FishBack derives a closed-form minimum-distortion steering direction from the pullback Fisher metric of the softmax layer, outperforming Euclidean baselines on GPT-2 verb-morphology tasks with lower off-target KL divergence.
Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space cs.LG · 2026-05-15 · unverdicted · none · ref 47 · internal anchor
Pretraining and alignment induce asymmetric geometric traces in transformer weights because alignment updates concentrate in read pathways due to activation covariance while write pathways inherit less structure from alignment losses.
Deep Minds and Shallow Probes cs.LG · 2026-05-12 · unverdicted · none · ref 35 · internal anchor
Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing cs.LG · 2026-05-11 · unverdicted · none · ref 12 · internal anchor
SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.
When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search cs.LG · 2026-05-09 · unverdicted · none · ref 3 · 2 links · internal anchor
Prompt-boundary directional alignment enables geometry-guided search that cuts trials to 95% best utility by 39.8% on average, while concept granularity predicts remaining difficulty via directional heterogeneity.
The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It cs.LG · 2026-05-05 · accept · none · ref 11 · 2 links · internal anchor
Transformers store count information internally but cannot read it out as digits due to near-orthogonal alignment with output-head rows; updating digit rows or applying LoRA to attention layers improves constrained and unconstrained counting respectively.
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs cs.LG · 2026-05-01 · unverdicted · none · ref 63 · internal anchor
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control cs.LG · 2026-04-21 · conditional · none · ref 41 · internal anchor
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
Steering Autoregressive Music Generation with Recursive Feature Machines cs.LG · 2025-10-21 · unverdicted · none · ref 11 · internal anchor
MusicRFM discovers interpretable concept directions in music model hidden states using RFM probes and injects them at inference to steer generation toward desired musical properties without retraining.
Activation Steering with a Feedback Controller cs.LG · 2025-10-05 · unverdicted · none · ref 25 · internal anchor
Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 192 · internal anchor
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
The Model Organism Lottery: Model Organism Interpretability Strongly Depends on Training Methodology cs.LG · 2026-07-01 · unverdicted · none · ref 15 · internal anchor
Model organism interpretability depends strongly on training methodology, with integrated training yielding less interpretable MOs than post-hoc SFT or DPO.
Surrogate Fidelity: When Can Open LLMs Explain Closed Ones? cs.LG · 2026-06-30 · unverdicted · none · ref 35 · internal anchor
Prediction agreement between open and closed LLMs substantially overstates agreement on attributions and causal reasons.
Internal-State Probes Read the Situation, Not the Action: Three Negative Results for Pre-Action Misalignment Monitoring cs.LG · 2026-06-29 · accept · none · ref 5 · internal anchor
Internal probes across three model families fail generalization and specificity tests and therefore do not support robust pre-action misalignment monitoring.
Do Models Read What They Write? Causal Registers in Scratchpad Reasoning cs.LG · 2026-06-28 · unverdicted · none · ref 8 · internal anchor
State-writing models causally use edited scratchpad states in a controlled task at 80-91% accuracy on held-out examples, unlike final-answer-only and pretrained controls.
Evidence for feature-specific error correction in LLMs cs.LG · 2026-06-23 · unverdicted · none · ref 15 · internal anchor
Perturbation experiments across six LLMs show activation robustness follows L^p norm with p>2 for feature directions (contrastive, MELBO, SAE) but p≈2 for random/PCA controls, indicating feature-specific error correction.
Learning to Refine Hidden States for Reliable LLM Reasoning cs.LG · 2026-06-16 · unverdicted · none · ref 10 · internal anchor
ReLAR uses reinforcement-guided latent refinement with adaptive controllers to improve LLM reasoning accuracy and stability at lower inference cost than explicit reasoning methods.
Recoverable but Not Stationary:Local Linear Structures in Weights and Activations cs.LG · 2026-06-09 · unverdicted · none · ref 19 · internal anchor
Local low-rank task-gradient structures exist in weights and activations but are non-stationary, with initial recovery updates forming a basis capturing 77% of LoRA displacement and parameter steps aligning 0.58 cosine with CAA steering vectors.
Activation Steering Induces Emergent Misalignment: A More Comprehensive Evaluation cs.LG · 2026-06-07 · unverdicted · none · ref 2 · internal anchor
Activation steering induces emergent misalignment in LLMs, yielding more semantically relevant and coherent harmful responses than finetuning across model families, scales, tasks, and layers.
Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects cs.LG · 2026-06-06 · unverdicted · none · ref 12 · internal anchor
Pre-intervention feature statistics predict SAE steering modularity (stability and collateral spread) better than baselines across multiple models and dictionaries, with model-dependent success in held-out selection.
TALAN: Task-Aligned Latent Adaptation Networks for Targeted Post-Training of Large Language Models cs.LG · 2026-06-05 · unverdicted · none · ref 22 · internal anchor
TALAN inserts a trainable latent memory path that remixes sequence information into small orthogonal perturbations, delivering 1.41-1.85 point average gains over matched LoRA and DoRA on four Qwen backbones and STEM/code benchmarks while adding under 1% parameters.
HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models cs.LG · 2026-06-02 · unverdicted · none · ref 30 · internal anchor
HARVE removes the component of the reward-head vector aligned with a multi-directional hacking subspace from residual streams using a small set of contrastive examples, improving robustness on RewardHackBench across eight models without fine-tuning while preserving general capability.
CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models cs.LG · 2026-06-01 · unverdicted · none · ref 15 · internal anchor
CANARY detects 1% fine-tuning contamination with AUROC 1.000 using SAE-filtered hidden states, 7.5x below output-level detection thresholds, with zero false positives on benign tuning.
Measuring, Localizing, and Ablating Alignment Signatures in LLMs cs.LG · 2026-05-28 · unverdicted · none · ref 3 · internal anchor
Post-training introduces measurable AI-like stylistic signatures in LLMs that can be localized via aligned-base residual contrasts and ablated to lower detector rates while preserving coherence.
Activation Steering for Synthetic Data Generation: The Role of Diversity in Downstream Safety Detection cs.LG · 2026-05-27 · unverdicted · none · ref 50 · internal anchor
Activation steering produces synthetic safety-violating data that improves downstream classifiers over prompting on most tested concepts when a harmonic mean of alignment, coherence, and diversity is optimized.
Manifold-Guided Attention Steering cs.LG · 2026-05-20 · unverdicted · none · ref 17 · internal anchor
MAGS learns low-dimensional subspaces from correct versus incorrect reasoning traces and applies targeted projection corrections to attention heads when they deviate from the correctness manifold during inference.
VSPO: Vector-Steered Policy Optimization for Behavioral Control cs.LG · 2026-05-15 · unverdicted · none · ref 24 · internal anchor
VSPO samples rollouts at varying steering intensities to improve behavioral control in LLMs while preserving task accuracy.
TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale cs.LG · 2026-05-14 · unverdicted · none · ref 64 · internal anchor
TFGN is an architectural overlay for transformers enabling task-free, replay-free continual pre-training across heterogeneous domains at LLM scale with near-zero backward transfer and high gradient orthogonality.
Interpretability Can Be Actionable cs.LG · 2026-05-11 · conditional · none · ref 1 · internal anchor
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
Enabling Performant and Flexible Model-Internal Observability for LLM Inference cs.LG · 2026-05-11 · unverdicted · none · ref 40 · internal anchor
DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.
Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders cs.LG · 2026-05-07 · conditional · none · ref 36 · internal anchor
Sparse autoencoders isolate unstable features in reward model representations and enable two mitigation techniques that reduce preference errors on perturbed inputs without retraining.
Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs cs.LG · 2026-05-07 · unverdicted · none · ref 19 · 2 links · internal anchor
Memory Inception is a training-free method that injects latent KV banks at chosen layers to steer LLMs, achieving superior control-drift balance and up to 118x storage reduction on personality and structured-reasoning tasks.
On the Blessing of Pre-training in Weak-to-Strong Generalization cs.LG · 2026-05-07 · unverdicted · none · ref 6 · internal anchor
Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.
Conceptors for Semantic Steering cs.LG · 2026-05-06 · unverdicted · none · ref 31 · internal anchor
Conceptors as soft projection matrices from bipolar activations offer a multidimensional, compositional, and geometrically principled method for semantic steering in LLMs that outperforms single-vector baselines in multi-dimensional subspaces.
Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes cs.LG · 2026-05-04 · unverdicted · none · ref 17 · internal anchor
Pairwise matrices for SAEs demonstrate that single-feature inspection mislabels causal axes, with joint suppression and matched-geometry controls revealing distinct output regimes not captured by single-feature or random perturbations.

Steering Language Models With Activation Engineering

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer