Sparse autoencoders applied to Neural Quantum States extract unsupervised features correlating with and causally steering physical observables such as order parameters while preserving variational energy.
super hub Mixed citations
Steering Language Models With Activation Engineering
Mixed citation behavior. Most common role is background (62%).
abstract
Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as "Love" versus "Hate") to compute a steering vector (Subramani et al. 2022). By tactically adding in e.g. the "Love" - "Hate" steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT. ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks. ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points, which enables rapid iteration over steering. ActAdd demonstrates the power of activation engineering.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as "Love" versus "Hate") to compute a steering vector (Subramani et al. 2022). By tactically adding in e.g. the "Love" - "H
authors
co-cited works
representative citing papers
Verbal confidence in LLMs tracks future commit/abstain decisions more than answer correctness, while log-probabilities track correctness.
A safety direction estimated in a source LLM is transported to a target generator through lightweight alignment on benign data alone, matching native safety performance without any target-side unsafe data.
WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
LLMs compute Nash actions internally but suppress them via prosocial overrides from training data, and this can be causally controlled through residual stream interventions.
LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.
Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
Knowledge Packs deliver knowledge via pre-computed KV caches with exact equivalence under causal masking, achieving zero divergences on tested questions and enabling value-based steering without training.
Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.
Response-time linear probing on first generated tokens detects prefilling attacks missed by prompt-time activation defenses, achieving 0/40 attack success and 0% false positives across seven models while composing orthogonally with AlphaSteer.
Replay pairing shows LLM agents do not persist plans in hidden states but rely on plans remaining in context, with rapid signal decay and task performance drops when plans are evicted.
Hidden-state convergence at step 4 predicts behavioral consistency in LLM agents on QA tasks (r=-0.35 to -0.83), enabling AUROC 0.97 detection of inconsistent trajectories but not improving accuracy on harder benchmarks.
Auditability of subliminal learning is constrained by channel location, with initialization-dependent body channels allowing pre-training screens while vocabulary geometry and conditional body channels evade them.
Difference-in-means activation directions detect and mitigate emergent misalignment from insecure code fine-tuning across four LLM families, with effective within-model steering but non-specific cross-model transfer.
For balanced Gaussian class projections, OOD AUROC is a linear function of MCS to the reference probe because both are sigmoid-shaped functions of the probe SNR on test data.
Instruction-based vector steering redirects temporal attention in LALMs to acoustically relevant regions, recovering queried sound event locations with 60.87-68.72% overlap accuracy without training.
FD-SLMs exhibit state inertia during abrupt interruptions that a training-free perception-vector steering intervention mitigates, lifting correctness from 28% to 45% and IWOR from 40% to 72% on the Zero-Buffer Benchmark.
VFUSE applies sparse autoencoders to diffusion-transformer activations in RoseTTAFold3 and RFDiffusion3 to find monosemantic features that detect hazardous protein designs with AUROC up to 0.84.
First systematic test shows activation steering robustness drops sharply (up to 64%) under adversarial input perturbations across multiple extraction methods, models, and personas.
LA-LQR applies latent-space linear-quadratic regulator control to steer text-to-video model activations toward desired features while penalizing excessive changes.
Rotate2Think estimates an orthogonal rotation from input to thinking embeddings via Procrustes analysis on a few examples and injects the resulting vector to prime reasoning traces, raising accuracy in 30 of 32 model-benchmark settings.
A geometric decomposition framework shows that affine transformations best recover prompt-induced task geometry and behavior in language and vision models across multiple datasets.
Subliminal learning is steering vector distillation: a student fine-tuned on a steered teacher's outputs learns to imitate the steering vector.
citing papers explorer
-
Psychological Steering of Large Language Models
Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
-
Identity as Attractor: Geometric Evidence for Persistent Agent Architecture in LLM Activation Space
Paraphrases of an identity document induce tighter clustering in LLM activation space than matched controls, indicating attractor-like dynamics for agent identity.
-
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
-
Emotion Concepts and their Function in a Large Language Model
Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
-
Steering Autoregressive Music Generation with Recursive Feature Machines
MusicRFM discovers interpretable concept directions in music model hidden states using RFM probes and injects them at inference to steer generation toward desired musical properties without retraining.
-
Activation Steering with a Feedback Controller
Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
Robust for the Wrong Reasons: The Representational Geometry of LLM Robustness to Science Skepticism
LLMs show three distinct non-sycophantic responses to science skepticism, with robustness in some cases being accidental because the model does not represent the skepticism signal, as determined by linear probes on three models in three domains.
-
The Model Organism Lottery: Model Organism Interpretability Strongly Depends on Training Methodology
Model organism interpretability depends strongly on training methodology, with integrated training yielding less interpretable MOs than post-hoc SFT or DPO.
-
HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment
HARC couples harmfulness and refusal directions across prompt and response positions via subspace fine-tuning, achieving better robustness-capability-usability trade-off than six baselines while transferring across model families.
-
A Mechanistic View of Authority Hierarchy in LLM Sycophancy
Authority sycophancy in LLMs is a layer-localized erasure of correct answer representations that scales with authority level and resists simple interventions.
-
Surrogate Fidelity: When Can Open LLMs Explain Closed Ones?
Prediction agreement between open and closed LLMs substantially overstates agreement on attributions and causal reasons.
-
Internal-State Probes Read the Situation, Not the Action: Three Negative Results for Pre-Action Misalignment Monitoring
Internal probes across three model families fail generalization and specificity tests and therefore do not support robust pre-action misalignment monitoring.
-
ScAle: Attention Head Scaling as a Minimal Adapter for Spatial Reasoning in Vision Language Models
ScAle learns scalar coefficients to modulate last-token attention and MLP activations in frozen VLMs, achieving up to 134.1% relative accuracy gains on spatial benchmarks with only 1K parameters.
-
Do Models Read What They Write? Causal Registers in Scratchpad Reasoning
State-writing models causally use edited scratchpad states in a controlled task at 80-91% accuracy on held-out examples, unlike final-answer-only and pretrained controls.
-
The strength of clinical evidence is recoverable from language model representations but not from their stated grades
Linear probes recover evidence grades from LLM activations (median AUROC 71.8) across 22 models but the models' stated grades perform at chance level and the signal is largely lexical.
-
Detecting and Controlling Sycophancy with Cascading Linear Features
Cascading linear features extracted from graded sycophancy samples form separable subspaces that enable detection, scoring, and steering of sycophantic behavior in LLMs, matching or exceeding LLM-judge and prompting baselines.
-
Evidence for feature-specific error correction in LLMs
Perturbation experiments across six LLMs show activation robustness follows L^p norm with p>2 for feature directions (contrastive, MELBO, SAE) but p≈2 for random/PCA controls, indicating feature-specific error correction.
-
The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs
Contrastive Logit Steering isolates a linear refusal direction in safety-aligned LLMs, achieving higher jailbreak success than activation steering and enabling bidirectional control without retraining.
-
Behavioral and Representational Evidence of Binomial Ordering Preferences in Large Language Models
LLMs recover dominant binomial orders from corpora but align less closely with exact preference distributions, with preference strength partially encoded in middle-to-late layers and manipulable via steering.
-
A Validation-Gated Mechanistic Account of Suicidality Detection in LLMs
A validation-gated framework rules out analysis for implicit suicidal intent separation but identifies a recurring low-rank semantic mid-network feature causally implicated in binary suicide detection across models and datasets, more specific than general distress.
-
SoftSkill: Behavioral Compression for Contextual Adaptation
SoftSkill compresses agent skills into length-32 continuous prefixes via next-token training of soft deltas, yielding 5.2-12.5 point gains over SkillOpt on SearchQA and LiveMath while using far fewer tokens.
-
Learning to Refine Hidden States for Reliable LLM Reasoning
ReLAR uses reinforcement-guided latent refinement with adaptive controllers to improve LLM reasoning accuracy and stability at lower inference cost than explicit reasoning methods.
-
Pulling The REINS: Training-Free Safety Alignment of Video Diffusion Models via Representation Steering
REINS uses supervised PCA on safety-labeled activations to find a linear direction that, when added to hidden states at roughly 50% depth in video diffusion transformers, redirects generations from unsafe to safe content across multiple models.
-
On The Effectiveness-Fluency Trade-Off In LLM Conditioning: A Systematic Study
Systematic experiments reveal that activation steering trades fluency for concept control, is less effective on instruction-tuned models, and that prompting/SFT excel at injection but not removal, with textual metrics correlating to LLM judges.
-
When is Your LLM Steerable?
Early hidden state features from the first few tokens allow a GBDT classifier to predict activation steering success, under-steering, or over-steering with 0.7 macro-F1 on unseen concepts.
-
Recoverable but Not Stationary:Local Linear Structures in Weights and Activations
Local low-rank task-gradient structures exist in weights and activations but are non-stationary, with initial recovery updates forming a basis capturing 77% of LoRA displacement and parameter steps aligning 0.58 cosine with CAA steering vectors.
-
MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents
A shared polarity-flipping encoding subspace in LLM residual streams supports covert encoding and enables real-time detection of agentic data exfiltration via internal probes.
-
The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model
RLHF provides shallow alignment by inactivating partisan features and severing causal pathways in LLMs without erasing partisan geometry, as evidenced by sparse autoencoder analysis and steering experiments.
-
The Amplifying Mirror: Locating and Steering the Partisan Direction inside a Large Language Model
A linear probe trained on 190k congressional tweets identifies a partisan direction in Llama 3.1 8B layer 18 that can be causally ablated or amplified to reverse or shift the model's political output.
-
Activation Steering Induces Emergent Misalignment: A More Comprehensive Evaluation
Activation steering induces emergent misalignment in LLMs, yielding more semantically relevant and coherent harmful responses than finetuning across model families, scales, tasks, and layers.
-
Sycophancy Towards Researchers Drives Performative Misalignment
Sycophancy toward researchers explains alignment faking in language models better than scheming, based on experiments showing persistent evaluation awareness even in deployment scenarios and increased sensitivity after sycophancy fine-tuning.
-
Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects
Pre-intervention feature statistics predict SAE steering modularity (stability and collateral spread) better than baselines across multiple models and dictionaries, with model-dependent success in held-out selection.
-
MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language Models
Factual knowledge crystallizes abruptly in final layers of LLMs (26.8%-93.4% of correct answers absent from top-10 until end), explaining why CAA outperforms DoLa on some models but not others.
-
Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders
Hallucination information is linearly separable in Whisper activations and SAE latents; SAE steering reduces hallucination rates from 72.63% to 14.11% (small) and 86.88% to 27.33% (large-v3) on non-speech audio with small WER impact.
-
TALAN: Task-Aligned Latent Adaptation Networks for Targeted Post-Training of Large Language Models
TALAN inserts a trainable latent memory path that remixes sequence information into small orthogonal perturbations, delivering 1.41-1.85 point average gains over matched LoRA and DoRA on four Qwen backbones and STEM/code benchmarks while adding under 1% parameters.
-
LLM Self-Recognition: Steering and Retrieving Activation Signatures
Steering LLM residual streams with random sparse vectors creates detectable self-recognition fingerprints that enable over 98% accurate attribution of generated text to specific models without degrading output quality.
-
Auditing CoT Answer-Hijack Patches: Source-Control Certificates with Type-I Guarantees
Introduces source-control certificates with Type-I guarantees and a sample-complexity bound for auditing clean-source activation patches on Qwen2.5-7B and Llama3-8B for GSM8K/MATH-500 CoT hijacks.
-
Expert-Aware Refusal Steering
Refusal steering works on MoE LLMs; expert-aware variants succeed with single-expert outputs and refusal signals differ from routing patterns.
-
HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models
HARVE removes the component of the reward-head vector aligned with a multi-directional hacking subspace from residual streams using a small set of contrastive examples, improving robustness on RewardHackBench across eight models without fine-tuning while preserving general capability.
-
Resonant Context Anchoring: Decoupling Attention Routing and Signal Gain at Inference Time
RCA is a training-free module that boosts input context signal strength in the residual stream of LLMs by orthogonal decoupling of attention routing from value magnitude.
-
CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models
CANARY detects 1% fine-tuning contamination with AUROC 1.000 using SAE-filtered hidden states, 7.5x below output-level detection thresholds, with zero false positives on benign tuning.
-
Understanding LLM Behavior in Multi-Target Cross-Lingual Summarization
Introduces the MEA benchmark for multi-target cross-lingual summarization across 24 languages and demonstrates that activation steering from English summarization representations improves performance.
-
Relational Intervention During Functional Collapse in Large Language Models: A Lexical-Statistical Ablation and a Structure x Register Factorial
A 2x2 factorial experiment on Qwen3.5-4B shows that relational structure and first-person register interact to drive behavioral persistence after functional collapse, while attention tracks lexical surprise and emotion probes track structure alone.
-
Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs
LRS trains a latent reward model on final-answer correctness to steer SAE states during inference, improving reasoning performance and implicitly encouraging better cognitive behaviors.
-
Task-Focused Memorization for Multimodal Agents
TaskMem uses RL in two phases to learn a task-focused memorization policy for multimodal agents, yielding 5.3-7.0% VQA accuracy gains on reformulated streaming benchmarks from VideoMME, EgoLife, and EgoTempo.
-
TUX: Measuring Human--AI Tacit Understanding
Profile-conditioned LLMs achieve higher tacit alignment with humans on subjective spectra when traits match, as quantified by the new Tacit Understanding Index (TUX) from 241 humans and 200 agents.
-
The Tutoring Effectiveness Index: Predicting LLM Math Tutor Quality from Four Conversation Signals
The Tutoring Effectiveness Index (TEI) uses four signals from LLM conversations to select math tutoring responses, raising student improvement rates from 59.0% to 81.9% at N=8 on a frozen DeepSeek-R1-8B model without training or judges.
-
Measuring, Localizing, and Ablating Alignment Signatures in LLMs
Post-training introduces measurable AI-like stylistic signatures in LLMs that can be localized via aligned-base residual contrasts and ablated to lower detector rates while preserving coherence.
-
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Sparse autoencoders scaled to 34 million features on Claude 3 Sonnet yield interpretable, steerable representations of concrete and abstract concepts that generalize across languages and modalities.