Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.
hub Mixed citations
Steering Llama 2 via Contrastive Activation Addition , url =
Mixed citation behavior. Most common role is background (40%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
10.3-22.9% of pass@k=0 math examples across GSM8K and MATH are recovered by a deterministic six-chain regime using activation grafting, showing a sampling blind spot in difficulty estimation.
Probes predicting future behaviors from intermediate steps enable Future Probe Controlled Generation for steering large reasoning models with minimal quality degradation.
First systematic test shows activation steering robustness drops sharply (up to 64%) under adversarial input perturbations across multiple extraction methods, models, and personas.
Query Lens extends Logit Lens to interpret sparse features via key-value analysis and indirect effects, yielding coherent token signatures where Logit Lens fails, and proposes the Subspace Channel Hypothesis.
Pretraining and alignment induce asymmetric geometric traces in transformer weights because alignment updates concentrate in read pathways due to activation covariance while write pathways inherit less structure from alignment losses.
Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
Prompt-boundary directional alignment enables geometry-guided search that cuts trials to 95% best utility by 39.8% on average, while concept granularity predicts remaining difficulty via directional heterogeneity.
Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.
Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
PSR models that estimate token-specific steering coefficients from activations outperform standard activation steering and compare favorably to prompting on steering benchmarks.
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.
Valence is geometrically encoded in Apertus-8B and Gemma-4-E4B with PC1 correlations of 0.76 and 0.83, but emerges at different depths than in Claude and arousal alignment varies by generated corpus.
LLMs fail to reliably self-report adversarial prefill attacks at 27.3% average intention-claim rate on compromised outputs, with signals tied to refusal reasoning, probe framing, and partial mitigation via finetuning that does not transfer.
Activation steering on early layers improves diversity of synthetic data for low-resource languages and often boosts downstream classifier performance compared to non-steered prompting.
Activation patching localizes English detokenization in Llama2-7B to a two-stage attention-then-MLP process at layer 1 that generalizes to 12 models from 8 families, with depth varying by positional encoding, plus an early-layer probe achieving 0.94-0.97 AUROC.
ALMs encode audio evidence but override it with text in conflicts; GACL interpolates joint and same-audio scores to repair reversals, gaining 17.8 nAUC points under a 5pp faithfulness budget.
RCA is a training-free module that boosts input context signal strength in the residual stream of LLMs by orthogonal decoupling of attention routing from value magnitude.
An exposure-based split on BLiMP data reveals delayed generalization in five grammatical phenomena during LLM pre-training, with post-generalization shifts in concept vector predictiveness and attention patterns.
MAGS learns low-dimensional subspaces from correct versus incorrect reasoning traces and applies targeted projection corrections to attention heads when they deviate from the correctness manifold during inference.
TRACE uses cross-layer candidate trajectories inside frozen LLMs to dynamically select and apply one of three correction operators, delivering mean gains of +12.26 MC1 and +8.65 MC2 points across 15 models and 3 benchmarks with no regressions.
citing papers explorer
-
Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.
-
SLAM: Structural Linguistic Activation Marking for Language Models
SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
-
Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation
10.3-22.9% of pass@k=0 math examples across GSM8K and MATH are recovered by a deterministic six-chain regime using activation grafting, showing a sampling blind spot in difficulty estimation.
-
Predicting Future Behaviors in Reasoning Models Enables Better Steering
Probes predicting future behaviors from intermediate steps enable Future Probe Controlled Generation for steering large reasoning models with minimal quality degradation.
-
Adversarial Robustness of Activation Steering in Large Language Models
First systematic test shows activation steering robustness drops sharply (up to 64%) under adversarial input perturbations across multiple extraction methods, models, and personas.
-
Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects
Query Lens extends Logit Lens to interpret sparse features via key-value analysis and indirect effects, yielding coherent token signatures where Logit Lens fails, and proposes the Subspace Channel Hypothesis.
-
Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space
Pretraining and alignment induce asymmetric geometric traces in transformer weights because alignment updates concentrate in read pathways due to activation covariance while write pathways inherit less structure from alignment losses.
-
Deep Minds and Shallow Probes
Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
-
When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search
Prompt-boundary directional alignment enables geometry-guided search that cuts trials to 95% best utility by 39.8% on average, while concept granularity predicts remaining difficulty via directional heterogeneity.
-
Interpreting Reinforcement Learning Agents with Susceptibilities
Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.
-
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
-
Steer Like the LLM: Activation Steering that Mimics Prompting
PSR models that estimate token-specific steering coefficients from activations outperform standard activation steering and compare favorably to prompting on steering benchmarks.
-
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
-
Psychological Steering of Large Language Models
Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
-
Activation Steering with a Feedback Controller
Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.
-
Where Do Models Find Happiness? Emotion Vectors in Open-Source LLMs
Valence is geometrically encoded in Apertus-8B and Gemma-4-E4B with PC1 correlations of 0.76 and 0.83, but emerges at different depths than in Claude and arousal alignment varies by generated corpus.
-
Can LLMs Reliably Self-Report Adversarial Prefills, and How?
LLMs fail to reliably self-report adversarial prefill attacks at 27.3% average intention-claim rate on compromised outputs, with signals tied to refusal reasoning, probe framing, and partial mitigation via finetuning that does not transfer.
-
Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation
Activation steering on early layers improves diversity of synthetic data for low-resource languages and often boosts downstream classifier performance compared to non-steered prompting.
-
Inside the LLM Word Factory
Activation patching localizes English detokenization in Llama2-7B to a two-stage attention-then-MLP process at layer 1 that generalizes to 12 models from 8 families, with depth varying by positional encoding, plus an early-layer probe achieving 0.94-0.97 AUROC.
-
Beyond Text Following: Repairable Arbitration Reversals in Audio-Language Models
ALMs encode audio evidence but override it with text in conflicts; GACL interpolates joint and same-audio scores to repair reversals, gaining 17.8 nAUC points under a 5pp faithfulness budget.
-
Resonant Context Anchoring: Decoupling Attention Routing and Signal Gain at Inference Time
RCA is a training-free module that boosts input context signal strength in the residual stream of LLMs by orthogonal decoupling of attention routing from value magnitude.
-
A Pre-Training Analogue of Grokking in Language Models: Tracing Delayed Grammatical Generalization
An exposure-based split on BLiMP data reveals delayed generalization in five grammatical phenomena during LLM pre-training, with post-generalization shifts in concept vector predictiveness and attention patterns.
-
Manifold-Guided Attention Steering
MAGS learns low-dimensional subspaces from correct versus incorrect reasoning traces and applies targeted projection corrections to attention heads when they deviate from the correctness manifold during inference.
-
TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction
TRACE uses cross-layer candidate trajectories inside frozen LLMs to dynamically select and apply one of three correction operators, delivering mean gains of +12.26 MC1 and +8.65 MC2 points across 15 models and 3 benchmarks with no regressions.
-
Contrastive Conceptor Activation Steering (COAST): Unlocking Vision-Language-Action Models through Hidden States
COAST applies contrastive conceptors to steer VLA hidden states into task-specific success subspaces, yielding over 20% simulation and 40% real-robot success rate gains across three distinct policies.
-
ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models
ASRU combines activation redirection and reward-optimized fine-tuning to unlearn cross-modal sensitive knowledge in MLLMs, reporting +24.6% better unlearning effectiveness and 5.8x higher generation quality on Qwen3-VL while preserving utility with limited retained data.
-
Inference-Time Machine Unlearning via Gated Activation Redirection
GUARD-IT performs machine unlearning in LLMs via input-dependent activation steering at inference time, matching or exceeding gradient-based baselines on TOFU and MUSE while preserving utility and working under quantization.
-
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
-
Pre-trained Tabular Foundation Models as Versatile Summary Networks for Neural Posterior Estimation
Pre-trained TabPFN acts as an effective training-free summary network for neural posterior estimation, matching or outperforming standard methods while preserving useful marginal and location information in the posterior.
-
Don't Lose Focus: Activation Steering via Key-Orthogonal Projections
SKOP uses key-orthogonal projections to steer LLM activations while preserving attention patterns on focus tokens, cutting utility degradation by 5-7x and retaining over 95% of standard steering efficacy.
-
Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs
Memory Inception is a training-free method that injects latent KV banks at chosen layers to steer LLMs, achieving superior control-drift balance and up to 118x storage reduction on personality and structured-reasoning tasks.
-
Conceptors for Semantic Steering
Conceptors as soft projection matrices from bipolar activations offer a multidimensional, compositional, and geometrically principled method for semantic steering in LLMs that outperforms single-vector baselines in multi-dimensional subspaces.
-
Do Large Language Models Plan Answer Positions? Position Bias in Multiple-Choice Question Generation
LLMs implicitly plan answer positions during MCQ generation, as shown by predictive signals in hidden representations and controllable shifts via activation steering.
-
What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal
Steering vectors for refusal primarily modify the OV circuit in attention, ignore most of the QK circuit, and can be sparsified to 1-10% of dimensions while retaining performance.
-
Sparse Concept Anchoring for Interpretable and Controllable Neural Representations
Sparse Concept Anchoring biases neural latent spaces toward targeted concepts using under 0.1% labels per concept, enabling reversible steering via projection and permanent removal via weight ablation with minimal side effects on other features.
-
Order Is Not Control
Order is distinct from control, where control is defined as a local receiver-gated response law demonstrated across biological circuits and LLM response panels with reported prediction accuracies of 72-84%.
-
TimpaTeks: Automatic In-place Text Sequence Modification via Diffusion Language Model Steering
TimpaTeks enables in-place text modification in diffusion language models via activation steering, shown on IMDB sentiment and synthetic concept tasks while lowering perplexity and retaining structure.
-
Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning
Trait-space drift monitoring detects emergent misalignment checkpoints in 7-9B LLMs with 2.2% FNR, 2.9% FPR and 0.99 AUROC, outperforming PCA and SAE baselines.
-
Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing
SafeMoE isolates unsafe knowledge in domain-specific LoRA experts and routes them via a lightweight gate trained on safe responses to produce safer and more informative LLM outputs with zero-shot generalization.
-
Do Linear Probes Generalize Better in Persona Coordinates?
Persona axes derived from contrastive prompts and PCA yield linear probes that generalize better than raw-activation probes across 10 datasets for deception and sycophancy.
-
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
-
A Geometric Account of Activation Steering through Angle-Norm Decomposition
Empirical study across seven language models finds concepts represented primarily in angular structure of activations while norm affects steering stability, recommending separate angular and radial parameterization over single additive coefficients.
-
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models
Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.
-
DFKI-MLT at SemEval-2026 TASK 7: Steering Multilingual Models Towards Cultural Knowledge
Activation steering with FLORES-derived language vectors produces modest, layer-sensitive and language-dependent gains on cultural awareness tasks, with some settings degrading performance and strong interaction with prompt design.
- Latent-space Attacks for Refusal Evasion in Language Models
- OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning
- Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models