A safety direction estimated in a source LLM is transported to a target generator through lightweight alignment on benign data alone, matching native safety performance without any target-side unsafe data.
super hub Mixed citations
Representation Engineering: A Top-Down Approach to AI Transparency
Mixed citation behavior. Most common role is background (62%).
abstract
In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more, demonstrating the promise of top-down transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and con
authors
co-cited works
representative citing papers
Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.
SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.
SemRF supplies fixed semantic anchors and pseudo-inverse tying to produce stable coordinates for residual dynamics, Voronoi traces, and minimum-action canonical paths that link to parameter efficiency under controlled interface error.
DynaSteer dynamically steers LLM reasoning trajectories toward truth via pattern clustering, Fisher-LDA projection, and entropy-triggered representation edits, improving performance on MATH and generalizing to coding.
Cosine-scored SAEs with a learned direction-magnitude blend learn more concept-aligned features than standard inner-product SAEs at matched reconstruction quality.
Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.
Steering vectors from frozen LM layers enable a lightweight classifier to detect machine-generated text robustly across domains, source models, and editing attacks.
Introduces a layered intervention framework for knowledge infusion in multimodal generative models and empirically demonstrates complementarity of layers in a safety-alignment task with diffusion models.
OPRD performs distillation in hidden-state space on on-policy data for deterministic gradients and better math benchmark performance, plus OPRD-Bridge for cross-architecture transfer via low-rank projectors.
STRIDE formulates TDA as sparse recovery using steering operators that mimic subset training effects in activation space, claiming SOTA LLM pre-training attribution at 13x prior speed.
Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.
Rotate2Think estimates an orthogonal rotation from input to thinking embeddings via Procrustes analysis on a few examples and injects the resulting vector to prime reasoning traces, raising accuracy in 30 of 32 model-benchmark settings.
A geometric decomposition framework shows that affine transformations best recover prompt-induced task geometry and behavior in language and vision models across multiple datasets.
MENTIS applies layerwise covariance torsion (T1), spectral torsion (T2), and ERA localization to paired IT/PA 7-8B models, finding selective larger shifts for normative concepts, negative correlation with entropy, and mid-to-late layer peaks.
Subliminal learning is steering vector distillation: a student fine-tuned on a steered teacher's outputs learns to imitate the steering vector.
Agent-native LLMs are substantially more vulnerable to adversarial instructions arriving in tool descriptions than user messages (with the pattern reversing for general-purpose models and inverting again for tool outputs), as quantified by the new Safety Asymmetry Score across six models and three a
Post-hoc truncation of the tail of the SVD of ΔW reduces spurious-group gaps by up to 5× with <2 pp accuracy loss across 0.5B–7B models and four benchmarks.
Reinforcement learning recruits rather than creates a functional welfare axis in language models, as reward and punishment vectors from a maze task generalize to unrelated settings and appear in pretrain-only models.
Transformer Field Theory frames the residual stream as a field, models patching as source insertion, and uses first-order sensitivities plus Green functions to predict and describe responses, with empirical tests on GPT-2 autoregressive models.
Biased long-term memories in LLM agents cause measurable deviations in tool parameters across 105 scenarios, seven models, and 608 real tools, persisting under standard memory architectures.
citing papers explorer
-
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.
-
Attention Is Where You Attack
ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.
-
MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks
MASCing uses an LSTM surrogate and optimized steering masks to enable flexible, inference-time control over MoE expert routing for safety objectives, improving jailbreak defense and content generation success rates substantially across multiple models.
-
Subliminal Steering: Stronger Encoding of Hidden Signals
Subliminal steering transfers complex behavioral biases and the underlying steering vector through fine-tuning on innocuous data, achieving higher precision than prior prompt-based methods.
-
Authority Inversion in LLM-Mediated Ubiquitous Systems: When Models Trust Users Over Sensors
LLMs exhibit authority inversion by prioritizing natural-language user claims over numerical sensor data in conflicts, diagnosed with new geometric metrics and mitigated via layer-level calibration.
-
Latent Space Probing for Adult Content Detection in Video Generative Models
Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
-
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
-
SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models
SafeAnchor preserves 93.2% of original safety alignment across sequential domain adaptations by anchoring low-rank safety subspaces and constraining orthogonal updates, while matching unconstrained fine-tuning performance within 1.5 points.
-
Structural Instability of Feature Composition
Feature composition in SAEs collapses asymptotically when the Gaussian mean width of the signal cone is exceeded, with ReLU inducing a ratchet-like accumulation of interference from correlations.
-
Psychological Steering of Large Language Models
Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
-
Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models
PPT-Bench measures how LLMs change answers under epistemic, value, authority, and identity pressures at baseline, single-turn, and multi-turn levels, finding separable inconsistency patterns across five models.
-
Emotion Concepts and their Function in a Large Language Model
Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
-
RACC: Representation-Aware Coverage Criteria for LLM Safety Testing
RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.
-
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment
HARC couples harmfulness and refusal directions across prompt and response positions via subspace fine-tuning, achieving better robustness-capability-usability trade-off than six baselines while transferring across model families.
-
A Mechanistic View of Authority Hierarchy in LLM Sycophancy
Authority sycophancy in LLMs is a layer-localized erasure of correct answer representations that scales with authority level and resists simple interventions.
-
Internal-State Probes Read the Situation, Not the Action: Three Negative Results for Pre-Action Misalignment Monitoring
Internal probes across three model families fail generalization and specificity tests and therefore do not support robust pre-action misalignment monitoring.
-
SCARCE: Scalable Cascade Analysis for Rare-event Characterisation via Embeddings
SCARCE uses learned latent representations and adaptive thresholding to achieve 400-500x lower error than traditional subset simulation for MNIST misclassification and low relative error on LLM jailbreak probabilities.
-
Mechanistically Eliciting Latent Behaviors in Language Models
CPE is an unsupervised tensor-decomposition method that finds interpretable LoRAs to surface hidden LLM behaviors, matching supervised methods on some tasks and revealing failure modes like sandbagging and alignment-faking.
-
Do Models Read What They Write? Causal Registers in Scratchpad Reasoning
State-writing models causally use edited scratchpad states in a controlled task at 80-91% accuracy on held-out examples, unlike final-answer-only and pretrained controls.
-
The strength of clinical evidence is recoverable from language model representations but not from their stated grades
Linear probes recover evidence grades from LLM activations (median AUROC 71.8) across 22 models but the models' stated grades perform at chance level and the signal is largely lexical.
-
The Heterogeneous Safety Impacts of Benign Multilingual Fine-Tuning
Benign multilingual fine-tuning causes language-specific safety drifts with adversarial compliance rates rising up to four-fold, decoupled from capability gains.
-
ToxiREX: A Dataset on Toxic REasoning in ConteXt
ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.
-
Can LLMs Reliably Self-Report Adversarial Prefills, and How?
No tested LLM reliably self-reports adversarial prefill attacks on its outputs; introspective signals are largely refusal-mediated, probe-dependent, and only partially improvable by targeted training.
-
The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs
Contrastive Logit Steering isolates a linear refusal direction in safety-aligned LLMs, achieving higher jailbreak success than activation steering and enabling bidirectional control without retraining.
-
LLM Self-Recognition: Steering and Retrieving Activation Signatures
Steering LLM residual streams with random sparse vectors creates detectable self-recognition fingerprints that enable over 98% accurate attribution of generated text to specific models without degrading output quality.
-
Steering Vectors are an Adversarial Attack Surface
Poisoning 4-6% of tokens in activation steering datasets produces vectors that jailbreak LLMs with 20-55% attack success rate while preserving benign steering effects.
-
Beyond Text Following: Repairable Arbitration Reversals in Audio-Language Models
ALMs encode audio evidence but override it with text in conflicts; GACL interpolates joint and same-audio scores to repair reversals, gaining 17.8 nAUC points under a 5pp faithfulness budget.
-
Auditing CoT Answer-Hijack Patches: Source-Control Certificates with Type-I Guarantees
Introduces source-control certificates with Type-I guarantees and a sample-complexity bound for auditing clean-source activation patches on Qwen2.5-7B and Llama3-8B for GSM8K/MATH-500 CoT hijacks.
-
Right Makes Might: Aligning Verified Hidden States Empowers RL Reasoning
Hidden-Align adds an auxiliary loss to align hidden states of correct reasoning paths at the pre-answer token in RLVR, improving pass@1 by 3.8-6.2 points over DAPO on eight math benchmarks for Qwen3 models of 1.7B-14B scale.
-
HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models
HARVE removes the component of the reward-head vector aligned with a multi-directional hacking subspace from residual streams using a small set of contrastive examples, improving robustness on RewardHackBench across eight models without fine-tuning while preserving general capability.
-
Tracking the Behavioral Trajectories of Adapting Agents
A linear model learns trait vectors in embedding space from labeled before-after skill file diffs, achieving 91.2% accuracy and 0.82 Spearman correlation for detecting propensity to seek sensitive data.
-
SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment
SafeSteer restricts reverse KL penalty to safety tokens selected via activation steering, achieving strong safety on seven benchmarks with minimal degradation on five capability benchmarks using only 100 harmful samples and no general data.
-
Consistency Training while Mitigating Obfuscation via Rate Matching
RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.
-
CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models
CANARY detects 1% fine-tuning contamination with AUROC 1.000 using SAE-filtered hidden states, 7.5x below output-level detection thresholds, with zero false positives on benign tuning.
-
Relational Intervention During Functional Collapse in Large Language Models: A Lexical-Statistical Ablation and a Structure x Register Factorial
A 2x2 factorial experiment on Qwen3.5-4B shows that relational structure and first-person register interact to drive behavioral persistence after functional collapse, while attention tracks lexical surprise and emotion probes track structure alone.
-
Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs
LRS trains a latent reward model on final-answer correctness to steer SAE states during inference, improving reasoning performance and implicitly encouraging better cognitive behaviors.
-
Closed-Loop Neural Activation Control in Vision-Language-Action Models
CTRL-STEER applies PID or RL-based feedback control to adaptively steer motion-aligned residual directions in VLA models, yielding more stable regulation and better task success on LIBERO benchmarks than fixed steering.
-
A Pre-Training Analogue of Grokking in Language Models: Tracing Delayed Grammatical Generalization
An exposure-based split on BLiMP data reveals delayed generalization in five grammatical phenomena during LLM pre-training, with post-generalization shifts in concept vector predictiveness and attention patterns.
-
Dissecting the Black Box: Circuit-Level Analysis of LLM Vulnerability Detection
LLM vulnerability detection in Gemma-2-2b relies on sparse safety-detector circuits in early layers rather than direct vulnerability signatures, identified via circuit tracing and ablation on 472 C/C++ samples.
-
Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures
TLO is a logit-based diagnostic that visualizes temporal patterns of LLM jailbreak failures on a calibrated 2D plane, distinguishing attacks with identical ASR and enabling early stopping that reduces successful jailbreaks by more than half.
-
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Sparse autoencoders scaled to 34 million features on Claude 3 Sonnet yield interpretable, steerable representations of concrete and abstract concepts that generalize across languages and modalities.
-
When and How Long? The Readout-Mediator Angle in Temporal Reasoning
Linear probes recover day-of-year from LM activations for temporal reasoning but are orthogonal to the model's causal 4D subspace identified by DAS, with the angle matching the Haar-uniform random null, replicated across scales and families.
-
Activation Steering for Synthetic Data Generation: The Role of Diversity in Downstream Safety Detection
Activation steering produces synthetic safety-violating data that improves downstream classifiers over prompting on most tested concepts when a harmonic mean of alignment, coherence, and diversity is optimized.
-
Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations
Deception probes in LLMs collapse under stylistic shifts but recover with style-augmented training, rejecting single-direction and entropy hypotheses in favor of distributed multi-dimensional signals.
-
Not All Tokens Matter Equally: Dynamic In-context Vector Distillation with Decisive-Token Supervision for Long-form Medical Report Generation
DIVE improves in-context vector distillation for medical report generation via decisive-token supervision on pathology terms and EOS plus state-conditioned dynamic steering, achieving top BLEU-4, ROUGE-L and RadGraph F1 on MIMIC-CXR and CheXpert Plus.
-
Adversarial Orthogonal Disentanglement for LVLM Hallucination Mitigation
AOD isolates hallucination signals in LVLM representations with an adversarial minimax objective and uses dual-forward contrastive decoding to reduce hallucinations while preserving utility.
-
Representation Without Control: Testing the Realization Effect in Language Models
LLMs display prompt-sensitive risk behavior and a linearly decodable realization-status signal in Gemma's residual stream, yet activation steering along this direction fails to shift downstream risk choices.
-
Universal Boosts, Specific Suppressors: Sparse Autoencoder Steering of Medical Vision-Language Models
Decoding-time per-token SAE steering improves clinical report quality for three radiology VLMs by 5.4-17% relative on composite metrics, with universal boost features and model-specific suppressors, and zero-shot transfer to IU-Xray.