super hub Mixed citations

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, James Campbell, Long Phan, Phillip Guo, Richard Ren, Sarah Chen · 2023 · cs.LG · arXiv 2310.01405

Mixed citation behavior. Most common role is background (62%).

224 Pith papers citing it

Background 62% of classified citations

open full Pith review browse 224 citing papers more from Andy Zou arXiv PDF

abstract

In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more, demonstrating the promise of top-down transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 17 baseline 2 method 2

citation-polarity summary

background 13 unclear 3 baseline 2 use method 2 support 1

claims ledger

abstract In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and con

authors

Andy Zou James Campbell Long Phan Phillip Guo Richard Ren Sarah Chen

co-cited works

representative citing papers

Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation

cs.CV · 2026-06-03 · unverdicted · novelty 8.0

A safety direction estimated in a source LLM is transported to a target generator through lightweight alignment on benign data alone, matching native safety performance without any target-side unsafe data.

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

cs.CR · 2026-05-14 · conditional · novelty 8.0

Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.

SLAM: Structural Linguistic Activation Marking for Language Models

cs.CL · 2026-05-06 · unverdicted · novelty 8.0

SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.

Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens

cs.LG · 2026-04-03 · accept · novelty 8.0

Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.

The Linear Representation Hypothesis and the Geometry of Large Language Models

cs.CL · 2023-11-07 · conditional · novelty 8.0

Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.

SemRF: A Semantic Reference Frame for Residual-Stream Dynamics in Language Models

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

SemRF supplies fixed semantic anchors and pseudo-inverse tying to produce stable coordinates for residual dynamics, Voronoi traces, and minimum-action canonical paths that link to parameter efficiency under controlled interface error.

Search for Truth from Reasoning: A Dynamic Representation Editing Framework for Steering LLM Trajectories

cs.AI · 2026-06-26 · unverdicted · novelty 7.0

DynaSteer dynamically steers LLM reasoning trajectories toward truth via pattern clustering, Fisher-LDA projection, and entropy-triggered representation edits, improving performance on MATH and generalizing to coding.

Size Doesn't Matter: Cosine-Scored Sparse Autoencoders

cs.LG · 2026-06-13 · unverdicted · novelty 7.0

Cosine-scored SAEs with a learned direction-magnitude blend learn more concept-aligned features than standard inner-product SAEs at matched reconstruction quality.

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.

SV-Detect: AI-generated Text Detection with Steering Vectors

cs.CL · 2026-06-05 · unverdicted · novelty 7.0

Steering vectors from frozen LM layers enable a lightweight classifier to detect machine-generated text robustly across domains, source models, and editing attacks.

STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

STRIDE formulates TDA as sparse recovery using steering operators that mimic subset training effects in activation space, claiming SOTA LLM pre-training attribution at 13x prior speed.

Toward Calibrated, Fair, and accurate Deepfake Detection

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.

Rotate2Think: Geometric Priming via Orthogonal Rotation to Improve Language Model Reasoning

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

Rotate2Think estimates an orthogonal rotation from input to thinking embeddings via Procrustes analysis on a few examples and injects the resulting vector to prime reasoning traces, raising accuracy in 30 of 32 model-benchmark settings.

Decomposing how prompting steers behavior

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

A geometric decomposition framework shows that affine transformations best recover prompt-induced task geometry and behavior in language and vision models across multiple datasets.

MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models

cs.CL · 2026-05-31 · unverdicted · novelty 7.0

MENTIS applies layerwise covariance torsion (T1), spectral torsion (T2), and ERA localization to paired IT/PA 7-8B models, finding selective larger shifts for normative concepts, negative correlation with entropy, and mid-to-late layer peaks.

Subliminal Learning Is Steering Vector Distillation

cs.AI · 2026-05-31 · unverdicted · novelty 7.0

Subliminal learning is steering vector distillation: a student fine-tuned on a steered teacher's outputs learns to imitate the steering vector.

Same Payload, Different Channel: Measuring Trust Asymmetry in Tool-Using Language Models

cs.LG · 2026-05-30 · unverdicted · novelty 7.0

Agent-native LLMs are substantially more vulnerable to adversarial instructions arriving in tool descriptions than user messages (with the pattern reversing for general-purpose models and inverting again for tool outputs), as quantified by the new Safety Asymmetry Score across six models and three a

Shortcuts in the Tail: Debiasing via Post-Hoc Spectral Compression of Fine-Tuning Updates

cs.LG · 2026-05-29 · unverdicted · novelty 7.0

Post-hoc truncation of the tail of the SVD of ΔW reduces spurious-group gaps by up to 5× with <2 pp accuracy loss across 0.5B–7B models and four benchmarks.

How's it going? Reinforcement learning in language models recruits a functional welfare axis

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Reinforcement learning recruits rather than creates a functional welfare axis in language models, as reward and punishment vectors from a maze task generalize to unrelated settings and appear in pretrain-only models.

Transformer Field Theory: A Response-Theoretic Approach to Mechanistic Interpretability

cs.LG · 2026-05-24 · unverdicted · novelty 7.0

Transformer Field Theory frames the residual stream as a field, models patching as source insertion, and uses first-order sensitivities plus Green functions to predict and describe responses, with empirical tests on GPT-2 autoregressive models.

Memory-Induced Tool-Drift in LLM Agents

cs.CR · 2026-05-24 · unverdicted · novelty 7.0

Biased long-term memories in LLM agents cause measurable deviations in tool parameters across 105 scenarios, seven models, and 608 real tools, persisting under standard memory architectures.

Measuring Alignment-Induced Activation Shifts Correctly: A Template-Controlled Difference-in-Differences Protocol

cs.LG · 2026-05-23 · unverdicted · novelty 7.0

Introduces a template-controlled difference-in-differences protocol that corrects chat-template confounding when measuring alignment-induced activation shifts in LLMs and recovers the refusal direction with higher fidelity.

Polymorphism Is Rotation: Operational Mechanistic Interpretability from a Two-Layer Transformer to Pythia-70m

cs.LG · 2026-05-23 · unverdicted · novelty 7.0

Transformers trained from different random seeds exhibit residual-stream polymorphism that is exactly a uniform random rotation, which a Procrustes alignment removes to transfer SAEs and steering vectors.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Latent Space Probing for Adult Content Detection in Video Generative Models cs.CV · 2026-04-25 · unverdicted · none · ref 17 · internal anchor
Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel cs.AI · 2026-05-12 · unverdicted · none · ref 69 · internal anchor
CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.

Representation Engineering: A Top-Down Approach to AI Transparency

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer