super hub Mixed citations

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, James Campbell, Long Phan, Phillip Guo, Richard Ren, Sarah Chen · 2023 · cs.LG · arXiv 2310.01405

Mixed citation behavior. Most common role is background (62%).

273 Pith papers citing it

Background 62% of classified citations

open full Pith review browse 273 citing papers more from Andy Zou arXiv PDF

abstract

In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more, demonstrating the promise of top-down transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 17 baseline 2 method 2

citation-polarity summary

background 13 unclear 3 baseline 2 use method 2 support 1

claims ledger

abstract In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and con

authors

Andy Zou James Campbell Long Phan Phillip Guo Richard Ren Sarah Chen

co-cited works

representative citing papers

Mechanistic Interpretability and Causal Feature Steering of Neural Quantum States via Sparse Autoencoders

quant-ph · 2026-07-01 · unverdicted · novelty 8.0

Sparse autoencoders applied to Neural Quantum States extract unsupervised features correlating with and causally steering physical observables such as order parameters while preserving variational energy.

Rift: A Conflict Signature for Deception in Language Models

cs.LG · 2026-06-15 · conditional · novelty 8.0

Deceptive forward passes show 2.1-2.3x higher residual rank than naive-liar passes on identical wrong answers, enabling label-free lie identification at 100% accuracy across GPT-2, Qwen, and Phi models with cross-family and cross-language transfer.

Do Activation Monitors Survive Model Updates? Benchmarking, Predicting, and Repairing Activation-Monitor Staleness

cs.LG · 2026-06-14 · unverdicted · novelty 8.0

Fine-tuning updates frequently stale activation monitors for language model safety while quantization does not, with degradation predictable and repairable via label-free realignment.

Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation

cs.CV · 2026-06-03 · unverdicted · novelty 8.0

A safety direction estimated in a source LLM is transported to a target generator through lightweight alignment on benign data alone, matching native safety performance without any target-side unsafe data.

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

cs.CR · 2026-05-14 · conditional · novelty 8.0

Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.

SLAM: Structural Linguistic Activation Marking for Language Models

cs.CL · 2026-05-06 · unverdicted · novelty 8.0

SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.

Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens

cs.LG · 2026-04-03 · accept · novelty 8.0

Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.

The Linear Representation Hypothesis and the Geometry of Large Language Models

cs.CL · 2023-11-07 · conditional · novelty 8.0

Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.

SemRF: A Semantic Reference Frame for Residual-Stream Dynamics in Language Models

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

SemRF supplies fixed semantic anchors and pseudo-inverse tying to produce stable coordinates for residual dynamics, Voronoi traces, and minimum-action canonical paths that link to parameter efficiency under controlled interface error.

The Topology of Ill-Posed Questions: Persistent Homology for Detection and Steering in LLMs

cs.AI · 2026-06-22 · unverdicted · novelty 7.0

Zero-dimensional persistent homology on transformer layer hidden states yields three descriptors per layer whose concatenation improves ill-posedness classification and enables topology-conditioned activation steering across three LLMs.

Plans Don't Persist: Why Context Management Is Load Bearing for LLM Agents

cs.AI · 2026-06-22 · unverdicted · novelty 7.0

Replay pairing shows LLM agents do not persist plans in hidden states but rely on plans remaining in context, with rapid signal decay and task performance drops when plans are evicted.

When Agents Commit Too Soon: Diagnosing Premature Commitment in LLM Agents

cs.AI · 2026-06-22 · unverdicted · novelty 7.0

Hidden-state convergence at step 4 predicts behavioral consistency in LLM agents on QA tasks (r=-0.35 to -0.83), enabling AUROC 0.97 detection of inconsistent trajectories but not improving accuracy on harder benchmarks.

When AUC 0.998 Is Not Enough: A Candidate Evaluation Protocol for Hidden-State Probes of Indirect Prompt Injection in Multimodal Computer-Use Agents

cs.LG · 2026-06-22 · unverdicted · novelty 7.0

High AUC from linear probes on model activations for indirect prompt injection does not license an unqualified claim of malicious-content detection, per a Qwen2.5-VL-7B case study with text and visual controls.

Channel Location Constrains the Auditability of Subliminal Learning

cs.LG · 2026-06-20 · unverdicted · novelty 7.0

Auditability of subliminal learning is constrained by channel location, with initialization-dependent body channels allowing pre-training screens while vocabulary geometry and conditional body channels evade them.

Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

cs.CL · 2026-06-18 · unverdicted · novelty 7.0

Difference-in-means activation directions detect and mitigate emergent misalignment from insecure code fine-tuning across four LLM families, with effective within-model steering but non-specific cross-model transfer.

Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

10.3-22.9% of pass@k=0 math examples across GSM8K and MATH are recovered by a deterministic six-chain regime using activation grafting, showing a sampling blind spot in difficulty estimation.

FloatDoor: Platform-Triggered Backdoors in LLMs

cs.CR · 2026-06-17 · unverdicted · novelty 7.0

FloatDoor uses two LoRA adapters to create the first input-independent backdoor that triggers adversary-chosen behavior only on a target platform while remaining benign elsewhere.

Size Doesn't Matter: Cosine-Scored Sparse Autoencoders

cs.LG · 2026-06-13 · unverdicted · novelty 7.0

Cosine-scored SAEs with a learned direction-magnitude blend learn more concept-aligned features than standard inner-product SAEs at matched reconstruction quality.

Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models

cs.SD · 2026-06-09 · unverdicted · novelty 7.0

Instruction-based vector steering redirects temporal attention in LALMs to acoustically relevant regions, recovering queried sound event locations with 60.87-68.72% overlap accuracy without training.

Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

FD-SLMs exhibit state inertia during abrupt interruptions that a training-free perception-vector steering intervention mitigates, lifting correctness from 28% to 45% and IWOR from 40% to 72% on the Zero-Buffer Benchmark.

When Probing Accuracy Saturates, Fragility Resolves: A Complementary Metric for LLM Pre-Training Analysis

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

Fragility, the activation noise level causing probe accuracy collapse, reveals evolving lexical-to-compositional moral encoding, layer robustness gradients, and fine-tuning differences invisible to saturated probing accuracy.

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.

Beyond Linear Activation Steering: Invertible Latent Transformations for Controlling LLM Behavior

cs.LG · 2026-06-07 · unverdicted · novelty 7.0

INNSteer learns an invertible neural network to map LLM activations into a latent space where linear steering becomes more effective, then applies the inverse map to produce nonlinear interventions in the original space.

citing papers explorer

Showing 50 of 273 citing papers.

Mechanistic Interpretability and Causal Feature Steering of Neural Quantum States via Sparse Autoencoders quant-ph · 2026-07-01 · unverdicted · none · ref 62 · internal anchor
Sparse autoencoders applied to Neural Quantum States extract unsupervised features correlating with and causally steering physical observables such as order parameters while preserving variational energy.
Rift: A Conflict Signature for Deception in Language Models cs.LG · 2026-06-15 · conditional · none · ref 4 · internal anchor
Deceptive forward passes show 2.1-2.3x higher residual rank than naive-liar passes on identical wrong answers, enabling label-free lie identification at 100% accuracy across GPT-2, Qwen, and Phi models with cross-family and cross-language transfer.
Do Activation Monitors Survive Model Updates? Benchmarking, Predicting, and Repairing Activation-Monitor Staleness cs.LG · 2026-06-14 · unverdicted · none · ref 108 · internal anchor
Fine-tuning updates frequently stale activation monitors for language model safety while quantization does not, with degradation predictable and repairable via label-free realignment.
Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation cs.CV · 2026-06-03 · unverdicted · none · ref 58 · internal anchor
A safety direction estimated in a source LLM is transported to a target generator through lightweight alignment on benign data alone, matching native safety performance without any target-side unsafe data.
Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models cs.CR · 2026-05-14 · conditional · none · ref 56 · internal anchor
Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 15 · internal anchor
Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.
SLAM: Structural Linguistic Activation Marking for Language Models cs.CL · 2026-05-06 · unverdicted · none · ref 30 · internal anchor
SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens cs.LG · 2026-04-03 · accept · none · ref 31 · internal anchor
Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
The Linear Representation Hypothesis and the Geometry of Large Language Models cs.CL · 2023-11-07 · conditional · none · ref 28 · internal anchor
Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.
SemRF: A Semantic Reference Frame for Residual-Stream Dynamics in Language Models cs.LG · 2026-06-30 · unverdicted · none · ref 17 · internal anchor
SemRF supplies fixed semantic anchors and pseudo-inverse tying to produce stable coordinates for residual dynamics, Voronoi traces, and minimum-action canonical paths that link to parameter efficiency under controlled interface error.
The Topology of Ill-Posed Questions: Persistent Homology for Detection and Steering in LLMs cs.AI · 2026-06-22 · unverdicted · none · ref 13 · internal anchor
Zero-dimensional persistent homology on transformer layer hidden states yields three descriptors per layer whose concatenation improves ill-posedness classification and enables topology-conditioned activation steering across three LLMs.
Plans Don't Persist: Why Context Management Is Load Bearing for LLM Agents cs.AI · 2026-06-22 · unverdicted · none · ref 19 · internal anchor
Replay pairing shows LLM agents do not persist plans in hidden states but rely on plans remaining in context, with rapid signal decay and task performance drops when plans are evicted.
When Agents Commit Too Soon: Diagnosing Premature Commitment in LLM Agents cs.AI · 2026-06-22 · unverdicted · none · ref 17 · internal anchor
Hidden-state convergence at step 4 predicts behavioral consistency in LLM agents on QA tasks (r=-0.35 to -0.83), enabling AUROC 0.97 detection of inconsistent trajectories but not improving accuracy on harder benchmarks.
When AUC 0.998 Is Not Enough: A Candidate Evaluation Protocol for Hidden-State Probes of Indirect Prompt Injection in Multimodal Computer-Use Agents cs.LG · 2026-06-22 · unverdicted · none · ref 55 · internal anchor
High AUC from linear probes on model activations for indirect prompt injection does not license an unqualified claim of malicious-content detection, per a Qwen2.5-VL-7B case study with text and visual controls.
Channel Location Constrains the Auditability of Subliminal Learning cs.LG · 2026-06-20 · unverdicted · none · ref 6 · internal anchor
Auditability of subliminal learning is constrained by channel location, with initialization-dependent body channels allowing pre-training screens while vocabulary geometry and conditional body channels evade them.
Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families cs.CL · 2026-06-18 · unverdicted · none · ref 12 · internal anchor
Difference-in-means activation directions detect and mitigate emergent misalignment from insecure code fine-tuning across four LLM families, with effective within-model steering but non-specific cross-model transfer.
Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation cs.LG · 2026-06-17 · unverdicted · none · ref 41 · internal anchor
10.3-22.9% of pass@k=0 math examples across GSM8K and MATH are recovered by a deterministic six-chain regime using activation grafting, showing a sampling blind spot in difficulty estimation.
FloatDoor: Platform-Triggered Backdoors in LLMs cs.CR · 2026-06-17 · unverdicted · none · ref 36 · internal anchor
FloatDoor uses two LoRA adapters to create the first input-independent backdoor that triggers adversary-chosen behavior only on a target platform while remaining benign elsewhere.
Size Doesn't Matter: Cosine-Scored Sparse Autoencoders cs.LG · 2026-06-13 · unverdicted · none · ref 7 · internal anchor
Cosine-scored SAEs with a learned direction-magnitude blend learn more concept-aligned features than standard inner-product SAEs at matched reconstruction quality.
Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models cs.SD · 2026-06-09 · unverdicted · none · ref 22 · internal anchor
Instruction-based vector steering redirects temporal attention in LALMs to acoustically relevant regions, recovering queried sound event locations with 60.87-68.72% overlap accuracy without training.
Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering cs.CL · 2026-06-09 · unverdicted · none · ref 51 · internal anchor
FD-SLMs exhibit state inertia during abrupt interruptions that a training-free perception-vector steering intervention mitigates, lifting correctness from 28% to 45% and IWOR from 40% to 72% on the Zero-Buffer Benchmark.
When Probing Accuracy Saturates, Fragility Resolves: A Complementary Metric for LLM Pre-Training Analysis cs.CL · 2026-06-09 · unverdicted · none · ref 40 · internal anchor
Fragility, the activation noise level causing probe accuracy collapse, reveals evolving lexical-to-compositional moral encoding, layer robustness gradients, and fine-tuning differences invisible to saturated probing accuracy.
The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment cs.AI · 2026-06-09 · unverdicted · none · ref 53 · internal anchor
Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.
Beyond Linear Activation Steering: Invertible Latent Transformations for Controlling LLM Behavior cs.LG · 2026-06-07 · unverdicted · none · ref 33 · internal anchor
INNSteer learns an invertible neural network to map LLM activations into a latent space where linear steering becomes more effective, then applies the inverse map to produce nonlinear interventions in the original space.
SV-Detect: AI-generated Text Detection with Steering Vectors cs.CL · 2026-06-05 · unverdicted · none · ref 1 · internal anchor
Steering vectors from frozen LM layers enable a lightweight classifier to detect machine-generated text robustly across domains, source models, and editing attacks.
Adversarial Robustness of Activation Steering in Large Language Models cs.LG · 2026-06-05 · unverdicted · none · ref 11 · internal anchor
First systematic test shows activation steering robustness drops sharply (up to 64%) under adversarial input perturbations across multiple extraction methods, models, and personas.
Where Should Knowledge Enter? A Layered Framework for Knowledge Infusion in Multimodal Iterative Generative Model cs.AI · 2026-06-04 · unverdicted · none · ref 34 · internal anchor
Introduces a layered intervention framework for knowledge infusion in multimodal generative models and empirically demonstrates complementarity of layers in a safety-alignment task with diffusion models.
OPRD: On-Policy Representation Distillation cs.LG · 2026-06-04 · unverdicted · none · ref 49 · internal anchor
OPRD performs distillation in hidden-state space on on-policy data for deterministic gradients and better math benchmark performance, plus OPRD-Bridge for cross-architecture transfer via low-rank projectors.
STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations cs.LG · 2026-06-03 · unverdicted · none · ref 16 · internal anchor
STRIDE formulates TDA as sparse recovery using steering operators that mimic subset training effects in activation space, claiming SOTA LLM pre-training attribution at 13x prior speed.
Toward Calibrated, Fair, and accurate Deepfake Detection cs.LG · 2026-06-03 · unverdicted · none · ref 52 · internal anchor
Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.
Rotate2Think: Geometric Priming via Orthogonal Rotation to Improve Language Model Reasoning cs.LG · 2026-06-02 · unverdicted · none · ref 9 · internal anchor
Rotate2Think estimates an orthogonal rotation from input to thinking embeddings via Procrustes analysis on a few examples and injects the resulting vector to prime reasoning traces, raising accuracy in 30 of 32 model-benchmark settings.
Decomposing how prompting steers behavior cs.AI · 2026-06-02 · unverdicted · none · ref 2 · internal anchor
A geometric decomposition framework shows that affine transformations best recover prompt-induced task geometry and behavior in language and vision models across multiple datasets.
MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models cs.CL · 2026-05-31 · unverdicted · none · ref 9 · internal anchor
MENTIS applies layerwise covariance torsion (T1), spectral torsion (T2), and ERA localization to paired IT/PA 7-8B models, finding selective larger shifts for normative concepts, negative correlation with entropy, and mid-to-late layer peaks.
Subliminal Learning Is Steering Vector Distillation cs.AI · 2026-05-31 · unverdicted · none · ref 42 · internal anchor
Subliminal learning is steering vector distillation: a student fine-tuned on a steered teacher's outputs learns to imitate the steering vector.
Same Payload, Different Channel: Measuring Trust Asymmetry in Tool-Using Language Models cs.LG · 2026-05-30 · unverdicted · none · ref 10 · internal anchor
Agent-native LLMs are substantially more vulnerable to adversarial instructions arriving in tool descriptions than user messages (with the pattern reversing for general-purpose models and inverting again for tool outputs), as quantified by the new Safety Asymmetry Score across six models and three a
Shortcuts in the Tail: Debiasing via Post-Hoc Spectral Compression of Fine-Tuning Updates cs.LG · 2026-05-29 · unverdicted · none · ref 19 · internal anchor
Post-hoc truncation of the tail of the SVD of ΔW reduces spurious-group gaps by up to 5× with <2 pp accuracy loss across 0.5B–7B models and four benchmarks.
How's it going? Reinforcement learning in language models recruits a functional welfare axis cs.LG · 2026-05-28 · unverdicted · none · ref 34 · internal anchor
Reinforcement learning recruits rather than creates a functional welfare axis in language models, as reward and punishment vectors from a maze task generalize to unrelated settings and appear in pretrain-only models.
Transformer Field Theory: A Response-Theoretic Approach to Mechanistic Interpretability cs.LG · 2026-05-24 · unverdicted · none · ref 23 · internal anchor
Transformer Field Theory frames the residual stream as a field, models patching as source insertion, and uses first-order sensitivities plus Green functions to predict and describe responses, with empirical tests on GPT-2 autoregressive models.
Memory-Induced Tool-Drift in LLM Agents cs.CR · 2026-05-24 · unverdicted · none · ref 44 · internal anchor
Biased long-term memories in LLM agents cause measurable deviations in tool parameters across 105 scenarios, seven models, and 608 real tools, persisting under standard memory architectures.
Measuring Alignment-Induced Activation Shifts Correctly: A Template-Controlled Difference-in-Differences Protocol cs.LG · 2026-05-23 · unverdicted · none · ref 14 · internal anchor
Introduces a template-controlled difference-in-differences protocol that corrects chat-template confounding when measuring alignment-induced activation shifts in LLMs and recovers the refusal direction with higher fidelity.
Polymorphism Is Rotation: Operational Mechanistic Interpretability from a Two-Layer Transformer to Pythia-70m cs.LG · 2026-05-23 · unverdicted · none · ref 14 · internal anchor
Transformers trained from different random seeds exhibit residual-stream polymorphism that is exactly a uniform random rotation, which a Procrustes alignment removes to transfer SAEs and steering vectors.
Causal Physics Steering in Video World Models via Concept Activation Vectors cs.CV · 2026-05-23 · unverdicted · none · ref 23 · internal anchor
Physics steering uses CAVs from PEZ-layer probes to directionally shift VideoMAE's physical expectations on IntPhys, with effects localized to the emergence zone and distinct from motion encoding.
ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions cs.CL · 2026-05-22 · unverdicted · none · ref 100 · internal anchor
ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.
As X, Do Y: How Persona and Task Combine in Instruction-Tuned LLMs cs.CL · 2026-05-22 · unverdicted · none · ref 10 · internal anchor
Persona and task in role prompts decompose additively into orthogonal directions at the prompt-to-answer transition in LLM residual streams, but this local structure does not allow compressing the prompt into a single cached residual vector because generation depends on distributed attention to the原
The Hidden Signal of Verifier Strictness: Controlling and Improving Step-Wise Verification via Selective Latent Steering cs.LG · 2026-05-20 · conditional · none · ref 28 · internal anchor
VerifySteer selectively steers hidden states at paragraph boundaries using latent correctness signals to control verifier strictness and outperform baselines on ProcessBench and Hard2Verify with lower compute.
Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing cs.LG · 2026-05-18 · unverdicted · none · ref 5 · internal anchor
Residual Paving decomposes selective refusal editing into an early-layer router for intervention decisions and later-layer residual experts for edits, with oracle routing showing that learned route selectivity is the primary bottleneck across six backbones.
FishBack: Pullback Fisher Geometry for Optimal Activation Steering in Transformers cs.LG · 2026-05-17 · unverdicted · none · ref 50 · internal anchor
FishBack derives a closed-form minimum-distortion steering direction from the pullback Fisher metric of the softmax layer, outperforming Euclidean baselines on GPT-2 verb-morphology tasks with lower off-target KL divergence.
Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space cs.LG · 2026-05-15 · unverdicted · none · ref 12 · internal anchor
Pretraining and alignment induce asymmetric geometric traces in transformer weights because alignment updates concentrate in read pathways due to activation covariance while write pathways inherit less structure from alignment losses.
Dynamic Latent Routing cs.LG · 2026-05-14 · unverdicted · none · ref 55 · internal anchor
Dynamic Latent Routing jointly learns discrete latent codes, routing policies, and model parameters via dynamic search to match or exceed supervised fine-tuning by 6.6 points on average in low-data settings across four datasets and six models.
Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use cs.AI · 2026-05-13 · unverdicted · none · ref 38 · 2 links · internal anchor
Model-adaptive tool necessity shows 26-54% mismatch with actual tool calls across LLMs, driven by nearly orthogonal hidden-state signals for cognition versus action.

Representation Engineering: A Top-Down Approach to AI Transparency

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer