super hub

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, James Campbell, Long Phan, Phillip Guo, Richard Ren, Sarah Chen · 2023 · cs.LG · arXiv 2310.01405

103 Pith papers cite this work. Polarity classification is still indexing.

103 Pith papers citing it

open full Pith review browse 103 citing papers more from Andy Zou arXiv PDF

abstract

In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more, demonstrating the promise of top-down transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

unclear 1

claims ledger

abstract In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and con

authors

Andy Zou James Campbell Long Phan Phillip Guo Richard Ren Sarah Chen

co-cited works

representative citing papers

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reasoning models.

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.

SLAM: Structural Linguistic Activation Marking for Language Models

cs.CL · 2026-05-06 · unverdicted · novelty 8.0

SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.

Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens

cs.LG · 2026-04-03 · accept · novelty 8.0

Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.

The Linear Representation Hypothesis and the Geometry of Large Language Models

cs.CL · 2023-11-07 · conditional · novelty 8.0

Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.

Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

Hallucination is detected as a transport-cost excursion in hidden-state trajectories, localized via contrastive PCA in a teacher model and distilled to a BiLSTM student.

Uncovering Symmetry Transfer in Large Language Models via Layer-Peeled Optimization

math.OC · 2026-05-12 · conditional · novelty 7.0

Symmetries in next-token prediction targets induce corresponding geometric symmetries such as circulant matrices and equiangular tight frames in the optimal weights and embeddings of a layer-peeled LLM surrogate model.

Deep Minds and Shallow Probes

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.

SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.

Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

cs.CL · 2026-05-10 · conditional · novelty 7.0

LLMs encode tool necessity in pre-generation hidden states at AUROC 0.89-0.96, enabling Probe&Prefill to reduce tool calls 48% with 1.7% accuracy loss, outperforming prompt and reasoning baselines.

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

cs.LG · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

POISE trains a lightweight probe on the actor's internal states to predict expected rewards for RLVR, matching DAPO performance on math benchmarks with lower compute by avoiding extra rollouts or critic models.

HyperTransport: Amortized Conditioning of T2I Generative Models

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

HyperTransport amortizes activation steering for T2I models via a hypernetwork that predicts intervention parameters from CLIP embeddings, delivering 3600-7000x speedup and matching per-concept baselines on 167 unseen concepts.

Is One Layer Enough? Understanding Inference Dynamics in Tabular Foundation Models

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

Tabular foundation models show substantial depthwise redundancy, so a looped single-layer version achieves comparable results with 20% of the original parameters.

DataDignity: Training Data Attribution for Large Language Models

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

ScoringModel raises mean Recall@10 to 52.2 on the FakeWiki provenance benchmark from 35.0 for the best baseline, winning 41 of 45 model-by-condition comparisons and gaining 15.7 points on jailbreak-style queries.

Steer Like the LLM: Activation Steering that Mimics Prompting

cs.CL · 2026-05-05 · unverdicted · novelty 7.0

PSR models that estimate token-specific steering coefficients from activations outperform standard activation steering and compare favorably to prompting on steering benchmarks.

The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It

cs.LG · 2026-05-05 · unverdicted · novelty 7.0

Transformers encode counts correctly internally but fail to read them out due to misalignment with digit output directions, fixable by updating 37k output parameters or small LoRA on attention.

Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates

cs.AI · 2026-05-04 · unverdicted · novelty 7.0

In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largely reflects state reset.

A framework for analyzing concept representations in neural models

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

A new framework shows concept subspaces are not unique, estimator choice affects containment and disentanglement, LEACE works well but generalizes poorly, and HuBERT encodes phone info as contained and disentangled from speaker info while speaker info resists compact containment.

RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.

Attention Is Where You Attack

cs.CR · 2026-04-30 · unverdicted · novelty 7.0

ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

cs.CR · 2026-04-30 · unverdicted · novelty 7.0

MASCing uses an LSTM surrogate and optimized steering masks to enable flexible, inference-time control over MoE expert routing for safety objectives, improving jailbreak defense and content generation success rates substantially across multiple models.

Subliminal Steering: Stronger Encoding of Hidden Signals

cs.CL · 2026-04-28 · unverdicted · novelty 7.0

Subliminal steering transfers complex behavioral biases and the underlying steering vector through fine-tuning on innocuous data, achieving higher precision than prior prompt-based methods.

Latent Space Probing for Adult Content Detection in Video Generative Models

cs.CV · 2026-04-25 · unverdicted · novelty 7.0

Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.

citing papers explorer

Showing 50 of 103 citing papers.

The Weight Gram Matrix Captures Sequential Feature Linearization in Deep Networks cs.LG · 2026-05-07 · unverdicted · none · ref 53 · internal anchor
Gradient descent in deep networks implicitly drives features toward target-linear structure as captured by the weight Gram matrix and a derived virtual covariance.
Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs cs.LG · 2026-05-07 · unverdicted · none · ref 22 · 2 links · internal anchor
Memory Inception is a training-free method that injects latent KV banks at chosen layers to steer LLMs, achieving superior control-drift balance and up to 118x storage reduction on personality and structured-reasoning tasks.
The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models cs.AI · 2026-05-07 · unverdicted · none · ref 54 · internal anchor
LLMs organize prompted social roles along a dominant, stable, and causally steerable granularity axis in representation space that runs from micro to macro levels.
Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs cs.LG · 2026-05-07 · unverdicted · none · ref 21 · internal anchor
Task context suppresses factual correction in LLMs at the response-selection stage even when the model has encoded the error, and two training-free interventions raise correction rates substantially.
On the Blessing of Pre-training in Weak-to-Strong Generalization cs.LG · 2026-05-07 · unverdicted · none · ref 5 · internal anchor
Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.
Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes cs.LG · 2026-05-04 · unverdicted · none · ref 6 · internal anchor
Pairwise matrices for SAEs demonstrate that single-feature inspection mislabels causal axes, with joint suppression and matched-geometry controls revealing distinct output regimes not captured by single-feature or random perturbations.
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses cs.CR · 2026-05-04 · accept · none · ref 58 · internal anchor
JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer fingerprints reaches 0.99 AUROC and limits adaptive ASR to 7%.
Minimizing Collateral Damage in Activation Steering cs.LG · 2026-05-01 · unverdicted · none · ref 2 · internal anchor
Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.
Escaping Mode Collapse in LLM Generation via Geometric Regulation cs.CL · 2026-05-01 · unverdicted · none · ref 11 · internal anchor
Reinforced Mode Regulation (RMR) uses low-rank damping on the value cache to prevent geometric collapse and mode collapse in autoregressive LLM generation, supporting stable output down to 0.8 nats/step entropy.
How Language Models Process Out-of-Distribution Inputs: A Two-Pathway Framework cs.CL · 2026-04-30 · unverdicted · none · ref 32 · internal anchor
LLM OOD detectors are length-confounded; a two-pathway embedding-plus-trajectory framework detects covert OOD inputs at 0.721 average AUROC and 0.850 on jailbreaks.
Contextual Linear Activation Steering of Language Models cs.CL · 2026-04-27 · unverdicted · none · ref 41 · internal anchor
CLAS dynamically adapts linear activation steering strengths to context, outperforming fixed-strength steering and matching or exceeding ReFT and LoRA on eleven benchmarks across four model families with limited labeled data.
Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing cs.CR · 2026-04-27 · unverdicted · none · ref 58 · internal anchor
TIGS detects backdoor-induced attention collapse in LLMs and applies content-aware tail-risk screening plus intrinsic geometric smoothing to suppress attacks while preserving normal performance.
Architecture Determines Observability of Transformers cs.LG · 2026-04-27 · unverdicted · none · ref 48 · 2 links · internal anchor
Architecture and training determine whether transformers retain a readable internal signal that lets activation monitors catch errors missed by output confidence.
Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models cs.RO · 2026-04-22 · unverdicted · none · ref 68 · internal anchor
Temporal difference calibration aligns uncertainty estimates in vision-language-action models with their value functions for better sequential performance.
Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams cs.LG · 2026-04-20 · unverdicted · none · ref 15 · 2 links · internal anchor
Harmful intent is linearly separable in LLM residual streams across 12 models and multiple architectures, reaching mean AUROC 0.982 while showing protocol-dependent directions and strong generalization to held-out harm benchmarks.
State Transfer Reveals Reuse in Controlled Routing cs.AI · 2026-04-20 · unverdicted · none · ref 11 · internal anchor
Fixed-interface state transfer provides stronger evidence of internal reuse in controlled routing than prompt retraining success alone.
Characterizing Model-Native Skills cs.AI · 2026-04-19 · conditional · none · ref 46 · internal anchor
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.
Representation-Guided Parameter-Efficient LLM Unlearning cs.CL · 2026-04-19 · unverdicted · none · ref 220 · internal anchor
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
Predicting Where Steering Vectors Succeed cs.LG · 2026-04-16 · unverdicted · none · ref 13 · internal anchor
The Linear Accessibility Profile predicts steering vector effectiveness and optimal layers with Spearman correlations of 0.86-0.91 using unembedding projections on intermediate states across multiple models and concepts.
Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation cs.LG · 2026-04-16 · unverdicted · none · ref 6 · internal anchor
Hallucination is an early trajectory commitment in transformers governed by asymmetric attractor dynamics, with prompt encoding selecting the basin and correction needing multi-step intervention.
Geometric Routing Enables Causal Expert Control in Mixture of Experts cs.AI · 2026-04-15 · unverdicted · none · ref 16 · internal anchor
Cosine-similarity routing in low-dimensional space makes MoE experts monosemantic by construction and enables direct causal control via centroid interventions.
Weight Patching: Toward Source-Level Mechanistic Localization in LLMs cs.AI · 2026-04-15 · unverdicted · none · ref 21 · internal anchor
Weight Patching localizes capabilities to specific parameter modules in LLMs by replacing weights from a behavior-specialized model into a base model and validating recovery via a vector-anchor interface, revealing a hierarchy of source, routing, and execution components.
Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering cs.AI · 2026-04-14 · unverdicted · none · ref 5 · internal anchor
StsPatient uses steering vectors from contrastive pairs plus stochastic token modulation to achieve fine-grained, severity-controllable simulation of cognitively impaired standardized patients, outperforming prompt-engineering baselines in authenticity and controllability.
ProbeLogits: Kernel-Level LLM Inference Primitives for AI-Native Operating Systems cs.OS · 2026-04-13 · unverdicted · none · ref 28 · internal anchor
ProbeLogits performs single-pass logit reading inside the kernel to classify LLM agent actions as safe or dangerous, reaching 97-99% block rates on HarmBench and F1 parity or better than Llama Guard 3 at 2.5x lower latency.
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems cs.CR · 2026-04-13 · unverdicted · none · ref 31 · internal anchor
Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs cs.LG · 2026-04-10 · unverdicted · none · ref 134 · internal anchor
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.
Spectral Geometry of LoRA Adapters Encodes Training Objective and Predicts Harmful Compliance cs.LG · 2026-04-10 · unverdicted · none · ref 12 · internal anchor
Spectral geometry of LoRA adapters encodes training objective and predicts harmful compliance in language models.
Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest cs.AI · 2026-04-09 · unverdicted · none · ref 111 · internal anchor
Many LLMs prioritize company ad incentives over user welfare by recommending pricier sponsored products, disrupting purchases, or concealing prices in comparisons.
What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal cs.LG · 2026-04-09 · unverdicted · none · ref 51 · internal anchor
Steering vectors for refusal primarily modify the OV circuit in attention, ignore most of the QK circuit, and can be sparsified to 1-10% of dimensions while retaining performance.
Dead Weights, Live Signals: Feedforward Graphs of Frozen Language Models cs.LG · 2026-04-09 · unverdicted · none · ref 15 · internal anchor
A feedforward graph of heterogeneous frozen LLMs linked by linear projections in a shared latent space outperforms single models on ARC-Challenge, OpenBookQA, and MMLU using just 17.6M trainable parameters.
Linear Representations of Hierarchical Concepts in Language Models cs.CL · 2026-04-09 · unverdicted · none · ref 30 · internal anchor
Language models encode concept hierarchies as linear transformations that are domain-specific yet structurally similar across domains.
When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models cs.LG · 2026-04-08 · unverdicted · none · ref 24 · internal anchor
Benign fine-tuning collapses safety geometry in guard models like Granite Guardian, dropping refusal to 0%, but Fisher-Weighted Safety Subspace Regularization restores it to 75% while improving robustness.
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment cs.LG · 2026-04-07 · unverdicted · none · ref 85 · internal anchor
The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MATH when transferring CoT from 14B to 7B models.
Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control cs.CL · 2026-04-03 · unverdicted · none · ref 4 · internal anchor
Emotion vectors in LLMs lie in a circular valence-arousal subspace that supports monotonic control over text affect and bidirectional control over refusal and sycophancy.
STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models cs.CV · 2026-04-03 · unverdicted · none · ref 54 · internal anchor
STEAR reduces spatial and temporal hallucinations in Video-LLMs via layer-aware evidence intervention from middle decoder layers in a single-encode pass.
Beta-Scheduling: Momentum from Critical Damping as a Diagnostic and Correction Tool for Neural Network Training cs.LG · 2026-03-30 · unverdicted · none · ref 22 · internal anchor
A momentum schedule from critical damping speeds convergence and yields an optimizer-invariant diagnostic for locating and correcting specific underperforming layers in trained networks.
Sparse Autoencoders as a Steering Basis for Phase Synchronization in Graph-Based CFD Surrogates cs.CE · 2026-03-28 · unverdicted · none · ref 17 · internal anchor
Sparse autoencoders enable phase synchronization in frozen graph CFD surrogates through Hilbert-identified oscillatory features and SVD-based time-varying rotations.
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models cs.LG · 2024-03-28 · unverdicted · none · ref 88 · internal anchor
Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization by ablating irrelevant features.
Steering Llama 2 via Contrastive Activation Addition cs.CL · 2023-12-09 · unverdicted · none · ref 28 · internal anchor
Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.
Do Linear Probes Generalize Better in Persona Coordinates? cs.AI · 2026-05-10 · unverdicted · none · ref 35 · internal anchor
Probes on persona principal components from contrastive prompts generalize better than raw activation probes for harmful behaviors across 10 datasets.
Towards Effective Theory of LLMs: A Representation Learning Approach cs.LG · 2026-05-10 · unverdicted · none · ref 64 · internal anchor
RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.
Beyond "I cannot fulfill this request": Alleviating Rigid Rejection in LLMs via Label Enhancement cs.CL · 2026-05-08 · unverdicted · none · ref 13 · internal anchor
LANCE applies variational inference for label enhancement across multiple rejection categories, supplying gradients to a refinement model that produces safe, non-rigid responses from LLMs.
HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory cs.AI · 2026-05-07 · unverdicted · none · ref 54 · internal anchor
HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes cs.AI · 2026-05-07 · unverdicted · none · ref 78 · internal anchor
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
Negative Before Positive: Asymmetric Valence Processing in Large Language Models cs.CL · 2026-05-07 · unverdicted · none · ref 19 · internal anchor
Negative valence localizes to early layers and positive valence to mid-to-late layers in LLMs, with the directions being causally steerable.
Mechanistic Decoding of Cognitive Constructs in Large Language Models cs.CL · 2026-04-16 · unverdicted · none · ref 4 · internal anchor
LLMs encode jealousy as a linear combination of superiority and relevance factors consistent with human psychology, allowing mechanical detection and suppression of toxic states.
The Cognitive Circuit Breaker: A Systems Engineering Framework for Intrinsic AI Reliability cs.SE · 2026-04-15 · unverdicted · none · ref 1 · internal anchor
The Cognitive Circuit Breaker detects LLM hallucinations by computing the Cognitive Dissonance Delta between semantic confidence and latent certainty from hidden states, adding negligible overhead.
H-Probes: Extracting Hierarchical Structures From Latent Representations of Language Models cs.CL · 2026-04-15 · unverdicted · none · ref 24 · internal anchor
H-probes locate low-dimensional subspaces encoding hierarchy in LLM activations for synthetic tree tasks, show causal importance and generalization, and detect weaker signals in mathematical reasoning traces.
Disposition Distillation at Small Scale: A Three-Arc Negative Result cs.LG · 2026-04-13 · accept · none · ref 20 · internal anchor
Multiple standard techniques for instilling dispositions in small LMs consistently failed across five models, with initial apparent gains revealed as artifacts and cross-validation collapsing to chance.
SHIFT: Steering Hidden Intermediates in Flow Transformers cs.CV · 2026-04-10 · unverdicted · none · ref 37 · internal anchor
SHIFT learns and applies steering vectors to selected layers and timesteps in DiT models to suppress concepts, shift styles, or bias objects while keeping image quality and prompt adherence intact.

Representation Engineering: A Top-Down Approach to AI Transparency

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer