hub

Open Problems in Mechanistic Interpretability

Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq · 2025 · cs.LG · arXiv 2501.16496

22 Pith papers cite this work. Polarity classification is still indexing.

22 Pith papers citing it

open full Pith review browse 22 citing papers arXiv PDF

abstract

Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many scientific and practical benefits can be realized: Our methods require both conceptual and practical improvements to reveal deeper insights; we must figure out how best to apply our methods in pursuit of specific goals; and the field must grapple with socio-technical challenges that influence and are influenced by our work. This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Tracing Persona Vectors Through LLM Pretraining

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.

WriteSAE: Sparse Autoencoders for Recurrent State

cs.LG · 2026-05-12 · unverdicted · novelty 8.0

WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.

fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery

cs.LG · 2026-05-10 · conditional · novelty 7.0

fmxcoders improve cross-layer feature recovery in transformers via factorized weights and layer masking, delivering 10-30 point probing F1 gains, 25-50% lower MSE, doubled functional coherence, and 3-13x more coherent latents than standard crosscoders on GPT2-Small, Pythia, and Gemma2 models.

From Mechanistic to Compositional Interpretability

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaranteeing concise human-aligned decompositions.

Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.

Linear-Readout Floors and Threshold Recovery in Computation in Superposition

cs.LG · 2026-05-02 · unverdicted · novelty 7.0

Linear readouts incur an Omega(d^{-1/2}) crosstalk floor that caps the Hanni template at d^{3/2} capacity, while threshold recovery succeeds at quadratic loads for s = O(d/log d) sparsity, resolving the apparent contradiction via distinct readout invariants.

Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts

cs.AI · 2026-05-01 · unverdicted · novelty 7.0

Llama-3.1-8B computes sums for cyclic concepts using base-10 addition via task-agnostic Fourier features with periods 2, 5, and 10 rather than modular arithmetic in the concept period.

Diverse Dictionary Learning

cs.LG · 2026-04-19 · unverdicted · novelty 7.0

Diverse dictionary learning identifies intersections, complements, and dependency structures of latent variables from data X = g(Z) up to indeterminacies, and full identifiability when structural diversity is sufficient.

The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts

cs.LG · 2026-04-13 · unverdicted · novelty 7.0 · 2 refs

The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and saliency maps.

Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.

Bilinear autoencoders find interpretable manifolds

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

Bilinear autoencoders decompose neural activations into low-rank quadratic forms to discover interpretable multi-dimensional manifolds, improving reconstruction in language models and challenging linear representation assumptions.

What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis

cs.AI · 2026-05-05 · unverdicted · novelty 6.0

In LLM agents, memory routing circuits emerge at 0.6B scale while content circuits appear only at 4B, and write/read operations recruit a pre-existing late-layer context hub instead of creating a new one, enabling a 76% accurate unsupervised failure diagnostic.

Compared to What? Baselines and Metrics for Counterfactual Prompting

cs.CL · 2026-05-01 · conditional · novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.

Understanding the Mechanism of Altruism in Large Language Models

econ.GN · 2026-04-21 · unverdicted · novelty 6.0

A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.

Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

Stylistic rewrites of harmful prompts raise attack success rates from 3.84% to 36.8-65% across 31 frontier models, indicating weak generalization in safety refusals.

Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space

cs.CL · 2026-04-06 · unverdicted · novelty 6.0

PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.

cs.LG · 2026-04-04 · unverdicted · novelty 6.0

A functional similarity metric for ReLU networks uses normalized activation region signatures and MinHash to overcome parametric symmetries like neuron permutation and scaling.

Metaphor Is Not All Attention Needs

cs.CL · 2026-05-12 · unverdicted · novelty 5.0

Poetic jailbreaks succeed because they induce distinct attention patterns in LLMs that are independent of harmful-content detection, not because models fail to recognize literary formatting.

Do Linear Probes Generalize Better in Persona Coordinates?

cs.AI · 2026-05-10 · unverdicted · novelty 5.0

Probes on persona principal components from contrastive prompts generalize better than raw activation probes for harmful behaviors across 10 datasets.

Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models

cs.CL · 2026-05-12 · unverdicted · novelty 4.0

Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.

High-Dimensional Statistics: Reflections on Progress and Open Problems

math.ST · 2026-05-06 · unverdicted · novelty 2.0

A survey synthesizing representative advances, common themes, and open problems in high-dimensional statistics while pointing to key entry-point works.

There Will Be a Scientific Theory of Deep Learning

stat.ML · 2026-04-23 · unverdicted · novelty 2.0

A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.

citing papers explorer

Showing 22 of 22 citing papers.

Tracing Persona Vectors Through LLM Pretraining cs.CL · 2026-05-13 · unverdicted · none · ref 4 · internal anchor
Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.
WriteSAE: Sparse Autoencoders for Recurrent State cs.LG · 2026-05-12 · unverdicted · none · ref 68 · internal anchor
WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery cs.LG · 2026-05-10 · conditional · none · ref 3 · internal anchor
fmxcoders improve cross-layer feature recovery in transformers via factorized weights and layer masking, delivering 10-30 point probing F1 gains, 25-50% lower MSE, doubled functional coherence, and 3-13x more coherent latents than standard crosscoders on GPT2-Small, Pythia, and Gemma2 models.
From Mechanistic to Compositional Interpretability cs.LG · 2026-05-09 · unverdicted · none · ref 59 · internal anchor
Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaranteeing concise human-aligned decompositions.
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior cs.LG · 2026-05-06 · unverdicted · none · ref 80 · internal anchor
Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
Linear-Readout Floors and Threshold Recovery in Computation in Superposition cs.LG · 2026-05-02 · unverdicted · none · ref 15 · internal anchor
Linear readouts incur an Omega(d^{-1/2}) crosstalk floor that caps the Hanni template at d^{3/2} capacity, while threshold recovery succeeds at quadratic loads for s = O(d/log d) sparsity, resolving the apparent contradiction via distinct readout invariants.
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts cs.AI · 2026-05-01 · unverdicted · none · ref 84 · internal anchor
Llama-3.1-8B computes sums for cyclic concepts using base-10 addition via task-agnostic Fourier features with periods 2, 5, and 10 rather than modular arithmetic in the concept period.
Diverse Dictionary Learning cs.LG · 2026-04-19 · unverdicted · none · ref 11 · internal anchor
Diverse dictionary learning identifies intersections, complements, and dependency structures of latent variables from data X = g(Z) up to indeterminacies, and full identifiability when structural diversity is sufficient.
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts cs.LG · 2026-04-13 · unverdicted · none · ref 9 · 2 links · internal anchor
The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and saliency maps.
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space cs.CL · 2026-05-12 · unverdicted · none · ref 128 · internal anchor
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
Bilinear autoencoders find interpretable manifolds cs.LG · 2026-05-09 · unverdicted · none · ref 54 · internal anchor
Bilinear autoencoders decompose neural activations into low-rank quadratic forms to discover interpretable multi-dimensional manifolds, improving reconstruction in language models and challenging linear representation assumptions.
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis cs.AI · 2026-05-05 · unverdicted · none · ref 28 · internal anchor
In LLM agents, memory routing circuits emerge at 0.6B scale while content circuits appear only at 4B, and write/read operations recruit a pre-existing late-layer context hub instead of creating a new one, enabling a 76% accurate unsupervised failure diagnostic.
Compared to What? Baselines and Metrics for Counterfactual Prompting cs.CL · 2026-05-01 · conditional · none · ref 69 · internal anchor
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
Understanding the Mechanism of Altruism in Large Language Models econ.GN · 2026-04-21 · unverdicted · none · ref 230 · internal anchor
A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety cs.CL · 2026-04-20 · unverdicted · none · ref 15 · internal anchor
Stylistic rewrites of harmful prompts raise attack success rates from 3.84% to 36.8-65% across 31 frontier models, indicating weak generalization in safety refusals.
Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space cs.CL · 2026-04-06 · unverdicted · none · ref 58 · internal anchor
PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
Functional Similarity Metric for Neural Networks: Overcoming Parametric Ambiguity via Activation Region Analysis cs.LG · 2026-04-04 · unverdicted · none · ref 31 · internal anchor
A functional similarity metric for ReLU networks uses normalized activation region signatures and MinHash to overcome parametric symmetries like neuron permutation and scaling.
Metaphor Is Not All Attention Needs cs.CL · 2026-05-12 · unverdicted · none · ref 32 · internal anchor
Poetic jailbreaks succeed because they induce distinct attention patterns in LLMs that are independent of harmful-content detection, not because models fail to recognize literary formatting.
Do Linear Probes Generalize Better in Persona Coordinates? cs.AI · 2026-05-10 · unverdicted · none · ref 21 · internal anchor
Probes on persona principal components from contrastive prompts generalize better than raw activation probes for harmful behaviors across 10 datasets.
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models cs.CL · 2026-05-12 · unverdicted · none · ref 63 · internal anchor
Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.
High-Dimensional Statistics: Reflections on Progress and Open Problems math.ST · 2026-05-06 · unverdicted · none · ref 79 · internal anchor
A survey synthesizing representative advances, common themes, and open problems in high-dimensional statistics while pointing to key entry-point works.
There Will Be a Scientific Theory of Deep Learning stat.ML · 2026-04-23 · unverdicted · none · ref 164 · internal anchor
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.

Open Problems in Mechanistic Interpretability

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer