pith. sign in

hub Mixed citations

Open Problems in Mechanistic Interpretability

Mixed citation behavior. Most common role is background (67%).

42 Pith papers citing it
Background 67% of classified citations
abstract

Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many scientific and practical benefits can be realized: Our methods require both conceptual and practical improvements to reveal deeper insights; we must figure out how best to apply our methods in pursuit of specific goals; and the field must grapple with socio-technical challenges that influence and are influenced by our work. This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.

hub tools

citation-role summary

background 4 method 1 other 1

citation-polarity summary

years

2026 34 2025 8

clear filters

representative citing papers

Disentanglement Beyond Generative Models with Riemannian ICA

cs.LG · 2026-05-21 · unverdicted · novelty 8.0

RICA replaces ICA's global generative model with local Riemannian geometry, introducing a disentanglement tensor based on the Hessian of the log-likelihood and Ricci curvature to measure pointwise disentanglement, which recovers sources across manifolds in controlled tests.

Tracing Persona Vectors Through LLM Pretraining

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.

WriteSAE: Sparse Autoencoders for Recurrent State

cs.LG · 2026-05-12 · unverdicted · novelty 8.0 · 3 refs

WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

cs.LG · 2026-06-04 · conditional · novelty 7.0

SASA replaces single-vector decoders in SAEs with learned subspaces plus block sparsity and nuclear-norm regularization, proving that a single group becomes the global minimizer once block size meets intrinsic dimension and yielding polynomial rather than exponential sample complexity.

fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery

cs.LG · 2026-05-10 · conditional · novelty 7.0

fmxcoders improve cross-layer feature recovery in transformers via factorized weights and layer masking, delivering 10-30 point probing F1 gains, 25-50% lower MSE, doubled functional coherence, and 3-13x more coherent latents than standard crosscoders on GPT2-Small, Pythia, and Gemma2 models.

From Mechanistic to Compositional Interpretability

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

The paper introduces compositional interpretability as a category-theoretic framework that casts mechanistic explanations as commuting syntactic-semantic mappings optimized under faithfulness and complexity constraints derived from minimum description length.

Linear-Readout Floors and Threshold Recovery in Computation in Superposition

cs.LG · 2026-05-02 · unverdicted · novelty 7.0

Linear readouts incur an Omega(d^{-1/2}) crosstalk floor that caps the Hanni template at d^{3/2} capacity, while threshold recovery succeeds at quadratic loads for s = O(d/log d) sparsity, resolving the apparent contradiction via distinct readout invariants.

Diverse Dictionary Learning

cs.LG · 2026-04-19 · unverdicted · novelty 7.0

Diverse dictionary learning identifies intersections, complements, and dependency structures of latent variables from data X = g(Z) up to indeterminacies, and full identifiability when structural diversity is sufficient.

Bilinear autoencoders find interpretable manifolds

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

Bilinear autoencoders decompose neural activations into low-rank quadratic forms to discover interpretable multi-dimensional manifolds, improving reconstruction in language models and challenging linear representation assumptions.

What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis

cs.AI · 2026-05-05 · unverdicted · novelty 6.0

In LLM agents, memory routing circuits emerge at 0.6B scale while content circuits appear only at 4B, and write/read operations recruit a pre-existing late-layer context hub instead of creating a new one, enabling a 76% accurate unsupervised failure diagnostic.

Compared to What? Baselines and Metrics for Counterfactual Prompting

cs.CL · 2026-05-01 · conditional · novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.

citing papers explorer

Showing 8 of 8 citing papers after filters.

  • Decision Potential Surface: A Theoretical and Practical Approximation of Large Language Model Decision Boundary cs.LG · 2025-09-27 · unverdicted · none · ref 17 · 2 links · internal anchor

    Defines Decision Potential Surface (DPS) whose zero isohypse equals an LLM decision boundary and supplies a K-sample approximation algorithm with derived upper bounds on absolute, expected, and concentration errors.

  • Task complexity shapes internal representations and robustness in neural networks cs.LG · 2025-08-07 · unverdicted · none · ref 57 · internal anchor

    Harder classification tasks produce neural representations whose accuracy collapses under binarization and shuffling while easier tasks remain robust, defining task complexity via the performance gap between full-precision and perturbed networks.

  • Hypothesis-Driven Feature Manifold Analysis in LLMs via Supervised Multi-Dimensional Scaling cs.AI · 2025-10-01 · unverdicted · none · ref 7 · internal anchor

    SMDS analysis finds that temporal reasoning features in LLMs form distinct manifolds such as circles, lines, and clusters that reflect semantics, remain stable across models, support reasoning, and reshape with context.

  • Feature Identification via the Empirical NTK cs.LG · 2025-10-01 · unverdicted · none · ref 13 · internal anchor

    Eigenanalysis of the empirical NTK surfaces feature directions that align with Fourier features in modular addition networks and grammatical features in Gemma-3-270M, outperforming PCA baselines on activations.

  • ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack cs.AI · 2025-09-30 · unverdicted · none · ref 5 · internal anchor

    ASGuard identifies tense-vulnerable attention heads via circuit analysis, trains precise scaling vectors on those activations, and applies them in preventative fine-tuning to reduce targeted jailbreaking success across four LLMs with minimal impact on utility or over-refusal.

  • Do Activation Verbalization Methods Convey Privileged Information? cs.CL · 2025-09-16 · unverdicted · none · ref 47 · internal anchor

    Activation verbalization methods for LLMs largely reflect the verbalizer model's parametric knowledge rather than privileged information from the target model's activations.

  • Internal Deployment in the AI Act cs.CY · 2025-12-05 · unverdicted · none · ref 11 · internal anchor

    Interpretations of Articles 2(1), 2(6), and 2(8) of the AI Act support applying the regulation to internal AI deployment while allowing for R&D exceptions, with the provisions viewed as complementary.

  • Mechanistic Interpretability Needs Philosophy cs.CL · 2025-06-23 · unverdicted · none · ref 24 · internal anchor

    The paper claims that mechanistic interpretability needs philosophy as a partner to clarify concepts, refine methods, and navigate epistemic and ethical complexities in AI systems.