pith. machine review for the scientific record. sign in

arxiv: 2605.05115 · v1 · submitted 2026-05-06 · 💻 cs.LG

Recognition: unknown

Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

Atticus Geiger, Can Rager, Daniel Wurgaft, Ekdeep Singh Lubana, Eric Bigelow, Jack Merullo, Matthew Kowal, Noah Goodman, Owen Lewis, Raphael Sarfati, Sheridan Feucht, Tal Haklay, Thomas Fel, Thomas McGrath, Usha Bhalla, Vasudev Shyam

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:40 UTC · model grok-4.3

classification 💻 cs.LG
keywords manifold steeringneural network geometryactivation space interventionsrepresentation and behaviorsteering language modelsworld models
0
0 comments X

The pith

Interventions along activation manifolds produce natural behavioral trajectories while linear ones do not.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the geometric structure in neural representations causally shapes behavior by comparing different ways to intervene in activation space. It fits a manifold to the model's internal representations and another to its output behaviors, then steers the model along paths on the representation manifold. This manifold steering produces output changes that follow the behavior manifold, whereas straight-line linear steering cuts through unnatural regions and yields odd behaviors. The reverse holds too: paths optimized to stay on the behavior manifold trace the curves of the representation manifold. This pattern appears in language models on reasoning and learning tasks as well as in video models on physical tasks.

Core claim

Fitting manifolds M_h to hidden activations and M_y to output distributions reveals a bidirectional geometric link: steering along M_h induces trajectories along M_y, and optimizing interventions to follow M_y recovers paths that curve like M_h. Linear steering, by contrast, deviates from these manifolds and produces unnatural outputs. This demonstrates that representation geometry is not incidental but the structure that governs how interventions affect behavior.

What carries the argument

Manifold steering, the method of intervening along paths defined by the fitted activation manifold M_h instead of assuming Euclidean straight lines.

Load-bearing premise

The manifolds fitted to activations and behaviors capture the true causal geometry connecting them, not just superficial patterns from the data used to fit them.

What would settle it

Observing that linear interventions in activation space produce behavioral trajectories as natural as those from manifold steering on the tested tasks would falsify the claim that manifold geometry is necessary.

Figures

Figures reproduced from arXiv: 2605.05115 by Atticus Geiger, Can Rager, Daniel Wurgaft, Ekdeep Singh Lubana, Eric Bigelow, Jack Merullo, Matthew Kowal, Noah Goodman, Owen Lewis, Raphael Sarfati, Sheridan Feucht, Tal Haklay, Thomas Fel, Thomas McGrath, Usha Bhalla, Vasudev Shyam.

Figure 1
Figure 1. Figure 1: How do different geometries of activation space modulate behavior? We illustrate paths through activation space (left), each defined by a different geometry. Interventions along paths in activation space induce paths in behavior space (right, illustrated on a three-concept probability simplex). Euclidean: the standard approach of linear steering assumes a flat geometry and interven￾tions follow a straight … view at source ↗
Figure 2
Figure 2. Figure 2: Approximate isometry between activation and behavior manifolds for cyclic concepts. Manifolds (cubic splines) fit to activation and behavior (i.e., output distributions over concept tokens) spaces of Llama 3.1 8B. The weekdays (a) and months (b) tasks consist of simple addition questions such as: What is four days after Monday?. Both activation and behavior manifolds show cyclic structure (PCA visualizatio… view at source ↗
Figure 3
Figure 3. Figure 3: Approximate isometry between activation and behavior manifolds for sequential concepts. Manifolds (cubic splines) fit to activation and behavior (i.e., output distributions over concept tokens) spaces of Llama 3.1 8B. The letters (a) and ages (b) tasks consist of simple addition questions such as: What letter comes four letters after M?. Both activation and behavior manifolds show sequential structure (PCA… view at source ↗
Figure 4
Figure 4. Figure 4: Manifold steering yields smooth and ordered behavioral transitions. Using simple addition tasks which require reasoning over structured concepts (e.g., What is four days after Monday?), we compare two steering strategies in activation space: standard linear steering, which takes direct paths, and manifold steering, which takes paths along a fitted activation manifold. The bottom panel shows example output … view at source ↗
Figure 5
Figure 5. Figure 5: Manifold steering and pullback yield coinciding trajectories in activation and behavior space. Going in the Activations→Behavior direction, we find that steering along the activation manifold Mh (black) produces paths that lie close to the behavior manifold My. We then examine the reverse direction, Activations←Behavior: We start with paths along the behavior manifold and optimize for corresponding paths i… view at source ↗
Figure 6
Figure 6. Figure 6: Manifold steering enables factored control in multi-dimensional conceptual spaces. (a) We examine manifold steering on multidimensional spaces using Park et al. (2025b)’s in-context learning of representations (ICLR) task. In an ICLR task, arbitrary tokens are assigned to nodes along a graph, and a language model is prompted with tokens from a random walk along the graph. Park et al. (2025b) showed that wi… view at source ↗
Figure 7
Figure 7. Figure 7: Manifold steering on a visual world model produces smooth movement. (a). We examine whether manifold steering can generalize to a visual modality by training a recurrent network on the Mountain Car environment (Moore, 1990; Sutton & Barto, 2018) to predict the next frame xt+1 given the previous frame and an action. (b) We test the mapping between the activation and behavior manifolds by computing on-manifo… view at source ↗
Figure 8
Figure 8. Figure 8: Recurrent visual world-model architecture. A convolutional encoder fenc maps each 128 × 128 × 3 frame xt to a layer-normalized latent zt ∈ R n with n = 64. The discrete action at ∈ {0, 1, 2} is mapped to a learned embedding e(at) ∈ R 16, concatenated with zt, and fed to a GRU together with the previous hidden state ht−1. A convolutional decoder fdec produces a residual image from the resulting hidden state… view at source ↗
Figure 9
Figure 9. Figure 9: Results for in-context learning of representations on a view at source ↗
Figure 10
Figure 10. Figure 10: (a) Activation and behavior space paths for the 5 × 5 Grid task and 9 × 9 Cylinder. Similarly to the addition tasks with known concepts, we find that the manifold steering paths closely follow the behavior manifold My. (b) Multidimensional scaling (MDS) embedding for linear and manifold distances in activation space and manifold distances in behavior space. As with the addition tasks with known concepts, … view at source ↗
Figure 12
Figure 12. Figure 12: Pullback from My recovers Mh in the visual world model. Left (Activation Space): PCA visualization of the encoder representations, showing the geometric path along Mh, the linear chord, and the pullback-optimized path π ⋆ between endpoints pA and pB. Although initialized at the chord, π ⋆ converges onto Mh, closing the spiral loop traced by the encoder geometry and becoming nearly indistinguishable from t… view at source ↗
read the original abstract

Neural representations carry rich geometric structure; but does that structure causally shape behavior? To address this question, we intervene along paths through activation space defined by different geometries, and measure the behavioral trajectories they induce. In particular, we test whether interventions that respect the geometry of activation space will yield behaviors close to those the model exhibits naturally. Concretely, we first fit an activation manifold $M_h$ to representations and a behavior manifold $M_y$ to output probability distributions. We then test the link $M_h \leftrightarrow M_y$ via interventions: we find that steering along $M_h$, which we term manifold steering, yields behavioral trajectories that follow $M_y$, while linear steering -- which assumes a Euclidean geometry -- cuts through off-manifold regions and hence produces unnatural outputs. Moreover, optimizing interventions in activation space to produce paths along $M_y$ recovers activation trajectories that trace the curvature of $M_h$. We demonstrate this bidirectional relationship between the geometry of representation and behavior across tasks and modalities. In language models, we use reasoning tasks with cyclic and sequential geometries as well as in-context learning tasks with more complex graph geometries. In a video world model, we use a task with geometry corresponding to physical dynamics. Overall, our work shows that geometry in neural representation is not merely incidental, but is in fact the proper object for enabling principled control via intervention on internals. This recasts the core problem of steering from finding the right direction to finding the right geometry.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that neural representations and behaviors share a common geometric structure captured by manifolds M_h (fit to activations) and M_y (fit to output distributions). Interventions respecting M_h geometry ('manifold steering') produce behavioral trajectories that follow M_y, whereas linear (Euclidean) steering yields unnatural outputs by leaving the manifold; conversely, optimizing activation-space interventions to follow M_y recovers trajectories tracing M_h curvature. This bidirectional link is demonstrated on language-model reasoning, in-context learning, and video world-model physical-dynamics tasks, implying that manifold geometry, not arbitrary directions, is the proper object for causal steering.

Significance. If the central claim holds after addressing circularity concerns, the work would meaningfully advance mechanistic interpretability by providing evidence that representation geometry is causally linked to behavior and by offering a concrete method (manifold-constrained intervention) that outperforms standard linear steering. The bidirectional recovery result and the breadth of tasks (cyclic/sequential reasoning, graph-structured ICL, physical dynamics) are strengths that could support generality. The absence of machine-checked proofs or fully parameter-free derivations is noted, but the empirical framing is appropriate for the cs.LG venue.

major comments (3)
  1. [Abstract] Abstract (intervention procedure paragraph): Both M_h and M_y are constructed from the identical set of natural forward passes, so paths constrained to M_h necessarily remain near the observed data distribution while linear paths do not; this makes the reported superiority of manifold steering and the recovery of M_h trajectories when optimizing for M_y potentially tautological rather than evidence of independent causal geometry. An explicit control (e.g., permuted activation-output pairings or a synthetic model with known ground-truth geometry) is required to establish that the shared structure is not an artifact of the fitting process.
  2. [Abstract] Abstract (manifold-fitting description): No quantitative details are supplied on manifold dimensionality, fitting regularization, validation metrics, or out-of-distribution generalization of M_h and M_y; without these, it is impossible to assess whether the claimed 'intrinsic geometry' is robust or merely descriptive of the training trajectories. This directly affects the load-bearing claim that manifold steering reveals the 'proper object' for intervention.
  3. [Abstract] Abstract (bidirectional claim): The statement that 'optimizing interventions in activation space to produce paths along M_y recovers activation trajectories that trace the curvature of M_h' lacks a description of the optimization objective, convergence criteria, or comparison to null models; if the recovery is driven by the model's own input-output consistency rather than geometry, the causal interpretation does not follow.
minor comments (2)
  1. [Abstract] The abstract refers to 'tasks with geometry corresponding to physical dynamics' without naming the specific video dataset or task metric, reducing reproducibility.
  2. [Abstract] Notation M_h and M_y is introduced without an explicit equation defining the manifold parametrization or distance metric used for steering.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed report. The three major comments identify important points about potential circularity, missing quantitative details, and clarity on the bidirectional optimization. We address each below and will revise the manuscript to incorporate additional controls, metrics, and expanded descriptions. We believe these changes will strengthen the empirical claims without altering the core findings.

read point-by-point responses
  1. Referee: [Abstract] Abstract (intervention procedure paragraph): Both M_h and M_y are constructed from the identical set of natural forward passes, so paths constrained to M_h necessarily remain near the observed data distribution while linear paths do not; this makes the reported superiority of manifold steering and the recovery of M_h trajectories when optimizing for M_y potentially tautological rather than evidence of independent causal geometry. An explicit control (e.g., permuted activation-output pairings or a synthetic model with known ground-truth geometry) is required to establish that the shared structure is not an artifact of the fitting process.

    Authors: We agree that the shared data source for fitting M_h and M_y raises a legitimate concern about circularity, and that the superiority of manifold steering could partly reflect proximity to the training distribution rather than independent geometric structure. The linear baseline does provide one contrast (off-manifold trajectories produce unnatural outputs), but it does not fully isolate whether the manifold fit itself induces the observed alignment. To address this directly, we will add two explicit controls in the revised manuscript: (1) permuted activation-output pairings, where we randomly shuffle the correspondence between activation trajectories and output distributions before fitting, and (2) a synthetic model with known ground-truth geometry (e.g., a low-dimensional dynamical system) where we can verify whether manifold steering recovers the true structure while linear steering does not. These controls will be reported alongside the existing results to demonstrate that the bidirectional link is not an artifact of the fitting procedure. revision: yes

  2. Referee: [Abstract] Abstract (manifold-fitting description): No quantitative details are supplied on manifold dimensionality, fitting regularization, validation metrics, or out-of-distribution generalization of M_h and M_y; without these, it is impossible to assess whether the claimed 'intrinsic geometry' is robust or merely descriptive of the training trajectories. This directly affects the load-bearing claim that manifold steering reveals the 'proper object' for intervention.

    Authors: The abstract was written for brevity and therefore omitted the specific quantitative details on manifold dimensionality, regularization parameters, validation metrics (e.g., reconstruction error, held-out likelihood), and out-of-distribution generalization tests. These details appear in the methods and supplementary sections of the full manuscript, but we acknowledge that the abstract should be self-contained on this point. In the revision we will insert concise quantitative summaries (e.g., chosen intrinsic dimensions via cross-validation, regularization strength, and OOD metrics on held-out trajectories) into the abstract and ensure the main text provides explicit validation that the manifolds generalize beyond the fitting trajectories. revision: yes

  3. Referee: [Abstract] Abstract (bidirectional claim): The statement that 'optimizing interventions in activation space to produce paths along M_y recovers activation trajectories that trace the curvature of M_h' lacks a description of the optimization objective, convergence criteria, or comparison to null models; if the recovery is driven by the model's own input-output consistency rather than geometry, the causal interpretation does not follow.

    Authors: We agree that the abstract's phrasing of the bidirectional result is too terse and does not specify the optimization objective (e.g., minimizing divergence from M_y while staying in activation space), convergence criteria, or null-model comparisons. The full paper contains these elements, but they must be summarized in the abstract to support the causal claim. In the revision we will expand the abstract sentence to include: (i) the precise objective (projected gradient descent on activation interventions to match M_y marginals), (ii) convergence criteria (e.g., stabilization of trajectory curvature within a tolerance), and (iii) explicit null-model results (random activation perturbations and input-output consistency baselines) showing that only geometry-respecting optimization recovers M_h curvature. This will clarify that the recovery is attributable to the shared manifold structure rather than generic consistency. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical interventions test geometry link independently of fits

full rationale

The paper fits M_h to activations and M_y to outputs from natural forward passes, then reports intervention results showing manifold steering follows M_y while linear steering does not, plus bidirectional recovery via optimization. These are empirical observations from steering experiments across tasks, not derivations that reduce to the fitting procedure by construction (no equations equate the steering outcome to the manifold definition itself). No self-citation chains, uniqueness theorems, or ansatz smuggling are invoked in the abstract or described process to bear the central claim. The experiments provide independent content by contrasting manifold vs. Euclidean paths and measuring induced behaviors.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

Central claim depends on the recoverability of low-dimensional manifolds from activations and outputs plus the assumption that interventions along fitted manifolds are causally meaningful.

free parameters (1)
  • manifold fitting parameters
    Dimensionality, regularization, and other choices used to construct M_h from activations and M_y from output distributions.
axioms (2)
  • domain assumption Neural activations lie on low-dimensional manifolds that can be recovered from data
    Invoked when fitting M_h to representations.
  • domain assumption Output distributions lie on manifolds whose geometry corresponds to behavioral structure
    Invoked when fitting M_y and claiming steering respects it.
invented entities (2)
  • activation manifold M_h no independent evidence
    purpose: To represent the geometric structure of internal representations for steering
    Constructed by fitting to model activations; no external validation provided.
  • behavior manifold M_y no independent evidence
    purpose: To represent the geometric structure of output probability distributions
    Constructed by fitting to model outputs; no external validation provided.

pith-pipeline@v0.9.0 · 5622 in / 1392 out tokens · 90723 ms · 2026-05-08T17:40:57.531619+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

    cs.CL 2026-05 unverdicted novelty 6.0

    LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...

Reference graph

Works this paper leans on

299 extracted references · 91 canonical work pages · cited by 1 Pith paper · 9 internal anchors

  1. [1]

    arXiv preprint arXiv:2602.03655 , year =

    Sequential Group Composition: A Window into the Mechanics of Deep Learning , author =. arXiv preprint arXiv:2602.03655 , year =

  2. [2]

    , title =

    Smolensky, P. , title =. Parallel Distributed Processing: Explorations in the Microstructure, Vol. 2: Psychological and Biological Models , pages =. 1986 , isbn =

  3. [3]

    2025 , eprint=

    On Linear Representations and Pretraining Data Frequency in Language Models , author=. 2025 , eprint=

  4. [4]

    arXiv preprint arXiv:2601.05328 , year =

    Bi-Orthogonal Factor Decomposition for Vision Transformers , author =. arXiv preprint arXiv:2601.05328 , year =

  5. [5]

    Advances in Neural Information Processing Systems , volume =

    Pareto frontiers in deep feature learning: Data, compute, width, and luck , author =. Advances in Neural Information Processing Systems , volume =

  6. [6]

    The Thirty Sixth Annual Conference on Learning Theory , pages =

    Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics , author =. The Thirty Sixth Annual Conference on Learning Theory , pages =. 2023 , organization =

  7. [7]

    arXiv preprint arXiv:2506.06489 , year =

    Alternating gradient flows: A theory of feature learning in two-layer neural networks , author =. arXiv preprint arXiv:2506.06489 , year =

  8. [8]

    What happens during the loss plateau? understanding abrupt learning in transformers

    What Happens During the Loss Plateau? Understanding Abrupt Learning in Transformers , author =. arXiv preprint arXiv:2506.13688 , year =

  9. [9]

    Kyle O’Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Robert Kirk, Xander Davies, Ishan Mishra, Geoffrey Irving, Yarin Gal, and Stella Biderman

    Representation shattering in transformers: A synthetic study with knowledge editing , author =. arXiv preprint arXiv:2410.17194 , year =

  10. [10]

    arXiv preprint arXiv:2505.18651 , year =

    On the emergence of linear analogies in word embeddings , author =. arXiv preprint arXiv:2505.18651 , year =

  11. [11]

    Transformer Circuits Thread , year =

    Gurnee, Wes and Ameisen, Emmanuel and Kauvar, Isaac and Tarng ,Julius and Pearce, Adam and Olah, Chris and Batson, Joshua , title =. Transformer Circuits Thread , year =

  12. [12]

    arXiv preprint arXiv:2603.16689 , year =

    Grid-World Representations in Transformers Reflect Predictive Geometry , author =. arXiv preprint arXiv:2603.16689 , year =

  13. [13]

    Uncovering hidden geometry in transformers via disentangling position and context.arXiv preprint arXiv:2310.04861, 2023

    Uncovering hidden geometry in transformers via disentangling position and context , author =. arXiv preprint arXiv:2310.04861 , year =

  14. [14]

    NeurIPS 2024 Workshop on Symmetry and Geometry in Neural Representations , year =

    Constrained Belief Updating and Geometric Structures in Transformer Representations , author =. NeurIPS 2024 Workshop on Symmetry and Geometry in Neural Representations , year =

  15. [15]

    Advances in Neural Information Processing Systems , volume =

    Transformers represent belief state geometry in their residual stream , author =. Advances in Neural Information Processing Systems , volume =

  16. [16]

    Advances in Neural Information Processing Systems , volume =

    Abrupt learning in transformers: A case study on matrix completion , author =. Advances in Neural Information Processing Systems , volume =

  17. [17]

    2023 , eprint =

    Kernelized Concept Erasure , author =. 2023 , eprint =

  18. [18]

    2022 , eprint =

    Linear Adversarial Concept Erasure , author =. 2022 , eprint =

  19. [20]

    Nora Belrose and David Schneider. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , year =

  20. [21]

    Log-linear

    Shauli Ravfogel and Yoav Goldberg and Ryan Cotterell , editor =. Log-Linear Guardedness and its Implications , booktitle =. 2023 , url =. doi:10.18653/V1/2023.ACL-LONG.523 , timestamp =

  21. [22]

    2017 , eprint =

    Understanding Neural Networks through Representation Erasure , author =. 2017 , eprint =

  22. [23]

    Distill , year =

    Cammarata, Nick and Carter, Shan and Goh, Gabriel and Olah, Chris and Petrov, Michael and Schubert, Ludwig and Voss, Chelsea and Egan, Ben and Lim, Swee Kiat , title =. Distill , year =

  23. [24]

    2023 , eprint =

    Dissecting Recall of Factual Associations in Auto-Regressive Language Models , author =. 2023 , eprint =

  24. [25]

    Advances in neural information processing systems , volume =

    Universality and individuality in neural dynamics across large populations of recurrent networks , author =. Advances in neural information processing systems , volume =

  25. [26]

    Proceedings of the National Academy of Sciences , volume =

    Neural representational geometry underlies few-shot concept learning , author =. Proceedings of the National Academy of Sciences , volume =. 2022 , publisher =

  26. [27]

    Proceedings of the National Academy of Sciences , volume =

    A mathematical theory of semantic development in deep neural networks , author =. Proceedings of the National Academy of Sciences , volume =. 2019 , publisher =

  27. [28]

    Computational Linguistics , volume =

    Probing classifiers: Promises, shortcomings, and advances , author =. Computational Linguistics , volume =

  28. [29]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

  29. [30]

    A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis , booktitle =

    Alessandro Stolfo and Yonatan Belinkov and Mrinmaya Sachan , editor =. A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis , booktitle =. 2023 , url =. doi:10.18653/V1/2023.EMNLP-MAIN.435 , timestamp =

  30. [31]

    ArXiv , year =

    Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations , author =. ArXiv , year =

  31. [32]

    Investigating Gender Bias in Language Models Using Causal Mediation Analysis , url =

    Vig, Jesse and Gehrmann, Sebastian and Belinkov, Yonatan and Qian, Sharon and Nevo, Daniel and Singer, Yaron and Shieber, Stuart , booktitle =. Investigating Gender Bias in Language Models Using Causal Mediation Analysis , url =

  32. [33]

    Computational Linguistics , pages =

    Mueller, Aaron and Brinkmann, Jannik and Li, Millicent and Marks, Samuel and Pal, Koyena and Prakash, Nikhil and Rager, Can and Sankaranarayanan, Aruna and Sharma, Arnab Sen and Sun, Jiuding and Todd, Eric and Bau, David and Belinkov, Yonatan , title =. Computational Linguistics , pages =. 2026 , month =. doi:10.1162/COLI.a.572 , url =

  33. [34]

    Journal of Machine Learning Research , year =

    Atticus Geiger and Duligur Ibeling and Amir Zur and Maheep Chaudhary and Sonakshi Chauhan and Jing Huang and Aryaman Arora and Zhengxuan Wu and Noah Goodman and Christopher Potts and Thomas Icard , title =. Journal of Machine Learning Research , year =

  34. [35]

    2025 , eprint =

    Not All Language Model Features Are One-Dimensionally Linear , author =. 2025 , eprint =

  35. [36]

    2025 , eprint =

    The Origins of Representation Manifolds in Large Language Models , author =. 2025 , eprint =

  36. [37]

    2024 , eprint =

    Pre-trained Large Language Models Use Fourier Features to Compute Addition , author =. 2024 , eprint =

  37. [38]

    2025 , eprint =

    Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics , author =. 2025 , eprint =

  38. [39]

    2025 , eprint =

    Language Models Use Trigonometry to Do Addition , author =. 2025 , eprint =

  39. [40]

    2023 , eprint =

    Progress measures for grokking via mechanistic interpretability , author =. 2023 , eprint =

  40. [41]

    2023 , eprint =

    The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks , author =. 2023 , eprint =

  41. [42]

    Proceedings of the 35th International Conference on Neural Information Processing Systems , articleno =

    Geiger, Atticus and Lu, Hanson and Icard, Thomas and Potts, Christopher , title =. Proceedings of the 35th International Conference on Neural Information Processing Systems , articleno =. 2021 , isbn =

  42. [43]

    2025 , journal =

    Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability , author =. 2025 , journal =

  43. [44]

    2025 , booktitle =

    Combining Causal Models for More Accurate Abstractions of Neural Networks , author =. 2025 , booktitle =

  44. [45]

    The Fourteenth International Conference on Learning Representations , year =

    Correlations in the Data Lead to Semantically Rich Feature Geometry Under Superposition , author =. The Fourteenth International Conference on Learning Representations , year =

  45. [46]

    2025 , booktitle =

    Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization , author =. 2025 , booktitle =

  46. [47]

    2025 , booktitle =

    Enhancing Automated Interpretability with Output-Centric Feature Descriptions , author =. 2025 , booktitle =

  47. [48]

    2025 , booktitle =

    How Causal Abstraction Underpins Computational Explanation , author =. 2025 , booktitle =

  48. [49]

    2025 , booktitle =

    How Do Transformers Learn Variable Binding in Symbolic Programs? , author =. 2025 , booktitle =

  49. [50]

    2025 , booktitle =

    HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks , author =. 2025 , booktitle =

  50. [51]

    2025 , booktitle =

    HyperSteer: Activation Steering at Scale with Hypernetworks , author =. 2025 , booktitle =

  51. [52]

    2025 , booktitle =

    MIB: A Mechanistic Interpretability Benchmark , author =. 2025 , booktitle =

  52. [53]

    Why Can't Transformers Learn Multiplication? Reverse­engineering Reveals Long­Range Dependency Pitfalls,

    Xiaoyan Bai and Itamar Pres and Yuntian Deng and Chenhao Tan and Stuart M. Shieber and Fernanda B. Vi. Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls , journal =. 2025 , url =. doi:10.48550/ARXIV.2510.00184 , eprinttype =. 2510.00184 , timestamp =

  53. [54]

    2025 , booktitle =

    Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context , author =. 2025 , booktitle =

  54. [55]

    2025 , journal =

    Open Problems in Mechanistic Interpretability , author =. 2025 , journal =

  55. [56]

    (2023)'s Interpretability Illusions Arguments , author =

    A Reply to Makelov et al. (2023)'s Interpretability Illusions Arguments , author =. 2024 , booktitle =

  56. [57]

    2024 , booktitle =

    Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small , author =. 2024 , booktitle =

  57. [58]

    2024 , booktitle =

    Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations , author =. 2024 , booktitle =

  58. [59]

    2024 , booktitle =

    Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching , author =. 2024 , booktitle =

  59. [60]

    2024 , booktitle =

    Language Models Linearly Represent Sentiment , author =. 2024 , booktitle =

  60. [61]

    2024 , booktitle =

    pyvene: A Library for Understanding and Improving PyTorch Models via Interventions , author =. 2024 , booktitle =

  61. [62]

    2024 , booktitle =

    RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations , author =. 2024 , booktitle =

  62. [63]

    2024 , booktitle =

    Updating CLIP to Prefer Descriptions Over Captions , author =. 2024 , booktitle =

  63. [64]

    2023 , booktitle =

    A Semantics for Causing, Enabling, and Preventing Verbs Using Structural Causal Models , author =. 2023 , booktitle =

  64. [65]

    2023 , booktitle =

    Causal Abstraction with Soft Interventions , author =. 2023 , booktitle =

  65. [66]

    Language Models Encode Numbers Using Digit Representations in Base 10 , booktitle =

    Amit Arnold Levy and Mor Geva , editor =. Language Models Encode Numbers Using Digit Representations in Base 10 , booktitle =. 2025 , url =. doi:10.18653/V1/2025.NAACL-SHORT.33 , timestamp =

  66. [67]

    2025 , eprint =

    Priors in Time: Missing Inductive Biases for Language Model Interpretability , author =. 2025 , eprint =

  67. [68]

    2025 , eprint =

    From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit , author =. 2025 , eprint =

  68. [69]

    2025 , eprint =

    MIB: A Mechanistic Interpretability Benchmark , author =. 2025 , eprint =

  69. [70]

    The Fourteenth International Conference on Learning Representations (ICLR 2026) , year =

    LLMs Process Lists With General Filter Heads , author =. The Fourteenth International Conference on Learning Representations (ICLR 2026) , year =. 2510.26784 , archiveprefix =

  70. [71]

    2026 , eprint =

    From Directions to Regions: Decomposing Activations in Language Models via Local Geometry , author =. 2026 , eprint =

  71. [72]

    FoNE: Precise Single-Token Number Embeddings via Fourier Features

    FoNE: Precise Single-Token Number Embeddings via Fourier Features , author =. arXiv preprint arXiv:2502.09741 , year =

  72. [73]

    Saxe , editor =

    Lukas Braun and Erin Grant and Andrew M. Saxe , editor =. Not all solutions are created equal: An analytical dissociation of functional and representational similarity in deep linear neural networks , booktitle =. 2025 , url =

  73. [74]

    Second Mechanistic Interpretability Workshop at NeurIPS , year =

    Vector Arithmetic in Concept and Token Subspaces , author =. Second Mechanistic Interpretability Workshop at NeurIPS , year =

  74. [75]

    Feature Learning beyond the Lazy-Rich Dichotomy: Insights from Representational Geometry , booktitle =

    Chi. Feature Learning beyond the Lazy-Rich Dichotomy: Insights from Representational Geometry , booktitle =. 2025 , url =

  75. [76]

    CoRR , volume =

    Hyunmo Kang and Abdulkadir Canatar and SueYeon Chung , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2502.19648 , eprinttype =. 2502.19648 , timestamp =

  76. [77]

    2025 , eprint =

    Language Models use Lookbacks to Track Beliefs , author =. 2025 , eprint =

  77. [78]

    Second Conference on Language Modeling , year =

    The Dual-Route Model of Induction , author =. Second Conference on Language Modeling , year =

  78. [79]

    Locating and Editing Factual Associations in

    Kevin Meng and David Bau and Alex Andonian and Yonatan Belinkov , journal =. Locating and Editing Factual Associations in. 2022 , note =

  79. [80]

    Open Problems in Mechanistic Interpretability

    Lee Sharkey and Bilal Chughtai and Joshua Batson and Jack Lindsey and Jeff Wu and Lucius Bushnaq and Nicholas Goldowsky. Open Problems in Mechanistic Interpretability , journal =. 2025 , url =. doi:10.48550/ARXIV.2501.16496 , eprinttype =. 2501.16496 , timestamp =

  80. [81]

    2026 , eprint =

    Convergent Evolution: How Different Language Models Learn Similar Number Representations , author =. 2026 , eprint =

Showing first 80 references.