Probing for Representation Manifolds in Superposition

Alexander Modell

arxiv: 2605.18537 · v1 · pith:3G6TOCTDnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI· stat.ML

Probing for Representation Manifolds in Superposition

Alexander Modell This is my paper

Pith reviewed 2026-05-20 11:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords superpositionrepresentation manifoldsprobing methodsmodel interpretabilitysteeringlanguage modelstemporal representations

0 comments

The pith

The Manifold Probe discovers representation manifolds in superposition for concepts like time and space, enabling causal steering of model behavior.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops the Manifold Probe to locate manifolds within neural representations where concepts exist in superposition. The method learns the subspace of features that can be linearly predicted from activations and identifies the directions used to represent them. Applied to time and space in Llama 2-7b, it uncovers manifolds with clear interpretability. Steering the time manifold then modifies the model's responses about when specific songs, movies, and books came out, showing that the probe can find structures that the model actually uses in its computations.

Core claim

The author establishes that a generalized linear probe can recover manifolds encoding concepts in superposition by first determining the linearly predictable feature space for the concept and then learning the encoding directions, with evidence from successful steering interventions on time-related outputs in a large language model.

What carries the argument

The Manifold Probe, which identifies both the feature space of a concept predictable from representations and the linear directions encoding those features.

If this is right

Interpretable manifolds for time and space exist in the representations of Llama 2-7b.
Steering along the time manifold influences completions about release years of media.
The discovered manifolds are causally involved in the model's behavior.
The approach extends standard linear probes to handle superposition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying the probe to additional concepts could map out more of the model's internal knowledge structure.
This method might be adapted to test whether other behavioral influences arise from similar manifold encodings.
Future work could examine if these manifolds persist across different model scales or training regimes.

Load-bearing premise

The assumption that the probe identifies the model's true encoding directions for the concept instead of incidental correlations.

What would settle it

A finding that steering along the manifold does not produce the expected changes in the model's year predictions for the tested items, or that the linear feature predictions do not match model behavior under perturbation.

Figures

Figures reproduced from arXiv: 2605.18537 by Alexander Modell.

**Figure 2.** Figure 2: A representation manifold (top left) and linear prediction (top right) from a Manifold Probe [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Ranked test R2 values for features fitted using the probing datasets described in Section 4 at each layer of Llama 2-7b. Left: the dotted line shows the test R2 coefficient of a ridge regression fit directly to the release dates from the songs, movies and books representations. Right: the dotted and dashed lines show the test R2 coefficients of ridge regression fits directly to the latitude and longitude f… view at source ↗

**Figure 4.** Figure 4: Steering experiment. Top: the mean probability a completion is within two years of the target year it was steered to at each layer, grouped by release decade (left) and target decade (right). Clean baselines are shown with dashed lines. Bottom: colour intensity (capped at 0.1) indicates the mean probability of a completion given the steering target. 5 Discussion In this work, we introduced the Manifold Pro… view at source ↗

**Figure 5.** Figure 5: The top 5 time features, and top 32 space features from layer 16 of Llama 2-7b after [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: The mean probability that the model completes the prompt with a valid year in the steering [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Colour intensity indicates the standard deviation of the probability of a completion given [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

read the original abstract

This paper introduces the Manifold Probe, a supervised method for discovering representation manifolds in superposition. The method generalizes linear regression probes by learning the space of features of a concept that can be linearly predicted from the representations, and then learning the directions used to encode them. We demonstrate the probe on representations of time and space in Llama 2-7b, finding manifolds which linearly represent an interpretable set of features in each case. In the case of time, we show that by steering along the manifold, we can influence the model's completions about the years in which famous songs, movies and books were released, providing evidence that the Manifold Probe can discover manifolds which are causally involved in model behaviour.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the Manifold Probe, a supervised method that generalizes linear regression probes to discover representation manifolds in superposition. It learns the space of features of a concept that can be linearly predicted from model representations and then identifies the directions used to encode those features. The method is demonstrated on time and space representations in Llama 2-7b, yielding manifolds with interpretable features; steering along the time manifold is shown to influence model completions about release years of songs, movies, and books, supporting the claim that the probe identifies causally relevant manifolds.

Significance. If validated with appropriate controls and metrics, the Manifold Probe could offer a useful extension of probing techniques for handling superposition in high-dimensional representations, with the steering intervention providing a direct test of causal involvement in model behavior. The work attempts to bridge discovery and intervention, which is a positive direction for mechanistic interpretability, though the current presentation leaves the strength of this bridge unclear.

major comments (2)

[Abstract] Abstract: the steering result on Llama 2-7b is presented without any reported validation metrics, error analysis, or controls (such as orthogonal steering vectors, norm-matched random directions, or post-steering performance on unrelated tasks). This is load-bearing for the central causal claim that the discovered manifold is specifically involved in year-related behavior rather than producing effects through off-manifold side effects or incidental changes.
[Method] Method description (as summarized in the abstract): the generalization of linear probes to manifolds is stated at a high level without equations or pseudocode specifying how the feature space is learned or how encoding directions are optimized. This absence makes it impossible to assess whether the procedure avoids self-referential fitting or normalization artifacts that could force the reported manifolds.

minor comments (1)

[Abstract] The abstract refers to 'an interpretable set of features' for both time and space without specifying what those features are or how interpretability was quantified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each major comment below and have updated the paper accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: the steering result on Llama 2-7b is presented without any reported validation metrics, error analysis, or controls (such as orthogonal steering vectors, norm-matched random directions, or post-steering performance on unrelated tasks). This is load-bearing for the central causal claim that the discovered manifold is specifically involved in year-related behavior rather than producing effects through off-manifold side effects or incidental changes.

Authors: We agree that the steering experiments would benefit from additional controls to strengthen the causal interpretation. In the revised manuscript, we will report validation metrics for the steering interventions, include error analysis, and add controls using orthogonal steering vectors, norm-matched random directions, and assessments of post-steering performance on unrelated tasks. These additions will help demonstrate that the effects are specific to the time manifold rather than off-manifold artifacts. revision: yes
Referee: [Method] Method description (as summarized in the abstract): the generalization of linear probes to manifolds is stated at a high level without equations or pseudocode specifying how the feature space is learned or how encoding directions are optimized. This absence makes it impossible to assess whether the procedure avoids self-referential fitting or normalization artifacts that could force the reported manifolds.

Authors: The full method section in the manuscript provides a more detailed description, but we acknowledge that the abstract summarizes it at a high level. To address this, we will include explicit equations and pseudocode in a new subsection of the Methods to specify the optimization procedure for the feature space and encoding directions. This will allow readers to evaluate potential issues such as self-referential fitting or normalization artifacts. revision: yes

Circularity Check

0 steps flagged

Manifold Probe method and steering validation are self-contained without circular reduction

full rationale

The paper introduces the Manifold Probe as a supervised generalization of linear regression probes: it learns the space of linearly predictable features for a concept from representations and then identifies the encoding directions. This is applied to time and space manifolds in Llama 2-7b, with steering along the learned manifold used to causally influence year-related completions as external evidence of involvement in model behavior. No derivation step reduces a claimed result to its own fitted inputs by construction, nor relies on self-citation chains, uniqueness theorems from prior author work, or ansatzes imported via citation. The steering test functions as an independent intervention check rather than a tautological prediction. The derivation chain remains non-circular and externally falsifiable via the reported behavioral changes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, background axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5637 in / 1138 out tokens · 53383 ms · 2026-05-20T11:43:16.469782+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

113 extracted references · 113 canonical work pages · 5 internal anchors

[1]

The Journal of Machine Learning Research , volume=

Minimax manifold estimation , author=. The Journal of Machine Learning Research , volume=. 2012 , publisher=

work page 2012
[2]

science , volume=

A global geometric framework for nonlinear dimensionality reduction , author=. science , volume=. 2000 , publisher=

work page 2000
[3]

2002 , publisher=

Learning with kernels: support vector machines, regularization, optimization, and beyond , author=. 2002 , publisher=

work page 2002
[4]

Understanding intermediate layers using linear classifier probes

Understanding intermediate layers using linear classifier probes , author=. arXiv preprint arXiv:1610.01644 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

2024 , url =

Lin, Johnny , title =. 2024 , url =

work page 2024
[6]

AI Alignment Forum , author =

The ‘strong’ feature hypothesis could be wrong , url =. AI Alignment Forum , author =

work page
[7]

Transformer Circuits Thread , author =

Feature. Transformer Circuits Thread , author =

work page
[8]

Transformer Circuits Thread , author =

What is a. Transformer Circuits Thread , author =

work page
[9]

AI Alignment Forum , author =

work page
[10]

Less Wrong , author =

Calendar feature geometry in. Less Wrong , author =

work page
[11]

Transformer Circuits Thread , author =

Toy. Transformer Circuits Thread , author =

work page
[12]

Transformer Circuits Thread , author =

A. Transformer Circuits Thread , author =

work page
[13]

Less Wrong , author =

Showing. Less Wrong , author =

work page
[14]

Transformer Circuits Thread , author =

Scaling. Transformer Circuits Thread , author =

work page
[15]

Transformer Circuits Thread , author =

Towards. Transformer Circuits Thread , author =

work page
[16]

The Twelfth International Conference on Learning Representations , year=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. The Twelfth International Conference on Learning Representations , year=

work page
[17]

URL https: //doi.org/10.1038/s41586-023-06221-2

Scientific discovery in the age of artificial intelligence , volume =. Nature , author =. 2023 , pages =. doi:10.1038/s41586-023-06221-2 , number =

work page doi:10.1038/s41586-023-06221-2 2023
[18]

Superintelligence:

Bostrom, Nick , year =. Superintelligence:

work page
[19]

Agent foundations for aligning machine intelligence with human interests: a technical research agenda , journal =

Soares, Nate and Fallenstein, Benya , year =. Agent foundations for aligning machine intelligence with human interests: a technical research agenda , journal =

work page
[20]

Understanding intermediate layers using linear classifier probes , url =

Alain, Guillaume and Bengio, Yoshua , year =. Understanding intermediate layers using linear classifier probes , url =. 5th

work page
[21]

Finding. Trans. Mach. Learn. Res. , author =

work page
[22]

A course in metric geometry , publisher =

Burago, Dmitri and Burago, Yuri and Ivanov, Sergei and. A course in metric geometry , publisher =

work page
[23]

Gorton, Liv , month = aug, year =. Curve

work page
[24]

Munroe, Randall , year =

work page
[25]

S im CSE : Simple Contrastive Learning of Sentence Embeddings

Gao, Tianyu and Yao, Xingcheng and Chen, Danqi , editor =. 2021 , pages =. doi:10.18653/V1/2021.EMNLP-MAIN.552 , booktitle =

work page doi:10.18653/v1/2021.emnlp-main.552 2021
[26]

Li, Bohan and Zhou, Hao and He, Junxian and Wang, Mingxuan and Yang, Yiming and Li, Lei , editor =. On the. 2020 , pages =. doi:10.18653/V1/2020.EMNLP-MAIN.733 , booktitle =

work page doi:10.18653/v1/2020.emnlp-main.733 2020
[27]

Language

Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya , year =. Language

work page
[28]

Journal of the American Statistical Association , author =

A new coefficient of correlation , volume =. Journal of the American Statistical Association , author =. 2021 , note =

work page 2021
[29]

, year =

Sutherland, Wilson A. , year =. Introduction to

work page
[30]

and Balestriero, Randall and Brendel, Wieland and Klindt, David A

Reizinger, Patrik and Bizeul, Alice and Juhos, Attila and Vogt, Julia E. and Balestriero, Randall and Brendel, Wieland and Klindt, David A. , year =. Cross-. The

work page
[31]

and Teixeira, Lucas and Oldenziel, Alexander Gietelink and Marzen, Sarah and Riechers, Paul M

Shai, Adam S. and Teixeira, Lucas and Oldenziel, Alexander Gietelink and Marzen, Sarah and Riechers, Paul M. , editor =. Transformers. Advances in

work page
[32]

Progress measures for grokking via mechanistic interpretability , url =

Nanda, Neel and Chan, Lawrence and Lieberum, Tom and Smith, Jess and Steinhardt, Jacob , year =. Progress measures for grokking via mechanistic interpretability , url =. The

work page
[33]

Chang, Zhuowen Tu, and Benjamin K

Chang, Tyler A. and Tu, Zhuowen and Bergen, Benjamin K. , editor =. The. 2022 , pages =. doi:10.18653/V1/2022.EMNLP-MAIN.9 , booktitle =

work page doi:10.18653/v1/2022.emnlp-main.9 2022
[34]

and Liao, Isaac and Gurnee, Wes and Tegmark, Max , year =

Engels, Joshua and Michaud, Eric J. and Liao, Isaac and Gurnee, Wes and Tegmark, Max , year =. Not. The

work page
[35]

Sparse autoencoders find highly interpretable features in language models , journal =

Cunningham, Hoagy and Ewart, Aidan and Riggs, Logan and Huben, Robert and Sharkey, Lee , year =. Sparse autoencoders find highly interpretable features in language models , journal =

work page
[36]

Towards a definition of disentangled representations , journal =

Higgins, Irina and Amos, David and Pfau, David and Racaniere, Sebastien and Matthey, Loic and Rezende, Danilo and Lerchner, Alexander , year =. Towards a definition of disentangled representations , journal =

work page
[37]

Advances in Neural Information Processing Systems , author =

Disentangling by subspace diffusion , volume =. Advances in Neural Information Processing Systems , author =. 2020 , pages =

work page 2020
[38]

Sparse and redundant representations: from theory to applications in signal and image processing , publisher =

Elad, Michael , year =. Sparse and redundant representations: from theory to applications in signal and image processing , publisher =

work page
[39]

Contrastive learning inverts the data generating process , booktitle =

Zimmermann, Roland S and Sharma, Yash and Schneider, Steffen and Bethge, Matthias and Brendel, Wieland , year =. Contrastive learning inverts the data generating process , booktitle =

work page
[40]

Annals of the Institute of Statistical Mathematics , author =

Identifiability of latent-variable and structural-equation models: from linear to nonlinear , volume =. Annals of the Institute of Statistical Mathematics , author =. 2024 , note =

work page 2024
[41]

Neural networks , author =

Independent component analysis: algorithms and applications , volume =. Neural networks , author =. 2000 , note =

work page 2000
[42]

Linguistic regularities in continuous space word representations , booktitle =

Mikolov, Tomáš and Yih, Wen-tau and Zweig, Geoffrey , year =. Linguistic regularities in continuous space word representations , booktitle =

work page
[43]

Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, and Mohamed S

Learning. arXiv preprint arXiv:2503.17547 , author =

work page arXiv
[44]

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders , journal =

A is for absorption:. arXiv preprint arXiv:2409.14507 , author =

work page arXiv
[45]

Entropy , author =

The geometry of concepts:. Entropy , author =. 2025 , note =

work page 2025
[46]

arXiv preprint arXiv:2407.14662 , author =

Relational composition in neural networks:. arXiv preprint arXiv:2407.14662 , author =

work page arXiv
[47]

The Twelfth International Conference on Learning Representations , year=

Language Models Represent Space and Time , author=. The Twelfth International Conference on Learning Representations , year=

work page
[48]

Advances in Neural Information Processing Systems , author =

The geometry of hidden representations of large transformer models , volume =. Advances in Neural Information Processing Systems , author =. 2023 , pages =

work page 2023
[49]

Advances in Neural Information Processing Systems , author =

Hierarchical nucleation in deep neural networks , volume =. Advances in Neural Information Processing Systems , author =. 2020 , pages =

work page 2020
[50]

Interpretability illusions in the generalization of simplified models , journal =

Friedman, Dan and Lampinen, Andrew and Dixon, Lucas and Chen, Danqi and Ghandeharioun, Asma , year =. Interpretability illusions in the generalization of simplified models , journal =

work page
[51]

Advances in Neural Information Processing Systems , author =

Intrinsic dimension of data representations in deep neural networks , volume =. Advances in Neural Information Processing Systems , author =

work page
[52]

International conference on learning representations , author =

Isotropy in the contextual embedding space:. International conference on learning representations , author =

work page
[53]

Advances in neural information processing systems , author =

The clock and the pizza:. Advances in neural information processing systems , author =. 2023 , pages =

work page 2023
[54]

Advances in Neural Information Processing Systems , author =

Learning to grok:. Advances in Neural Information Processing Systems , author =. 2024 , pages =

work page 2024
[55]

Advances in Neural Information Processing Systems , author =

Towards understanding grokking:. Advances in Neural Information Processing Systems , author =. 2022 , pages =

work page 2022
[56]

Michaud, Eric J. and Liao, Isaac and Lad, Vedang and Liu, Ziming and Mudide, Anish and Loughridge, Chloe and Guo, Zifan Carl and Kheirkhah, Tara Rezaei and Vukelić, Mateja and Tegmark, Max , month = feb, year =. Opening the. doi:10.48550/arXiv.2402.05110 , abstract =

work page doi:10.48550/arxiv.2402.05110
[57]

The geometry of categorical and hierarchical concepts in large language models , journal =

Park, Kiho and Choe, Yo Joong and Jiang, Yibo and Veitch, Victor , year =. The geometry of categorical and hierarchical concepts in large language models , journal =

work page
[58]

Advances in Neural Information Processing Systems , author =

Matrix factorisation and the interpretation of geodesic distance , volume =. Advances in Neural Information Processing Systems , author =. 2021 , pages =

work page 2021
[59]

Representation degeneration problem in training natural language generation models , journal =

Gao, Jun and He, Di and Tan, Xu and Qin, Tao and Wang, Liwei and Liu, Tie-Yan , year =. Representation degeneration problem in training natural language generation models , journal =

work page
[60]

Random matrices: universality of local eigenvalue statistics , author =

work page
[61]

ICLR , author =

Emergent world representations:. ICLR , author =. 2023 , note =

work page 2023
[62]

Emergent linear representations in world models of self-supervised sequence models , journal =

Nanda, Neel and Lee, Andrew and Wattenberg, Martin , year =. Emergent linear representations in world models of self-supervised sequence models , journal =

work page
[63]

Advances in Neural Information Processing Systems , author =

Causal abstractions of neural networks , volume =. Advances in Neural Information Processing Systems , author =. 2021 , pages =

work page 2021
[64]

Advances in neural information processing systems , author =

Investigating gender bias in language models using causal mediation analysis , volume =. Advances in neural information processing systems , author =. 2020 , pages =

work page 2020
[65]

arXiv preprint arXiv:1808.08079 , author =

Under the hood:. arXiv preprint arXiv:1808.08079 , author =

work page arXiv
[66]

The functional relevance of probed information:

Hanna, Michael and Zamparelli, Roberto and Mareček, David and. The functional relevance of probed information:. Proceedings of the 17th. 2023 , pages =

work page 2023
[67]

arXiv preprint arXiv:2005.00719 , author =

Probing the probing paradigm:. arXiv preprint arXiv:2005.00719 , author =

work page arXiv 2005
[68]

Transactions of the Association for Computational Linguistics , author =

Amnesic probing:. Transactions of the Association for Computational Linguistics , author =. 2021 , note =

work page 2021
[69]

Sparse autoencoders enable scalable and reliable circuit identification in language models , journal =

O'Neill, Charles and Bui, Thang , year =. Sparse autoencoders enable scalable and reliable circuit identification in language models , journal =

work page
[70]

Sparse autoencoders reveal universal feature spaces across large language models , journal =

Lan, Michael and Torr, Philip and Meek, Austin and Khakzar, Ashkan and Krueger, David and Barez, Fazl , year =. Sparse autoencoders reveal universal feature spaces across large language models , journal =

work page
[71]

Transformer Circuits Thread , author =

Sparse. Transformer Circuits Thread , author =

work page
[72]

Scaling and evaluating sparse autoencoders , url =

Gao, Leo and others , year =. Scaling and evaluating sparse autoencoders , url =

work page
[73]

OpenAI Research , author =

Extracting concepts from. OpenAI Research , author =

work page
[74]

Efficient estimation of word representations in vector space , journal =

Mikolov, Tomas and Chen, Kai and Corrado, Greg and Dean, Jeffrey , year =. Efficient estimation of word representations in vector space , journal =

work page
[75]

Parallel distributed processing: Explorations in the microstructure of cognition , author =

A general framework for parallel distributed processing , volume =. Parallel distributed processing: Explorations in the microstructure of cognition , author =. 1986 , note =

work page 1986
[76]

Improving dictionary learning with gated sparse autoencoders , journal =

Rajamanoharan, Senthooran and Conmy, Arthur and Smith, Lewis and Lieberum, Tom and Varma, Vikrant and Kramár, János and Shah, Rohin and Nanda, Neel , year =. Improving dictionary learning with gated sparse autoencoders , journal =

work page
[77]

Contemporary mathematics , author =

Extensions of. Contemporary mathematics , author =. 1984 , pages =

work page 1984
[78]

Pennington, Jeffrey and Socher, Richard and Manning, Christopher D , year =. Glove:. Proceedings of the 2014 conference on empirical methods in natural language processing (

work page 2014
[79]

Artificial intelligence , author =

Tensor product variable binding and the representation of symbolic structures in connectionist systems , volume =. Artificial intelligence , author =. 1990 , note =

work page 1990
[80]

What is the ‘

Van Gelder, Tim , year =. What is the ‘. Philosophy and

work page

Showing first 80 references.

[1] [1]

The Journal of Machine Learning Research , volume=

Minimax manifold estimation , author=. The Journal of Machine Learning Research , volume=. 2012 , publisher=

work page 2012

[2] [2]

science , volume=

A global geometric framework for nonlinear dimensionality reduction , author=. science , volume=. 2000 , publisher=

work page 2000

[3] [3]

2002 , publisher=

Learning with kernels: support vector machines, regularization, optimization, and beyond , author=. 2002 , publisher=

work page 2002

[4] [4]

Understanding intermediate layers using linear classifier probes

Understanding intermediate layers using linear classifier probes , author=. arXiv preprint arXiv:1610.01644 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

2024 , url =

Lin, Johnny , title =. 2024 , url =

work page 2024

[6] [6]

AI Alignment Forum , author =

The ‘strong’ feature hypothesis could be wrong , url =. AI Alignment Forum , author =

work page

[7] [7]

Transformer Circuits Thread , author =

Feature. Transformer Circuits Thread , author =

work page

[8] [8]

Transformer Circuits Thread , author =

What is a. Transformer Circuits Thread , author =

work page

[9] [9]

AI Alignment Forum , author =

work page

[10] [10]

Less Wrong , author =

Calendar feature geometry in. Less Wrong , author =

work page

[11] [11]

Transformer Circuits Thread , author =

Toy. Transformer Circuits Thread , author =

work page

[12] [12]

Transformer Circuits Thread , author =

A. Transformer Circuits Thread , author =

work page

[13] [13]

Less Wrong , author =

Showing. Less Wrong , author =

work page

[14] [14]

Transformer Circuits Thread , author =

Scaling. Transformer Circuits Thread , author =

work page

[15] [15]

Transformer Circuits Thread , author =

Towards. Transformer Circuits Thread , author =

work page

[16] [16]

The Twelfth International Conference on Learning Representations , year=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. The Twelfth International Conference on Learning Representations , year=

work page

[17] [17]

URL https: //doi.org/10.1038/s41586-023-06221-2

Scientific discovery in the age of artificial intelligence , volume =. Nature , author =. 2023 , pages =. doi:10.1038/s41586-023-06221-2 , number =

work page doi:10.1038/s41586-023-06221-2 2023

[18] [18]

Superintelligence:

Bostrom, Nick , year =. Superintelligence:

work page

[19] [19]

Agent foundations for aligning machine intelligence with human interests: a technical research agenda , journal =

Soares, Nate and Fallenstein, Benya , year =. Agent foundations for aligning machine intelligence with human interests: a technical research agenda , journal =

work page

[20] [20]

Understanding intermediate layers using linear classifier probes , url =

Alain, Guillaume and Bengio, Yoshua , year =. Understanding intermediate layers using linear classifier probes , url =. 5th

work page

[21] [21]

Finding. Trans. Mach. Learn. Res. , author =

work page

[22] [22]

A course in metric geometry , publisher =

Burago, Dmitri and Burago, Yuri and Ivanov, Sergei and. A course in metric geometry , publisher =

work page

[23] [23]

Gorton, Liv , month = aug, year =. Curve

work page

[24] [24]

Munroe, Randall , year =

work page

[25] [25]

S im CSE : Simple Contrastive Learning of Sentence Embeddings

Gao, Tianyu and Yao, Xingcheng and Chen, Danqi , editor =. 2021 , pages =. doi:10.18653/V1/2021.EMNLP-MAIN.552 , booktitle =

work page doi:10.18653/v1/2021.emnlp-main.552 2021

[26] [26]

Li, Bohan and Zhou, Hao and He, Junxian and Wang, Mingxuan and Yang, Yiming and Li, Lei , editor =. On the. 2020 , pages =. doi:10.18653/V1/2020.EMNLP-MAIN.733 , booktitle =

work page doi:10.18653/v1/2020.emnlp-main.733 2020

[27] [27]

Language

Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya , year =. Language

work page

[28] [28]

Journal of the American Statistical Association , author =

A new coefficient of correlation , volume =. Journal of the American Statistical Association , author =. 2021 , note =

work page 2021

[29] [29]

, year =

Sutherland, Wilson A. , year =. Introduction to

work page

[30] [30]

and Balestriero, Randall and Brendel, Wieland and Klindt, David A

Reizinger, Patrik and Bizeul, Alice and Juhos, Attila and Vogt, Julia E. and Balestriero, Randall and Brendel, Wieland and Klindt, David A. , year =. Cross-. The

work page

[31] [31]

and Teixeira, Lucas and Oldenziel, Alexander Gietelink and Marzen, Sarah and Riechers, Paul M

Shai, Adam S. and Teixeira, Lucas and Oldenziel, Alexander Gietelink and Marzen, Sarah and Riechers, Paul M. , editor =. Transformers. Advances in

work page

[32] [32]

Progress measures for grokking via mechanistic interpretability , url =

Nanda, Neel and Chan, Lawrence and Lieberum, Tom and Smith, Jess and Steinhardt, Jacob , year =. Progress measures for grokking via mechanistic interpretability , url =. The

work page

[33] [33]

Chang, Zhuowen Tu, and Benjamin K

Chang, Tyler A. and Tu, Zhuowen and Bergen, Benjamin K. , editor =. The. 2022 , pages =. doi:10.18653/V1/2022.EMNLP-MAIN.9 , booktitle =

work page doi:10.18653/v1/2022.emnlp-main.9 2022

[34] [34]

and Liao, Isaac and Gurnee, Wes and Tegmark, Max , year =

Engels, Joshua and Michaud, Eric J. and Liao, Isaac and Gurnee, Wes and Tegmark, Max , year =. Not. The

work page

[35] [35]

Sparse autoencoders find highly interpretable features in language models , journal =

Cunningham, Hoagy and Ewart, Aidan and Riggs, Logan and Huben, Robert and Sharkey, Lee , year =. Sparse autoencoders find highly interpretable features in language models , journal =

work page

[36] [36]

Towards a definition of disentangled representations , journal =

Higgins, Irina and Amos, David and Pfau, David and Racaniere, Sebastien and Matthey, Loic and Rezende, Danilo and Lerchner, Alexander , year =. Towards a definition of disentangled representations , journal =

work page

[37] [37]

Advances in Neural Information Processing Systems , author =

Disentangling by subspace diffusion , volume =. Advances in Neural Information Processing Systems , author =. 2020 , pages =

work page 2020

[38] [38]

Sparse and redundant representations: from theory to applications in signal and image processing , publisher =

Elad, Michael , year =. Sparse and redundant representations: from theory to applications in signal and image processing , publisher =

work page

[39] [39]

Contrastive learning inverts the data generating process , booktitle =

Zimmermann, Roland S and Sharma, Yash and Schneider, Steffen and Bethge, Matthias and Brendel, Wieland , year =. Contrastive learning inverts the data generating process , booktitle =

work page

[40] [40]

Annals of the Institute of Statistical Mathematics , author =

Identifiability of latent-variable and structural-equation models: from linear to nonlinear , volume =. Annals of the Institute of Statistical Mathematics , author =. 2024 , note =

work page 2024

[41] [41]

Neural networks , author =

Independent component analysis: algorithms and applications , volume =. Neural networks , author =. 2000 , note =

work page 2000

[42] [42]

Linguistic regularities in continuous space word representations , booktitle =

Mikolov, Tomáš and Yih, Wen-tau and Zweig, Geoffrey , year =. Linguistic regularities in continuous space word representations , booktitle =

work page

[43] [43]

Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, and Mohamed S

Learning. arXiv preprint arXiv:2503.17547 , author =

work page arXiv

[44] [44]

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders , journal =

A is for absorption:. arXiv preprint arXiv:2409.14507 , author =

work page arXiv

[45] [45]

Entropy , author =

The geometry of concepts:. Entropy , author =. 2025 , note =

work page 2025

[46] [46]

arXiv preprint arXiv:2407.14662 , author =

Relational composition in neural networks:. arXiv preprint arXiv:2407.14662 , author =

work page arXiv

[47] [47]

The Twelfth International Conference on Learning Representations , year=

Language Models Represent Space and Time , author=. The Twelfth International Conference on Learning Representations , year=

work page

[48] [48]

Advances in Neural Information Processing Systems , author =

The geometry of hidden representations of large transformer models , volume =. Advances in Neural Information Processing Systems , author =. 2023 , pages =

work page 2023

[49] [49]

Advances in Neural Information Processing Systems , author =

Hierarchical nucleation in deep neural networks , volume =. Advances in Neural Information Processing Systems , author =. 2020 , pages =

work page 2020

[50] [50]

Interpretability illusions in the generalization of simplified models , journal =

Friedman, Dan and Lampinen, Andrew and Dixon, Lucas and Chen, Danqi and Ghandeharioun, Asma , year =. Interpretability illusions in the generalization of simplified models , journal =

work page

[51] [51]

Advances in Neural Information Processing Systems , author =

Intrinsic dimension of data representations in deep neural networks , volume =. Advances in Neural Information Processing Systems , author =

work page

[52] [52]

International conference on learning representations , author =

Isotropy in the contextual embedding space:. International conference on learning representations , author =

work page

[53] [53]

Advances in neural information processing systems , author =

The clock and the pizza:. Advances in neural information processing systems , author =. 2023 , pages =

work page 2023

[54] [54]

Advances in Neural Information Processing Systems , author =

Learning to grok:. Advances in Neural Information Processing Systems , author =. 2024 , pages =

work page 2024

[55] [55]

Advances in Neural Information Processing Systems , author =

Towards understanding grokking:. Advances in Neural Information Processing Systems , author =. 2022 , pages =

work page 2022

[56] [56]

Michaud, Eric J. and Liao, Isaac and Lad, Vedang and Liu, Ziming and Mudide, Anish and Loughridge, Chloe and Guo, Zifan Carl and Kheirkhah, Tara Rezaei and Vukelić, Mateja and Tegmark, Max , month = feb, year =. Opening the. doi:10.48550/arXiv.2402.05110 , abstract =

work page doi:10.48550/arxiv.2402.05110

[57] [57]

The geometry of categorical and hierarchical concepts in large language models , journal =

Park, Kiho and Choe, Yo Joong and Jiang, Yibo and Veitch, Victor , year =. The geometry of categorical and hierarchical concepts in large language models , journal =

work page

[58] [58]

Advances in Neural Information Processing Systems , author =

Matrix factorisation and the interpretation of geodesic distance , volume =. Advances in Neural Information Processing Systems , author =. 2021 , pages =

work page 2021

[59] [59]

Representation degeneration problem in training natural language generation models , journal =

Gao, Jun and He, Di and Tan, Xu and Qin, Tao and Wang, Liwei and Liu, Tie-Yan , year =. Representation degeneration problem in training natural language generation models , journal =

work page

[60] [60]

Random matrices: universality of local eigenvalue statistics , author =

work page

[61] [61]

ICLR , author =

Emergent world representations:. ICLR , author =. 2023 , note =

work page 2023

[62] [62]

Emergent linear representations in world models of self-supervised sequence models , journal =

Nanda, Neel and Lee, Andrew and Wattenberg, Martin , year =. Emergent linear representations in world models of self-supervised sequence models , journal =

work page

[63] [63]

Advances in Neural Information Processing Systems , author =

Causal abstractions of neural networks , volume =. Advances in Neural Information Processing Systems , author =. 2021 , pages =

work page 2021

[64] [64]

Advances in neural information processing systems , author =

Investigating gender bias in language models using causal mediation analysis , volume =. Advances in neural information processing systems , author =. 2020 , pages =

work page 2020

[65] [65]

arXiv preprint arXiv:1808.08079 , author =

Under the hood:. arXiv preprint arXiv:1808.08079 , author =

work page arXiv

[66] [66]

The functional relevance of probed information:

Hanna, Michael and Zamparelli, Roberto and Mareček, David and. The functional relevance of probed information:. Proceedings of the 17th. 2023 , pages =

work page 2023

[67] [67]

arXiv preprint arXiv:2005.00719 , author =

Probing the probing paradigm:. arXiv preprint arXiv:2005.00719 , author =

work page arXiv 2005

[68] [68]

Transactions of the Association for Computational Linguistics , author =

Amnesic probing:. Transactions of the Association for Computational Linguistics , author =. 2021 , note =

work page 2021

[69] [69]

Sparse autoencoders enable scalable and reliable circuit identification in language models , journal =

O'Neill, Charles and Bui, Thang , year =. Sparse autoencoders enable scalable and reliable circuit identification in language models , journal =

work page

[70] [70]

Sparse autoencoders reveal universal feature spaces across large language models , journal =

Lan, Michael and Torr, Philip and Meek, Austin and Khakzar, Ashkan and Krueger, David and Barez, Fazl , year =. Sparse autoencoders reveal universal feature spaces across large language models , journal =

work page

[71] [71]

Transformer Circuits Thread , author =

Sparse. Transformer Circuits Thread , author =

work page

[72] [72]

Scaling and evaluating sparse autoencoders , url =

Gao, Leo and others , year =. Scaling and evaluating sparse autoencoders , url =

work page

[73] [73]

OpenAI Research , author =

Extracting concepts from. OpenAI Research , author =

work page

[74] [74]

Efficient estimation of word representations in vector space , journal =

Mikolov, Tomas and Chen, Kai and Corrado, Greg and Dean, Jeffrey , year =. Efficient estimation of word representations in vector space , journal =

work page

[75] [75]

Parallel distributed processing: Explorations in the microstructure of cognition , author =

A general framework for parallel distributed processing , volume =. Parallel distributed processing: Explorations in the microstructure of cognition , author =. 1986 , note =

work page 1986

[76] [76]

Improving dictionary learning with gated sparse autoencoders , journal =

Rajamanoharan, Senthooran and Conmy, Arthur and Smith, Lewis and Lieberum, Tom and Varma, Vikrant and Kramár, János and Shah, Rohin and Nanda, Neel , year =. Improving dictionary learning with gated sparse autoencoders , journal =

work page

[77] [77]

Contemporary mathematics , author =

Extensions of. Contemporary mathematics , author =. 1984 , pages =

work page 1984

[78] [78]

Pennington, Jeffrey and Socher, Richard and Manning, Christopher D , year =. Glove:. Proceedings of the 2014 conference on empirical methods in natural language processing (

work page 2014

[79] [79]

Artificial intelligence , author =

Tensor product variable binding and the representation of symbolic structures in connectionist systems , volume =. Artificial intelligence , author =. 1990 , note =

work page 1990

[80] [80]

What is the ‘

Van Gelder, Tim , year =. What is the ‘. Philosophy and

work page