arxiv: 2605.09967 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions

Andrew Lee, Fernanda Vi\'egas, Martin Wattenberg

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:26 UTC · model grok-4.3

classification 💻 cs.LG

keywords tensor product representationslinear probesOthelloneural network interpretabilityboard state representationfactorizationshared structuredirectional representations

0 comments

The pith

Linear directions for Othello board states decompose into tensor products of square and color embeddings

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains tensor product representation probes to find common structure among the linear probes for board states in an Othello model. This produces separate embeddings for board squares and for piece colors, along with a binding matrix that combines them to form the full representation. The original linear probes can then be reconstructed directly from these TPR components. A sympathetic reader would care because it suggests that the simple directional concepts discovered in models may actually stem from richer, factorized internal structures rather than standing alone. This helps address how models handle relational information beyond isolated features.

Core claim

In a model trained on the board game Othello, board states are linearly decodable but also exhibit tensor product structure. Training TPR probes to recover shared structure among the linear probes produces a factorization into square-embeddings, color-embeddings, and a binding matrix. The TPR probe weights show geometric signatures matching the board layout. The linear probes can be recovered directly from the TPR parameters, indicating that directional representations may be projections of more structured underlying representations.

What carries the argument

Tensor product representation (TPR) probes, which decompose linear directions into square-embeddings and color-embeddings composed by a binding matrix.

If this is right

The weights of the TPR probe contain geometric signatures that align with the structure of the Othello board.
The original linear probes can be recovered directly from the parameters of the TPR probe.
Directional representations in the model are projections of more structured underlying representations.
This reveals shared structure across the linear directions for different board positions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar TPR probe techniques could be used in language models to find structure in what appear to be isolated linear concepts.
The factorization might enable more precise interventions by editing embeddings or the binding matrix separately.
The success in a structured game domain suggests the method can test the limits of linear probing in other settings.

Load-bearing premise

That training the TPR probe on the linear probes extracts genuine shared underlying structure rather than imposing an artificial factorization.

What would settle it

If the linear probes reconstructed from the TPR probe parameters do not closely match the original linear probes in direction or accuracy on board state prediction, the claim of shared structure would fail.

Figures

Figures reproduced from arXiv: 2605.09967 by Andrew Lee, Fernanda Vi\'egas, Martin Wattenberg.

**Figure 1.** Figure 1: Tensor product representation probes recover a structured factorization of OthelloGPT’s board-state representation, including (i) role (square) embeddings R ∈ R 64×dr (topleft, Isomap), (ii) filler (color) embeddings F ∈ R 3×df (top-right, PCA), (iii) a binding matrix B ∈ R dr×df to bind the two objects (bottom-left, PCA) to represent a board-state (bottom-right). readout directions, but does not specify… view at source ↗

**Figure 2.** Figure 2: TPR probe accuracy: The board-state can be reconstructed using low-rank role embeddings (rank-dr, or du, dv) and filler embeddings (rank-df ). The binding matrix B encodes which roles (squares) are occupied by which fillers (colors), and unbinding with the learned role rs and filler fc recovers the correct color c for square s. Trilinear TPR Probes. Note that bindings do not have to be bilinear. While a n… view at source ↗

**Figure 3.** Figure 3: Intervention results for linear & TPR probes. Evaluation. We test our interventions on 1,000 held out samples. We compare the intervened move predictions minterv ∈ {0, 1} 8×8 against the groundtruth set of valid moves mtarget ∈ {0, 1} 8×8 corresponding to the target board-state B (target) . Per prior work [Li et al., 2022, Nanda et al., 2023], we report mean error count – the average number of false posit… view at source ↗

**Figure 4.** Figure 4: Cosine similarity scores between linear probes vs. “effective linear probes” from TPR probes. Linear probes can be recovered from the parameters of our TPR probes, suggesting that linear directions may be a projection of more structured underlying components. dimension, whereas a distributed code encodes a concept across multiple dimensions. When dr = 64, the TPR probe has enough dimensions to index all li… view at source ↗

**Figure 5.** Figure 5: Rank-k truncated SVD accuracy. At rank80, SVDk(W) matches the number of parameters as our TPR probe, but only achieves 85% accuracy. Thus far we show that TPR yields a structural low-rank decomposition of the linear probes W – is this because W is simply low-rank? Thus we compare the accuracy of rank-k truncated SVD of W (denoted SVDk(W)) against our TPR probes (dr = 30 ∼ 60). We sweep over k and compare … view at source ↗

**Figure 6.** Figure 6: Local k-NN based classification of neighbors. Local neighborhood structure. We first evaluate local neighborhood structure. Depending on its position, each square s has ks adjacent board neighbors (including diagonals). For each square s, we retrieve the ks-nearest neighbors of rs in embedding space and classify each retrieved square into five disjoint categories: true board neighbor, same row, column, d… view at source ↗

**Figure 7.** Figure 7: Pairwise Board Geometry. Each entry shows the average cosine similarity between pairs of square embeddings that are ∆i rows and ∆j columns apart on the board. Pairs that are close on the same row (∆i = 0), column (∆j = 0), or diagonal (∆i = ∆j) exhibit higher cosine similarity. Meanwhile, a growing body of work finds that features are not always well described by isolated rank1 directions [Mueller et al.,… view at source ↗

**Figure 8.** Figure 8: Representations of digits in a Transformer trained on multi-digit multiplication may appear [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: TPR probe accuracy per layer. Unsupervised structure discovery. Another related line of work asks how to uncover latent, unknown structure from activations at scale. Various geometry-aware sparse autoencoders have been proposed, based on various structural assumptions [Hindupur et al., 2025, Costa et al., 2025, Bhalla et al., 2025, Bussmann et al., 2025, Muchane et al., 2025]. A complementary approach is t… view at source ↗

**Figure 10.** Figure 10: Cosine similarity scores between linear probes vs. “effective linear probes” from trilinear TPR probes. our TPR probes use df = 2. This is sufficient because a three-way softmax has only two identifiable degrees of freedom. Adding the same scalar offset to all three logits does not change the predicted probabilities, thus for each square, the probe only needs to represent two independent directions to rep… view at source ↗

**Figure 11.** Figure 11: In a “full-dimensional” case, row embeddings effectively behave like an effective orthonormal basis. Given row-embeddings U ∈ R 8×du , with enough dimensions (du = 8) the row-normalized Gram matrix UU⊤ shows that the rows of U form an orthogonal set of basis vectors to encode each row of the board. The singular values of U are all close to 1, confirming that U behaves like an effective orthonormal basis. … view at source ↗

**Figure 12.** Figure 12: Gram matrix and singular values of column embeddings. D Training Details [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

read the original abstract

While researchers are finding concepts represented as linear directions in language models, a bag of linear directions fails to capture relational structure. To better understand this dichotomy, we study a model with known linear representations, but trained in a highly structured domain -- the board game Othello. While the model's internal board-state representation is linearly decodable, we find additional structure in the form of tensor product representations (TPRs). We train TPR probes to recover shared structure amongst the linear probes, yielding a factorization into square-embeddings, color-embeddings, and a binding matrix that composes them to construct the model's board-state representation. We find geometric signatures within the weights of our TPR probe that align with the structure of the board, but perhaps more importantly, that the linear probes can be recovered directly from the parameters of our TPR probe. Our findings suggest that directional representations may be projections of more structured underlying representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TPR probes factor linear directions in an Othello model into square/color embeddings plus a binding matrix, with direct recoverability of the probes, but the abstract leaves the validation thin.

read the letter

The key point is that this work trains tensor product representation probes on linear probes from an Othello model and gets a factorization into square embeddings, color embeddings, and a binding matrix that composes them. The linear probes can then be recovered straight from the TPR parameters, and the weights show geometric patterns that line up with the board structure. That moves past a simple bag of directions toward something compositional, which is the main new angle here compared to prior linear probe work on the same model family.

Referee Report

2 major / 2 minor

Summary. The paper examines an Othello-playing transformer model whose board-state representations are known to be linearly decodable. It introduces Tensor Product Representation (TPR) probes trained on the outputs of these linear probes to extract a shared factorization consisting of square-embeddings, color-embeddings, and a binding matrix. The authors report that the original linear probes can be exactly recovered from the learned TPR parameters and that the TPR weights exhibit geometric alignments with the physical board layout. They conclude that linear directions may be projections of more structured underlying representations.

Significance. If the central claim is substantiated, the work offers a concrete method for moving beyond bag-of-directions analyses toward relational structure in model representations, using a controlled domain where ground-truth board geometry is available. The direct recoverability result and the reported geometric signatures are potentially useful contributions to mechanistic interpretability. However, the significance is currently limited by the absence of controls that would distinguish probe-imposed structure from model-intrinsic structure.

major comments (2)

[Section 3.2] Section 3.2 (TPR probe training): The probe is optimized to reconstruct the linear-probe outputs via the tensor-product decomposition. Consequently, the reported recoverability of the linear probes from the TPR parameters (abstract and §4.1) is guaranteed by a successful fit and does not constitute independent evidence that the factorization reflects genuine shared structure inside the model rather than the inductive bias of the TPR architecture.
[Section 4.3] Section 4.3 (geometric signatures): The alignment of embedding and binding-matrix weights with board geometry is presented as supporting evidence, yet no quantitative comparison to random embeddings, shuffled board targets, or alternative low-rank factorizations is provided. Without such baselines it remains possible that the observed geometry is a consequence of the Othello-specific probe targets rather than an intrinsic property of the model.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a concise statement of the quantitative metrics (e.g., reconstruction MSE, alignment scores) used to evaluate the TPR probe.
[Section 3] Notation for the binding matrix and the two embedding sets should be introduced once and used consistently; occasional reuse of the same symbol for different quantities appears in §3.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We address each of the major comments point-by-point below and describe the revisions we plan to make to strengthen the paper.

read point-by-point responses

Referee: [Section 3.2] Section 3.2 (TPR probe training): The probe is optimized to reconstruct the linear-probe outputs via the tensor-product decomposition. Consequently, the reported recoverability of the linear probes from the TPR parameters (abstract and §4.1) is guaranteed by a successful fit and does not constitute independent evidence that the factorization reflects genuine shared structure inside the model rather than the inductive bias of the TPR architecture.

Authors: We acknowledge the validity of this observation: successful reconstruction does imply recoverability by construction. Our contribution lies in demonstrating that the linear probes for Othello board states can be accurately decomposed into a shared tensor-product structure consisting of square embeddings, color embeddings, and a binding matrix. This decomposition is not arbitrary; it reveals a compositional organization that aligns with the game's rules. To address potential concerns regarding the TPR's inductive bias, we will include additional experiments in the revision comparing the TPR fit to other factorization techniques, such as non-negative matrix factorization or principal component analysis applied directly to the probe weights. We will also revise the abstract and Section 4.1 to emphasize that we are showing the linear directions admit such a factorization, providing evidence for shared structure among them. revision: partial
Referee: [Section 4.3] Section 4.3 (geometric signatures): The alignment of embedding and binding-matrix weights with board geometry is presented as supporting evidence, yet no quantitative comparison to random embeddings, shuffled board targets, or alternative low-rank factorizations is provided. Without such baselines it remains possible that the observed geometry is a consequence of the Othello-specific probe targets rather than an intrinsic property of the model.

Authors: We agree that quantitative baselines are necessary to substantiate the geometric signatures. In the revised version of the manuscript, we will add comparisons of the observed alignments against those from random embeddings, embeddings trained on shuffled board targets, and alternative low-rank decompositions. These controls will quantify whether the alignments with the physical board layout exceed what would be expected from the probe targets alone, thereby providing stronger evidence that the structure is intrinsic to the model's representations. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper trains TPR probes on existing linear probes for Othello board states to extract a factorization (square-embeddings, color-embeddings, binding matrix). Recovery of the original linear probes from TPR parameters is an outcome of this training but does not reduce the central claim to a tautology by construction, as the factorization introduces additional interpretable structure and the paper reports independent geometric signatures in the TPR weights that align with board geometry. No equations equate a prediction directly to fitted inputs, no load-bearing self-citations are used to justify uniqueness or ansatzes, and the overall argument remains self-contained against the model's known linear decodability without forcing the result from its inputs alone.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the model's board-state representation is linearly decodable and that the learned TPR probe parameters capture genuine shared structure; the embeddings and binding matrix are fitted quantities introduced by the probe training.

free parameters (1)

square-embeddings, color-embeddings, and binding matrix
These are learned parameters of the TPR probe trained to recover structure from the linear probes.

axioms (1)

domain assumption The model's internal board-state representation is linearly decodable
Invoked in the abstract as the starting point for studying additional TPR structure.

invented entities (1)

TPR probe no independent evidence
purpose: To recover shared structure amongst linear probes and produce the factorization
New probe architecture introduced in the work; no independent evidence outside this paper is provided in the abstract.

pith-pipeline@v0.9.0 · 5452 in / 1408 out tokens · 72862 ms · 2026-05-12T03:26:39.711614+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear
We train TPR probes to recover shared structure amongst the linear probes, yielding a factorization into square-embeddings, color-embeddings, and a binding matrix that composes them
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
linear probes can be recovered directly from the parameters of our TPR probe

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

[1]

Why Can't Transformers Learn Multiplication? Reverseengineering Reveals LongRange Dependency Pitfalls,

Xiaoyan Bai, Itamar Pres, Yuntian Deng, Chenhao Tan, Stuart Shieber, Fernanda Viégas, Martin Wattenberg, and Andrew Lee. Why can’t transformers learn multiplication? reverse-engineering reveals long-range dependency pitfalls.arXiv preprint arXiv:2510.00184,

work page arXiv
[2]

Temporal sparse autoencoders: Leveraging the sequential nature of language for interpretability

Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, and Flavio P Calmon. Temporal sparse autoencoders: Leveraging the sequential nature of language for interpretability. arXiv preprint arXiv:2511.05541,

work page arXiv
[3]

2025 , archivePrefix=

Bart Bussmann, Noa Nabeshima, Adam Karvonen, and Neel Nanda. Learning multi-level features with matryoshka sparse autoencoders.arXiv preprint arXiv:2503.17547,

work page arXiv
[4]

Designing a dashboard for transparency and control of conversational ai.arXiv preprint arXiv:2406.07882, 2024

Yida Chen, Aoyu Wu, Trevor DePodesta, Catherine Yeh, Kenneth Li, Nicholas Castillo Marin, Oam Patel, Jan Riecke, Shivam Raval, Olivia Seow, et al. Designing a dashboard for transparency and control of conversational ai.arXiv preprint arXiv:2406.07882,

work page arXiv
[5]

arXiv preprint arXiv:2506.03093 , year=

Valérie Costa, Thomas Fel, Ekdeep Singh Lubana, Bahareh Tolooshams, and Demba Ba. From flat to hierarchical: Extracting sparse representations with matching pursuit.arXiv preprint arXiv:2506.03093,

work page arXiv
[6]

arXiv preprint arXiv:2405.14860 , year=

Joshua Engels, Eric J Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark. Not all language model features are one-dimensionally linear.arXiv preprint arXiv:2405.14860,

work page arXiv
[7]

URL https://transformer-circuits.pub/2025/linebreaks/index. html. Sai Sumedh R Hindupur, Ekdeep Singh Lubana, Thomas Fel, and Demba Ba. Projecting assumptions: The duality between sparse autoencoders and concept geometry.arXiv preprint arXiv:2503.01822,

work page arXiv 2025
[8]

Tensor product generation networks for deep nlp modeling

Qiuyuan Huang, Paul Smolensky, Xiaodong He, Li Deng, and Dapeng Wu. Tensor product generation networks for deep nlp modeling. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1263–1273,

work page 2018
[9]

Enriching transformers with struc- tured tensor-product representations for abstractive summarization

Yichen Jiang, Asli Celikyilmaz, Paul Smolensky, Paul Soulos, Sudha Rao, Hamid Palangi, Roland Fernandez, Caitlin Smith, Mohit Bansal, and Jianfeng Gao. Enriching transformers with struc- tured tensor-product representations for abstractive summarization. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard...

work page 2021
[10]

doi: 10.18653/ v1/2021.naacl-main.381

Association for Computational Linguistics. doi: 10.18653/ v1/2021.naacl-main.381. URLhttps://aclanthology.org/2021.naacl-main.381/. Subhash Kantamneni and Max Tegmark. Language models use trigonometry to do addition.arXiv preprint arXiv:2502.00873,

work page arXiv 2021
[11]

A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity.arXiv preprint arXiv:2401.01967, 2024

10 Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K Kummerfeld, and Rada Mihalcea. A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity. arXiv preprint arXiv:2401.01967,

work page arXiv
[12]

The geometry of self-verification in a task-specific reasoning model.arXiv preprint arXiv:2504.14379, 2025a

Andrew Lee, Lihao Sun, Chris Wendler, Fernanda Viégas, and Martin Wattenberg. The geometry of self-verification in a task-specific reasoning model.arXiv preprint arXiv:2504.14379, 2025a. Andrew Lee, Melanie Weber, Fernanda Viégas, and Martin Wattenberg. Shared global and local geometry of language model embeddings.arXiv preprint arXiv:2503.21073, 2025b. A...

work page arXiv
[13]

arXiv preprint arXiv:2210.13382 , year=

Kenneth Li, Aspen K Hopkins, David Bau, Fernanda Viegas, Hanspeter Pfister, and Martin Watten- berg. Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv preprint arXiv:2210.13382,

work page arXiv
[14]

Learning a generative meta-model of llm activations.arXiv preprint arXiv:2602.06964,

Grace Luo, Jiahai Feng, Trevor Darrell, Alec Radford, and Jacob Steinhardt. Learning a generative meta-model of llm activations.arXiv preprint arXiv:2602.06964,

work page arXiv
[15]

The origins of representation manifolds in large language models.arXiv preprint arXiv:2505.18235,

Alexander Modell, Patrick Rubin-Delanchy, and Nick Whiteley. The origins of representation manifolds in large language models.arXiv preprint arXiv:2505.18235,

work page arXiv
[16]

Incorporating hierarchical semantics in sparse autoencoder architectures.arXiv preprint arXiv:2506.01197,

Mark Muchane, Sean Richardson, Kiho Park, and Victor Veitch. Incorporating hierarchical semantics in sparse autoencoder architectures.arXiv preprint arXiv:2506.01197,

work page arXiv
[17]

From isolation to entanglement: When do interpretability methods identify and disen- tangle known concepts?arXiv preprint arXiv:2512.15134,

Aaron Mueller, Andrew Lee, Shruti Joshi, Ekdeep Singh Lubana, Dhanya Sridhar, and Patrik Reizinger. From isolation to entanglement: When do interpretability methods identify and disen- tangle known concepts?arXiv preprint arXiv:2512.15134,

work page arXiv
[18]

2023 , month = sep, journal =

URL https://arxiv.org/abs/ 2309.00941. Core Francisco Park, Andrew Lee, Ekdeep Singh Lubana, Yongyi Yang, Maya Okawa, Kento Nishi, Martin Wattenberg, and Hidenori Tanaka. Iclr: In-context learning of representations.arXiv preprint arXiv:2501.00070, 2024a. Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of...

work page arXiv
[19]

Attention-based iterative decomposition for tensor product representation.arXiv preprint arXiv:2406.01012, 2024b

Taewon Park, Inchul Choi, and Minho Lee. Attention-based iterative decomposition for tensor product representation.arXiv preprint arXiv:2406.01012, 2024b. Raphaël Sarfati, Eric Bigelow, Daniel Wurgaft, Jack Merullo, Atticus Geiger, Owen Lewis, Tom McGrath, and Ekdeep Singh Lubana. The shape of beliefs: Geometry, dynamics, and interventions along represent...

work page arXiv
[20]

Enhancing the transformer with explicit relational encoding for math problem solving, 2019, 1910.06611 http://arxiv.org/abs/1910.06611

Imanol Schlag, Paul Smolensky, Roland Fernandez, Nebojsa Jojic, Jürgen Schmidhuber, and Jianfeng Gao. Enhancing the transformer with explicit relational encoding for math problem solving.arXiv preprint arXiv:1910.06611,

work page arXiv 1910
[21]

Transformers learn factored representations.arXiv preprint arXiv:2602.02385,

Adam Shai, Loren Amdahl-Culleton, Casper L Christensen, Henry R Bigelow, Fernando E Rosas, Alexander B Boyd, Eric A Alt, Kyle J Ray, and Paul M Riechers. Transformers learn factored representations.arXiv preprint arXiv:2602.02385,

work page arXiv
[22]

Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control

Lihao Sun, Lewen Yan, Xiaoya Lu, Andrew Lee, Jie Zhang, and Jing Shao. Valence-arousal subspace in llms: Circular emotion geometry and multi-behavioral control.arXiv preprint arXiv:2604.03147,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Learning distributed representations of symbolic structure using binding and unbinding operations.arXiv preprint arXiv:1810.12456,

Shuai Tang, Paul Smolensky, and Virginia de Sa. Learning distributed representations of symbolic structure using binding and unbinding operations.arXiv preprint arXiv:1810.12456,

work page arXiv
[24]

2023 , month = oct, journal =

URL https://api. semanticscholar.org/CorpusID:70175501. Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda. Linear representations of sentiment in large language models.arXiv preprint arXiv:2310.15154,

work page arXiv
[25]

Relational composition in neural networks: A survey and call to action.arXiv preprint arXiv:2407.14662,

Martin Wattenberg and Fernanda B Viégas. Relational composition in neural networks: A survey and call to action.arXiv preprint arXiv:2407.14662,

work page arXiv
[26]

Neural manifold geome- try encodes feature fields

Julian Yocum, Cameron Allen, Bruno Olshausen, and Stuart Russell. Neural manifold geome- try encodes feature fields. InNeurIPS 2025 Workshop on Symmetry and Geometry in Neural Representations,

work page 2025
[27]

URLhttps://openreview.net/forum?id=MwU86qfCTW. 12 (a) Digits as Directions (b) Digits Form a Pentagonal Prism 0 1 2 4 5 6 7 8 9 3 0 1 2 4 5 6 7 8 9 3 Figure 8: Representations of digits in a Transformer trained on multi-digit multiplication may appear as linear directions, but a closer look reveals structure in the form of a pentagonal prism [Bai et al., ...

work page 2025
[28]

correctness

show that RNN hidden-states can be reconstructed using Tensor Product Decomposition Networks. Feature geometry.In recent years, a large body of interpretability work have found numerous concepts that are encoded as linear directions, and that these representations often generalize across models [Lee et al., 2025b]. Examples include sentiment [Tigges et al...

work page 2023
[29]

Taken together, these works suggest that linear probes may sometimes only recover local readouts of a richer underlying structure

similarly show that features such as word count and token position can lie on manifolds that are aligned to produce high attention scores. Taken together, these works suggest that linear probes may sometimes only recover local readouts of a richer underlying structure. A good example might be of Bai et al. [2025], who study a toy Transformer trained on mu...

work page 2025
[30]

posterior beliefs

build on this idea to recover a manifold of “posterior beliefs”: by training a family of linear probes across different latent parameter settings of a controlled in-context learning task, and by “tiling” the linear probes together, they are able to recover a manifold over inferred latent parameter values, similarly suggesting that linear readouts stem fro...

work page 2025