Recognition: 2 theorem links
· Lean TheoremFinding Belief Geometries with Sparse Autoencoders
Pith reviewed 2026-05-13 20:30 UTC · model grok-4.3
The pith
Sparse autoencoders uncover five clusters in Gemma-2-9B whose subspaces show simplex geometry with predictive belief-state coordinates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A pipeline of sparse autoencoders, k-subspace clustering, and AANet simplex fitting identifies 13 candidate clusters in Gemma-2-9B. Five of these pass barycentric prediction tests with Wilcoxon p-values below 10^{-14} on near-vertex samples or simplex-interior samples, while no null clusters pass either split. Cluster 768_596 additionally records the highest causal steering score, making it the only case where passive prediction and active intervention align.
What carries the argument
The barycentric prediction test applied to simplex-fitted subspaces extracted from SAE features, which checks whether mixture coordinates add predictive signal beyond individual latents.
If this is right
- If the interpretation holds, large language models encode uncertainty over latent generative states as geometric mixtures inside their residual stream.
- Barycentric prediction offers a concrete criterion that can separate real belief geometries from accidental simplex shapes produced by the autoencoder.
- The same pipeline can be applied to other models and datasets to map additional belief-like structures.
- One cluster already shows both passive and active evidence, suggesting a concrete target for further causal experiments.
- Confirmation would link controlled HMM results to naturalistic text training in a single geometric framework.
Where Pith is reading between the lines
- Extending the method to smaller or larger models could test whether simplex belief geometry scales with model size or training data diversity.
- If the coordinates truly represent beliefs, targeted interventions on them should produce predictable changes in the model's uncertainty about specific facts.
- The approach might reveal how models maintain multiple possible world states simultaneously rather than committing to single interpretations.
Load-bearing premise
That passing the barycentric prediction test on near-vertex or interior samples reliably distinguishes genuine belief-state encoding from tiling artifacts or spurious correlations in the SAE features.
What would settle it
Run the same pipeline on a model trained without the relevant latent structure and observe that the identified clusters lose their prediction advantage on near-vertex and interior samples.
Figures
read the original abstract
Understanding the geometric structure of internal representations is a central goal of mechanistic interpretability. Prior work has shown that transformers trained on sequences generated by hidden Markov models encode probabilistic belief states as simplex-shaped geometries in their residual stream, with vertices corresponding to latent generative states. Whether large language models trained on naturalistic text develop analogous geometric representations remains an open question. We introduce a pipeline for discovering candidate simplex-structured subspaces in transformer representations, combining sparse autoencoders (SAEs), $k$-subspace clustering of SAE features, and simplex fitting using AANet. We validate the pipeline on a transformer trained on a multipartite hidden Markov model with known belief-state geometry. Applied to Gemma-2-9B, we identify 13 priority clusters exhibiting candidate simplex geometry ($K \geq 3$). A key challenge is distinguishing genuine belief-state encoding from tiling artifacts: latents can span a simplex-shaped subspace without the mixture coordinates carrying predictive signal beyond any individual feature. We therefore adopt barycentric prediction as our primary discriminating test. Among the 13 priority clusters, 3 exhibit a highly significant advantage on near-vertex samples (Wilcoxon $p < 10^{-14}$) and 4 on simplex-interior samples. Together 5 distinct real clusters pass at least one split, while no null cluster passes either. One cluster, 768_596, additionally achieves the highest causal steering score in the dataset. This is the only case where passive prediction and active intervention converge. We present these findings as preliminary evidence that genuine belief-like geometry exists in Gemma-2-9B's representation space, and identify the structured evaluation that would be required to confirm this interpretation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a pipeline that combines sparse autoencoders, k-subspace clustering of SAE features, and AANet simplex fitting to discover candidate simplex-structured subspaces in transformer residual streams. The pipeline is validated on a transformer trained on a multipartite hidden Markov model with known belief-state geometry, then applied to Gemma-2-9B, identifying 13 priority clusters (K ≥ 3). Among these, 5 clusters exhibit statistically significant barycentric prediction advantages (Wilcoxon p < 10^{-14} on near-vertex or interior samples) while no null clusters do; one cluster (768_596) additionally shows the highest causal steering score. The authors frame the results as preliminary evidence that genuine belief-like geometry exists in Gemma-2-9B's representation space.
Significance. If the barycentric prediction test reliably isolates genuine belief-state encodings rather than SAE-induced correlations or tiling artifacts, the work would supply a concrete, scalable method for detecting structured probabilistic representations in large language models trained on naturalistic data, extending prior HMM results. The synthetic validation and the single cluster where passive prediction and active steering converge are clear strengths. The preliminary framing and the open question of test specificity, however, limit the immediate interpretive weight.
major comments (3)
- [Results section] Barycentric Prediction Test (Results section): The central claim that the test distinguishes genuine belief-state encoding from tiling artifacts rests on 5 of 13 clusters passing while null clusters do not, yet no analysis is supplied showing that the advantage cannot arise from residual-stream correlations or the SAE sparsity penalty alone when the features span a low-dimensional subspace.
- [§4] §4 (Gemma-2-9B application): The selectivity of the result (5/13 clusters pass, 0/ null clusters) is load-bearing for the interpretation, but the manuscript reports neither the full distribution of Wilcoxon statistics across all 13 clusters nor any multiple-comparison correction, making it impossible to assess whether the observed pattern exceeds what would be expected by chance under the null.
- [Validation section] Validation section: The synthetic HMM experiment confirms that the pipeline can recover known simplex geometry, but lacks quantitative metrics (e.g., vertex alignment error, reconstruction fidelity, or out-of-sample prediction accuracy) that would establish the reliability of the barycentric test when ground-truth states are absent, as in the Gemma-2-9B case.
minor comments (2)
- [Abstract] Abstract: 'AANet' is used without expansion or citation on first appearance.
- [Throughout] Throughout: Exact sample sizes, effect sizes, and full p-value tables for the Wilcoxon tests are not reported, reducing reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address each major point below with targeted revisions that strengthen the statistical controls, reporting, and validation metrics while preserving the preliminary framing of the results.
read point-by-point responses
-
Referee: [Results section] Barycentric Prediction Test (Results section): The central claim that the test distinguishes genuine belief-state encoding from tiling artifacts rests on 5 of 13 clusters passing while null clusters do not, yet no analysis is supplied showing that the advantage cannot arise from residual-stream correlations or the SAE sparsity penalty alone when the features span a low-dimensional subspace.
Authors: We agree that the null clusters alone do not fully isolate the contribution of residual-stream correlations or the SAE sparsity penalty in low-dimensional subspaces. In the revised manuscript we will add an explicit ablation: (i) barycentric prediction on random low-dimensional linear subspaces drawn from the residual stream (matched for dimensionality), and (ii) the same test after randomly permuting activations within each SAE feature while preserving sparsity statistics. These controls will quantify whether the observed advantages are specific to the recovered simplex geometries rather than generic correlation or sparsity effects. revision: yes
-
Referee: [§4] §4 (Gemma-2-9B application): The selectivity of the result (5/13 clusters pass, 0/ null clusters) is load-bearing for the interpretation, but the manuscript reports neither the full distribution of Wilcoxon statistics across all 13 clusters nor any multiple-comparison correction, making it impossible to assess whether the observed pattern exceeds what would be expected by chance under the null.
Authors: We acknowledge that the full distribution of Wilcoxon statistics and multiple-comparison correction were omitted. The revised version will include a supplementary table listing the exact Wilcoxon p-values (near-vertex and interior splits) for all 13 priority clusters and all null clusters. We will apply Bonferroni correction across the 13 tests and report both uncorrected and corrected significance; we will also discuss the implications for the observed selectivity (5/13 passing) under the corrected threshold. revision: yes
-
Referee: [Validation section] Validation section: The synthetic HMM experiment confirms that the pipeline can recover known simplex geometry, but lacks quantitative metrics (e.g., vertex alignment error, reconstruction fidelity, or out-of-sample prediction accuracy) that would establish the reliability of the barycentric test when ground-truth states are absent, as in the Gemma-2-9B case.
Authors: The referee is correct that quantitative metrics are missing from the synthetic validation. In revision we will report three additional metrics on the HMM data: (1) vertex alignment error (mean Euclidean distance between recovered and ground-truth vertices after Procrustes alignment), (2) reconstruction fidelity (average cosine similarity between original residual-stream activations and those reconstructed from the fitted simplex), and (3) out-of-sample barycentric prediction accuracy on held-out sequences. These numbers will provide a concrete benchmark for the reliability of the barycentric test in the absence of ground truth. revision: yes
Circularity Check
Barycentric prediction test remains independent of simplex fitting
full rationale
The paper describes a pipeline of SAE feature extraction, k-subspace clustering, AANet simplex fitting, and subsequent barycentric prediction on near-vertex and interior samples. No equations are shown that reduce the reported Wilcoxon advantage to a fitted parameter by construction, nor does any self-citation chain supply the central discriminating test. The synthetic HMM validation and null-cluster controls are presented as external checks. The derivation therefore retains independent content and is scored as having only minor (non-load-bearing) self-citation risk at most.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a pipeline for discovering candidate simplex-structured subspaces... combining sparse autoencoders (SAEs), k-subspace clustering of SAE features, and simplex fitting using AANet... barycentric prediction as our primary discriminating test.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Prior work demonstrates this concretely in controlled settings. Transformers trained on sequences from hidden Markov models develop simplex-shaped geometries in their residual stream, with vertices corresponding to latent states
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey
URL https://transformer-circuits.pub/2023/monosemanticity/. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models
work page 2023
-
[2]
Advance access published 27 June
doi: 10.1093/imaiai/iay008. Advance access published 27 June
-
[3]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Guilherme Penedo, Hynek Kydl´ıˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. FineWeb: Decanting the web for the finest text data at scale. arXiv preprint arXiv:2406.17557,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
arXiv:2405.15943. David van Dijk, Daniel B. Burkhardt, Matthew Amodio, Alexander Tong, Guy Wolf, and Smita Krishnaswamy. Finding archetypal spaces using neural networks,
-
[5]
doi:10.1101/2024.05.11.593705. Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, Jo ˜ao Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In International Conference on Machine Learning,
-
[6]
and two Tom Quantum (Bloch Walk) generalized HMMs (Riechers et al., 2025). The three Mess3 instances have parameters (x, a) ∈ { (0.05, 0.85), (0.075, 0.90), (0.10, 0.95)}; the two Tom Quantum instances have parameters (α, β) ∈ { (1.51, 3.07), (1.99, 2.51)}. The joint vocabulary is 4 × 4 × 3 × 3 × 3 = 432 tokens. This setup is intentionally challenging. Ea...
work page 2025
-
[7]
sig.” if Wilcoxon p < 0.001). Steering: best-combination score (“—
D AANet Architecture and Training We use the AANet implementation from van Dijk et al. (2019). The encoder consists of two hidden layers (widths 256, 128 by default), a bottleneck of dimension K − 1 (to enforce the simplex constraint via softmax normalization), and an analogous decoder. The loss combines L2 reconstruction, a simplex penalty λ1(1 − ∥E(a)∥1...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.