pith. machine review for the scientific record. sign in

arxiv: 2604.02685 · v1 · submitted 2026-04-03 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Finding Belief Geometries with Sparse Autoencoders

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords sparse autoencodersbelief statessimplex geometrymechanistic interpretabilitytransformer representationsGemma-2-9Bbarycentric coordinateshidden Markov models
0
0 comments X

The pith

Sparse autoencoders uncover five clusters in Gemma-2-9B whose subspaces show simplex geometry with predictive belief-state coordinates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a pipeline that extracts features with sparse autoencoders, clusters them into candidate subspaces, and fits simplices to look for structures that encode probabilistic beliefs over hidden states. It first validates the method on a transformer trained on a known hidden Markov model, then applies it to Gemma-2-9B to find 13 priority clusters with simplex-like shapes. The central test checks whether the barycentric coordinates within these subspaces carry extra predictive power for the underlying states on near-vertex and interior samples. Five distinct clusters pass this test at high statistical significance while control clusters do not, and one also responds strongly to causal steering. The authors treat the results as preliminary evidence that large language models can develop genuine belief-like geometries.

Core claim

A pipeline of sparse autoencoders, k-subspace clustering, and AANet simplex fitting identifies 13 candidate clusters in Gemma-2-9B. Five of these pass barycentric prediction tests with Wilcoxon p-values below 10^{-14} on near-vertex samples or simplex-interior samples, while no null clusters pass either split. Cluster 768_596 additionally records the highest causal steering score, making it the only case where passive prediction and active intervention align.

What carries the argument

The barycentric prediction test applied to simplex-fitted subspaces extracted from SAE features, which checks whether mixture coordinates add predictive signal beyond individual latents.

If this is right

  • If the interpretation holds, large language models encode uncertainty over latent generative states as geometric mixtures inside their residual stream.
  • Barycentric prediction offers a concrete criterion that can separate real belief geometries from accidental simplex shapes produced by the autoencoder.
  • The same pipeline can be applied to other models and datasets to map additional belief-like structures.
  • One cluster already shows both passive and active evidence, suggesting a concrete target for further causal experiments.
  • Confirmation would link controlled HMM results to naturalistic text training in a single geometric framework.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the method to smaller or larger models could test whether simplex belief geometry scales with model size or training data diversity.
  • If the coordinates truly represent beliefs, targeted interventions on them should produce predictable changes in the model's uncertainty about specific facts.
  • The approach might reveal how models maintain multiple possible world states simultaneously rather than committing to single interpretations.

Load-bearing premise

That passing the barycentric prediction test on near-vertex or interior samples reliably distinguishes genuine belief-state encoding from tiling artifacts or spurious correlations in the SAE features.

What would settle it

Run the same pipeline on a model trained without the relevant latent structure and observe that the identified clusters lose their prediction advantage on near-vertex and interior samples.

Figures

Figures reproduced from arXiv: 2604.02685 by Matthew Levinson.

Figure 1
Figure 1. Figure 1: Overview of the belief geometry discovery pipeline. Text sequences are pro￾cessed through Gemma-2-9B, and residual-stream activations at layer 20 are encoded by a GemmaScope JumpReLU SAE. Decoder directions of SAE latents are clustered into candi￾date latent groups using k-subspace clustering. Each candidate group is then fit with AANet to test for simplex structure and recover barycentric coordinates. The… view at source ↗
Figure 2
Figure 2. Figure 2: Cluster 768 596: per-latent centroid positions. Mean barycentric centroid of each of the six latents. The latents partition across the three vertices, consistent with vertex￾specialized feature coding. tically crisp part of the geometry. The near-vertex barycentric advantage (p < 10−14) shows that V0 and V1 together encode information that neither latent captures alone. The steering score of 0.419 confirms… view at source ↗
Figure 3
Figure 3. Figure 3: Per-sub-component PCA projections, toy model layer 1. Each panel shows the PCA subspace that best reveals one sub-component’s geometry, colored by that component’s true discrete output token. Every panel also exhibits clear separation for at least one other sub-component, demonstrating cross-component entanglement. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Two representative latents from Cluster 4 (multipartite toy model, layer 1, TopK K = 12). Each panel shows KDE-smoothed activation density over all five component belief geometries. The assigned component is marked ⋆ (R 2 = 0.89). Top: Latent 4 fires near the top vertex of each Mess3 simplex and at the centre of the Tom Quantum disks. Bottom: Latent 26 fires near the base vertices and along the annular rim… view at source ↗
Figure 5
Figure 5. Figure 5: Cluster 512 181: barycentric vs. best-latent R 2 . Barycentric coordinates (mean R 2 = 0.612) outperform the best individual latent (mean R 2 = 0.539) for every one of the 50 tokens (Wilcoxon p < 10−15). 16 [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cluster 512 229: barycentric vs. best-latent R 2 . Barycentric coordinates (mean R 2 = 0.378) outperform the best individual latent (mean R 2 = 0.244) for all 50 tokens (Wilcoxon p < 10−15), with a ∼55% relative improvement. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cluster 768 596: barycentric vs. best-latent R 2 . Barycentric coordinates (mean R 2 = 0.269) outperform the best individual latent (mean R 2 = 0.217) for 98% of tokens (Wilcoxon p < 10−14). This cluster also has the highest causal steering score (0.419). G All Priority Cluster Results H Per-Cluster Latent Catalogue For each of the four clusters with the strongest functional evidence—the highlight cluster … view at source ↗
read the original abstract

Understanding the geometric structure of internal representations is a central goal of mechanistic interpretability. Prior work has shown that transformers trained on sequences generated by hidden Markov models encode probabilistic belief states as simplex-shaped geometries in their residual stream, with vertices corresponding to latent generative states. Whether large language models trained on naturalistic text develop analogous geometric representations remains an open question. We introduce a pipeline for discovering candidate simplex-structured subspaces in transformer representations, combining sparse autoencoders (SAEs), $k$-subspace clustering of SAE features, and simplex fitting using AANet. We validate the pipeline on a transformer trained on a multipartite hidden Markov model with known belief-state geometry. Applied to Gemma-2-9B, we identify 13 priority clusters exhibiting candidate simplex geometry ($K \geq 3$). A key challenge is distinguishing genuine belief-state encoding from tiling artifacts: latents can span a simplex-shaped subspace without the mixture coordinates carrying predictive signal beyond any individual feature. We therefore adopt barycentric prediction as our primary discriminating test. Among the 13 priority clusters, 3 exhibit a highly significant advantage on near-vertex samples (Wilcoxon $p < 10^{-14}$) and 4 on simplex-interior samples. Together 5 distinct real clusters pass at least one split, while no null cluster passes either. One cluster, 768_596, additionally achieves the highest causal steering score in the dataset. This is the only case where passive prediction and active intervention converge. We present these findings as preliminary evidence that genuine belief-like geometry exists in Gemma-2-9B's representation space, and identify the structured evaluation that would be required to confirm this interpretation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a pipeline that combines sparse autoencoders, k-subspace clustering of SAE features, and AANet simplex fitting to discover candidate simplex-structured subspaces in transformer residual streams. The pipeline is validated on a transformer trained on a multipartite hidden Markov model with known belief-state geometry, then applied to Gemma-2-9B, identifying 13 priority clusters (K ≥ 3). Among these, 5 clusters exhibit statistically significant barycentric prediction advantages (Wilcoxon p < 10^{-14} on near-vertex or interior samples) while no null clusters do; one cluster (768_596) additionally shows the highest causal steering score. The authors frame the results as preliminary evidence that genuine belief-like geometry exists in Gemma-2-9B's representation space.

Significance. If the barycentric prediction test reliably isolates genuine belief-state encodings rather than SAE-induced correlations or tiling artifacts, the work would supply a concrete, scalable method for detecting structured probabilistic representations in large language models trained on naturalistic data, extending prior HMM results. The synthetic validation and the single cluster where passive prediction and active steering converge are clear strengths. The preliminary framing and the open question of test specificity, however, limit the immediate interpretive weight.

major comments (3)
  1. [Results section] Barycentric Prediction Test (Results section): The central claim that the test distinguishes genuine belief-state encoding from tiling artifacts rests on 5 of 13 clusters passing while null clusters do not, yet no analysis is supplied showing that the advantage cannot arise from residual-stream correlations or the SAE sparsity penalty alone when the features span a low-dimensional subspace.
  2. [§4] §4 (Gemma-2-9B application): The selectivity of the result (5/13 clusters pass, 0/ null clusters) is load-bearing for the interpretation, but the manuscript reports neither the full distribution of Wilcoxon statistics across all 13 clusters nor any multiple-comparison correction, making it impossible to assess whether the observed pattern exceeds what would be expected by chance under the null.
  3. [Validation section] Validation section: The synthetic HMM experiment confirms that the pipeline can recover known simplex geometry, but lacks quantitative metrics (e.g., vertex alignment error, reconstruction fidelity, or out-of-sample prediction accuracy) that would establish the reliability of the barycentric test when ground-truth states are absent, as in the Gemma-2-9B case.
minor comments (2)
  1. [Abstract] Abstract: 'AANet' is used without expansion or citation on first appearance.
  2. [Throughout] Throughout: Exact sample sizes, effect sizes, and full p-value tables for the Wilcoxon tests are not reported, reducing reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major point below with targeted revisions that strengthen the statistical controls, reporting, and validation metrics while preserving the preliminary framing of the results.

read point-by-point responses
  1. Referee: [Results section] Barycentric Prediction Test (Results section): The central claim that the test distinguishes genuine belief-state encoding from tiling artifacts rests on 5 of 13 clusters passing while null clusters do not, yet no analysis is supplied showing that the advantage cannot arise from residual-stream correlations or the SAE sparsity penalty alone when the features span a low-dimensional subspace.

    Authors: We agree that the null clusters alone do not fully isolate the contribution of residual-stream correlations or the SAE sparsity penalty in low-dimensional subspaces. In the revised manuscript we will add an explicit ablation: (i) barycentric prediction on random low-dimensional linear subspaces drawn from the residual stream (matched for dimensionality), and (ii) the same test after randomly permuting activations within each SAE feature while preserving sparsity statistics. These controls will quantify whether the observed advantages are specific to the recovered simplex geometries rather than generic correlation or sparsity effects. revision: yes

  2. Referee: [§4] §4 (Gemma-2-9B application): The selectivity of the result (5/13 clusters pass, 0/ null clusters) is load-bearing for the interpretation, but the manuscript reports neither the full distribution of Wilcoxon statistics across all 13 clusters nor any multiple-comparison correction, making it impossible to assess whether the observed pattern exceeds what would be expected by chance under the null.

    Authors: We acknowledge that the full distribution of Wilcoxon statistics and multiple-comparison correction were omitted. The revised version will include a supplementary table listing the exact Wilcoxon p-values (near-vertex and interior splits) for all 13 priority clusters and all null clusters. We will apply Bonferroni correction across the 13 tests and report both uncorrected and corrected significance; we will also discuss the implications for the observed selectivity (5/13 passing) under the corrected threshold. revision: yes

  3. Referee: [Validation section] Validation section: The synthetic HMM experiment confirms that the pipeline can recover known simplex geometry, but lacks quantitative metrics (e.g., vertex alignment error, reconstruction fidelity, or out-of-sample prediction accuracy) that would establish the reliability of the barycentric test when ground-truth states are absent, as in the Gemma-2-9B case.

    Authors: The referee is correct that quantitative metrics are missing from the synthetic validation. In revision we will report three additional metrics on the HMM data: (1) vertex alignment error (mean Euclidean distance between recovered and ground-truth vertices after Procrustes alignment), (2) reconstruction fidelity (average cosine similarity between original residual-stream activations and those reconstructed from the fitted simplex), and (3) out-of-sample barycentric prediction accuracy on held-out sequences. These numbers will provide a concrete benchmark for the reliability of the barycentric test in the absence of ground truth. revision: yes

Circularity Check

0 steps flagged

Barycentric prediction test remains independent of simplex fitting

full rationale

The paper describes a pipeline of SAE feature extraction, k-subspace clustering, AANet simplex fitting, and subsequent barycentric prediction on near-vertex and interior samples. No equations are shown that reduce the reported Wilcoxon advantage to a fitted parameter by construction, nor does any self-citation chain supply the central discriminating test. The synthetic HMM validation and null-cluster controls are presented as external checks. The derivation therefore retains independent content and is scored as having only minor (non-load-bearing) self-citation risk at most.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.0 · 5594 in / 1076 out tokens · 38000 ms · 2026-05-13T20:30:16.810291+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We introduce a pipeline for discovering candidate simplex-structured subspaces... combining sparse autoencoders (SAEs), k-subspace clustering of SAE features, and simplex fitting using AANet... barycentric prediction as our primary discriminating test.

  • IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    Prior work demonstrates this concretely in controlled settings. Transformers trained on sequences from hidden Markov models develop simplex-shaped geometries in their residual stream, with vertices corresponding to latent states

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey

    URL https://transformer-circuits.pub/2023/monosemanticity/. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models

  2. [2]

    Advance access published 27 June

    doi: 10.1093/imaiai/iay008. Advance access published 27 June

  3. [3]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    Guilherme Penedo, Hynek Kydl´ıˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. FineWeb: Decanting the web for the finest text data at scale. arXiv preprint arXiv:2406.17557,

  4. [4]

    2024 , archivePrefix=

    arXiv:2405.15943. David van Dijk, Daniel B. Burkhardt, Matthew Amodio, Alexander Tong, Guy Wolf, and Smita Krishnaswamy. Finding archetypal spaces using neural networks,

  5. [5]

    Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, Jo ˜ao Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov

    doi:10.1101/2024.05.11.593705. Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, Jo ˜ao Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In International Conference on Machine Learning,

  6. [6]

    and two Tom Quantum (Bloch Walk) generalized HMMs (Riechers et al., 2025). The three Mess3 instances have parameters (x, a) ∈ { (0.05, 0.85), (0.075, 0.90), (0.10, 0.95)}; the two Tom Quantum instances have parameters (α, β) ∈ { (1.51, 3.07), (1.99, 2.51)}. The joint vocabulary is 4 × 4 × 3 × 3 × 3 = 432 tokens. This setup is intentionally challenging. Ea...

  7. [7]

    sig.” if Wilcoxon p < 0.001). Steering: best-combination score (“—

    D AANet Architecture and Training We use the AANet implementation from van Dijk et al. (2019). The encoder consists of two hidden layers (widths 256, 128 by default), a bottleneck of dimension K − 1 (to enforce the simplex constraint via softmax normalization), and an analogous decoder. The loss combines L2 reconstruction, a simplex penalty λ1(1 − ∥E(a)∥1...