pith. machine review for the scientific record. sign in

arxiv: 2605.15183 · v1 · submitted 2026-05-14 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:13 UTC · model grok-4.3

classification 💻 cs.LG
keywords tensor similaritymechanistic interpretabilityfunctional equivalencenetwork similarityweight symmetriesrecursive algorithmgrokkingbackdoor insertion
0
0 comments X

The pith

Tensor similarity is a weight-based metric that algebraically determines when two neural networks implement the same computation by ignoring irrelevant symmetries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes tensor similarity to check whether two model components perform identical computations without depending on observed outputs or fixed parameter bases. Prior approaches either test empirical behavior, which misses out-of-distribution mechanisms, or compare raw weights, which treats equivalent rotated representations as different. The new metric uses a recursive algorithm to match tensors while accounting for cross-layer interactions and weight-space symmetries. This converts similarity verification into an exact algebraic computation. A reader would care because it offers a more reliable foundation for dissecting how models actually work during training.

Core claim

Tensor similarity is introduced as a weight-based metric for tensor-based models that remains invariant to basis changes and other symmetries. It captures global functional equivalence, including cross-layer mechanisms, through an efficient recursive algorithm. The metric tracks functional training dynamics such as grokking and backdoor insertion with higher fidelity than existing measures, reducing the task of measuring similarity and verifying faithfulness to a solved algebraic problem.

What carries the argument

Tensor similarity metric computed by a recursive algorithm that matches tensors while respecting weight-space symmetries to detect functional equivalence.

If this is right

  • Tensor similarity tracks grokking and backdoor insertion during training with higher fidelity than behavior-based or parameter-based alternatives.
  • Verifying that two model parts implement the same mechanism becomes an algebraic calculation rather than an empirical approximation.
  • Cross-layer mechanisms are incorporated directly into the similarity computation.
  • The measure remains unchanged under weight-space symmetries that leave the implemented function intact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the algebraic check holds, it could support automated circuit extraction by confirming when a candidate circuit matches a known reference.
  • The approach might extend to comparing checkpoints across training runs to detect when specific computations emerge or disappear.
  • It opens a route to exact equivalence checks between models trained on different random seeds or architectures that realize the same mapping.

Load-bearing premise

The recursive algorithm identifies every instance of functional equivalence in tensor-based models without missing non-linear interactions or symmetries beyond weight-space basis changes.

What would settle it

Finding two networks that produce identical outputs for every input yet receive a low tensor similarity score, or two networks with high tensor similarity that implement demonstrably different functions.

Figures

Figures reproduced from arXiv: 2605.15183 by Jacob Meyer Cohen, Laurence Wroe, Logan Riggs Smith, Melwina Albuquerque, ML Nissen Gonzalez, Thomas Dooms.

Figure 1
Figure 1. Figure 1: Overview of our method. The model learns a backdoor that existing similarity metrics [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Matrix similarity computes the direct cosine similarity between (flattened) weight matrices, but is sensitive to weight-space symmetries such as permutation and scaling, severely limiting it as an accurate proxy for functional equivalence, especially for models that are trained differently. Behavioural similarity (or output similarity) mitigates this by comparing model outputs directly, making it invariant… view at source ↗
Figure 3
Figure 3. Figure 3: Further setup details and additional plots are reported in Appendix B. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 3
Figure 3. Figure 3: We report four similarity measures across the training trajectory of an SVHN model. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Modular addition training tracked by accuracy and loss (top), the Fourier components of the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Five similarity measures across 101 log-spaced checkpoints of a two-layer bilinear attention [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Model evolution across the progressive training setup on SVHN described in subsection 3.1. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Tensor diff similarities across the progressive SVHN training setup (subsection 3.1). [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Five one-layer bilinear models were trained for each of nine input data distributions, and [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
read the original abstract

Mechanistic interpretability aims to break models into meaningful parts; verifying that two such parts implement the same computation is a prerequisite. Existing similarity measures evaluate either empirical behaviour, leaving them blind to out-of-distribution mechanisms, or basis-dependent parameters, meaning they disregard weight-space symmetries. To address these issues for the class of tensor-based models, we introduce a weight-based metric, tensor similarity, that is invariant to such symmetries. This metric captures global functional equivalence and accounts for cross-layer mechanisms using an efficient recursive algorithm. Empirically, tensor similarity tracks functional training dynamics, such as grokking and backdoor insertion, with higher fidelity than existing metrics. This reduces measuring similarity and verifying faithfulness into a solved algebraic problem rather than one of empirical approximation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces tensor similarity, a weight-based metric for tensor-based neural networks that is invariant to basis changes and other weight-space symmetries. It uses an efficient recursive algorithm to capture global functional equivalence, including cross-layer mechanisms, and claims this reduces similarity measurement and faithfulness verification to an algebraic problem. Empirically, the metric tracks training dynamics such as grokking and backdoor insertion with higher fidelity than existing metrics.

Significance. If the recursive algorithm is shown to correctly compute exact functional equivalence invariant to relevant symmetries and without omissions from non-linearities, the work would provide a principled algebraic alternative to empirical similarity measures in mechanistic interpretability, potentially improving reliability in verifying that model components implement equivalent computations.

major comments (2)
  1. [Abstract] Abstract: The central claim that the recursive algorithm computes global functional equivalence for tensor models (including cross-layer mechanisms) without missing non-linear interactions rests on unshown correctness; no formal proof, derivation details, or exhaustive case analysis is referenced, with validation limited to correlation with training dynamics rather than direct equivalence checks.
  2. [Abstract] Abstract: The assertion of higher fidelity than existing metrics on training dynamics (grokking, backdoor insertion) lacks any error analysis, validation setup, or quantitative comparison details, leaving the empirical support for the algebraic reduction unassessable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their review and valuable feedback on our work. Below we provide point-by-point responses to the major comments, indicating where revisions will be made to address the concerns raised.

read point-by-point responses
  1. Referee: The central claim that the recursive algorithm computes global functional equivalence for tensor models (including cross-layer mechanisms) without missing non-linear interactions rests on unshown correctness; no formal proof, derivation details, or exhaustive case analysis is referenced, with validation limited to correlation with training dynamics rather than direct equivalence checks.

    Authors: We agree that the manuscript would be strengthened by including a formal proof or detailed derivation of the recursive algorithm. In the revised version, we will add this in an appendix, providing a step-by-step derivation based on the tensor contraction rules and symmetry groups. We will also include an exhaustive case analysis for linear and common non-linear layers to demonstrate no omissions in cross-layer mechanisms. Additionally, we will supplement the empirical results with direct equivalence tests on controlled examples. revision: yes

  2. Referee: The assertion of higher fidelity than existing metrics on training dynamics (grokking, backdoor insertion) lacks any error analysis, validation setup, or quantitative comparison details, leaving the empirical support for the algebraic reduction unassessable.

    Authors: We acknowledge the need for more rigorous empirical validation details. The revised manuscript will include a comprehensive description of the experimental setup, including model architectures, training procedures, the specific existing metrics used for comparison, and quantitative results with error analysis from multiple seeds. This will allow for a clear assessment of the higher fidelity claims. revision: yes

Circularity Check

0 steps flagged

Tensor similarity defined as independent algebraic metric with no circular reductions to inputs or self-citations

full rationale

The paper introduces tensor similarity as a weight-based metric invariant to symmetries, using an efficient recursive algorithm to capture global functional equivalence for tensor models. The abstract presents this directly as an algebraic construction that addresses limitations of empirical or basis-dependent measures, without any equations or steps that reduce by construction to fitted parameters, self-referential predictions, or load-bearing self-citations. No self-definitional loops, uniqueness theorems imported from prior author work, or renaming of known results appear in the derivation chain. The central claim remains an original proposal whose correctness is tested empirically on dynamics like grokking, keeping the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that neural networks are tensor-based and that functional equivalence is fully captured by symmetry-invariant weight comparisons; no free parameters or invented entities are mentioned.

axioms (2)
  • domain assumption Neural network computations are fully represented by tensor weights up to basis symmetries.
    Required for the invariance property to hold.
  • domain assumption A recursive algorithm can efficiently match cross-layer mechanisms without loss of equivalence information.
    Core to the global functional equivalence claim.

pith-pipeline@v0.9.0 · 5437 in / 1073 out tokens · 46762 ms · 2026-05-15T03:13:32.984659+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 11 internal anchors

  1. [1]

    David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba

    URL https://transformer-circuits.pub/ 2025/attribution-graphs/methods.html. David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6541–6549,

  2. [2]

    Network Dissection: Quantifying Interpretability of Deep Visual Representations

    URL https://arxiv. org/abs/1704.05796. Nora Belrose, Quintin Pope, Lucia Quirke, Alex Mallen, and Xiaoli Fern. Neural networks learn statistics of increasing complexity,

  3. [3]

    9 Nadav Cohen and Amnon Shashua

    URLhttps://arxiv.org/abs/2402.04362. 9 Nadav Cohen and Amnon Shashua. Convolutional Rectifier Networks as Generalized Tensor Decom- positions,

  4. [4]

    Convolutional Rectifier Networks as Generalized Tensor Decompositions

    URLhttps://arxiv.org/abs/1603.00162v2. Nadav Cohen, Or Sharir, and Amnon Shashua. On the Expressive Power of Deep Learning: A Tensor Analysis,

  5. [5]

    On the Expressive Power of Deep Learning: A Tensor Analysis

    URLhttps://arxiv.org/abs/1509.05009v3. Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga- Alonso. Towards automated circuit discovery for mechanistic interpretability. InAdvances in Neural Information Processing Systems,

  6. [6]

    org/posts/baJyjpktzmcmRfosq/ stitching-saes-of-different-sizes

    URLhttps://arxiv.org/abs/2304.14997. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models,

  7. [7]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    URL https://arxiv.org/ abs/2309.08600. Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam, Mario Lucic, Yian Ma, Cory McLean, Diana Mincu, Akinori Mitani, Andrea...

  8. [8]

    URL https://arxiv.org/abs/2011.03395. Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks,

  9. [9]

    URLhttps://arxiv.org/abs/1612.08083. Li Deng. The mnist database of handwritten digit images for machine learning research.IEEE Signal Processing Magazine, 29(6):141–142,

  10. [10]

    URLhttps://arxiv.org/abs/2504.02667. Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kapl...

  11. [11]

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy

    https://transformer-circuits.pub/2021/framework/index.html. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling,

  12. [12]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    URL https://arxiv.org/abs/ 2101.00027. Johnnie Gray. quimb: a python library for quantum information and many-body calculations.Journal of Open Source Software, 3(29):819,

  13. [13]

    Michael Hanna, Sandro Pezzelle, and Yonatan Belinkov

    doi: 10.21105/joss.00819. Michael Hanna, Sandro Pezzelle, and Yonatan Belinkov. Have faith in faithfulness: Going beyond circuit overlap when finding model mechanisms,

  14. [14]

    Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viégas, and Rory Sayres

    URL https://arxiv.org/abs/2402.02364. Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viégas, and Rory Sayres. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCA V). InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine L...

  15. [15]

    Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)

    URL https://arxiv.org/abs/1711.11279. 10 Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 3519–3529. PMLR,

  16. [16]

    Similarity of Neural Network Representations Revisited

    URLhttps://arxiv.org/abs/1905.00414. Nikolaus Kriegeskorte, Marieke Mur, and Peter Bandettini. Representational similarity analysis — connecting the branches of systems neuroscience.Frontiers in Systems Neuroscience, 2:4,

  17. [17]

    URLhttps://doi.org/10.3389/neuro.06.004.2008

    doi: 10.3389/neuro.06.004.2008. URLhttps://doi.org/10.3389/neuro.06.004.2008. Yoav Levine, Or Sharir, Nadav Cohen, and Amnon Shashua. Quantum Entanglement in Deep Learning Architectures.Phys. Rev. Lett., 122(6):065301,

  18. [18]

    doi: 10.1103/PhysRevLett.122. 065301. URLhttps://link.aps.org/doi/10.1103/PhysRevLett.122.065301. Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. In International Conference on Learning Representations, 2025a. URL http...

  19. [19]

    Maxime Méloux, Silviu Maniu, François Portet, and Maxime Peyrard

    URLhttp://arxiv.org/abs/2503.01588. Maxime Méloux, Silviu Maniu, François Portet, and Maxime Peyrard. Everything, everywhere, all at once: Is mechanistic interpretability identifiable?,

  20. [20]

    Progress measures for grokking via mechanistic interpretability

    URLhttps://arxiv.org/abs/2301.05217. Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y . Ng. Reading digits in natural images with unsupervised feature learning. InNIPS Workshop on Deep Learn- ing and Unsupervised Feature Learning 2011,

  21. [21]

    org/abs/2407.12034

    URL https://arxiv. org/abs/2407.12034. Chris Olah. A toy model of mechanistic (un)faithfulness.Transformer Circuits Thread,

  22. [22]

    URL https://transformer-circuits.pub/2025/faithfulness-toy-model/index.html. Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal ...

  23. [23]

    Michael T

    URL https://transformer-circuits.pub/2022/ in-context-learning-and-induction-heads/index.html. Michael T. Pearce, Thomas Dooms, Alice Rigg, José M. Oramas, and Lee Sharkey. Bilinear MLPs enable weight-based mechanistic interpretability. InInternational Conference on Learning Representations,

  24. [24]

    and Dooms, Thomas and Rigg, Alice and Oramas, Jose M

    URLhttps://arxiv.org/abs/2410.08417. Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. InAdvances in Neural Information Processing Systems,

  25. [25]

    SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability

    URLhttps://arxiv.org/abs/1706.05806. 11 Lyle Ramshaw. Blossoms are polar forms.Computer Aided Geometric Design, 6(4):323–358,

  26. [26]

    doi: 10.1016/0167-8396(89)90032-0

    ISSN 0167-8396. doi: 10.1016/0167-8396(89)90032-0. URL https://www.sciencedirect. com/science/article/pii/0167839689900320. Logan Riggs. Tensor-transformer variants are surprisingly performant. Less- Wrong,

  27. [27]

    org/abs/2305.03452

    URL http://arxiv. org/abs/2305.03452. Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Ca...

  28. [28]

    URL https://arxiv.org/abs/2002. 05202. George Wang, Jesse Hoogland, Stan van Wingerden, Zach Furman, and Daniel Murfet. Differentiation and specialization of attention heads via the refined local learning coefficient,

  29. [29]

    Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang

    URL https: //arxiv.org/abs/2410.02984. Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang. Data selection for language models via importance resampling.arXiv preprint arXiv:2302.03169,

  30. [30]

    parameterised by L(i) A ,R (i) A ∈R ri×Hi−1, D(i) A ∈ RHi×ri (and likewise for B), the Gram step (10) reduces to matrix operations alone. Initialise G(0) =1 d+1 and, for eachi∈[L], compute four matrix products, (LL)(i) :=L (i) A G(i−1) L(i) B T ,(RR) (i) :=R (i) A G(i−1) R(i) B T , (LR)(i) :=L (i) A G(i−1) R(i) B T ,(RL) (i) :=R (i) A G(i−1) L(i) B T ,(12...