arxiv: 2605.15183 · v1 · submitted 2026-05-14 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability

ML Nissen Gonzalez , Melwina Albuquerque , Laurence Wroe , Jacob Meyer Cohen , Logan Riggs Smith , Thomas Dooms

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:13 UTC · model grok-4.3

classification 💻 cs.LG

keywords tensor similaritymechanistic interpretabilityfunctional equivalencenetwork similarityweight symmetriesrecursive algorithmgrokkingbackdoor insertion

0 comments

The pith

Tensor similarity is a weight-based metric that algebraically determines when two neural networks implement the same computation by ignoring irrelevant symmetries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes tensor similarity to check whether two model components perform identical computations without depending on observed outputs or fixed parameter bases. Prior approaches either test empirical behavior, which misses out-of-distribution mechanisms, or compare raw weights, which treats equivalent rotated representations as different. The new metric uses a recursive algorithm to match tensors while accounting for cross-layer interactions and weight-space symmetries. This converts similarity verification into an exact algebraic computation. A reader would care because it offers a more reliable foundation for dissecting how models actually work during training.

Core claim

Tensor similarity is introduced as a weight-based metric for tensor-based models that remains invariant to basis changes and other symmetries. It captures global functional equivalence, including cross-layer mechanisms, through an efficient recursive algorithm. The metric tracks functional training dynamics such as grokking and backdoor insertion with higher fidelity than existing measures, reducing the task of measuring similarity and verifying faithfulness to a solved algebraic problem.

What carries the argument

Tensor similarity metric computed by a recursive algorithm that matches tensors while respecting weight-space symmetries to detect functional equivalence.

If this is right

Tensor similarity tracks grokking and backdoor insertion during training with higher fidelity than behavior-based or parameter-based alternatives.
Verifying that two model parts implement the same mechanism becomes an algebraic calculation rather than an empirical approximation.
Cross-layer mechanisms are incorporated directly into the similarity computation.
The measure remains unchanged under weight-space symmetries that leave the implemented function intact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the algebraic check holds, it could support automated circuit extraction by confirming when a candidate circuit matches a known reference.
The approach might extend to comparing checkpoints across training runs to detect when specific computations emerge or disappear.
It opens a route to exact equivalence checks between models trained on different random seeds or architectures that realize the same mapping.

Load-bearing premise

The recursive algorithm identifies every instance of functional equivalence in tensor-based models without missing non-linear interactions or symmetries beyond weight-space basis changes.

What would settle it

Finding two networks that produce identical outputs for every input yet receive a low tensor similarity score, or two networks with high tensor similarity that implement demonstrably different functions.

Figures

Figures reproduced from arXiv: 2605.15183 by Jacob Meyer Cohen, Laurence Wroe, Logan Riggs Smith, Melwina Albuquerque, ML Nissen Gonzalez, Thomas Dooms.

**Figure 2.** Figure 2: Matrix similarity computes the direct cosine similarity between (flattened) weight matrices, but is sensitive to weight-space symmetries such as permutation and scaling, severely limiting it as an accurate proxy for functional equivalence, especially for models that are trained differently. Behavioural similarity (or output similarity) mitigates this by comparing model outputs directly, making it invariant… view at source ↗

**Figure 3.** Figure 3: Further setup details and additional plots are reported in Appendix B. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 3.** Figure 3: We report four similarity measures across the training trajectory of an SVHN model. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Modular addition training tracked by accuracy and loss (top), the Fourier components of the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Five similarity measures across 101 log-spaced checkpoints of a two-layer bilinear attention [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Model evolution across the progressive training setup on SVHN described in subsection 3.1. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Tensor diff similarities across the progressive SVHN training setup (subsection 3.1). [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Five one-layer bilinear models were trained for each of nine input data distributions, and [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

read the original abstract

Mechanistic interpretability aims to break models into meaningful parts; verifying that two such parts implement the same computation is a prerequisite. Existing similarity measures evaluate either empirical behaviour, leaving them blind to out-of-distribution mechanisms, or basis-dependent parameters, meaning they disregard weight-space symmetries. To address these issues for the class of tensor-based models, we introduce a weight-based metric, tensor similarity, that is invariant to such symmetries. This metric captures global functional equivalence and accounts for cross-layer mechanisms using an efficient recursive algorithm. Empirically, tensor similarity tracks functional training dynamics, such as grokking and backdoor insertion, with higher fidelity than existing metrics. This reduces measuring similarity and verifying faithfulness into a solved algebraic problem rather than one of empirical approximation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a symmetry-invariant tensor similarity metric with recursive cross-layer handling that could help verify functional equivalence in mechanistic interpretability, but the abstract leaves the algorithm's correctness for non-linear cases unproven.

read the letter

The main takeaway is that this work defines a weight-based metric for when two tensor models implement the same computation, invariant to basis changes and using recursion to handle cross-layer effects. That directly targets the limits of current empirical or parameter-dependent similarity measures in interpretability. The paper shows the metric tracking training dynamics like grokking and backdoor insertion more closely than prior options, which is a concrete step forward if the numbers hold in the full experiments. What stands out is the attempt to turn equivalence checking into an algebraic computation rather than repeated forward passes. The recursive algorithm is presented as efficient and global, which could matter for circuit analysis if it composes correctly. On the soft side, the abstract gives no derivation, error bounds, or case analysis for how the recursion deals with non-linear activations or symmetries beyond linear basis shifts. The empirical claims rest on correlation with training rather than direct tests of equivalence detection, so the reduction to a solved algebraic problem is still conditional on the algorithm working as stated. The weakest assumption in the reader's note—that the recursion captures all functional equivalence without omissions—needs checking against the full math and any counterexamples. This paper is aimed at mechanistic interpretability researchers who already work with weight-space analysis and want a tool for verifying model parts during training. It shows clear engagement with the literature on symmetries and faithfulness. I would bring it to a reading group to walk through the algorithm details. It is not ready to cite yet because the central claim is unverified beyond the abstract. A serious editor should send it to peer review so referees can examine the recursion and validation setup.

Referee Report

2 major / 0 minor

Summary. The paper introduces tensor similarity, a weight-based metric for tensor-based neural networks that is invariant to basis changes and other weight-space symmetries. It uses an efficient recursive algorithm to capture global functional equivalence, including cross-layer mechanisms, and claims this reduces similarity measurement and faithfulness verification to an algebraic problem. Empirically, the metric tracks training dynamics such as grokking and backdoor insertion with higher fidelity than existing metrics.

Significance. If the recursive algorithm is shown to correctly compute exact functional equivalence invariant to relevant symmetries and without omissions from non-linearities, the work would provide a principled algebraic alternative to empirical similarity measures in mechanistic interpretability, potentially improving reliability in verifying that model components implement equivalent computations.

major comments (2)

[Abstract] Abstract: The central claim that the recursive algorithm computes global functional equivalence for tensor models (including cross-layer mechanisms) without missing non-linear interactions rests on unshown correctness; no formal proof, derivation details, or exhaustive case analysis is referenced, with validation limited to correlation with training dynamics rather than direct equivalence checks.
[Abstract] Abstract: The assertion of higher fidelity than existing metrics on training dynamics (grokking, backdoor insertion) lacks any error analysis, validation setup, or quantitative comparison details, leaving the empirical support for the algebraic reduction unassessable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their review and valuable feedback on our work. Below we provide point-by-point responses to the major comments, indicating where revisions will be made to address the concerns raised.

read point-by-point responses

Referee: The central claim that the recursive algorithm computes global functional equivalence for tensor models (including cross-layer mechanisms) without missing non-linear interactions rests on unshown correctness; no formal proof, derivation details, or exhaustive case analysis is referenced, with validation limited to correlation with training dynamics rather than direct equivalence checks.

Authors: We agree that the manuscript would be strengthened by including a formal proof or detailed derivation of the recursive algorithm. In the revised version, we will add this in an appendix, providing a step-by-step derivation based on the tensor contraction rules and symmetry groups. We will also include an exhaustive case analysis for linear and common non-linear layers to demonstrate no omissions in cross-layer mechanisms. Additionally, we will supplement the empirical results with direct equivalence tests on controlled examples. revision: yes
Referee: The assertion of higher fidelity than existing metrics on training dynamics (grokking, backdoor insertion) lacks any error analysis, validation setup, or quantitative comparison details, leaving the empirical support for the algebraic reduction unassessable.

Authors: We acknowledge the need for more rigorous empirical validation details. The revised manuscript will include a comprehensive description of the experimental setup, including model architectures, training procedures, the specific existing metrics used for comparison, and quantitative results with error analysis from multiple seeds. This will allow for a clear assessment of the higher fidelity claims. revision: yes

Circularity Check

0 steps flagged

Tensor similarity defined as independent algebraic metric with no circular reductions to inputs or self-citations

full rationale

The paper introduces tensor similarity as a weight-based metric invariant to symmetries, using an efficient recursive algorithm to capture global functional equivalence for tensor models. The abstract presents this directly as an algebraic construction that addresses limitations of empirical or basis-dependent measures, without any equations or steps that reduce by construction to fitted parameters, self-referential predictions, or load-bearing self-citations. No self-definitional loops, uniqueness theorems imported from prior author work, or renaming of known results appear in the derivation chain. The central claim remains an original proposal whose correctness is tested empirically on dynamics like grokking, keeping the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that neural networks are tensor-based and that functional equivalence is fully captured by symmetry-invariant weight comparisons; no free parameters or invented entities are mentioned.

axioms (2)

domain assumption Neural network computations are fully represented by tensor weights up to basis symmetries.
Required for the invariance property to hold.
domain assumption A recursive algorithm can efficiently match cross-layer mechanisms without loss of equivalence information.
Core to the global functional equivalence claim.

pith-pipeline@v0.9.0 · 5437 in / 1073 out tokens · 46762 ms · 2026-05-15T03:13:32.984659+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Tensor similarity uses ... symmetrisation ... polarisation isomorphism ... Gaussian metric Λ=E_{x∼N(0,I)} ⊗^{2n}x ... sim_{P_n}(A,B)=1 ⇔ ∃λ>0 : A=λB
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Gram-based recursion ... tree graph ... local tensors

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 11 internal anchors

[1]

David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba

URL https://transformer-circuits.pub/ 2025/attribution-graphs/methods.html. David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6541–6549,

work page 2025
[2]

Network Dissection: Quantifying Interpretability of Deep Visual Representations

URL https://arxiv. org/abs/1704.05796. Nora Belrose, Quintin Pope, Lucia Quirke, Alex Mallen, and Xiaoli Fern. Neural networks learn statistics of increasing complexity,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

9 Nadav Cohen and Amnon Shashua

URLhttps://arxiv.org/abs/2402.04362. 9 Nadav Cohen and Amnon Shashua. Convolutional Rectifier Networks as Generalized Tensor Decom- positions,

work page arXiv
[4]

Convolutional Rectifier Networks as Generalized Tensor Decompositions

URLhttps://arxiv.org/abs/1603.00162v2. Nadav Cohen, Or Sharir, and Amnon Shashua. On the Expressive Power of Deep Learning: A Tensor Analysis,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

On the Expressive Power of Deep Learning: A Tensor Analysis

URLhttps://arxiv.org/abs/1509.05009v3. Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga- Alonso. Towards automated circuit discovery for mechanistic interpretability. InAdvances in Neural Information Processing Systems,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

org/posts/baJyjpktzmcmRfosq/ stitching-saes-of-different-sizes

URLhttps://arxiv.org/abs/2304.14997. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models,

work page arXiv
[7]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

URL https://arxiv.org/ abs/2309.08600. Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam, Mario Lucic, Yian Ma, Cory McLean, Diana Mincu, Akinori Mitani, Andrea...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

URL https://arxiv.org/abs/2011.03395. Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks,

work page arXiv 2011
[9]

URLhttps://arxiv.org/abs/1612.08083. Li Deng. The mnist database of handwritten digit images for machine learning research.IEEE Signal Processing Magazine, 29(6):141–142,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

URLhttps://arxiv.org/abs/2504.02667. Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kapl...

work page arXiv
[11]

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy

https://transformer-circuits.pub/2021/framework/index.html. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling,

work page 2021
[12]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

URL https://arxiv.org/abs/ 2101.00027. Johnnie Gray. quimb: a python library for quantum information and many-body calculations.Journal of Open Source Software, 3(29):819,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Michael Hanna, Sandro Pezzelle, and Yonatan Belinkov

doi: 10.21105/joss.00819. Michael Hanna, Sandro Pezzelle, and Yonatan Belinkov. Have faith in faithfulness: Going beyond circuit overlap when finding model mechanisms,

work page doi:10.21105/joss.00819
[14]

Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viégas, and Rory Sayres

URL https://arxiv.org/abs/2402.02364. Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viégas, and Rory Sayres. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCA V). InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine L...

work page arXiv
[15]

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)

URL https://arxiv.org/abs/1711.11279. 10 Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 3519–3529. PMLR,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Similarity of Neural Network Representations Revisited

URLhttps://arxiv.org/abs/1905.00414. Nikolaus Kriegeskorte, Marieke Mur, and Peter Bandettini. Representational similarity analysis — connecting the branches of systems neuroscience.Frontiers in Systems Neuroscience, 2:4,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[17]

URLhttps://doi.org/10.3389/neuro.06.004.2008

doi: 10.3389/neuro.06.004.2008. URLhttps://doi.org/10.3389/neuro.06.004.2008. Yoav Levine, Or Sharir, Nadav Cohen, and Amnon Shashua. Quantum Entanglement in Deep Learning Architectures.Phys. Rev. Lett., 122(6):065301,

work page doi:10.3389/neuro.06.004.2008 2008
[18]

doi: 10.1103/PhysRevLett.122. 065301. URLhttps://link.aps.org/doi/10.1103/PhysRevLett.122.065301. Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. In International Conference on Learning Representations, 2025a. URL http...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1103/physrevlett.122
[19]

Maxime Méloux, Silviu Maniu, François Portet, and Maxime Peyrard

URLhttp://arxiv.org/abs/2503.01588. Maxime Méloux, Silviu Maniu, François Portet, and Maxime Peyrard. Everything, everywhere, all at once: Is mechanistic interpretability identifiable?,

work page arXiv
[20]

Progress measures for grokking via mechanistic interpretability

URLhttps://arxiv.org/abs/2301.05217. Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y . Ng. Reading digits in natural images with unsupervised feature learning. InNIPS Workshop on Deep Learn- ing and Unsupervised Feature Learning 2011,

work page internal anchor Pith review Pith/arXiv arXiv 2011
[21]

org/abs/2407.12034

URL https://arxiv. org/abs/2407.12034. Chris Olah. A toy model of mechanistic (un)faithfulness.Transformer Circuits Thread,

work page arXiv
[22]

URL https://transformer-circuits.pub/2025/faithfulness-toy-model/index.html. Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal ...

work page 2025
[23]

Michael T

URL https://transformer-circuits.pub/2022/ in-context-learning-and-induction-heads/index.html. Michael T. Pearce, Thomas Dooms, Alice Rigg, José M. Oramas, and Lee Sharkey. Bilinear MLPs enable weight-based mechanistic interpretability. InInternational Conference on Learning Representations,

work page 2022
[24]

and Dooms, Thomas and Rigg, Alice and Oramas, Jose M

URLhttps://arxiv.org/abs/2410.08417. Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. InAdvances in Neural Information Processing Systems,

work page arXiv
[25]

SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability

URLhttps://arxiv.org/abs/1706.05806. 11 Lyle Ramshaw. Blossoms are polar forms.Computer Aided Geometric Design, 6(4):323–358,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

doi: 10.1016/0167-8396(89)90032-0

ISSN 0167-8396. doi: 10.1016/0167-8396(89)90032-0. URL https://www.sciencedirect. com/science/article/pii/0167839689900320. Logan Riggs. Tensor-transformer variants are surprisingly performant. Less- Wrong,

work page doi:10.1016/0167-8396(89)90032-0
[27]

org/abs/2305.03452

URL http://arxiv. org/abs/2305.03452. Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Ca...

work page arXiv
[28]

URL https://arxiv.org/abs/2002. 05202. George Wang, Jesse Hoogland, Stan van Wingerden, Zach Furman, and Daniel Murfet. Differentiation and specialization of attention heads via the refined local learning coefficient,

work page 2002
[29]

Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang

URL https: //arxiv.org/abs/2410.02984. Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang. Data selection for language models via importance resampling.arXiv preprint arXiv:2302.03169,

work page arXiv
[30]

parameterised by L(i) A ,R (i) A ∈R ri×Hi−1, D(i) A ∈ RHi×ri (and likewise for B), the Gram step (10) reduces to matrix operations alone. Initialise G(0) =1 d+1 and, for eachi∈[L], compute four matrix products, (LL)(i) :=L (i) A G(i−1) L(i) B T ,(RR) (i) :=R (i) A G(i−1) R(i) B T , (LR)(i) :=L (i) A G(i−1) R(i) B T ,(RL) (i) :=R (i) A G(i−1) L(i) B T ,(12...

work page 2018