arxiv: 2605.08934 · v1 · submitted 2026-05-09 · 💻 cs.LG

Recognition: no theorem link

From Mechanistic to Compositional Interpretability

Ward Gauderis , Thomas Dooms , Steven T. Holmer , Kola Ayonrinde , Geraint A. Wiggins

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:35 UTC · model grok-4.3

classification 💻 cs.LG

keywords compositional interpretabilitymechanistic interpretabilitycategory theoryminimum description lengthcompressive refinementneural network explanationsmodel decompositionparsimony criterion

0 comments

The pith

Compositional interpretability defines explanations as commuting syntactic and semantic mappings under minimum description length to make them verifiable and optimizable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a category-theoretic framework that formalizes mechanistic interpretability by requiring pairs of syntactic and semantic mappings to commute, thereby enforcing consistency between a model's internal decomposition and its actual behavior. It measures explanation quality through faithfulness to observed outputs and complexity measured by minimum description length, recasting interpretability as a constrained optimization task. The work introduces compressive refinement as a systematic way to break models into simpler functional parts and proves a parsimony criterion showing that syntactic compression produces more concise explanations aligned with human understanding. If the framework holds, existing mechanistic techniques become special cases of this refinement process, providing an objective basis for comparing and automating explanations.

Core claim

Compositional interpretations are pairs of syntactic and semantic mappings that must commute to ensure a model's decomposition matches its observed behavior. Explanation quality decomposes into faithfulness and complexity, turning interpretability into constrained optimization. Compressive refinement restructures a model into simpler parts that preserve function exactly, and a parsimony criterion proves that syntactic compression under minimum description length yields more concise, human-aligned explanations. Prominent mechanistic methods appear as subclasses of this refinement, explaining why their heuristics often match human interpretability preferences.

What carries the argument

Commuting pairs of syntactic and semantic mappings, enforced by category theory and minimum description length, with compressive refinement as the process that simplifies decompositions while preserving exact function.

If this is right

Existing mechanistic interpretability methods become subclasses of compressive refinement within the same formal structure.
Explanation creation reduces to a measurable optimization balancing faithfulness against description length.
Syntactic compression is guaranteed to produce more concise explanations that remain aligned with human interpretability.
Interpretations can be composed and verified objectively because the commuting condition enforces consistency with observed behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same commuting-map structure could be applied to non-neural systems such as symbolic programs or hybrid models to generate comparable explanations.
If the optimization can be performed efficiently in practice, automated pipelines might discover decompositions directly from weights without manual intervention.
The emphasis on minimum description length opens a direct link to information-theoretic measures of model complexity already used in compression research.

Load-bearing premise

That requiring syntactic and semantic mappings to commute, together with minimum description length, will produce explanations faithful to model behavior and aligned with human understanding without discarding essential functional details or creating intractable optimization.

What would settle it

A concrete counterexample in which applying compressive refinement to a model changes its output on some input while the claimed preservation of function is supposed to hold would falsify the core refinement guarantee.

Figures

Figures reproduced from arXiv: 2605.08934 by Geraint A. Wiggins, Kola Ayonrinde, Steven T. Holmer, Thomas Dooms, Ward Gauderis.

**Figure 1.** Figure 1: A commutative diagram illustrating compositional interpretability through compressive refinement for a model that classifies animals and their colour. In the original decomposition, a string diagram in S, the mechanisms appear structurally entangled even though their learned representations [[ · ]] in C are not. Through a compressive refinement R, a new model decomposition S ′ is discovered that clearly se… view at source ↗

read the original abstract

Mechanistic interpretability aims to explain neural model behaviour by reverse-engineering learned computational structure into human-understandable components. Without a formal framework, however, mechanistic explanations cannot be objectively verified, compared, or composed. We introduce compositional interpretability, a category-theoretic framework grounded in the principles of compositionality and minimum description length. Compositional interpretations are pairs of syntactic and semantic mappings that must commute to enforce consistency between a model's decomposition and its observed behaviour. We deconstruct explanation quality into measures of faithfulness and complexity to cast interpretability as a constrained optimisation problem, and introduce compressive refinement to systematically restructure models into simpler parts without altering their function. Finally, we prove a parsimony criterion under which syntactic compression theoretically guarantees more concise, human-aligned explanations. Our framework situates prominent mechanistic methods as subclasses of refinement, and clarifies why their compressibility heuristics tend to align with human interpretability. Our work provides a measurable, optimisable foundation for automating the discovery and evaluation of mechanistic explanations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a category-theoretic unification of mechanistic interpretability with commuting maps and an MDL parsimony claim, but the constructions stay too abstract to verify.

read the letter

The paper's core idea is to treat mechanistic interpretability through category theory by requiring syntactic and semantic mappings to commute for consistency, then optimize for faithful yet simple explanations via minimum description length. It claims this leads to a provable parsimony criterion and unifies prior methods as special cases of compressive refinement. What is new is the specific construction of interpretations as commuting pairs and the casting of the whole problem as constrained optimization. This moves beyond informal explanations toward something measurable and potentially automatable. The paper does well in explaining how compositionality and MDL can justify why certain compression heuristics align with human understanding. Soft spots appear in the application to real models. The stress-test concern holds up here: without clear rules for defining the syntactic category on neural architectures, commutativity might be satisfied by decompositions that miss key interactions like attention or residuals. The abstract states proofs exist for consistency and parsimony, but without the derivations or examples, it is impossible to confirm they support the claims without circularity or triviality. This makes the optimization step look more aspirational than immediate. This work is for researchers in AI interpretability who are comfortable with formal methods and want a foundation for verifiable, composable explanations. It could interest those in AI safety looking for ways to audit models more systematically. The paper shows honest engagement with the literature and clear formal thinking on its own terms. It deserves a serious referee to check the math and push for concrete instantiations. I recommend sending it to peer review rather than desk rejecting it.

Referee Report

3 major / 2 minor

Summary. The paper introduces compositional interpretability, a category-theoretic framework for mechanistic interpretability. Compositional interpretations consist of syntactic and semantic mappings that must commute to ensure consistency between model decompositions and observed behavior. Grounded in compositionality and minimum description length (MDL), the work frames interpretability as optimizing faithfulness and complexity, proposes compressive refinement to simplify decompositions without functional change, and proves a parsimony criterion where syntactic compression yields more concise, human-aligned explanations. Existing mechanistic methods are positioned as subclasses of this refinement process.

Significance. If the claims hold, this provides a formal, measurable foundation for interpretability, allowing objective evaluation and automation of explanations. The use of category theory and MDL could unify disparate methods and explain their alignment with human understanding. Strengths include the attempt at proofs and situating prior work. However, without concrete implementations or examples on neural models, the practical significance remains to be demonstrated.

major comments (3)

[Framework Definition] The syntactic category for neural architectures is under-specified. The manuscript does not detail how objects and morphisms are chosen to correspond to components such as attention patterns or residual streams (see the section introducing compositional interpretations). This leaves open whether commutativity is a substantive constraint or can be met by incomplete decompositions, which is load-bearing for the faithfulness claims.
[Parsimony Criterion Proof] The proof of the parsimony criterion relies on MDL without providing the full derivation, error analysis, or concrete examples. It is unclear how the optimization avoids circularity where faithfulness is defined in terms of the mappings being optimized (see the section on the parsimony criterion and the abstract's claim of proofs).
[Compressive Refinement] The definition of compressive refinement and how it preserves function while restructuring into simpler parts lacks explicit construction rules for categories on actual models, undermining the claim that it systematically produces human-aligned explanations.

minor comments (2)

[Abstract] The abstract is quite dense with technical terms introduced without prior definition, which may hinder accessibility for readers unfamiliar with category theory.
[Notation] Some notation for mappings and categories could be clarified with early examples to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment below and indicate the revisions we will make to improve clarity and rigor.

read point-by-point responses

Referee: [Framework Definition] The syntactic category for neural architectures is under-specified. The manuscript does not detail how objects and morphisms are chosen to correspond to components such as attention patterns or residual streams (see the section introducing compositional interpretations). This leaves open whether commutativity is a substantive constraint or can be met by incomplete decompositions, which is load-bearing for the faithfulness claims.

Authors: We agree that greater specification is needed. In the revised manuscript we will expand the section on compositional interpretations with explicit rules for selecting objects and morphisms corresponding to standard neural components (attention patterns, residual streams, MLPs). We will add a worked example on a two-layer transformer block that shows how commutativity fails for incomplete decompositions but holds for faithful ones, thereby confirming it is a substantive constraint. revision: yes
Referee: [Parsimony Criterion Proof] The proof of the parsimony criterion relies on MDL without providing the full derivation, error analysis, or concrete examples. It is unclear how the optimization avoids circularity where faithfulness is defined in terms of the mappings being optimized (see the section on the parsimony criterion and the abstract's claim of proofs).

Authors: The referee correctly notes that the main-text proof is abbreviated. We will relocate the complete derivation to an appendix, add an error analysis, and include a concrete numerical example on a small synthetic model. Faithfulness is defined independently as the expected divergence between the original network output and the output of the composed semantic mapping; the MDL term is applied only afterward to select among already-faithful decompositions. We will make this separation explicit in the revised text. revision: yes
Referee: [Compressive Refinement] The definition of compressive refinement and how it preserves function while restructuring into simpler parts lacks explicit construction rules for categories on actual models, undermining the claim that it systematically produces human-aligned explanations.

Authors: We accept that explicit construction rules are required. The revision will add a dedicated subsection containing algorithmic steps for applying compressive refinement to neural categories, including how to identify compressible morphisms while preserving the commuting diagram. Pseudocode and a small-scale transformer example will be provided to illustrate the process and its connection to human-aligned explanations. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper grounds its framework in external principles of compositionality and minimum description length, then defines compositional interpretations as commuting syntactic-semantic mapping pairs that enforce consistency. Explanation quality is deconstructed into faithfulness and complexity measures to form an optimization problem, with a claimed proof of a parsimony criterion. These steps formalize interpretability without reducing the central claims to tautological redefinitions or fitted inputs by construction; the commuting condition and MDL are applied as independent constraints rather than self-referential loops. No load-bearing self-citations or ansatzes imported from prior author work appear in the provided text, and the framework offers measurable structure beyond renaming known results. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on category-theoretic compositionality, the minimum description length principle, and the new definitions of syntactic and semantic mappings; no explicit free parameters are stated in the abstract, and the invented entity is the commuting interpretation pair itself.

axioms (1)

domain assumption Principles of compositionality and minimum description length
Explicitly stated as grounding the framework in the abstract.

invented entities (1)

Compositional interpretation as a pair of syntactic and semantic mappings that must commute no independent evidence
purpose: To enforce consistency between model decomposition and observed behavior
Introduced as the core object of the framework; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5478 in / 1399 out tokens · 49623 ms · 2026-05-12T02:35:13.663169+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

210 extracted references · 210 canonical work pages · 8 internal anchors

[1]

Alexander, Yotam and Vega, Nimrod De La and Razin, Noam and Cohen, Nadav , year = 2024, month = jan, number =. What. doi:10.48550/arXiv.2303.11249 , urldate =. arXiv , keywords =:2303.11249 , primaryclass =

work page doi:10.48550/arxiv.2303.11249 2024
[2]

Evaluating

Ayonrinde, Kola and Jaburi, Louis , year = 2025, month = may, number =. Evaluating. doi:10.48550/arXiv.2505.01372 , urldate =. arXiv , keywords =:2505.01372 , primaryclass =

work page doi:10.48550/arxiv.2505.01372 2025
[3]

Ayonrinde, Kola and Jaburi, Louis , year = 2025, month = may, number =. A. doi:10.48550/arXiv.2505.00808 , urldate =. arXiv , keywords =:2505.00808 , primaryclass =

work page doi:10.48550/arxiv.2505.00808 2025
[4]

Position:

Ayonrinde, Kola , year = 2025, month = apr, urldate =. Position:

work page 2025
[5]

Foundations of

Barbiero, Pietro and Zarlenga, Mateo Espinosa and Termine, Alberto and Jamnik, Mateja and Marra, Giuseppe , year = 2025, month = aug, number =. Foundations of. doi:10.48550/arXiv.2508.00545 , urldate =. arXiv , keywords =:2508.00545 , primaryclass =

work page doi:10.48550/arxiv.2508.00545 2025
[6]

Converting

Belrose, Nora and Rigg, Alice , year = 2025, month = feb, number =. Converting. doi:10.48550/arXiv.2502.01032 , urldate =. arXiv , keywords =:2502.01032 , primaryclass =

work page doi:10.48550/arxiv.2502.01032 2025
[7]

Biamonte,Lectures on quantum tensor networks, 2020, arXiv:1912.10049

Biamonte, Jacob , year = 2020, month = jan, number =. Lectures on. doi:10.48550/arXiv.1912.10049 , urldate =. arXiv , langid =:1912.10049 , primaryclass =

work page doi:10.48550/arxiv.1912.10049 2020
[8]

2025 , archivePrefix=

Bushnaq, Lucius and Braun, Dan and Sharkey, Lee , year = 2025, month = sep, number =. Stochastic. doi:10.48550/arXiv.2506.20790 , urldate =. arXiv , keywords =:2506.20790 , primaryclass =

work page doi:10.48550/arxiv.2506.20790 2025
[9]

2024 , journal=

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet , author=. 2024 , journal=

work page 2024
[10]

Pearce, Michael T. and Dooms, Thomas and Yamamoto, Ryo and Meehl, Joshua and Molnar, Carl and Bissell, Mark and Hazra, Dron and Fang, Ching and Nguyen, Nam and Anderson, Michael and Osborne, Collin and Duffy, Patrick and Toomey, Bridget and Klee, Eric and Myasoedova, Elena and Ryu, Alexander J. and Ayanian, Shant and Korfiatis, Panos and Redlon, Matt and ...

work page 2026
[11]

2014 , eprint=

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , author=. 2014 , eprint=

work page 2014
[12]

Cheng, Yixin and Chrysos, Grigorios G and Georgopoulos, Markos and Cevher, Volkan , year = 2024, abstract =

work page 2024
[13]

and Georgopoulos, Markos and Deng, Jiankang and Kossaifi, Jean and Panagakis, Yannis and Anandkumar, Anima , year = 2021, month = apr, journal =

Chrysos, Grigorios G. and Georgopoulos, Markos and Deng, Jiankang and Kossaifi, Jean and Panagakis, Yannis and Anandkumar, Anima , year = 2021, month = apr, journal =. Augmenting

work page 2021
[14]

Chrysos, Grigorios and Moschoglou, Stylianos and Bouritsas, Giorgos and Deng, Jiankang and Panagakis, Yannis and Zafeiriou, Stefanos , year = 2020, month = jun, journal =. Deep. doi:10.1109/TPAMI.2021.3058891 , urldate =

work page doi:10.1109/tpami.2021.3058891 2020
[15]

Axiomatic

Coecke, Bob , year = 2008, month = jul, journal =. Axiomatic. doi:10.1016/j.entcs.2008.04.014 , urldate =

work page doi:10.1016/j.entcs.2008.04.014 2008
[16]

Categories of

Coecke, Bob and Heunen, Chris and Kissinger, Aleks , year = 2014, month = sep, number =. Categories of. doi:10.48550/arXiv.1305.3821 , urldate =. arXiv , langid =:1305.3821 , primaryclass =

work page doi:10.48550/arxiv.1305.3821 2014
[17]

doi:10.48550/arXiv.2110.05327 , urldate =

Compositionality as We See It, Everywhere around Us , author =. doi:10.48550/arXiv.2110.05327 , urldate =. arXiv , keywords =:2110.05327 , primaryclass =

work page doi:10.48550/arxiv.2110.05327
[18]

arXiv.org , doi =

A New Description of Orthogonal Bases , author =. arXiv.org , doi =

work page
[20]

and D'Ascenzo, Davide and Dubach, Rafael and Poggio, Tomaso , year = 2025, month = jul, number =

Danhofer, David A. and D'Ascenzo, Davide and Dubach, Rafael and Poggio, Tomaso , year = 2025, month = jul, number =. Position:. doi:10.48550/arXiv.2507.02550 , urldate =. arXiv , langid =:2507.02550 , primaryclass =

work page doi:10.48550/arxiv.2507.02550 2025
[21]

Analyzing transformers in embedding space

Dar, Guy and Geva, Mor and Gupta, Ankit and Berant, Jonathan , year = 2023, month = dec, number =. Analyzing. doi:10.48550/arXiv.2209.02535 , urldate =. arXiv , langid =:2209.02535 , primaryclass =

work page doi:10.48550/arxiv.2209.02535 2023
[22]

Domingos, Pedro , year = 2025, month = oct, number =. Tensor. doi:10.48550/arXiv.2510.12269 , urldate =. arXiv , langid =:2510.12269 , primaryclass =

work page doi:10.48550/arxiv.2510.12269 2025
[23]

Compositionality

Dooms, Thomas and Gauderis, Ward and Wiggins, Geraint and Mogrovejo, Jose Antonio Oramas , year = 2024, month = nov, urldate =. Compositionality. Connecting

work page 2024
[24]

Scalable

Dubey, Abhimanyu and Radenovic, Filip and Mahajan, Dhruv , year = 2022, month = may, journal =. Scalable

work page 2022
[25]

Categories for

Dudzik, Andrew and Gavranovi. Categories for

work page
[26]

Towards a

Duneau, Tiffany , year = 2025, month = jun, number =. Towards a. doi:10.48550/arXiv.2507.02940 , urldate =. arXiv , langid =:2507.02940 , primaryclass =

work page doi:10.48550/arxiv.2507.02940 2025
[27]

Transcoders

Dunefsky, Jacob and Chlenski, Philippe and Nanda, Neel , year = 2024, month = nov, number =. Transcoders. doi:10.48550/arXiv.2406.11944 , urldate =. arXiv , keywords =:2406.11944 , primaryclass =

work page doi:10.48550/arxiv.2406.11944 2024
[28]

Elhage, Nelson and Nanda, Neel and Olsson, Catherine and Henighan, Tom and Joseph, Nicholas and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and DasSarma, Nova and Drain, Dawn and Ganguli, Deep and. A. Transformer Circuits Thread , file =

work page
[29]

Gauge Fixing, Canonical Forms and Optimal Truncations in Tensor Networks with Closed Loops , author =. Phys. Rev. B , volume =. doi:10.1103/PhysRevB.98.085155 , urldate =. arXiv , keywords =:1801.05390 , primaryclass =

work page doi:10.1103/physrevb.98.085155
[30]

Evenbly, Glen , year = 2019, month = may, number =. Number-. doi:10.48550/arXiv.1905.06352 , urldate =. arXiv , keywords =:1905.06352 , primaryclass =

work page doi:10.48550/arxiv.1905.06352 2019
[31]

Evenbly, Glen , year = 2022, month = feb, number =. A. doi:10.48550/arXiv.2202.02138 , urldate =. arXiv , keywords =:2202.02138 , primaryclass =

work page doi:10.48550/arxiv.2202.02138 2022
[32]

Fong, Brendan and Spivak, David I. and Tuy. Backprop as. doi:10.48550/arXiv.1711.10455 , urldate =. arXiv , langid =:1711.10455 , primaryclass =

work page doi:10.48550/arxiv.1711.10455
[33]

, year = 2019, month = jan, number =

Fong, Brendan and Spivak, David I. , year = 2019, month = jan, number =. Hypergraph. doi:10.48550/arXiv.1806.08304 , urldate =. arXiv , langid =:1806.08304 , primaryclass =

work page doi:10.48550/arxiv.1806.08304 2019
[34]

arXiv preprint arXiv:2511.13653 , year=

Weight-Sparse Transformers Have Interpretable Circuits , author =. doi:10.48550/arXiv.2511.13653 , urldate =. arXiv , keywords =:2511.13653 , primaryclass =

work page doi:10.48550/arxiv.2511.13653
[35]

Gauderis, Ward and Wiggins, Geraint , year = 2023, month = aug, address =. Quantum

work page 2023
[36]

Gavranovi´ c, P

Gavranovi. Categorical. doi:10.48550/arXiv.2402.15332 , urldate =. arXiv , keywords =:2402.15332 , primaryclass =

work page doi:10.48550/arxiv.2402.15332
[37]

arXiv.org , urldate =

Learning with Tree-Based Tensor Formats , author =. arXiv.org , urldate =

work page
[38]

doi:10.48550/arXiv.2505.20132 , urldate =

Tensorization Is a Powerful but Underexplored Tool for Compression and Interpretability of Neural Networks , author =. doi:10.48550/arXiv.2505.20132 , urldate =. arXiv , keywords =:2505.20132 , primaryclass =

work page doi:10.48550/arxiv.2505.20132
[39]

and Lubana, Ekdeep Singh and Fel, Thomas and Ba, Demba , year = 2025, month = mar, number =

Hindupur, Sai Sumedh R. and Lubana, Ekdeep Singh and Fel, Thomas and Ba, Demba , year = 2025, month = mar, number =. Projecting. doi:10.48550/arXiv.2503.01822 , urldate =. arXiv , langid =:2503.01822 , primaryclass =

work page doi:10.48550/arxiv.2503.01822 2025
[42]

Compositionality decomposed: how do neural networks generalise?, 2019, 1908.08351 http://arxiv.org/abs/1908.08351

Hupkes, Dieuwke and Dankers, Verna and Mul, Mathijs and Bruni, Elia , year = 2020, month = feb, number =. Compositionality Decomposed: How Do Neural Networks Generalise? , shorttitle =. doi:10.48550/arXiv.1908.08351 , urldate =. arXiv , keywords =:1908.08351 , primaryclass =

work page doi:10.48550/arxiv.1908.08351 2020
[43]

and Czarnecki, Wojciech M

Jayakumar, Siddhant M. and Czarnecki, Wojciech M. and Menick, Jacob and Schwarz, Jonathan and Rae, Jack and Osindero, Simon and Teh, Yee Whye and Harley, Tim and Pascanu, Razvan , year = 2019, month = sep, urldate =. Multiplicative. International

work page 2019
[44]

Khatri, Nikhil and Laakkonen, Tuomas and Liu, Jonathon and. On the. doi:10.48550/arXiv.2407.02423 , urldate =. arXiv , keywords =:2407.02423 , primaryclass =

work page doi:10.48550/arxiv.2407.02423
[45]

doi:10.48550/arXiv.1406.5942 , urldate =

Finite Matrices Are Complete for (Dagger-)Hypergraph Categories , author =. doi:10.48550/arXiv.1406.5942 , urldate =. arXiv , keywords =:1406.5942 , primaryclass =

work page doi:10.48550/arxiv.1406.5942
[48]

Liu, Yipeng and Liu, Jiani and Long, Zhen and Zhu, Ce , year = 2022, publisher =. Tensor. doi:10.1007/978-3-030-74386-4 , urldate =

work page doi:10.1007/978-3-030-74386-4 2022
[49]

Interpretability Needs a New Paradigm , url=

Madsen, Andreas and Lakkaraju, Himabindu and Reddy, Siva and Chandar, Sarath , year = 2024, month = nov, number =. Interpretability. doi:10.48550/arXiv.2405.05386 , urldate =. arXiv , langid =:2405.05386 , primaryclass =

work page doi:10.48550/arxiv.2405.05386 2024
[50]

Martin, Tian Peng, and Michael W

Predicting Trends in the Quality of State-of-the-Art Neural Networks without Access to Training or Testing Data , author =. Nat Commun , volume =. doi:10.1038/s41467-021-24025-8 , urldate =

work page doi:10.1038/s41467-021-24025-8
[51]

Everything,

M. Everything,. doi:10.48550/arXiv.2502.20914 , urldate =. arXiv , langid =:2502.20914 , primaryclass =

work page doi:10.48550/arxiv.2502.20914
[52]

Otsuka, Jun and Jp, Bun Kyoto-U Ac and Saigo, Hayato , abstract =. On the

work page
[53]

and Oldfield, James and Patti, Taylor and Nicolaou, Mihalis A

Panagakis, Yannis and Kossaifi, Jean and Chrysos, Grigorios G. and Oldfield, James and Patti, Taylor and Nicolaou, Mihalis A. and Anandkumar, Anima and Zafeiriou, Stefanos , year = 2024, pages =. Tensor Methods in Deep Learning , booktitle =. doi:10.1016/B978-0-32-391772-8.00021-1 , urldate =

work page doi:10.1016/b978-0-32-391772-8.00021-1 2024
[54]

Applications of

Penrose, Roger , year = 1971, journal =. Applications of

work page 1971
[55]

Compositional Sparsity of Learnable Functions , author =. Bull. Amer. Math. Soc. , volume =. doi:10.1090/bull/1820 , urldate =

work page doi:10.1090/bull/1820
[57]

Implicit

Razin, Noam and Maman, Asaf and Cohen, Nadav , year = 2022, month = jun, pages =. Implicit. Proceedings of the 39th

work page 2022
[58]

Rodatz, Benjamin and Fan, Ian and Laakkonen, Tuomas and Ortega, Neil John and Hoffmann, Thomas and. A. doi:10.48550/arXiv.2407.02424 , urldate =. arXiv , keywords =:2407.02424 , primaryclass =

work page doi:10.48550/arxiv.2407.02424
[59]

Sharkey, Lee and Chughtai, Bilal and Batson, Joshua and Lindsey, Jack and Wu, Jeff and Bushnaq, Lucius and. Open. doi:10.48550/arXiv.2501.16496 , urldate =. arXiv , langid =:2501.16496 , primaryclass =

work page internal anchor Pith review doi:10.48550/arxiv.2501.16496
[60]

Shen, Alexander , year = 2015, month = apr, number =. Around. doi:10.48550/arXiv.1504.04955 , urldate =. arXiv , langid =:1504.04955 , primaryclass =

work page doi:10.48550/arxiv.1504.04955 2015
[61]

Category theory in machine learning

Shiebler, Dan and Gavranovi. Category. doi:10.48550/arXiv.2106.07032 , urldate =. arXiv , keywords =:2106.07032 , primaryclass =

work page doi:10.48550/arxiv.2106.07032
[62]

Miles and Schwab, David J

Stoudenmire, E. Miles and Schwab, David J. , year = 2017, month = may, number =. Supervised. doi:10.48550/arXiv.1605.05775 , urldate =. arXiv , keywords =:1605.05775 , primaryclass =

work page doi:10.48550/arxiv.1605.05775 2017
[63]

Sutter, Denis and Minder, Julian and Hofmann, Thomas and Pimentel, Tiago , year = 2025, month = nov, number =. The. doi:10.48550/arXiv.2507.08802 , urldate =. arXiv , keywords =:2507.08802 , primaryclass =

work page doi:10.48550/arxiv.2507.08802 2025
[64]

arXiv.org , urldate =

An Introduction to Graphical Tensor Notation for Mechanistic Interpretability , author =. arXiv.org , urldate =

work page
[65]

Tull, Sean and Lorenz, Robin and Clark, Stephen and Khan, Ilyas and Coecke, Bob , year = 2024, month = jun, number =. Towards. doi:10.48550/arXiv.2406.17583 , urldate =. arXiv , keywords =:2406.17583 , primaryclass =

work page doi:10.48550/arxiv.2406.17583 2024
[66]

and Zemljic, Sara Sabrina and Clark, Stephen , year = 2023, month = nov, number =

Tull, Sean and Shaikh, Razin A. and Zemljic, Sara Sabrina and Clark, Stephen , year = 2023, month = nov, number =. From. arXiv , langid =:2401.08585 , primaryclass =

work page arXiv 2023
[68]

Zhou, Yilun and Shah, Julie , year = 2023, month = feb, number =. The. doi:10.48550/arXiv.2205.08696 , urldate =. arXiv , langid =:2205.08696 , primaryclass =

work page doi:10.48550/arxiv.2205.08696 2023
[69]

2022 , eprint=

In-context Learning and Induction Heads , author=. 2022 , eprint=

work page 2022
[70]

2016 , eprint=

Layer-wise Relevance Propagation for Neural Networks with Local Renormalization Layers , author=. 2016 , eprint=

work page 2016
[71]

and Ng, Madelena Y and Pannu, Jaspreet and Re, Christopher and Schmok, Jonathan C and St

Brixi, Garyk and Durrant, Matthew G and Ku, Jerome and Poli, Michael and Brockman, Greg and Chang, Daniel and Gonzalez, Gabriel A and King, Samuel H and Li, David B and Merchant, Aditi T and Naghipourfar, Mohsen and Nguyen, Eric and Ricci-Tam, Chiara and Romero, David W and Sun, Gwanggyu and Taghibakshi, Ali and Vorontsov, Anton and Yang, Brandon and Deng...

work page 2025
[72]

2024 , month = oct, howpublished =

Jack Lindsey and Adly Templeton and Jonathan Marcus and Thomas Conerly and Joshua Batson and Christopher Olah , title =. 2024 , month = oct, howpublished =

work page 2024
[73]

2024 , eprint=

Bilinear MLPs enable weight-based mechanistic interpretability , author=. 2024 , eprint=

work page 2024
[74]

2018 , eprint=

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) , author=. 2018 , eprint=

work page 2018
[75]

SIAM Journal on Matrix Analysis and Applications , volume =

Grasedyck, Lars , title =. SIAM Journal on Matrix Analysis and Applications , volume =. 2010 , doi =

work page 2010
[76]

Schollwöck,The density-matrix renormalization group in the age of matrix product states, Annals of Physics326(1), 96 (2011), doi:10.1016/j.aop.2010.09.012

Schollwöck, Ulrich , year=. The density-matrix renormalization group in the age of matrix product states , volume=. Annals of Physics , publisher=. doi:10.1016/j.aop.2010.09.012 , number=

work page doi:10.1016/j.aop.2010.09.012 2010
[77]

2020 , eprint=

Concept Bottleneck Models , author=. 2020 , eprint=

work page 2020
[78]

2020 , eprint=

Language Models are Few-Shot Learners , author=. 2020 , eprint=

work page 2020
[79]

2024 , eprint=

Scaling and evaluating sparse autoencoders , author=. 2024 , eprint=

work page 2024
[80]

2025 , eprint=

From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit , author=. 2025 , eprint=

work page 2025
[81]

and Bader, Brett W

Kolda, Tamara G. and Bader, Brett W. , title =. SIAM Review , volume =. 2009 , doi =

work page 2009
[82]

2022 , eprint=

Interpreting Neural Networks through the Polytope Lens , author=. 2022 , eprint=

work page 2022
[83]

Structured Data Fusion , year=

Sorber, Laurent and Van Barel, Marc and De Lathauwer, Lieven , journal=. Structured Data Fusion , year=

work page
[84]

doi: 10.1038/s41586-021-03819-2

Jumper, John and Evans, Richard and Pritzel, Alexander and Green, Tim and Figurnov, Michael and Ronneberger, Olaf and Tunyasuvunakool, Kathryn and Bates, Russ and. Highly accurate protein structure prediction with AlphaFold , journal=. 2021 , month=. doi:10.1038/s41586-021-03819-2 , url=

work page doi:10.1038/s41586-021-03819-2 2021
[85]

2025 , eprint=

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition , author=. 2025 , eprint=

work page 2025
[86]

2018 , eprint=

Understanding intermediate layers using linear classifier probes , author=. 2018 , eprint=

work page 2018
[87]

2025 , eprint=

Not All Language Model Features Are One-Dimensionally Linear , author=. 2025 , eprint=

work page 2025

Showing first 80 references.