pith. machine review for the scientific record. sign in

arxiv: 2605.08934 · v1 · submitted 2026-05-09 · 💻 cs.LG

Recognition: no theorem link

From Mechanistic to Compositional Interpretability

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:35 UTC · model grok-4.3

classification 💻 cs.LG
keywords compositional interpretabilitymechanistic interpretabilitycategory theoryminimum description lengthcompressive refinementneural network explanationsmodel decompositionparsimony criterion
0
0 comments X

The pith

Compositional interpretability defines explanations as commuting syntactic and semantic mappings under minimum description length to make them verifiable and optimizable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a category-theoretic framework that formalizes mechanistic interpretability by requiring pairs of syntactic and semantic mappings to commute, thereby enforcing consistency between a model's internal decomposition and its actual behavior. It measures explanation quality through faithfulness to observed outputs and complexity measured by minimum description length, recasting interpretability as a constrained optimization task. The work introduces compressive refinement as a systematic way to break models into simpler functional parts and proves a parsimony criterion showing that syntactic compression produces more concise explanations aligned with human understanding. If the framework holds, existing mechanistic techniques become special cases of this refinement process, providing an objective basis for comparing and automating explanations.

Core claim

Compositional interpretations are pairs of syntactic and semantic mappings that must commute to ensure a model's decomposition matches its observed behavior. Explanation quality decomposes into faithfulness and complexity, turning interpretability into constrained optimization. Compressive refinement restructures a model into simpler parts that preserve function exactly, and a parsimony criterion proves that syntactic compression under minimum description length yields more concise, human-aligned explanations. Prominent mechanistic methods appear as subclasses of this refinement, explaining why their heuristics often match human interpretability preferences.

What carries the argument

Commuting pairs of syntactic and semantic mappings, enforced by category theory and minimum description length, with compressive refinement as the process that simplifies decompositions while preserving exact function.

If this is right

  • Existing mechanistic interpretability methods become subclasses of compressive refinement within the same formal structure.
  • Explanation creation reduces to a measurable optimization balancing faithfulness against description length.
  • Syntactic compression is guaranteed to produce more concise explanations that remain aligned with human interpretability.
  • Interpretations can be composed and verified objectively because the commuting condition enforces consistency with observed behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same commuting-map structure could be applied to non-neural systems such as symbolic programs or hybrid models to generate comparable explanations.
  • If the optimization can be performed efficiently in practice, automated pipelines might discover decompositions directly from weights without manual intervention.
  • The emphasis on minimum description length opens a direct link to information-theoretic measures of model complexity already used in compression research.

Load-bearing premise

That requiring syntactic and semantic mappings to commute, together with minimum description length, will produce explanations faithful to model behavior and aligned with human understanding without discarding essential functional details or creating intractable optimization.

What would settle it

A concrete counterexample in which applying compressive refinement to a model changes its output on some input while the claimed preservation of function is supposed to hold would falsify the core refinement guarantee.

Figures

Figures reproduced from arXiv: 2605.08934 by Geraint A. Wiggins, Kola Ayonrinde, Steven T. Holmer, Thomas Dooms, Ward Gauderis.

Figure 1
Figure 1. Figure 1: A commutative diagram illustrating compositional interpretability through compressive refinement for a model that classifies animals and their colour. In the original decomposition, a string diagram in S, the mechanisms appear structurally entangled even though their learned representations [[ · ]] in C are not. Through a compressive refinement R, a new model decomposition S ′ is discovered that clearly se… view at source ↗
read the original abstract

Mechanistic interpretability aims to explain neural model behaviour by reverse-engineering learned computational structure into human-understandable components. Without a formal framework, however, mechanistic explanations cannot be objectively verified, compared, or composed. We introduce compositional interpretability, a category-theoretic framework grounded in the principles of compositionality and minimum description length. Compositional interpretations are pairs of syntactic and semantic mappings that must commute to enforce consistency between a model's decomposition and its observed behaviour. We deconstruct explanation quality into measures of faithfulness and complexity to cast interpretability as a constrained optimisation problem, and introduce compressive refinement to systematically restructure models into simpler parts without altering their function. Finally, we prove a parsimony criterion under which syntactic compression theoretically guarantees more concise, human-aligned explanations. Our framework situates prominent mechanistic methods as subclasses of refinement, and clarifies why their compressibility heuristics tend to align with human interpretability. Our work provides a measurable, optimisable foundation for automating the discovery and evaluation of mechanistic explanations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces compositional interpretability, a category-theoretic framework for mechanistic interpretability. Compositional interpretations consist of syntactic and semantic mappings that must commute to ensure consistency between model decompositions and observed behavior. Grounded in compositionality and minimum description length (MDL), the work frames interpretability as optimizing faithfulness and complexity, proposes compressive refinement to simplify decompositions without functional change, and proves a parsimony criterion where syntactic compression yields more concise, human-aligned explanations. Existing mechanistic methods are positioned as subclasses of this refinement process.

Significance. If the claims hold, this provides a formal, measurable foundation for interpretability, allowing objective evaluation and automation of explanations. The use of category theory and MDL could unify disparate methods and explain their alignment with human understanding. Strengths include the attempt at proofs and situating prior work. However, without concrete implementations or examples on neural models, the practical significance remains to be demonstrated.

major comments (3)
  1. [Framework Definition] The syntactic category for neural architectures is under-specified. The manuscript does not detail how objects and morphisms are chosen to correspond to components such as attention patterns or residual streams (see the section introducing compositional interpretations). This leaves open whether commutativity is a substantive constraint or can be met by incomplete decompositions, which is load-bearing for the faithfulness claims.
  2. [Parsimony Criterion Proof] The proof of the parsimony criterion relies on MDL without providing the full derivation, error analysis, or concrete examples. It is unclear how the optimization avoids circularity where faithfulness is defined in terms of the mappings being optimized (see the section on the parsimony criterion and the abstract's claim of proofs).
  3. [Compressive Refinement] The definition of compressive refinement and how it preserves function while restructuring into simpler parts lacks explicit construction rules for categories on actual models, undermining the claim that it systematically produces human-aligned explanations.
minor comments (2)
  1. [Abstract] The abstract is quite dense with technical terms introduced without prior definition, which may hinder accessibility for readers unfamiliar with category theory.
  2. [Notation] Some notation for mappings and categories could be clarified with early examples to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment below and indicate the revisions we will make to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Framework Definition] The syntactic category for neural architectures is under-specified. The manuscript does not detail how objects and morphisms are chosen to correspond to components such as attention patterns or residual streams (see the section introducing compositional interpretations). This leaves open whether commutativity is a substantive constraint or can be met by incomplete decompositions, which is load-bearing for the faithfulness claims.

    Authors: We agree that greater specification is needed. In the revised manuscript we will expand the section on compositional interpretations with explicit rules for selecting objects and morphisms corresponding to standard neural components (attention patterns, residual streams, MLPs). We will add a worked example on a two-layer transformer block that shows how commutativity fails for incomplete decompositions but holds for faithful ones, thereby confirming it is a substantive constraint. revision: yes

  2. Referee: [Parsimony Criterion Proof] The proof of the parsimony criterion relies on MDL without providing the full derivation, error analysis, or concrete examples. It is unclear how the optimization avoids circularity where faithfulness is defined in terms of the mappings being optimized (see the section on the parsimony criterion and the abstract's claim of proofs).

    Authors: The referee correctly notes that the main-text proof is abbreviated. We will relocate the complete derivation to an appendix, add an error analysis, and include a concrete numerical example on a small synthetic model. Faithfulness is defined independently as the expected divergence between the original network output and the output of the composed semantic mapping; the MDL term is applied only afterward to select among already-faithful decompositions. We will make this separation explicit in the revised text. revision: yes

  3. Referee: [Compressive Refinement] The definition of compressive refinement and how it preserves function while restructuring into simpler parts lacks explicit construction rules for categories on actual models, undermining the claim that it systematically produces human-aligned explanations.

    Authors: We accept that explicit construction rules are required. The revision will add a dedicated subsection containing algorithmic steps for applying compressive refinement to neural categories, including how to identify compressible morphisms while preserving the commuting diagram. Pseudocode and a small-scale transformer example will be provided to illustrate the process and its connection to human-aligned explanations. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper grounds its framework in external principles of compositionality and minimum description length, then defines compositional interpretations as commuting syntactic-semantic mapping pairs that enforce consistency. Explanation quality is deconstructed into faithfulness and complexity measures to form an optimization problem, with a claimed proof of a parsimony criterion. These steps formalize interpretability without reducing the central claims to tautological redefinitions or fitted inputs by construction; the commuting condition and MDL are applied as independent constraints rather than self-referential loops. No load-bearing self-citations or ansatzes imported from prior author work appear in the provided text, and the framework offers measurable structure beyond renaming known results. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on category-theoretic compositionality, the minimum description length principle, and the new definitions of syntactic and semantic mappings; no explicit free parameters are stated in the abstract, and the invented entity is the commuting interpretation pair itself.

axioms (1)
  • domain assumption Principles of compositionality and minimum description length
    Explicitly stated as grounding the framework in the abstract.
invented entities (1)
  • Compositional interpretation as a pair of syntactic and semantic mappings that must commute no independent evidence
    purpose: To enforce consistency between model decomposition and observed behavior
    Introduced as the core object of the framework; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5478 in / 1399 out tokens · 49623 ms · 2026-05-12T02:35:13.663169+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

210 extracted references · 210 canonical work pages · 8 internal anchors

  1. [1]

    Alexander, Yotam and Vega, Nimrod De La and Razin, Noam and Cohen, Nadav , year = 2024, month = jan, number =. What. doi:10.48550/arXiv.2303.11249 , urldate =. arXiv , keywords =:2303.11249 , primaryclass =

  2. [2]

    Evaluating

    Ayonrinde, Kola and Jaburi, Louis , year = 2025, month = may, number =. Evaluating. doi:10.48550/arXiv.2505.01372 , urldate =. arXiv , keywords =:2505.01372 , primaryclass =

  3. [3]

    Ayonrinde, Kola and Jaburi, Louis , year = 2025, month = may, number =. A. doi:10.48550/arXiv.2505.00808 , urldate =. arXiv , keywords =:2505.00808 , primaryclass =

  4. [4]

    Position:

    Ayonrinde, Kola , year = 2025, month = apr, urldate =. Position:

  5. [5]

    Foundations of

    Barbiero, Pietro and Zarlenga, Mateo Espinosa and Termine, Alberto and Jamnik, Mateja and Marra, Giuseppe , year = 2025, month = aug, number =. Foundations of. doi:10.48550/arXiv.2508.00545 , urldate =. arXiv , keywords =:2508.00545 , primaryclass =

  6. [6]

    Converting

    Belrose, Nora and Rigg, Alice , year = 2025, month = feb, number =. Converting. doi:10.48550/arXiv.2502.01032 , urldate =. arXiv , keywords =:2502.01032 , primaryclass =

  7. [7]

    Biamonte,Lectures on quantum tensor networks, 2020, arXiv:1912.10049

    Biamonte, Jacob , year = 2020, month = jan, number =. Lectures on. doi:10.48550/arXiv.1912.10049 , urldate =. arXiv , langid =:1912.10049 , primaryclass =

  8. [8]

    2025 , archivePrefix=

    Bushnaq, Lucius and Braun, Dan and Sharkey, Lee , year = 2025, month = sep, number =. Stochastic. doi:10.48550/arXiv.2506.20790 , urldate =. arXiv , keywords =:2506.20790 , primaryclass =

  9. [9]

    2024 , journal=

    Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet , author=. 2024 , journal=

  10. [10]

    Pearce, Michael T. and Dooms, Thomas and Yamamoto, Ryo and Meehl, Joshua and Molnar, Carl and Bissell, Mark and Hazra, Dron and Fang, Ching and Nguyen, Nam and Anderson, Michael and Osborne, Collin and Duffy, Patrick and Toomey, Bridget and Klee, Eric and Myasoedova, Elena and Ryu, Alexander J. and Ayanian, Shant and Korfiatis, Panos and Redlon, Matt and ...

  11. [11]

    2014 , eprint=

    Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , author=. 2014 , eprint=

  12. [12]

    Cheng, Yixin and Chrysos, Grigorios G and Georgopoulos, Markos and Cevher, Volkan , year = 2024, abstract =

  13. [13]

    and Georgopoulos, Markos and Deng, Jiankang and Kossaifi, Jean and Panagakis, Yannis and Anandkumar, Anima , year = 2021, month = apr, journal =

    Chrysos, Grigorios G. and Georgopoulos, Markos and Deng, Jiankang and Kossaifi, Jean and Panagakis, Yannis and Anandkumar, Anima , year = 2021, month = apr, journal =. Augmenting

  14. [14]

    Chrysos, Grigorios and Moschoglou, Stylianos and Bouritsas, Giorgos and Deng, Jiankang and Panagakis, Yannis and Zafeiriou, Stefanos , year = 2020, month = jun, journal =. Deep. doi:10.1109/TPAMI.2021.3058891 , urldate =

  15. [15]

    Axiomatic

    Coecke, Bob , year = 2008, month = jul, journal =. Axiomatic. doi:10.1016/j.entcs.2008.04.014 , urldate =

  16. [16]

    Categories of

    Coecke, Bob and Heunen, Chris and Kissinger, Aleks , year = 2014, month = sep, number =. Categories of. doi:10.48550/arXiv.1305.3821 , urldate =. arXiv , langid =:1305.3821 , primaryclass =

  17. [17]

    doi:10.48550/arXiv.2110.05327 , urldate =

    Compositionality as We See It, Everywhere around Us , author =. doi:10.48550/arXiv.2110.05327 , urldate =. arXiv , keywords =:2110.05327 , primaryclass =

  18. [18]

    arXiv.org , doi =

    A New Description of Orthogonal Bases , author =. arXiv.org , doi =

  19. [20]

    and D'Ascenzo, Davide and Dubach, Rafael and Poggio, Tomaso , year = 2025, month = jul, number =

    Danhofer, David A. and D'Ascenzo, Davide and Dubach, Rafael and Poggio, Tomaso , year = 2025, month = jul, number =. Position:. doi:10.48550/arXiv.2507.02550 , urldate =. arXiv , langid =:2507.02550 , primaryclass =

  20. [21]

    Analyzing transformers in embedding space

    Dar, Guy and Geva, Mor and Gupta, Ankit and Berant, Jonathan , year = 2023, month = dec, number =. Analyzing. doi:10.48550/arXiv.2209.02535 , urldate =. arXiv , langid =:2209.02535 , primaryclass =

  21. [22]

    Domingos, Pedro , year = 2025, month = oct, number =. Tensor. doi:10.48550/arXiv.2510.12269 , urldate =. arXiv , langid =:2510.12269 , primaryclass =

  22. [23]

    Compositionality

    Dooms, Thomas and Gauderis, Ward and Wiggins, Geraint and Mogrovejo, Jose Antonio Oramas , year = 2024, month = nov, urldate =. Compositionality. Connecting

  23. [24]

    Scalable

    Dubey, Abhimanyu and Radenovic, Filip and Mahajan, Dhruv , year = 2022, month = may, journal =. Scalable

  24. [25]

    Categories for

    Dudzik, Andrew and Gavranovi. Categories for

  25. [26]

    Towards a

    Duneau, Tiffany , year = 2025, month = jun, number =. Towards a. doi:10.48550/arXiv.2507.02940 , urldate =. arXiv , langid =:2507.02940 , primaryclass =

  26. [27]

    Transcoders

    Dunefsky, Jacob and Chlenski, Philippe and Nanda, Neel , year = 2024, month = nov, number =. Transcoders. doi:10.48550/arXiv.2406.11944 , urldate =. arXiv , keywords =:2406.11944 , primaryclass =

  27. [28]

    Elhage, Nelson and Nanda, Neel and Olsson, Catherine and Henighan, Tom and Joseph, Nicholas and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and DasSarma, Nova and Drain, Dawn and Ganguli, Deep and. A. Transformer Circuits Thread , file =

  28. [29]

    Gauge Fixing, Canonical Forms and Optimal Truncations in Tensor Networks with Closed Loops , author =. Phys. Rev. B , volume =. doi:10.1103/PhysRevB.98.085155 , urldate =. arXiv , keywords =:1801.05390 , primaryclass =

  29. [30]

    Evenbly, Glen , year = 2019, month = may, number =. Number-. doi:10.48550/arXiv.1905.06352 , urldate =. arXiv , keywords =:1905.06352 , primaryclass =

  30. [31]

    Evenbly, Glen , year = 2022, month = feb, number =. A. doi:10.48550/arXiv.2202.02138 , urldate =. arXiv , keywords =:2202.02138 , primaryclass =

  31. [32]

    Fong, Brendan and Spivak, David I. and Tuy. Backprop as. doi:10.48550/arXiv.1711.10455 , urldate =. arXiv , langid =:1711.10455 , primaryclass =

  32. [33]

    , year = 2019, month = jan, number =

    Fong, Brendan and Spivak, David I. , year = 2019, month = jan, number =. Hypergraph. doi:10.48550/arXiv.1806.08304 , urldate =. arXiv , langid =:1806.08304 , primaryclass =

  33. [34]

    arXiv preprint arXiv:2511.13653 , year=

    Weight-Sparse Transformers Have Interpretable Circuits , author =. doi:10.48550/arXiv.2511.13653 , urldate =. arXiv , keywords =:2511.13653 , primaryclass =

  34. [35]

    Gauderis, Ward and Wiggins, Geraint , year = 2023, month = aug, address =. Quantum

  35. [36]

    Gavranovi´ c, P

    Gavranovi. Categorical. doi:10.48550/arXiv.2402.15332 , urldate =. arXiv , keywords =:2402.15332 , primaryclass =

  36. [37]

    arXiv.org , urldate =

    Learning with Tree-Based Tensor Formats , author =. arXiv.org , urldate =

  37. [38]

    doi:10.48550/arXiv.2505.20132 , urldate =

    Tensorization Is a Powerful but Underexplored Tool for Compression and Interpretability of Neural Networks , author =. doi:10.48550/arXiv.2505.20132 , urldate =. arXiv , keywords =:2505.20132 , primaryclass =

  38. [39]

    and Lubana, Ekdeep Singh and Fel, Thomas and Ba, Demba , year = 2025, month = mar, number =

    Hindupur, Sai Sumedh R. and Lubana, Ekdeep Singh and Fel, Thomas and Ba, Demba , year = 2025, month = mar, number =. Projecting. doi:10.48550/arXiv.2503.01822 , urldate =. arXiv , langid =:2503.01822 , primaryclass =

  39. [42]

    Compositionality decomposed: how do neural networks generalise?, 2019, 1908.08351 http://arxiv.org/abs/1908.08351

    Hupkes, Dieuwke and Dankers, Verna and Mul, Mathijs and Bruni, Elia , year = 2020, month = feb, number =. Compositionality Decomposed: How Do Neural Networks Generalise? , shorttitle =. doi:10.48550/arXiv.1908.08351 , urldate =. arXiv , keywords =:1908.08351 , primaryclass =

  40. [43]

    and Czarnecki, Wojciech M

    Jayakumar, Siddhant M. and Czarnecki, Wojciech M. and Menick, Jacob and Schwarz, Jonathan and Rae, Jack and Osindero, Simon and Teh, Yee Whye and Harley, Tim and Pascanu, Razvan , year = 2019, month = sep, urldate =. Multiplicative. International

  41. [44]

    Khatri, Nikhil and Laakkonen, Tuomas and Liu, Jonathon and. On the. doi:10.48550/arXiv.2407.02423 , urldate =. arXiv , keywords =:2407.02423 , primaryclass =

  42. [45]

    doi:10.48550/arXiv.1406.5942 , urldate =

    Finite Matrices Are Complete for (Dagger-)Hypergraph Categories , author =. doi:10.48550/arXiv.1406.5942 , urldate =. arXiv , keywords =:1406.5942 , primaryclass =

  43. [48]

    Liu, Yipeng and Liu, Jiani and Long, Zhen and Zhu, Ce , year = 2022, publisher =. Tensor. doi:10.1007/978-3-030-74386-4 , urldate =

  44. [49]

    Interpretability Needs a New Paradigm , url=

    Madsen, Andreas and Lakkaraju, Himabindu and Reddy, Siva and Chandar, Sarath , year = 2024, month = nov, number =. Interpretability. doi:10.48550/arXiv.2405.05386 , urldate =. arXiv , langid =:2405.05386 , primaryclass =

  45. [50]

    Martin, Tian Peng, and Michael W

    Predicting Trends in the Quality of State-of-the-Art Neural Networks without Access to Training or Testing Data , author =. Nat Commun , volume =. doi:10.1038/s41467-021-24025-8 , urldate =

  46. [51]

    Everything,

    M. Everything,. doi:10.48550/arXiv.2502.20914 , urldate =. arXiv , langid =:2502.20914 , primaryclass =

  47. [52]

    Otsuka, Jun and Jp, Bun Kyoto-U Ac and Saigo, Hayato , abstract =. On the

  48. [53]

    and Oldfield, James and Patti, Taylor and Nicolaou, Mihalis A

    Panagakis, Yannis and Kossaifi, Jean and Chrysos, Grigorios G. and Oldfield, James and Patti, Taylor and Nicolaou, Mihalis A. and Anandkumar, Anima and Zafeiriou, Stefanos , year = 2024, pages =. Tensor Methods in Deep Learning , booktitle =. doi:10.1016/B978-0-32-391772-8.00021-1 , urldate =

  49. [54]

    Applications of

    Penrose, Roger , year = 1971, journal =. Applications of

  50. [55]

    Compositional Sparsity of Learnable Functions , author =. Bull. Amer. Math. Soc. , volume =. doi:10.1090/bull/1820 , urldate =

  51. [57]

    Implicit

    Razin, Noam and Maman, Asaf and Cohen, Nadav , year = 2022, month = jun, pages =. Implicit. Proceedings of the 39th

  52. [58]

    Rodatz, Benjamin and Fan, Ian and Laakkonen, Tuomas and Ortega, Neil John and Hoffmann, Thomas and. A. doi:10.48550/arXiv.2407.02424 , urldate =. arXiv , keywords =:2407.02424 , primaryclass =

  53. [59]

    Sharkey, Lee and Chughtai, Bilal and Batson, Joshua and Lindsey, Jack and Wu, Jeff and Bushnaq, Lucius and. Open. doi:10.48550/arXiv.2501.16496 , urldate =. arXiv , langid =:2501.16496 , primaryclass =

  54. [60]

    Shen, Alexander , year = 2015, month = apr, number =. Around. doi:10.48550/arXiv.1504.04955 , urldate =. arXiv , langid =:1504.04955 , primaryclass =

  55. [61]

    Category theory in machine learning

    Shiebler, Dan and Gavranovi. Category. doi:10.48550/arXiv.2106.07032 , urldate =. arXiv , keywords =:2106.07032 , primaryclass =

  56. [62]

    Miles and Schwab, David J

    Stoudenmire, E. Miles and Schwab, David J. , year = 2017, month = may, number =. Supervised. doi:10.48550/arXiv.1605.05775 , urldate =. arXiv , keywords =:1605.05775 , primaryclass =

  57. [63]

    Sutter, Denis and Minder, Julian and Hofmann, Thomas and Pimentel, Tiago , year = 2025, month = nov, number =. The. doi:10.48550/arXiv.2507.08802 , urldate =. arXiv , keywords =:2507.08802 , primaryclass =

  58. [64]

    arXiv.org , urldate =

    An Introduction to Graphical Tensor Notation for Mechanistic Interpretability , author =. arXiv.org , urldate =

  59. [65]

    Tull, Sean and Lorenz, Robin and Clark, Stephen and Khan, Ilyas and Coecke, Bob , year = 2024, month = jun, number =. Towards. doi:10.48550/arXiv.2406.17583 , urldate =. arXiv , keywords =:2406.17583 , primaryclass =

  60. [66]

    and Zemljic, Sara Sabrina and Clark, Stephen , year = 2023, month = nov, number =

    Tull, Sean and Shaikh, Razin A. and Zemljic, Sara Sabrina and Clark, Stephen , year = 2023, month = nov, number =. From. arXiv , langid =:2401.08585 , primaryclass =

  61. [68]

    Zhou, Yilun and Shah, Julie , year = 2023, month = feb, number =. The. doi:10.48550/arXiv.2205.08696 , urldate =. arXiv , langid =:2205.08696 , primaryclass =

  62. [69]

    2022 , eprint=

    In-context Learning and Induction Heads , author=. 2022 , eprint=

  63. [70]

    2016 , eprint=

    Layer-wise Relevance Propagation for Neural Networks with Local Renormalization Layers , author=. 2016 , eprint=

  64. [71]

    and Ng, Madelena Y and Pannu, Jaspreet and Re, Christopher and Schmok, Jonathan C and St

    Brixi, Garyk and Durrant, Matthew G and Ku, Jerome and Poli, Michael and Brockman, Greg and Chang, Daniel and Gonzalez, Gabriel A and King, Samuel H and Li, David B and Merchant, Aditi T and Naghipourfar, Mohsen and Nguyen, Eric and Ricci-Tam, Chiara and Romero, David W and Sun, Gwanggyu and Taghibakshi, Ali and Vorontsov, Anton and Yang, Brandon and Deng...

  65. [72]

    2024 , month = oct, howpublished =

    Jack Lindsey and Adly Templeton and Jonathan Marcus and Thomas Conerly and Joshua Batson and Christopher Olah , title =. 2024 , month = oct, howpublished =

  66. [73]

    2024 , eprint=

    Bilinear MLPs enable weight-based mechanistic interpretability , author=. 2024 , eprint=

  67. [74]

    2018 , eprint=

    Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) , author=. 2018 , eprint=

  68. [75]

    SIAM Journal on Matrix Analysis and Applications , volume =

    Grasedyck, Lars , title =. SIAM Journal on Matrix Analysis and Applications , volume =. 2010 , doi =

  69. [76]

    Schollwöck,The density-matrix renormalization group in the age of matrix product states, Annals of Physics326(1), 96 (2011), doi:10.1016/j.aop.2010.09.012

    Schollwöck, Ulrich , year=. The density-matrix renormalization group in the age of matrix product states , volume=. Annals of Physics , publisher=. doi:10.1016/j.aop.2010.09.012 , number=

  70. [77]

    2020 , eprint=

    Concept Bottleneck Models , author=. 2020 , eprint=

  71. [78]

    2020 , eprint=

    Language Models are Few-Shot Learners , author=. 2020 , eprint=

  72. [79]

    2024 , eprint=

    Scaling and evaluating sparse autoencoders , author=. 2024 , eprint=

  73. [80]

    2025 , eprint=

    From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit , author=. 2025 , eprint=

  74. [81]

    and Bader, Brett W

    Kolda, Tamara G. and Bader, Brett W. , title =. SIAM Review , volume =. 2009 , doi =

  75. [82]

    2022 , eprint=

    Interpreting Neural Networks through the Polytope Lens , author=. 2022 , eprint=

  76. [83]

    Structured Data Fusion , year=

    Sorber, Laurent and Van Barel, Marc and De Lathauwer, Lieven , journal=. Structured Data Fusion , year=

  77. [84]

    doi: 10.1038/s41586-021-03819-2

    Jumper, John and Evans, Richard and Pritzel, Alexander and Green, Tim and Figurnov, Michael and Ronneberger, Olaf and Tunyasuvunakool, Kathryn and Bates, Russ and. Highly accurate protein structure prediction with AlphaFold , journal=. 2021 , month=. doi:10.1038/s41586-021-03819-2 , url=

  78. [85]

    2025 , eprint=

    Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition , author=. 2025 , eprint=

  79. [86]

    2018 , eprint=

    Understanding intermediate layers using linear classifier probes , author=. 2018 , eprint=

  80. [87]

    2025 , eprint=

    Not All Language Model Features Are One-Dimensionally Linear , author=. 2025 , eprint=

Showing first 80 references.