pith. sign in

arxiv: 2606.09653 · v1 · pith:CMSKJMZXnew · submitted 2026-06-08 · 💻 cs.LG

A Unifying Framework for Concept-Based Representational Similarity

Pith reviewed 2026-06-27 17:09 UTC · model grok-4.3

classification 💻 cs.LG
keywords concept alignmentrepresentational similaritymulti-objective optimizationtranslation consistencydistributional alignmentsparse autoencoderintervention benchmarkunifying framework
0
0 comments X

The pith

Concept alignment decomposes into four distinct properties along two axes and requires joint optimization of all of them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper decomposes concept alignment into two axes—what is aligned (representations versus concepts) and the level of alignment (instance-wise versus distributional)—which together induce four properties: instance-wise and distributional versions of translation and concept consistency. It demonstrates through theory and a new intervention benchmark that commonly assumed equivalences between these properties do not hold, that purely unsupervised methods fail to recover instance-level alignment, and that a model jointly enforcing the properties succeeds where single-objective approaches do not. The results establish that meaningful alignment emerges only when the objectives are treated as complementary rather than interchangeable, with as little as 0.1 percent paired data sufficing when distributional anchors are present.

Core claim

Alignment along the representation-to-concept axis and the instance-to-distribution axis produces four independent guarantees—instance-wise translation, distributional translation, instance-wise concept consistency, and distributional concept consistency. Existing methods each guarantee only a subset of these, and optimizing any one subset does not recover the remaining guarantees. Only a model that couples the objectives across both axes recovers all four simultaneously; purely unsupervised objectives leave instance-level matches unrecoverable, while anchoring the distributional objectives allows 0.1 percent paired data to restore instance-level alignment.

What carries the argument

The two-axis decomposition (representations vs. concepts; instance-wise vs. distributional) that induces four alignment properties—instance-wise and distributional translation together with instance-wise and distributional concept consistency.

If this is right

  • Optimizing one alignment property does not reliably recover the others.
  • Purely unsupervised objectives fail to recover meaningful instance-level alignment.
  • Joint enforcement of the four complementary properties is required for strong alignment across both axes.
  • As little as 0.1 percent paired data recovers instance-level alignment when distributional objectives are anchored.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The decomposition could be tested on cross-modal settings such as vision-language models to check whether the same four properties remain independent.
  • The benchmark might be extended to measure whether the four properties predict downstream task performance when representations are transferred.
  • Additional axes, such as temporal or hierarchical structure, could be added if the current four properties leave certain misalignment cases unexplained.
  • The finding that minimal paired data suffices under distributional anchoring suggests similar low-supervision regimes may work for other multi-objective alignment problems in representation learning.

Load-bearing premise

The two axes and the four properties they induce capture all relevant dimensions of alignment without missing important failure modes or alternative decompositions.

What would settle it

An experiment in which a single-objective method achieves high scores on all four properties simultaneously, or in which the benchmark reveals a systematic failure mode outside the four properties.

Figures

Figures reproduced from arXiv: 2606.09653 by Agustin Martin Picard, Gr\'egoire Dhimo\"ila, Thomas Fel, Thomas Serre, Victor Boutin.

Figure 1
Figure 1. Figure 1: Summary of the proposed framework. Left : Illustration of a generative process where observations stem from a shared space through modality-specific generators gi . Feature extractors (ϕi) learn to invert this generative process up to some transform ψi . Top right : Different properties of ψi that achieve different types of alignment. Bottom right : summary of previous SAE-based methods for concept extract… view at source ↗
Figure 2
Figure 2. Figure 2: Element-wise alignment. Concept Alignment. Let I be an in￾dex set whose elements correspond to representation instances. Each i ∈ I may, for example, index the same modality embedded in different mod￾els, different layers of a single model, checkpoints of a single model at differ￾ent training steps, modality-specific instantiations of an abstract idea, etc. Let X be a set of data points equipped with a mea… view at source ↗
Figure 3
Figure 3. Figure 3: Distributional alignment. If the Xi are such that instance￾wise correspondences are not avail￾able, i.e., we do not have access to the underlying X, we can translate these into distributional alignment proper￾ties as follows. Let (ψ −1 j ◦ ψi)#µi define a measure on Xj , and ψi#µi define a measure on the concept space R K. We define the distributional alignment properties corresponding to translation and c… view at source ↗
Figure 4
Figure 4. Figure 4: Self reconstruction: demi and full cycle. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Concept alignment in our CoSAE is strong enough to support competitive downstream behavior. ImageNet zero-shot accuracy of our CoSAE trained on unimodal backbones. As a final validation, we evaluate whether the alignment learned by our CoSAE translates into competitive cross￾modal transfer. We consider a setting analogous to CLIP training, replacing the vision–vision alignment of Section 4.1.2 with a visio… view at source ↗
read the original abstract

Learned representations across models and modalities often exhibit striking structural similarities, suggesting shared underlying concept decompositions. However, concept alignment remains poorly defined: existing approaches optimize different objectives under the same terminology, obscuring what is actually aligned. We propose a unifying framework that decomposes alignment along two axes: what is aligned (representations vs. concepts) and at what level (instance-wise vs. distributional). This induces four corresponding properties -- instance-wise and distributional variants of translation and concept consistency -- and reveals precisely which of these guarantees existing methods provide. We further introduce \InterVenchA, an intervention-based benchmark that separately measures extraction quality, translation quality, and concept consistency. Through theory and experiments, we show that commonly assumed equivalences between alignment objectives fail in practice: optimizing one property does not reliably recover the others, and purely unsupervised objectives fail to recover meaningful instance-level alignment. We then propose the Coupled Sparse Autoencoder (CoSAE), which jointly enforces complementary alignment objectives. Strong alignment emerges only in this regime. Surprisingly, as little as 0.1\% paired data is sufficient to recover instance-level alignment when anchoring distributional objectives. Overall, our results show that concept alignment is fundamentally multi-objective: it must be defined, measured, and optimized as such.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that concept alignment between learned representations is poorly defined because existing methods optimize different objectives under the same name. It introduces a two-axis decomposition (representations vs. concepts; instance-wise vs. distributional) that induces four properties (instance-wise and distributional variants of translation and concept consistency). The framework is used to analyze which guarantees existing methods provide. The authors introduce the InterVenchA intervention-based benchmark and the CoSAE model, which jointly optimizes the complementary objectives. Experiments show that unsupervised objectives fail to recover instance-level alignment, that the four properties are not interchangeable, and that 0.1% paired data suffices when distributional objectives are anchored.

Significance. If the decomposition and empirical results hold, the work supplies a clear taxonomy for existing alignment techniques, demonstrates that single-objective optimization is insufficient, and offers both a diagnostic benchmark and a practical method that achieves strong alignment with minimal paired supervision. These contributions would help standardize evaluation and optimization in representational similarity and mechanistic interpretability.

major comments (3)
  1. [§3] §3 (two-axis decomposition): the manuscript presents the axes as inducing exactly four independent properties whose joint satisfaction is necessary, but provides no argument or test that the axes are exhaustive; dimensions such as hierarchical concept structure or causal intervention alignment are not ruled out, which directly affects whether the four properties must be optimized jointly.
  2. [§5] §5 (CoSAE and 0.1% paired-data result): the claim that anchoring distributional objectives allows instance-level alignment with 0.1% paired data is load-bearing for the multi-objective conclusion, yet the paper does not report ablation on the choice of distributional measure or on whether the result generalizes beyond the tested concept sets.
  3. [Table 2] Table 2 / InterVenchA results: the reported failures of unsupervised objectives are central to the claim that the properties are non-redundant, but the table does not include variance across random seeds or alternative unsupervised baselines, making it difficult to assess whether the failures are general or setup-specific.
minor comments (2)
  1. Notation for the four properties is introduced without an explicit summary table mapping each property to its axis combination; adding such a table would improve readability.
  2. The abstract states that 'commonly assumed equivalences between alignment objectives fail in practice' but does not cite the specific prior works that make those assumptions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [§3] §3 (two-axis decomposition): the manuscript presents the axes as inducing exactly four independent properties whose joint satisfaction is necessary, but provides no argument or test that the axes are exhaustive; dimensions such as hierarchical concept structure or causal intervention alignment are not ruled out, which directly affects whether the four properties must be optimized jointly.

    Authors: The two axes are chosen because they capture the primary ambiguities observed across existing alignment methods in the literature. We do not claim the decomposition is exhaustive; rather, it provides a minimal set of independent properties that existing techniques can be shown to satisfy or violate. Hierarchical structure and causal alignment can be viewed as special cases or extensions within the concept-consistency axis. In revision we will add an explicit scope paragraph acknowledging these additional dimensions and noting that the current framework focuses on the core instance/distributional distinction. revision: partial

  2. Referee: [§5] §5 (CoSAE and 0.1% paired-data result): the claim that anchoring distributional objectives allows instance-level alignment with 0.1% paired data is load-bearing for the multi-objective conclusion, yet the paper does not report ablation on the choice of distributional measure or on whether the result generalizes beyond the tested concept sets.

    Authors: The distributional measure used is the standard one employed in prior distributional-alignment literature. We agree that an ablation would strengthen the result. In the revised version we will add experiments that vary the distributional objective (MMD, Wasserstein distance, and correlation-based measures) and repeat the 0.1% paired-data protocol on two additional concept sets drawn from different domains. revision: yes

  3. Referee: [Table 2] Table 2 / InterVenchA results: the reported failures of unsupervised objectives are central to the claim that the properties are non-redundant, but the table does not include variance across random seeds or alternative unsupervised baselines, making it difficult to assess whether the failures are general or setup-specific.

    Authors: We will augment Table 2 with standard deviations computed across five random seeds. The unsupervised baselines already span the main families of methods; we will add one further linear-alignment baseline if space permits. The failures of unsupervised objectives were reproducible in our internal replication checks. revision: yes

Circularity Check

0 steps flagged

No circularity: framework decomposition and empirical disproof of equivalences are independent of self-referential inputs

full rationale

The paper proposes a two-axis decomposition (representations vs. concepts; instance-wise vs. distributional) that induces four properties, then uses theory plus a new intervention benchmark (InterVenchA) to show that optimizing one property does not recover the others and that unsupervised objectives fail at instance-level alignment. It further introduces CoSAE to jointly enforce objectives. None of these steps reduce by construction to fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations; the multi-objective conclusion follows directly from the proposed analysis and external experimental results rather than from re-deriving the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The central claims rest on the new framework, benchmark, and model being meaningfully distinct from prior work; no free parameters are described in the abstract, but two new entities are introduced without external validation.

invented entities (2)
  • InterVenchA no independent evidence
    purpose: Intervention-based benchmark that separately measures extraction quality, translation quality, and concept consistency
    New benchmark introduced to evaluate the four alignment properties
  • CoSAE no independent evidence
    purpose: Coupled Sparse Autoencoder that jointly enforces complementary alignment objectives
    New model proposed to achieve strong alignment by combining objectives

pith-pipeline@v0.9.1-grok · 5767 in / 1283 out tokens · 18559 ms · 2026-06-27T17:09:02.331427+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

75 extracted references · 7 linked inside Pith

  1. [1]

    Similarity of neural network representations revisited

    Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InInternational conference on machine learning, pages 3519–3529. PMlR, 2019

  2. [2]

    Revisiting model stitching to compare neural representations.Advances in neural information processing systems, 34:225–236, 2021

    Yamini Bansal, Preetum Nakkiran, and Boaz Barak. Revisiting model stitching to compare neural representations.Advances in neural information processing systems, 34:225–236, 2021

  3. [3]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  4. [4]

    Lit: Zero-shot transfer with locked-image text tuning

    Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

  5. [5]

    Sigmoid loss for language image pre-training.Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 11975–11986, 2023

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training.Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 11975–11986, 2023

  6. [6]

    Relations between two sets of variates.Biometrika, 28(3/4):321–377, 1936

    Harold Hotelling. Relations between two sets of variates.Biometrika, 28(3/4):321–377, 1936. ISSN 00063444. URLhttp://www.jstor.org/stable/2333955

  7. [7]

    A kernel statistical test of independence.Advances in neural information processing systems, 20, 2007

    Arthur Gretton, Kenji Fukumizu, Choon Teo, Le Song, Bernhard Sch¨olkopf, and Alex Smola. A kernel statistical test of independence.Advances in neural information processing systems, 20, 2007

  8. [8]

    Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability.Advances in neural information processing systems, 30, 2017

    Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability.Advances in neural information processing systems, 30, 2017

  9. [9]

    Grounding representation similarity through statistical testing.Advances in Neural Information Processing Systems, 34:1556–1568, 2021

    Frances Ding, Jean-Stanislas Denain, and Jacob Steinhardt. Grounding representation similarity through statistical testing.Advances in Neural Information Processing Systems, 34:1556–1568, 2021

  10. [10]

    Sparse autoencoders find highly interpretable features in language models.ArXiv e-print, 2023

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models.ArXiv e-print, 2023

  11. [11]

    Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Con- erly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and ...

  12. [12]

    A holistic approach to unifying automatic concept extraction and concept importance estimation.Advances in Neural Information Processing Systems (NeurIPS), 36: 54805–54818, 2023

    Thomas Fel, Victor Boutin, Mazda Moayeri, Remi Cadene, Louis Bethune, Mathieu Chalvidal, and Thomas Serre. A holistic approach to unifying automatic concept extraction and concept importance estimation.Advances in Neural Information Processing Systems (NeurIPS), 36: 54805–54818, 2023

  13. [13]

    Sparse crosscoders for cross-layer features and model diffing.Transformer Circuits Thread, 2024

    Jack Lindsey, Adly Templeton, Jonathan Marcus, Thomas Conerly, Joshua Batson, and Christo- pher Olah. Sparse crosscoders for cross-layer features and model diffing.Transformer Circuits Thread, 2024

  14. [14]

    Universal sparse autoencoders: Interpretable cross-model concept alignment.ArXiv e-print, 2025

    Harrish Thasarathan, Julian Forsyth, Thomas Fel, Matthew Kowal, and Konstantinos Derpanis. Universal sparse autoencoders: Interpretable cross-model concept alignment.ArXiv e-print, 2025

  15. [15]

    Cross-modal redundancy and the geometry of vision–language embeddings

    Gr´egoire Dhimo¨ıla, Thomas Fel, Victor Boutin, and Agustin Martin Picard. Cross-modal redundancy and the geometry of vision–language embeddings. InThe Fourteenth International Conference on Learning Representations, 2026. 10

  16. [16]

    Similarity of neural networks: A survey of functional and representational measures.Journal of Machine Learning Research, 25(87):1–77, 2024

    Max Klabunde, Tobias Schumacher, Alexander H ¨agele, Markus Bernstein, Patrick van der Smagt, and Marcus M ¨artens. Similarity of neural networks: A survey of functional and representational measures.Journal of Machine Learning Research, 25(87):1–77, 2024

  17. [17]

    Transcoders find interpretable llm feature circuits.Advances in Neural Information Processing Systems, 37:24375–24410, 2024

    Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable llm feature circuits.Advances in Neural Information Processing Systems, 37:24375–24410, 2024

  18. [18]

    Saebench: A comprehen- sive benchmark for sparse autoencoders in language model interpretability.arXiv preprint arXiv:2503.09532, 2025

    Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, et al. Saebench: A comprehen- sive benchmark for sparse autoencoders in language model interpretability.arXiv preprint arXiv:2503.09532, 2025

  19. [19]

    Representational similarity analysis-connecting the branches of systems neuroscience.Frontiers in systems neuroscience, 2:249, 2008

    Nikolaus Kriegeskorte, Marieke Mur, and Peter A Bandettini. Representational similarity analysis-connecting the branches of systems neuroscience.Frontiers in systems neuroscience, 2:249, 2008

  20. [20]

    The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024

    Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024

  21. [21]

    Back into plato’s cave: Examining cross-modal representational convergence at scale.arXiv preprint arXiv:2604.18572, 2026

    A Koepke, Daniil Zverev, Shiry Ginosar, and Alexei A Efros. Back into plato’s cave: Examining cross-modal representational convergence at scale.arXiv preprint arXiv:2604.18572, 2026

  22. [22]

    Revisiting the platonic representation hypothesis: An aristotelian view, 2026

    Fabian Gr¨oger, Shuo Wen, and Maria Brbi´c. Revisiting the platonic representation hypothesis: An aristotelian view, 2026. URLhttps://arxiv.org/abs/2602.14486

  23. [23]

    David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6541–6549, 2017

  24. [24]

    Feature visualization.Distill, 2017

    Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization.Distill, 2017

  25. [25]

    Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal

    Leilani H. Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. Explaining explanations: An overview of interpretability of machine learning.Proceedings of the IEEE International Conference on data science and advanced analytics (DSAA), pages 80–89, 2018

  26. [26]

    Unlocking feature visualization for deeper networks with magnitude constrained optimization.Advances in Neural Information Processing Systems (NeurIPS), 2023

    Thomas Fel, Thibaut Boissin, Victor Boutin, Agustin Picard, Paul Novello, Julien Colin, Drew Linsley, Tom Rousseau, R´emi Cad`ene, Laurent Gardes, and Thomas Serre. Unlocking feature visualization for deeper networks with magnitude constrained optimization.Advances in Neural Information Processing Systems (NeurIPS), 2023

  27. [27]

    Visualizing and understanding convolutional networks

    Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. Proceedings of the IEEE European Conference on Computer Vision (ECCV), pages 818–833, 2014

  28. [28]

    Rise: Randomized input sampling for explanation of black-box models.Proceedings of the British Machine Vision Conference (BMVC), page 151, 2018

    Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Randomized input sampling for explanation of black-box models.Proceedings of the British Machine Vision Conference (BMVC), page 151, 2018

  29. [29]

    Axiomatic attribution for deep networks

    Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. Proceedings of the International Conference on Machine Learning (ICML), pages 3319–3328, 2017

  30. [30]

    Towards automatic concept- based explanations.Advances in Neural Information Processing Systems (NeurIPS), 32, 2019

    Amirata Ghorbani, James Wexler, James Y Zou, and Been Kim. Towards automatic concept- based explanations.Advances in Neural Information Processing Systems (NeurIPS), 32, 2019

  31. [31]

    Ruihan Zhang, Prashan Madumal, Tim Miller, Krista A Ehinger, and Benjamin IP Rubinstein. Invertible concept-based explanations for cnn models with non-negative concept activation vectors.Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 35(13):11682– 11690, 2021

  32. [32]

    Toy models of superposition.Transformer Circuits Thread, 2022

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022. 11

  33. [33]

    Emergence of simple-cell receptive field properties by learning a sparse code for natural images.Nature, 381(6583):607–609, 1996

    Bruno A Olshausen and David J Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images.Nature, 381(6583):607–609, 1996

  34. [34]

    Sparse coding with an overcomplete basis set: A strategy employed by v1?Vision research, 37(23):3311–3325, 1997

    Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1?Vision research, 37(23):3311–3325, 1997

  35. [35]

    Sparse coding of sensory inputs.Current Opin- ion in Neurobiology, 14(4):481–487, 2004

    Bruno A Olshausen and David J Field. Sparse coding of sensory inputs.Current Opin- ion in Neurobiology, 14(4):481–487, 2004. ISSN 0959-4388. doi: https://doi.org/10. 1016/j.conb.2004.07.007. URL https://www.sciencedirect.com/science/article/pii/ S0959438804001035

  36. [36]

    Efficient sparse coding algorithms

    Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Ng. Efficient sparse coding algorithms. Advances in Neural Information Processing Systems (NeurIPS), 19, 2006

  37. [37]

    Beyond l1 sparse coding in v1.PLoS Computational Biology, 19(9):e1011459, 2023

    Ilias Rentzeperis, Luca Calatroni, Laurent U Perrinet, and Dario Prandi. Beyond l1 sparse coding in v1.PLoS Computational Biology, 19(9):e1011459, 2023

  38. [38]

    Extensions of lipschitz mappings into a hilbert space.Contemporary mathematics, 26(189-206):1, 1984

    William B Johnson, Joram Lindenstrauss, et al. Extensions of lipschitz mappings into a hilbert space.Contemporary mathematics, 26(189-206):1, 1984

  39. [39]

    Optimality of the johnson-lindenstrauss lemma

    Kasper Green Larsen and Jelani Nelson. Optimality of the johnson-lindenstrauss lemma. In 2017 IEEE 58th annual symposium on foundations of computer science (FOCS), pages 633–638. IEEE, 2017

  40. [40]

    Dictionaries for sparse representation modeling.Proceedings of the IEEE, 98(6):1045–1057, 2010

    Ron Rubinstein, Alfred M Bruckstein, and Michael Elad. Dictionaries for sparse representation modeling.Proceedings of the IEEE, 98(6):1045–1057, 2010

  41. [41]

    Springer International Publishing, 2010

    Michael Elad.Sparse and redundant representations: from theory to applications in signal and image processing. Springer International Publishing, 2010

  42. [42]

    Dictionary learning.IEEE Signal Processing Magazine, 28(2): 27–38, 2011

    Ivana Toˇsi´c and Pascal Frossard. Dictionary learning.IEEE Signal Processing Magazine, 28(2): 27–38, 2011

  43. [43]

    Springer, 2018

    Bogdan Dumitrescu and Paul Irofti.Dictionary learning algorithms and applications. Springer, 2018

  44. [44]

    Scaling and evaluating sparse autoencoders.Proceedings of the International Conference on Learning Representations (ICLR), 2025

    Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders.Proceedings of the International Conference on Learning Representations (ICLR), 2025

  45. [45]

    Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders.ArXiv e-print, 2024

    Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, Janos Kramar, and Neel Nanda. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders.ArXiv e-print, 2024

  46. [46]

    Batchtopk sparse autoencoders.ArXiv e-print, 2024

    Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders.ArXiv e-print, 2024

  47. [47]

    Archetypal sae: Adaptive and stable dictionary learning for concept extraction in large vision models

    Thomas Fel, Ekdeep Singh Lubana, Jacob S Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba Ba, and Talia Konkle. Archetypal sae: Adaptive and stable dictionary learning for concept extraction in large vision models. Proceedings of the International Conference on Machine Learning (ICML), 2025

  48. [48]

    From flat to hierarchical: Extracting sparse representations with matching pursuit.arXiv preprint arXiv:2506.03093, 2025

    Val´erie Costa, Thomas Fel, Ekdeep Singh Lubana, Bahareh Tolooshams, and Demba Ba. From flat to hierarchical: Extracting sparse representations with matching pursuit.arXiv preprint arXiv:2506.03093, 2025

  49. [49]

    The missing curve detectors of inceptionv1: Applying sparse autoencoders to inceptionv1 early vision.ArXiv e-print, 2024

    Liv Gorton. The missing curve detectors of inceptionv1: Applying sparse autoencoders to inceptionv1 early vision.ArXiv e-print, 2024

  50. [50]

    Into the rabbit hull: From task- relevant concepts in dino to minkowski geometry.Proceedings of the International Conference on Learning Representations (ICLR), 2026

    Thomas Fel, Binxu Wang, Michael A Lepori, Matthew Kowal, Andrew Lee, Randall Balestriero, Sonia Joseph, Ekdeep S Lubana, Talia Konkle, Demba Ba, et al. Into the rabbit hull: From task- relevant concepts in dino to minkowski geometry.Proceedings of the International Conference on Learning Representations (ICLR), 2026. 12

  51. [51]

    Unpacking sdxl turbo: Interpreting text-to-image models with sparse au- toencoders

    Viacheslav Surkov, Chris Wendler, Mikhail Terekhov, Justin Deschenaux, Robert West, and Caglar Gulcehre. Unpacking sdxl turbo: Interpreting text-to-image models with sparse au- toencoders. InMechanistic Interpretability for Vision at CVPR 2025 (Non-proceedings Track), 2025

  52. [52]

    Semi-supervised multimodal representation learning through a global workspace.IEEE Transactions on Neural Networks and Learning Systems, 36(5):7843–7857, 2024

    Benjamin Devillers, L ´eopold Mayti ´e, and Rufin VanRullen. Semi-supervised multimodal representation learning through a global workspace.IEEE Transactions on Neural Networks and Learning Systems, 36(5):7843–7857, 2024

  53. [53]

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. InProceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017

  54. [54]

    Unsupervised neural machine translation.arXiv preprint arXiv:1710.11041, 2017

    Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Unsupervised neural machine translation.arXiv preprint arXiv:1710.11041, 2017

  55. [55]

    On the rate of convergence in wasserstein distance of the empirical measure.Probability theory and related fields, 162(3):707–738, 2015

    Nicolas Fournier and Arnaud Guillin. On the rate of convergence in wasserstein distance of the empirical measure.Probability theory and related fields, 162(3):707–738, 2015

  56. [56]

    Computational optimal transport.Foundations and Trends in Machine Learning, 2018

    Gabriel Peyr´e and Marco Cuturi. Computational optimal transport.Foundations and Trends in Machine Learning, 2018

  57. [57]

    Lejepa: Provable and scalable self-supervised learning without the heuristics.arXiv preprint arXiv:2511.08544, 2025

    Randall Balestriero and Yann LeCun. Lejepa: Provable and scalable self-supervised learning without the heuristics.arXiv preprint arXiv:2511.08544, 2025

  58. [58]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghan, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021

  59. [59]

    Dinov2: Learning robust visual features without supervision.ArXiv e-print, 2023

    Maxime Oquab, Timoth´ee Darcet, Th´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.ArXiv e-print, 2023

  60. [60]

    Huggingface’s transform- ers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R´emi Louf, Morgan Funtowicz, et al. Huggingface’s transform- ers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019

  61. [61]

    Reproducible scaling laws for contrastive language-image learning.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2829, 2023

    Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2829, 2023

  62. [62]

    Microsoft coco: Common objects in context.Proceedings of the IEEE European Conference on Computer Vision (ECCV), pages 740–755, 2014

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context.Proceedings of the IEEE European Conference on Computer Vision (ECCV), pages 740–755, 2014

  63. [63]

    Harnessing frozen unimodal encoders for flexible multi- modal alignment

    Mayug Maniparambil, Raiymbek Akshulakov, Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Ankit Singh, and Noel E O’Connor. Harnessing frozen unimodal encoders for flexible multi- modal alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29847–29857, 2025

  64. [64]

    Overcoming sparsity artifacts in crosscoders to interpret chat-tuning

    Julian Minder, Cl´ement Dumas, Caden Juang, Bilal Chughtai, and Neel Nanda. Overcoming sparsity artifacts in crosscoders to interpret chat-tuning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum? id=yFdNygEryH

  65. [65]

    Deep learning face attributes in the wild

    Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. InProceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015

  66. [66]

    Hashimoto, and Percy Liang

    Shiori Sagawa*, Pang Wei Koh*, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks. InInternational Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=ryxGuJrFvS. 13

  67. [67]

    Invariant risk mini- mization.ArXiv e-print, 2019

    Martin Arjovsky, L´eon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk mini- mization.ArXiv e-print, 2019

  68. [68]

    A kernel two-sample test.The journal of machine learning research, 13(1):723–773, 2012

    Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch ¨olkopf, and Alexander Smola. A kernel two-sample test.The journal of machine learning research, 13(1):723–773, 2012

  69. [69]

    Learning transferable features with deep adaptation networks

    Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features with deep adaptation networks. InInternational conference on machine learning, pages 97–105. PMLR, 2015

  70. [70]

    Generative moment matching networks

    Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. InInterna- tional conference on machine learning, pages 1718–1727. PMLR, 2015

  71. [71]

    Demystifying mmd gans.arXiv preprint arXiv:1801.01401, 2018

    Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans.arXiv preprint arXiv:1801.01401, 2018

  72. [72]

    Enhancing neural network interpretability with feature-aligned sparse autoencoders, 2024

    Luke Marks, Alasdair Paren, David Krueger, and Fazl Barez. Enhancing neural network interpretability with feature-aligned sparse autoencoders, 2024. URL https://arxiv.org/ abs/2411.01220

  73. [73]

    Projecting assumptions: The duality between sparse autoencoders and concept geometry.ArXiv e-print, 2025

    Sai Sumedh R Hindupur, Ekdeep Singh Lubana, Thomas Fel, and Demba Ba. Projecting assumptions: The duality between sparse autoencoders and concept geometry.ArXiv e-print, 2025

  74. [74]

    Imagenet: A large-scale hierarchical image database.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009

  75. [75]

    Laion-400m: Open dataset of clip-filtered 400 million image-text pairs.ArXiv e-print, 2021

    Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs.ArXiv e-print, 2021. 14 A Extended Related Work A.1 Unifying existing methods Figure 1 summarizes the different methods for concept e...