pith. sign in

arxiv: 2605.20337 · v1 · pith:5TTWNIS5new · submitted 2026-05-19 · 💻 cs.CV

Capability neq Interpretability: Human Interpretability of Vision Foundation Models

Pith reviewed 2026-05-21 07:32 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision transformershuman interpretabilitypsychophysicsfoundation modelssparse autoencodersfeature localizationmodel evaluation
0
0 comments X

The pith

Foundation models produce less human-interpretable features than supervised vision transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework to measure human interpretability of vision model features through two psychophysics protocols. One tests whether people can predict where a recovered feature will activate on a new image. The other tests whether people can accurately describe what the feature represents. Features come from sparse autoencoders and scores are normalized against chance to allow direct comparison across models. When applied to supervised ViTs and four foundation models, the data from over 13,000 valid responses show foundation models rank lower on interpretability. This shortfall does not track differences in task performance, but instead tracks how focal the activations are and how well they match broad human categories.

Core claim

Foundation models are consistently less interpretable than their supervised counterparts, and the gap is not a capability tradeoff: interpretability does not correlate with downstream task performance on any benchmark we examine. What does correlate is the locality of a feature's activations and coarse-grained semantic alignment with humans -- models with focal activations and representations that reflect the world's broad categorical structure produce more interpretable features, whereas fine-grained perceptual alignment does not. The two protocols yield strongly correlated rankings and share the same predictors, establishing interpretability as an independent, measurable dimension of model

What carries the argument

Two psychophysics protocols (localizability and nameability) applied to sparse autoencoder features, combined with chance-anchored scoring to rank models on one scale.

Load-bearing premise

The localizability and nameability protocols together provide a valid and generalizable measure of human interpretability for the recovered features.

What would settle it

Demonstrating a positive correlation between the measured interpretability scores and performance on a new downstream benchmark not tested in the study would undermine the claim that interpretability is independent of capability.

Figures

Figures reproduced from arXiv: 2605.20337 by Julien Colin, Lore Goetschalckx, Nuria Oliver, Thomas Serre.

Figure 1
Figure 1. Figure 1: Models trained under different objectives learn different internal representations and, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Interpretability is uncorrelated with downstream task performance. Each panel plots an interpretability score against a capability benchmark across the six models. Top row: localizability; bottom row: nameability. Columns: ImageNet-1k top-1 accuracy, ADE20K semantic segmenta￾tion, and perceptual grouping. Spearman ρ and p-values are shown in each panel. Correlations are non-significant for both interpretab… view at source ↗
Figure 3
Figure 3. Figure 3: Locality and coarse-grained semantic alignment track interpretability. Each panel plots an interpretability score against a representational property across the six models. Top row: localizability; bottom row: nameability. Columns: locality of the representation (mean Hoyer sparsity over feature heatmaps), and coarse-grained alignment with human similarity judgments on THINGS [43] and Levels [39] (odd-one-… view at source ↗
read the original abstract

How interpretable are the features of leading vision models? The question is increasingly pressing as these models move from research benchmarks into high-stakes deployments, yet existing methods cannot answer it reliably. We close this gap with a framework for measuring and comparing the human interpretability of vision models, built around two complementary psychophysics protocols: (1) localizability -- can an observer predict where a feature fires on a novel image? -- and (2) nameability -- can an observer accurately describe what the feature represents? Features are recovered via sparse autoencoders, and a chance-anchored scoring function places every model on a common scale. Applying the framework to six vision transformers -- two supervised ViTs and four foundation models (DINOv2, DINOv3, CLIP, SigLIP) -- we collected more than $15{,}000$ behavioral responses, analyzing the $13{,}400$ responses from the $377$ participants who passed our pre-specified quality checks. Foundation models are consistently *less* interpretable than their supervised counterparts, and the gap is not a capability tradeoff: interpretability does not correlate with downstream task performance on any benchmark we examine. What does correlate is the locality of a feature's activations and coarse-grained semantic alignment with humans -- models with focal activations and representations that reflect the world's broad categorical structure produce more interpretable features, whereas fine-grained perceptual alignment does not. The two protocols yield strongly correlated rankings and share the same predictors, establishing interpretability as an independent, measurable dimension of representation quality -- and, surprisingly, one on which every foundation model we tested falls below the supervised baselines that came before. Capability alone cannot close that gap; locality and coarse-grained alignment can.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a framework to measure human interpretability of features in vision transformers using sparse autoencoders combined with two psychophysics protocols: localizability (predicting where a feature activates on novel images) and nameability (describing the feature). Based on 13,400 valid responses from 377 participants who passed pre-specified quality checks, it reports that foundation models (DINOv2, DINOv3, CLIP, SigLIP) are consistently less interpretable than supervised ViTs, that this gap does not reflect a capability tradeoff (no correlation with downstream benchmarks), and that interpretability instead correlates with activation locality and coarse-grained semantic alignment with humans.

Significance. If the protocols prove robust to image selection and participant criteria, the work would establish interpretability as a measurable, independent dimension of representation quality separate from capability. The large behavioral dataset provides solid empirical grounding for the model-type comparisons and identifies actionable predictors (locality, coarse alignment). This has direct relevance for high-stakes vision deployments where human-understandable features matter.

major comments (3)
  1. [Methods (psychophysics protocols and participant screening)] The central claim that foundation models are less interpretable than supervised ViTs, and that the gap is not a capability tradeoff, depends on the psychophysics protocols capturing intrinsic feature properties rather than artifacts. The image selection process and the quality filters that retained 377 participants (from the initial pool yielding 15,000 responses) could systematically favor focal activations more common in supervised models; without a sensitivity analysis varying image distributions or screening criteria, the observed gap and lack of benchmark correlation may be setup-dependent.
  2. [Results (correlation with downstream performance)] The assertion that interpretability does not correlate with downstream task performance is load-bearing for the no-tradeoff conclusion. The manuscript should report the exact benchmarks examined, the correlation coefficients (with confidence intervals), and any multiple-comparison corrections, as the null result could be sensitive to benchmark choice or statistical power.
  3. [Results (protocol agreement)] The statement that the two protocols yield strongly correlated rankings and share the same predictors underpins the claim that interpretability is a coherent dimension. The specific Pearson or Spearman correlation value, sample size, and p-value for the protocol agreement should be provided explicitly.
minor comments (2)
  1. [Abstract] The abstract states 'more than 15,000 behavioral responses' but analyzes 13,400 from 377 participants; state the exact initial participant count and response total in the main text for transparency.
  2. [Throughout] Ensure consistent terminology when referring to the four foundation models versus the two supervised ViTs across figures and tables.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive review of our manuscript. We address each major comment below and commit to revisions that strengthen the presentation of our methods and results without altering the core findings.

read point-by-point responses
  1. Referee: [Methods (psychophysics protocols and participant screening)] The central claim that foundation models are less interpretable than supervised ViTs, and that the gap is not a capability tradeoff, depends on the psychophysics protocols capturing intrinsic feature properties rather than artifacts. The image selection process and the quality filters that retained 377 participants (from the initial pool yielding 15,000 responses) could systematically favor focal activations more common in supervised models; without a sensitivity analysis varying image distributions or screening criteria, the observed gap and lack of benchmark correlation may be setup-dependent.

    Authors: We agree that robustness to these design choices is important to establish. In the revised manuscript we will add a dedicated sensitivity analysis section that re-runs the key comparisons under alternative image sampling distributions and under relaxed or stricter participant quality thresholds. This will directly test whether the interpretability gap and its predictors remain stable. revision: yes

  2. Referee: [Results (correlation with downstream performance)] The assertion that interpretability does not correlate with downstream task performance is load-bearing for the no-tradeoff conclusion. The manuscript should report the exact benchmarks examined, the correlation coefficients (with confidence intervals), and any multiple-comparison corrections, as the null result could be sensitive to benchmark choice or statistical power.

    Authors: We will expand the relevant results section to list every benchmark examined, report the Pearson correlation coefficients together with 95% confidence intervals, and state the multiple-comparison correction applied. These additions will make the statistical basis for the null result fully transparent. revision: yes

  3. Referee: [Results (protocol agreement)] The statement that the two protocols yield strongly correlated rankings and share the same predictors underpins the claim that interpretability is a coherent dimension. The specific Pearson or Spearman correlation value, sample size, and p-value for the protocol agreement should be provided explicitly.

    Authors: We will insert the requested quantitative details into the results section, reporting the Spearman rank correlation between the two protocol scores, the number of features on which it is computed, and the associated p-value. This will give explicit support to the claim that the protocols measure a coherent dimension. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical psychophysics measurements are self-contained

full rationale

The paper derives its claims from direct human behavioral data collected via two psychophysics protocols (localizability and nameability) applied to features recovered by sparse autoencoders from six vision transformers. Over 15,000 responses were gathered and filtered to 13,400 from 377 participants using pre-specified quality checks, with rankings and correlations computed from these participant judgments rather than from any fitted parameters, self-definitions, or load-bearing self-citations. The reported gap in interpretability between foundation models and supervised ViTs, along with the lack of correlation to downstream benchmarks and the role of locality and coarse-grained alignment, emerges from the external human responses and does not reduce to the inputs by construction. This is a standard empirical study whose central results are falsifiable against new participant cohorts or image sets and therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that sparse autoencoders recover human-meaningful features and that the chosen psychophysics tasks validly capture interpretability; no new physical entities or ad-hoc constants are introduced.

axioms (1)
  • domain assumption Human observers can reliably perform localization and naming tasks on model features when quality controls are applied.
    Invoked to justify the behavioral data as a measure of interpretability.

pith-pipeline@v0.9.0 · 5852 in / 1227 out tokens · 27304 ms · 2026-05-21T07:32:55.592693+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 5 internal anchors

  1. [1]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InProceedings of the International Conference on Learning Representation...

  2. [2]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning (ICML), pages 8748–8763. PmLR, 2021

  3. [3]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 11975–11986, 2023

  4. [4]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  5. [5]

    DINOv3

    Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. DINOv3. arXiv preprint arXiv:2508.10104, 2025

  6. [6]

    Fine-grained classifi- cation for poisonous fungi identification with transfer learning.arXiv preprint arXiv:2407.07492, 2024

    Christopher Chiu, Maximilian Heil, Teresa Kim, and Anthony Miyaguchi. Fine-grained classifi- cation for poisonous fungi identification with transfer learning.arXiv preprint arXiv:2407.07492, 2024

  7. [7]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    Liu Shilong, Zeng Zhaoyang, Ren Tianhe, Li Feng, Zhang Hao, Yang Jie, Jiang Qing, Li Chun- yuan, Yang Jianwei, Su Hang, Zhu Jun, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXIv:2303.05499, 2024

  8. [8]

    Leveraging foundation models via knowledge distillation in multi-object tracking: Distilling dinov2 features to fairmot.arXiv preprint arXiv:2407.18288, 2024

    Niels G Faber, Seyed Sahand Mohammadi Ziabari, and Fatemeh Karimi Nejadasl. Leveraging foundation models via knowledge distillation in multi-object tracking: Distilling dinov2 features to fairmot.arXiv preprint arXiv:2407.18288, 2024

  9. [9]

    Dino-tracker: Taming dino for self-supervised point tracking in a single video

    Narek Tumanyan, Assaf Singer, Shai Bagon, and Tali Dekel. Dino-tracker: Taming dino for self-supervised point tracking in a single video. InEuropean Conference on Computer Vision, pages 367–385. Springer, 2024

  10. [10]

    Openvla: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025

  11. [11]

    Evaluating general purpose vision foundation models for medical image analysis: An experimental study of dinov2 on radiology benchmarks.arXiv preprint arXiv:2312.02366, 2023

    Mohammed Baharoon, Waseem Qureshi, Jiahong Ouyang, Yanwu Xu, Abdulrhman Aljouie, and Wei Peng. Evaluating general purpose vision foundation models for medical image analysis: An experimental study of dinov2 on radiology benchmarks.arXiv preprint arXiv:2312.02366, 2023

  12. [12]

    Advancing human-centric ai for robust x-ray analysis through holistic self-supervised learning

    Théo Moutakanni, Piotr Bojanowski, Guillaume Chassagnon, Céline Hudelot, Armand Joulin, Yann LeCun, Matthew Muckley, Maxime Oquab, Marie-Pierre Revel, and Maria Vakalopoulou. Advancing human-centric ai for robust x-ray analysis through holistic self-supervised learning. arXiv preprint arXiv:2405.01469, 2024

  13. [13]

    Foundation models meet medical image interpretation.Research, 9:1024, 2026

    Licheng Jiao, Jiayao Hao, Ruiyang Li, Lingling Li, Xu Liu, Fang Liu, Wenping Ma, Puhua Chen, Zhongjian Huang, Jingyi Yang, Jiaxuan Zhao, and Qigong Sun. Foundation models meet medical image interpretation.Research, 9:1024, 2026. doi: 10.34133/research.1024. URL https://spj.science.org/doi/abs/10.34133/research.1024

  14. [14]

    Foundation models for radiology: fundamentals, applications, opportunities, challenges, risks, and prospects.Diagnostic and Interventional Radiology, 2025

    Tugba Akinci D’Antonoli, Christian Bluethgen, Renato Cuocolo, Michail E Klontzas, Andrea Ponsiglione, and Burak Kocak. Foundation models for radiology: fundamentals, applications, opportunities, challenges, risks, and prospects.Diagnostic and Interventional Radiology, 2025. 10

  15. [15]

    A survey for foundation models in autonomous driving

    Haoxiang Gao, Zhongruo Wang, Yaqian Li, Kaiwen Long, Ming Yang, and Yiqing Shen. A survey for foundation models in autonomous driving. In2025 6th International Conference on Computer Vision and Data Mining (ICCVDM), pages 63–71. IEEE, 2025

  16. [16]

    A survey on vision-language-action models for autonomous driving

    Sicong Jiang, Zilin Huang, Kangan Qian, Ziang Luo, Tianze Zhu, Yang Zhong, Yihong Tang, Menglin Kong, Yunlong Wang, Siwen Jiao, et al. A survey on vision-language-action models for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4524–4536, 2025

  17. [17]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023

  18. [18]

    Scaling and evaluating sparse autoencoders

    Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093, 2024

  19. [19]

    Zimmermann, Thomas Klein, and Wieland Brendel

    Roland S. Zimmermann, Thomas Klein, and Wieland Brendel. Scale alone does not improve mechanistic interpretability in vision models.Advances in Neural Information Processing Systems (NeurIPS), 36, 2023

  20. [20]

    Toy models of superposition.Transformer Circuits Thread, 2022

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022

  21. [21]

    Daniel Freeman, Theodore R

    Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosema...

  22. [22]

    Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba E

    Thomas Fel, Ekdeep Singh Lubana, Jacob S. Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba E. Ba, and Talia Konkle. Archetypal SAE: Adaptive and stable dictionary learning for concept extraction in large vision models. InProceedings of the 42nd International Conference on Machine Learning, volume 267 of Proc...

  23. [23]

    Network dissec- tion: Quantifying interpretability of deep visual representations

    David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissec- tion: Quantifying interpretability of deep visual representations. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6541–6549, 2017

  24. [24]

    Zimmermann, Judith Schepers, Robert Geirhos, Thomas S

    Judy Borowski, Roland S. Zimmermann, Judith Schepers, Robert Geirhos, Thomas S. A. Wallis, Matthias Bethge, and Wieland Brendel. Exemplary Natural Images Explain CNN Activa- tions Better than State-of-the-Art Feature Visualization. InProceedings of the International Conference on Learning Representations (ICLR), 2021

  25. [25]

    Zimmermann, David Klindt, and Wieland Brendel

    Roland S. Zimmermann, David Klindt, and Wieland Brendel. Measuring per-unit inter- pretability at scale without humans. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Process- ing Systems (NeurIPS), volume 37, pages 48448–48483. Curran Associates, Inc., 2024. doi: 10.52202/07901...

  26. [26]

    Choosing the right basis for interpretability: Psychophysical comparison between neuron-based and dictionary-based representations.arXiv preprint arXiv:2411.03993, 2024

    Julien Colin, Lore Goetschalckx, Thomas Fel, Victor Boutin, Thomas Serre, and Nuria Oliver. Choosing the right basis for interpretability: Psychophysical comparison between neuron-based and dictionary-based representations.arXiv preprint arXiv:2411.03993, 2024

  27. [27]

    Learning important features through propagating activation differences

    Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. InProceedings of the International Conference on Machine Learning (ICML), 2017

  28. [28]

    A holistic approach to unifying automatic concept extraction and concept importance estimation.Advances in Neural Information Processing Systems, 36:54805–54818, 2023

    Thomas Fel, Victor Boutin, Louis Béthune, Rémi Cadène, Mazda Moayeri, Léo Andéol, Mathieu Chalvidal, and Thomas Serre. A holistic approach to unifying automatic concept extraction and concept importance estimation.Advances in Neural Information Processing Systems, 36:54805–54818, 2023. 11

  29. [29]

    Negative results for saes on downstream tasks and deprioritising sae research (gdm mech interp team progress update #2

    Lewis Smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah, and Neel Nanda. Negative results for saes on downstream tasks and deprioritising sae research (gdm mech interp team progress update #2. AI Alignment Forum, 2025

  30. [30]

    Dense sae latents are features, not bugs

    Xiaoqing Sun, Alessandro Stolfo, Joshua Engels, Ben Peng Wu, Senthooran Rajamanoharan, Mrinmaya Sachan, and Max Tegmark. Dense sae latents are features, not bugs. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025

  31. [31]

    Unlocking Feature Visualization for Deeper Networks with Magnitude Constrained Optimization

    Thomas Fel, Thibaut Boissin, Victor Boutin, Agustin Picard, Paul Novello, Julien Colin, Drew Linsley, Tom Rousseau, Rémi Cadène, Lore Goetschalckx, Laurent Gardes, and Thomas Serre. Unlocking Feature Visualization for Deeper Networks with Magnitude Constrained Optimization. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, pages 37...

  32. [32]

    RISE: Randomized Input Sampling for Explanation of Black-box Models

    Vitali Petsiuk, Abir Das, and Kate Saenko. RISE: Randomized Input Sampling for Explanation of Black-box Models. InProceedings of the British Machine Vision Conference (BMVC), 2018

  33. [33]

    Pure: Turning polysemantic neurons into pure features by identifying relevant circuits

    Maximilian Dreyer, Erblina Purelku, Johanna Vielhaben, Wojciech Samek, and Sebastian La- puschkin. Pure: Turning polysemantic neurons into pure features by identifying relevant circuits. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8212–8217, 2024

  34. [34]

    Scene parsing through ade20k dataset

    Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017

  35. [35]

    Patrik O. Hoyer. Non-negative matrix factorization with sparseness constraints.The Journal of Machine Learning Research (JMLR), 5(Nov):1457–1469, 2004

  36. [36]

    Alignment with human representations supports robust few- shot learning.Advances in Neural Information Processing Systems (NeurIPS), 36:73464–73479, 2023

    Ilia Sucholutsky and Tom Griffiths. Alignment with human representations supports robust few- shot learning.Advances in Neural Information Processing Systems (NeurIPS), 36:73464–73479, 2023

  37. [37]

    Improving neural network representations using human similarity judgments.Advances in Neural Information Processing Systems (NeurIPS), 36:50978–51007, 2023

    Lukas Muttenthaler, Lorenz Linhardt, Jonas Dippel, Robert A Vandermeulen, Katherine Her- mann, Andrew Lampinen, and Simon Kornblith. Improving neural network representations using human similarity judgments.Advances in Neural Information Processing Systems (NeurIPS), 36:50978–51007, 2023

  38. [38]

    When does perceptual alignment benefit vision representations?Advances in Neural Information Processing Systems (NeurIPS), 37:55314– 55341, 2024

    Shobhita Sundaram, Stephanie Fu, Lukas Muttenthaler, Netanel Tamir, Lucy Chai, Simon Kornblith, Trevor Darrell, and Phillip Isola. When does perceptual alignment benefit vision representations?Advances in Neural Information Processing Systems (NeurIPS), 37:55314– 55341, 2024

  39. [39]

    Aligning machine and human visual representations across abstraction levels.Nature, 647(8089):349–355, 2025

    Lukas Muttenthaler, Klaus Greff, Frieda Born, Bernhard Spitzer, Simon Kornblith, Michael C Mozer, Klaus-Robert Müller, Thomas Unterthiner, and Andrew K Lampinen. Aligning machine and human visual representations across abstraction levels.Nature, 647(8089):349–355, 2025. doi: 10.1038/s41586-025-09631-6

  40. [40]

    Wichmann, and Robert Geirhos

    Jannis Ahlert, Thomas Klein, Felix A. Wichmann, and Robert Geirhos. How aligned are different alignment metrics? InICLR 2024 Workshop on Representational Alignment (Re-Align), 2024. URLhttps://openreview.net/forum?id=cHlKB28bjV

  41. [41]

    Learning What and Where to Attend

    Drew Linsley, Dan Shiebler, Sven Eberhardt, and Thomas Serre. Learning What and Where to Attend. InProceedings of the International Conference on Learning Representations (ICLR), 2019

  42. [42]

    Harmonizing the object recognition strategies of deep neural networks with humans.Advances in Neural Information Processing Systems (NeurIPS), 35:9432–9446, 2022

    Thomas Fel, Ivan F Rodriguez Rodriguez, Drew Linsley, and Thomas Serre. Harmonizing the object recognition strategies of deep neural networks with humans.Advances in Neural Information Processing Systems (NeurIPS), 35:9432–9446, 2022

  43. [43]

    Hebart, Charles Y

    Martin N. Hebart, Charles Y . Zheng, Francisco Pereira, and Chris I. Baker. Revealing the multi- dimensional mental representations of natural objects underlying human similarity judgements. Nature Human Behaviour, 4(11):1173–1185, 2020

  44. [44]

    G., Nishimoto, S., Vu, A

    Martin N. Hebart, Oliver Contier, Lina Teichmann, Adam H. Rockter, Charles Y . Zheng, Alexis Kidder, Anna Corriveau, Maryam Vaziri-Pashkam, and Chris I. Baker. THINGS-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior.eLife, 12:e82580, 2023. doi: 10.7554/eLife.82580. 12

  45. [45]

    Dreamsim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

    Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. DreamSim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

  46. [46]

    Eghbal Hosseini and Evelina Fedorenko

    Fabian Gröger, Shuo Wen, and Maria Brbi´c. Revisiting the platonic representation hypothesis: An aristotelian view.arXiv preprint arXiv:2602.14486, 2026

  47. [47]

    Getting aligned on representa- tional alignment.arXiv preprint arXiv:2310.13018, 2023

    Ilia Sucholutsky, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C Love, Erin Grant, Iris Groen, Jascha Achterberg, et al. Getting aligned on representa- tional alignment.arXiv preprint arXiv:2310.13018, 2023

  48. [48]

    Natural language descriptions of deep visual features

    Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, and Jacob Andreas. Natural language descriptions of deep visual features. InProceedings of the International Conference on Learning Representations (ICLR), 2022

  49. [49]

    Language models can explain neurons in language models.OpenAI Blog, 2023

    Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models.OpenAI Blog, 2023. URL https://openaipublic.blob.core.windows. net/neuron-explainer/paper/index.html

  50. [50]

    CLIP-Dissect: Automatic description of neuron representations in deep vision networks

    Tuomas Oikarinen and Tsui-Wei Weng. CLIP-Dissect: Automatic description of neuron representations in deep vision networks. InProceedings of the International Conference on Learning Representations (ICLR), 2023

  51. [51]

    The similarity metric.IEEE transactions on Information Theory, 50(12):3250–3264, 2004

    Ming Li, Xin Chen, Xin Li, Bin Ma, and Paul MB Vitányi. The similarity metric.IEEE transactions on Information Theory, 50(12):3250–3264, 2004. 13 A Limitations and broader impact. Limitations.We acknowledge several limitations. First, our model-level analyses span only six architectures, which limits statistical power and makes it difficult to disentangle...