Capability $\neq$ Interpretability: Human Interpretability of Vision Foundation Models

Julien Colin; Lore Goetschalckx; Nuria Oliver; Thomas Serre

arxiv: 2605.20337 · v1 · pith:5TTWNIS5new · submitted 2026-05-19 · 💻 cs.CV

Capability neq Interpretability: Human Interpretability of Vision Foundation Models

Julien Colin , Lore Goetschalckx , Nuria Oliver , Thomas Serre This is my paper

Pith reviewed 2026-05-21 07:32 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision transformershuman interpretabilitypsychophysicsfoundation modelssparse autoencodersfeature localizationmodel evaluation

0 comments

The pith

Foundation models produce less human-interpretable features than supervised vision transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework to measure human interpretability of vision model features through two psychophysics protocols. One tests whether people can predict where a recovered feature will activate on a new image. The other tests whether people can accurately describe what the feature represents. Features come from sparse autoencoders and scores are normalized against chance to allow direct comparison across models. When applied to supervised ViTs and four foundation models, the data from over 13,000 valid responses show foundation models rank lower on interpretability. This shortfall does not track differences in task performance, but instead tracks how focal the activations are and how well they match broad human categories.

Core claim

Foundation models are consistently less interpretable than their supervised counterparts, and the gap is not a capability tradeoff: interpretability does not correlate with downstream task performance on any benchmark we examine. What does correlate is the locality of a feature's activations and coarse-grained semantic alignment with humans -- models with focal activations and representations that reflect the world's broad categorical structure produce more interpretable features, whereas fine-grained perceptual alignment does not. The two protocols yield strongly correlated rankings and share the same predictors, establishing interpretability as an independent, measurable dimension of model

What carries the argument

Two psychophysics protocols (localizability and nameability) applied to sparse autoencoder features, combined with chance-anchored scoring to rank models on one scale.

Load-bearing premise

The localizability and nameability protocols together provide a valid and generalizable measure of human interpretability for the recovered features.

What would settle it

Demonstrating a positive correlation between the measured interpretability scores and performance on a new downstream benchmark not tested in the study would undermine the claim that interpretability is independent of capability.

Figures

Figures reproduced from arXiv: 2605.20337 by Julien Colin, Lore Goetschalckx, Nuria Oliver, Thomas Serre.

**Figure 2.** Figure 2: Interpretability is uncorrelated with downstream task performance. Each panel plots an interpretability score against a capability benchmark across the six models. Top row: localizability; bottom row: nameability. Columns: ImageNet-1k top-1 accuracy, ADE20K semantic segmentation, and perceptual grouping. Spearman ρ and p-values are shown in each panel. Correlations are non-significant for both interpretab… view at source ↗

**Figure 3.** Figure 3: Locality and coarse-grained semantic alignment track interpretability. Each panel plots an interpretability score against a representational property across the six models. Top row: localizability; bottom row: nameability. Columns: locality of the representation (mean Hoyer sparsity over feature heatmaps), and coarse-grained alignment with human similarity judgments on THINGS [43] and Levels [39] (odd-one-… view at source ↗

read the original abstract

How interpretable are the features of leading vision models? The question is increasingly pressing as these models move from research benchmarks into high-stakes deployments, yet existing methods cannot answer it reliably. We close this gap with a framework for measuring and comparing the human interpretability of vision models, built around two complementary psychophysics protocols: (1) localizability -- can an observer predict where a feature fires on a novel image? -- and (2) nameability -- can an observer accurately describe what the feature represents? Features are recovered via sparse autoencoders, and a chance-anchored scoring function places every model on a common scale. Applying the framework to six vision transformers -- two supervised ViTs and four foundation models (DINOv2, DINOv3, CLIP, SigLIP) -- we collected more than $15{,}000$ behavioral responses, analyzing the $13{,}400$ responses from the $377$ participants who passed our pre-specified quality checks. Foundation models are consistently *less* interpretable than their supervised counterparts, and the gap is not a capability tradeoff: interpretability does not correlate with downstream task performance on any benchmark we examine. What does correlate is the locality of a feature's activations and coarse-grained semantic alignment with humans -- models with focal activations and representations that reflect the world's broad categorical structure produce more interpretable features, whereas fine-grained perceptual alignment does not. The two protocols yield strongly correlated rankings and share the same predictors, establishing interpretability as an independent, measurable dimension of representation quality -- and, surprisingly, one on which every foundation model we tested falls below the supervised baselines that came before. Capability alone cannot close that gap; locality and coarse-grained alignment can.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Foundation models score lower than supervised ViTs on two human psychophysics measures of feature interpretability, and the gap shows no link to task performance.

read the letter

The main point is that this paper reports foundation models (DINOv2, DINOv3, CLIP, SigLIP) produce less human-interpretable features than supervised ViTs on localizability and nameability tasks, with no apparent tradeoff against downstream benchmarks. The gap tracks locality of activations and coarse semantic alignment instead. They recover features via sparse autoencoders, then run two protocols: one where participants predict feature locations on new images, and one where they name the feature. Scores are anchored to chance so models sit on the same scale. Over 13,400 valid responses from 377 screened participants give the comparison real weight, and the two protocols agree on model rankings. That is the concrete advance over prior interpretability work. The large behavioral sample and direct cross-model comparison on the same tasks are the parts that hold up cleanly. The absence of correlation with capability metrics is also a useful negative result if the stats check out. The soft spots sit in the experimental controls. The stress-test note is fair: results could shift with different image selections or stricter participant filters, since supervised models might produce more focal maps that fit the chosen stimuli better. Without the full methods on how features were extracted and how quality checks were applied, it is hard to rule out setup artifacts driving the supervised advantage. The paper is for interpretability researchers and anyone evaluating vision models for settings where humans need to understand what the features mean. Readers who want behavioral data rather than just saliency maps will find the protocols and the scale of the study useful. It deserves peer review because the empirical comparison is new and the dataset is substantial, even though the methods section will need close scrutiny on image sampling and statistical details.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a framework to measure human interpretability of features in vision transformers using sparse autoencoders combined with two psychophysics protocols: localizability (predicting where a feature activates on novel images) and nameability (describing the feature). Based on 13,400 valid responses from 377 participants who passed pre-specified quality checks, it reports that foundation models (DINOv2, DINOv3, CLIP, SigLIP) are consistently less interpretable than supervised ViTs, that this gap does not reflect a capability tradeoff (no correlation with downstream benchmarks), and that interpretability instead correlates with activation locality and coarse-grained semantic alignment with humans.

Significance. If the protocols prove robust to image selection and participant criteria, the work would establish interpretability as a measurable, independent dimension of representation quality separate from capability. The large behavioral dataset provides solid empirical grounding for the model-type comparisons and identifies actionable predictors (locality, coarse alignment). This has direct relevance for high-stakes vision deployments where human-understandable features matter.

major comments (3)

[Methods (psychophysics protocols and participant screening)] The central claim that foundation models are less interpretable than supervised ViTs, and that the gap is not a capability tradeoff, depends on the psychophysics protocols capturing intrinsic feature properties rather than artifacts. The image selection process and the quality filters that retained 377 participants (from the initial pool yielding 15,000 responses) could systematically favor focal activations more common in supervised models; without a sensitivity analysis varying image distributions or screening criteria, the observed gap and lack of benchmark correlation may be setup-dependent.
[Results (correlation with downstream performance)] The assertion that interpretability does not correlate with downstream task performance is load-bearing for the no-tradeoff conclusion. The manuscript should report the exact benchmarks examined, the correlation coefficients (with confidence intervals), and any multiple-comparison corrections, as the null result could be sensitive to benchmark choice or statistical power.
[Results (protocol agreement)] The statement that the two protocols yield strongly correlated rankings and share the same predictors underpins the claim that interpretability is a coherent dimension. The specific Pearson or Spearman correlation value, sample size, and p-value for the protocol agreement should be provided explicitly.

minor comments (2)

[Abstract] The abstract states 'more than 15,000 behavioral responses' but analyzes 13,400 from 377 participants; state the exact initial participant count and response total in the main text for transparency.
[Throughout] Ensure consistent terminology when referring to the four foundation models versus the two supervised ViTs across figures and tables.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive review of our manuscript. We address each major comment below and commit to revisions that strengthen the presentation of our methods and results without altering the core findings.

read point-by-point responses

Referee: [Methods (psychophysics protocols and participant screening)] The central claim that foundation models are less interpretable than supervised ViTs, and that the gap is not a capability tradeoff, depends on the psychophysics protocols capturing intrinsic feature properties rather than artifacts. The image selection process and the quality filters that retained 377 participants (from the initial pool yielding 15,000 responses) could systematically favor focal activations more common in supervised models; without a sensitivity analysis varying image distributions or screening criteria, the observed gap and lack of benchmark correlation may be setup-dependent.

Authors: We agree that robustness to these design choices is important to establish. In the revised manuscript we will add a dedicated sensitivity analysis section that re-runs the key comparisons under alternative image sampling distributions and under relaxed or stricter participant quality thresholds. This will directly test whether the interpretability gap and its predictors remain stable. revision: yes
Referee: [Results (correlation with downstream performance)] The assertion that interpretability does not correlate with downstream task performance is load-bearing for the no-tradeoff conclusion. The manuscript should report the exact benchmarks examined, the correlation coefficients (with confidence intervals), and any multiple-comparison corrections, as the null result could be sensitive to benchmark choice or statistical power.

Authors: We will expand the relevant results section to list every benchmark examined, report the Pearson correlation coefficients together with 95% confidence intervals, and state the multiple-comparison correction applied. These additions will make the statistical basis for the null result fully transparent. revision: yes
Referee: [Results (protocol agreement)] The statement that the two protocols yield strongly correlated rankings and share the same predictors underpins the claim that interpretability is a coherent dimension. The specific Pearson or Spearman correlation value, sample size, and p-value for the protocol agreement should be provided explicitly.

Authors: We will insert the requested quantitative details into the results section, reporting the Spearman rank correlation between the two protocol scores, the number of features on which it is computed, and the associated p-value. This will give explicit support to the claim that the protocols measure a coherent dimension. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical psychophysics measurements are self-contained

full rationale

The paper derives its claims from direct human behavioral data collected via two psychophysics protocols (localizability and nameability) applied to features recovered by sparse autoencoders from six vision transformers. Over 15,000 responses were gathered and filtered to 13,400 from 377 participants using pre-specified quality checks, with rankings and correlations computed from these participant judgments rather than from any fitted parameters, self-definitions, or load-bearing self-citations. The reported gap in interpretability between foundation models and supervised ViTs, along with the lack of correlation to downstream benchmarks and the role of locality and coarse-grained alignment, emerges from the external human responses and does not reduce to the inputs by construction. This is a standard empirical study whose central results are falsifiable against new participant cohorts or image sets and therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that sparse autoencoders recover human-meaningful features and that the chosen psychophysics tasks validly capture interpretability; no new physical entities or ad-hoc constants are introduced.

axioms (1)

domain assumption Human observers can reliably perform localization and naming tasks on model features when quality controls are applied.
Invoked to justify the behavioral data as a measure of interpretability.

pith-pipeline@v0.9.0 · 5852 in / 1227 out tokens · 27304 ms · 2026-05-21T07:32:55.592693+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We close this gap with a framework for measuring and comparing the human interpretability of vision models, built around two complementary psychophysics protocols: (1) localizability... and (2) nameability...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Locality of the representation... Hoyer metric... correlates strongly with localizability (ρ=0.91) and nameability (ρ=0.99)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 5 internal anchors

[1]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InProceedings of the International Conference on Learning Representation...

work page 2021
[2]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning (ICML), pages 8748–8763. PmLR, 2021

work page 2021
[3]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 11975–11986, 2023

work page 2023
[4]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

DINOv3

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. DINOv3. arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Fine-grained classifi- cation for poisonous fungi identification with transfer learning.arXiv preprint arXiv:2407.07492, 2024

Christopher Chiu, Maximilian Heil, Teresa Kim, and Anthony Miyaguchi. Fine-grained classifi- cation for poisonous fungi identification with transfer learning.arXiv preprint arXiv:2407.07492, 2024

work page arXiv 2024
[7]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Liu Shilong, Zeng Zhaoyang, Ren Tianhe, Li Feng, Zhang Hao, Yang Jie, Jiang Qing, Li Chun- yuan, Yang Jianwei, Su Hang, Zhu Jun, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXIv:2303.05499, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Leveraging foundation models via knowledge distillation in multi-object tracking: Distilling dinov2 features to fairmot.arXiv preprint arXiv:2407.18288, 2024

Niels G Faber, Seyed Sahand Mohammadi Ziabari, and Fatemeh Karimi Nejadasl. Leveraging foundation models via knowledge distillation in multi-object tracking: Distilling dinov2 features to fairmot.arXiv preprint arXiv:2407.18288, 2024

work page arXiv 2024
[9]

Dino-tracker: Taming dino for self-supervised point tracking in a single video

Narek Tumanyan, Assaf Singer, Shai Bagon, and Tali Dekel. Dino-tracker: Taming dino for self-supervised point tracking in a single video. InEuropean Conference on Computer Vision, pages 367–385. Springer, 2024

work page 2024
[10]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025

work page 2025
[11]

Evaluating general purpose vision foundation models for medical image analysis: An experimental study of dinov2 on radiology benchmarks.arXiv preprint arXiv:2312.02366, 2023

Mohammed Baharoon, Waseem Qureshi, Jiahong Ouyang, Yanwu Xu, Abdulrhman Aljouie, and Wei Peng. Evaluating general purpose vision foundation models for medical image analysis: An experimental study of dinov2 on radiology benchmarks.arXiv preprint arXiv:2312.02366, 2023

work page arXiv 2023
[12]

Advancing human-centric ai for robust x-ray analysis through holistic self-supervised learning

Théo Moutakanni, Piotr Bojanowski, Guillaume Chassagnon, Céline Hudelot, Armand Joulin, Yann LeCun, Matthew Muckley, Maxime Oquab, Marie-Pierre Revel, and Maria Vakalopoulou. Advancing human-centric ai for robust x-ray analysis through holistic self-supervised learning. arXiv preprint arXiv:2405.01469, 2024

work page arXiv 2024
[13]

Foundation models meet medical image interpretation.Research, 9:1024, 2026

Licheng Jiao, Jiayao Hao, Ruiyang Li, Lingling Li, Xu Liu, Fang Liu, Wenping Ma, Puhua Chen, Zhongjian Huang, Jingyi Yang, Jiaxuan Zhao, and Qigong Sun. Foundation models meet medical image interpretation.Research, 9:1024, 2026. doi: 10.34133/research.1024. URL https://spj.science.org/doi/abs/10.34133/research.1024

work page doi:10.34133/research.1024 2026
[14]

Foundation models for radiology: fundamentals, applications, opportunities, challenges, risks, and prospects.Diagnostic and Interventional Radiology, 2025

Tugba Akinci D’Antonoli, Christian Bluethgen, Renato Cuocolo, Michail E Klontzas, Andrea Ponsiglione, and Burak Kocak. Foundation models for radiology: fundamentals, applications, opportunities, challenges, risks, and prospects.Diagnostic and Interventional Radiology, 2025. 10

work page 2025
[15]

A survey for foundation models in autonomous driving

Haoxiang Gao, Zhongruo Wang, Yaqian Li, Kaiwen Long, Ming Yang, and Yiqing Shen. A survey for foundation models in autonomous driving. In2025 6th International Conference on Computer Vision and Data Mining (ICCVDM), pages 63–71. IEEE, 2025

work page 2025
[16]

A survey on vision-language-action models for autonomous driving

Sicong Jiang, Zilin Huang, Kangan Qian, Ziang Luo, Tianze Zhu, Yang Zhong, Yihong Tang, Menglin Kong, Yunlong Wang, Siwen Jiao, et al. A survey on vision-language-action models for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4524–4536, 2025

work page 2025
[17]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Zimmermann, Thomas Klein, and Wieland Brendel

Roland S. Zimmermann, Thomas Klein, and Wieland Brendel. Scale alone does not improve mechanistic interpretability in vision models.Advances in Neural Information Processing Systems (NeurIPS), 36, 2023

work page 2023
[20]

Toy models of superposition.Transformer Circuits Thread, 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022

work page 2022
[21]

Daniel Freeman, Theodore R

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosema...

work page 2024
[22]

Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba E

Thomas Fel, Ekdeep Singh Lubana, Jacob S. Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba E. Ba, and Talia Konkle. Archetypal SAE: Adaptive and stable dictionary learning for concept extraction in large vision models. InProceedings of the 42nd International Conference on Machine Learning, volume 267 of Proc...

work page 2025
[23]

Network dissec- tion: Quantifying interpretability of deep visual representations

David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissec- tion: Quantifying interpretability of deep visual representations. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6541–6549, 2017

work page 2017
[24]

Zimmermann, Judith Schepers, Robert Geirhos, Thomas S

Judy Borowski, Roland S. Zimmermann, Judith Schepers, Robert Geirhos, Thomas S. A. Wallis, Matthias Bethge, and Wieland Brendel. Exemplary Natural Images Explain CNN Activa- tions Better than State-of-the-Art Feature Visualization. InProceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021
[25]

Zimmermann, David Klindt, and Wieland Brendel

Roland S. Zimmermann, David Klindt, and Wieland Brendel. Measuring per-unit inter- pretability at scale without humans. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Process- ing Systems (NeurIPS), volume 37, pages 48448–48483. Curran Associates, Inc., 2024. doi: 10.52202/07901...

work page doi:10.52202/079017-1535 2024
[26]

Choosing the right basis for interpretability: Psychophysical comparison between neuron-based and dictionary-based representations.arXiv preprint arXiv:2411.03993, 2024

Julien Colin, Lore Goetschalckx, Thomas Fel, Victor Boutin, Thomas Serre, and Nuria Oliver. Choosing the right basis for interpretability: Psychophysical comparison between neuron-based and dictionary-based representations.arXiv preprint arXiv:2411.03993, 2024

work page arXiv 2024
[27]

Learning important features through propagating activation differences

Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. InProceedings of the International Conference on Machine Learning (ICML), 2017

work page 2017
[28]

A holistic approach to unifying automatic concept extraction and concept importance estimation.Advances in Neural Information Processing Systems, 36:54805–54818, 2023

Thomas Fel, Victor Boutin, Louis Béthune, Rémi Cadène, Mazda Moayeri, Léo Andéol, Mathieu Chalvidal, and Thomas Serre. A holistic approach to unifying automatic concept extraction and concept importance estimation.Advances in Neural Information Processing Systems, 36:54805–54818, 2023. 11

work page 2023
[29]

Negative results for saes on downstream tasks and deprioritising sae research (gdm mech interp team progress update #2

Lewis Smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah, and Neel Nanda. Negative results for saes on downstream tasks and deprioritising sae research (gdm mech interp team progress update #2. AI Alignment Forum, 2025

work page 2025
[30]

Dense sae latents are features, not bugs

Xiaoqing Sun, Alessandro Stolfo, Joshua Engels, Ben Peng Wu, Senthooran Rajamanoharan, Mrinmaya Sachan, and Max Tegmark. Dense sae latents are features, not bugs. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025

work page 2025
[31]

Unlocking Feature Visualization for Deeper Networks with Magnitude Constrained Optimization

Thomas Fel, Thibaut Boissin, Victor Boutin, Agustin Picard, Paul Novello, Julien Colin, Drew Linsley, Tom Rousseau, Rémi Cadène, Lore Goetschalckx, Laurent Gardes, and Thomas Serre. Unlocking Feature Visualization for Deeper Networks with Magnitude Constrained Optimization. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, pages 37...

work page 2023
[32]

RISE: Randomized Input Sampling for Explanation of Black-box Models

Vitali Petsiuk, Abir Das, and Kate Saenko. RISE: Randomized Input Sampling for Explanation of Black-box Models. InProceedings of the British Machine Vision Conference (BMVC), 2018

work page 2018
[33]

Pure: Turning polysemantic neurons into pure features by identifying relevant circuits

Maximilian Dreyer, Erblina Purelku, Johanna Vielhaben, Wojciech Samek, and Sebastian La- puschkin. Pure: Turning polysemantic neurons into pure features by identifying relevant circuits. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8212–8217, 2024

work page 2024
[34]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017

work page 2017
[35]

Patrik O. Hoyer. Non-negative matrix factorization with sparseness constraints.The Journal of Machine Learning Research (JMLR), 5(Nov):1457–1469, 2004

work page 2004
[36]

Alignment with human representations supports robust few- shot learning.Advances in Neural Information Processing Systems (NeurIPS), 36:73464–73479, 2023

Ilia Sucholutsky and Tom Griffiths. Alignment with human representations supports robust few- shot learning.Advances in Neural Information Processing Systems (NeurIPS), 36:73464–73479, 2023

work page 2023
[37]

Improving neural network representations using human similarity judgments.Advances in Neural Information Processing Systems (NeurIPS), 36:50978–51007, 2023

Lukas Muttenthaler, Lorenz Linhardt, Jonas Dippel, Robert A Vandermeulen, Katherine Her- mann, Andrew Lampinen, and Simon Kornblith. Improving neural network representations using human similarity judgments.Advances in Neural Information Processing Systems (NeurIPS), 36:50978–51007, 2023

work page 2023
[38]

When does perceptual alignment benefit vision representations?Advances in Neural Information Processing Systems (NeurIPS), 37:55314– 55341, 2024

Shobhita Sundaram, Stephanie Fu, Lukas Muttenthaler, Netanel Tamir, Lucy Chai, Simon Kornblith, Trevor Darrell, and Phillip Isola. When does perceptual alignment benefit vision representations?Advances in Neural Information Processing Systems (NeurIPS), 37:55314– 55341, 2024

work page 2024
[39]

Aligning machine and human visual representations across abstraction levels.Nature, 647(8089):349–355, 2025

Lukas Muttenthaler, Klaus Greff, Frieda Born, Bernhard Spitzer, Simon Kornblith, Michael C Mozer, Klaus-Robert Müller, Thomas Unterthiner, and Andrew K Lampinen. Aligning machine and human visual representations across abstraction levels.Nature, 647(8089):349–355, 2025. doi: 10.1038/s41586-025-09631-6

work page doi:10.1038/s41586-025-09631-6 2025
[40]

Wichmann, and Robert Geirhos

Jannis Ahlert, Thomas Klein, Felix A. Wichmann, and Robert Geirhos. How aligned are different alignment metrics? InICLR 2024 Workshop on Representational Alignment (Re-Align), 2024. URLhttps://openreview.net/forum?id=cHlKB28bjV

work page 2024
[41]

Learning What and Where to Attend

Drew Linsley, Dan Shiebler, Sven Eberhardt, and Thomas Serre. Learning What and Where to Attend. InProceedings of the International Conference on Learning Representations (ICLR), 2019

work page 2019
[42]

Harmonizing the object recognition strategies of deep neural networks with humans.Advances in Neural Information Processing Systems (NeurIPS), 35:9432–9446, 2022

Thomas Fel, Ivan F Rodriguez Rodriguez, Drew Linsley, and Thomas Serre. Harmonizing the object recognition strategies of deep neural networks with humans.Advances in Neural Information Processing Systems (NeurIPS), 35:9432–9446, 2022

work page 2022
[43]

Hebart, Charles Y

Martin N. Hebart, Charles Y . Zheng, Francisco Pereira, and Chris I. Baker. Revealing the multi- dimensional mental representations of natural objects underlying human similarity judgements. Nature Human Behaviour, 4(11):1173–1185, 2020

work page 2020
[44]

THINGS-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior , volume =

Martin N. Hebart, Oliver Contier, Lina Teichmann, Adam H. Rockter, Charles Y . Zheng, Alexis Kidder, Anna Corriveau, Maryam Vaziri-Pashkam, and Chris I. Baker. THINGS-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior.eLife, 12:e82580, 2023. doi: 10.7554/eLife.82580. 12

work page doi:10.7554/elife.82580 2023
[45]

Dreamsim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. DreamSim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

work page arXiv 2023
[46]

Revisiting the platonic representation hypothesis: An aristotelian view.arXiv preprint arXiv:2602.14486,

Fabian Gröger, Shuo Wen, and Maria Brbi´c. Revisiting the platonic representation hypothesis: An aristotelian view.arXiv preprint arXiv:2602.14486, 2026

work page arXiv 2026
[47]

Getting aligned on representa- tional alignment.arXiv preprint arXiv:2310.13018, 2023

Ilia Sucholutsky, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C Love, Erin Grant, Iris Groen, Jascha Achterberg, et al. Getting aligned on representa- tional alignment.arXiv preprint arXiv:2310.13018, 2023

work page arXiv 2023
[48]

Natural language descriptions of deep visual features

Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, and Jacob Andreas. Natural language descriptions of deep visual features. InProceedings of the International Conference on Learning Representations (ICLR), 2022

work page 2022
[49]

Language models can explain neurons in language models.OpenAI Blog, 2023

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models.OpenAI Blog, 2023. URL https://openaipublic.blob.core.windows. net/neuron-explainer/paper/index.html

work page 2023
[50]

CLIP-Dissect: Automatic description of neuron representations in deep vision networks

Tuomas Oikarinen and Tsui-Wei Weng. CLIP-Dissect: Automatic description of neuron representations in deep vision networks. InProceedings of the International Conference on Learning Representations (ICLR), 2023

work page 2023
[51]

The similarity metric.IEEE transactions on Information Theory, 50(12):3250–3264, 2004

Ming Li, Xin Chen, Xin Li, Bin Ma, and Paul MB Vitányi. The similarity metric.IEEE transactions on Information Theory, 50(12):3250–3264, 2004. 13 A Limitations and broader impact. Limitations.We acknowledge several limitations. First, our model-level analyses span only six architectures, which limits statistical power and makes it difficult to disentangle...

work page 2004

[1] [1]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InProceedings of the International Conference on Learning Representation...

work page 2021

[2] [2]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning (ICML), pages 8748–8763. PmLR, 2021

work page 2021

[3] [3]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 11975–11986, 2023

work page 2023

[4] [4]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

DINOv3

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. DINOv3. arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Fine-grained classifi- cation for poisonous fungi identification with transfer learning.arXiv preprint arXiv:2407.07492, 2024

Christopher Chiu, Maximilian Heil, Teresa Kim, and Anthony Miyaguchi. Fine-grained classifi- cation for poisonous fungi identification with transfer learning.arXiv preprint arXiv:2407.07492, 2024

work page arXiv 2024

[7] [7]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Liu Shilong, Zeng Zhaoyang, Ren Tianhe, Li Feng, Zhang Hao, Yang Jie, Jiang Qing, Li Chun- yuan, Yang Jianwei, Su Hang, Zhu Jun, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXIv:2303.05499, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Leveraging foundation models via knowledge distillation in multi-object tracking: Distilling dinov2 features to fairmot.arXiv preprint arXiv:2407.18288, 2024

Niels G Faber, Seyed Sahand Mohammadi Ziabari, and Fatemeh Karimi Nejadasl. Leveraging foundation models via knowledge distillation in multi-object tracking: Distilling dinov2 features to fairmot.arXiv preprint arXiv:2407.18288, 2024

work page arXiv 2024

[9] [9]

Dino-tracker: Taming dino for self-supervised point tracking in a single video

Narek Tumanyan, Assaf Singer, Shai Bagon, and Tali Dekel. Dino-tracker: Taming dino for self-supervised point tracking in a single video. InEuropean Conference on Computer Vision, pages 367–385. Springer, 2024

work page 2024

[10] [10]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025

work page 2025

[11] [11]

Evaluating general purpose vision foundation models for medical image analysis: An experimental study of dinov2 on radiology benchmarks.arXiv preprint arXiv:2312.02366, 2023

Mohammed Baharoon, Waseem Qureshi, Jiahong Ouyang, Yanwu Xu, Abdulrhman Aljouie, and Wei Peng. Evaluating general purpose vision foundation models for medical image analysis: An experimental study of dinov2 on radiology benchmarks.arXiv preprint arXiv:2312.02366, 2023

work page arXiv 2023

[12] [12]

Advancing human-centric ai for robust x-ray analysis through holistic self-supervised learning

Théo Moutakanni, Piotr Bojanowski, Guillaume Chassagnon, Céline Hudelot, Armand Joulin, Yann LeCun, Matthew Muckley, Maxime Oquab, Marie-Pierre Revel, and Maria Vakalopoulou. Advancing human-centric ai for robust x-ray analysis through holistic self-supervised learning. arXiv preprint arXiv:2405.01469, 2024

work page arXiv 2024

[13] [13]

Foundation models meet medical image interpretation.Research, 9:1024, 2026

Licheng Jiao, Jiayao Hao, Ruiyang Li, Lingling Li, Xu Liu, Fang Liu, Wenping Ma, Puhua Chen, Zhongjian Huang, Jingyi Yang, Jiaxuan Zhao, and Qigong Sun. Foundation models meet medical image interpretation.Research, 9:1024, 2026. doi: 10.34133/research.1024. URL https://spj.science.org/doi/abs/10.34133/research.1024

work page doi:10.34133/research.1024 2026

[14] [14]

Foundation models for radiology: fundamentals, applications, opportunities, challenges, risks, and prospects.Diagnostic and Interventional Radiology, 2025

Tugba Akinci D’Antonoli, Christian Bluethgen, Renato Cuocolo, Michail E Klontzas, Andrea Ponsiglione, and Burak Kocak. Foundation models for radiology: fundamentals, applications, opportunities, challenges, risks, and prospects.Diagnostic and Interventional Radiology, 2025. 10

work page 2025

[15] [15]

A survey for foundation models in autonomous driving

Haoxiang Gao, Zhongruo Wang, Yaqian Li, Kaiwen Long, Ming Yang, and Yiqing Shen. A survey for foundation models in autonomous driving. In2025 6th International Conference on Computer Vision and Data Mining (ICCVDM), pages 63–71. IEEE, 2025

work page 2025

[16] [16]

A survey on vision-language-action models for autonomous driving

Sicong Jiang, Zilin Huang, Kangan Qian, Ziang Luo, Tianze Zhu, Yang Zhong, Yihong Tang, Menglin Kong, Yunlong Wang, Siwen Jiao, et al. A survey on vision-language-action models for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4524–4536, 2025

work page 2025

[17] [17]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Zimmermann, Thomas Klein, and Wieland Brendel

Roland S. Zimmermann, Thomas Klein, and Wieland Brendel. Scale alone does not improve mechanistic interpretability in vision models.Advances in Neural Information Processing Systems (NeurIPS), 36, 2023

work page 2023

[20] [20]

Toy models of superposition.Transformer Circuits Thread, 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022

work page 2022

[21] [21]

Daniel Freeman, Theodore R

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosema...

work page 2024

[22] [22]

Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba E

Thomas Fel, Ekdeep Singh Lubana, Jacob S. Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba E. Ba, and Talia Konkle. Archetypal SAE: Adaptive and stable dictionary learning for concept extraction in large vision models. InProceedings of the 42nd International Conference on Machine Learning, volume 267 of Proc...

work page 2025

[23] [23]

Network dissec- tion: Quantifying interpretability of deep visual representations

David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissec- tion: Quantifying interpretability of deep visual representations. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6541–6549, 2017

work page 2017

[24] [24]

Zimmermann, Judith Schepers, Robert Geirhos, Thomas S

Judy Borowski, Roland S. Zimmermann, Judith Schepers, Robert Geirhos, Thomas S. A. Wallis, Matthias Bethge, and Wieland Brendel. Exemplary Natural Images Explain CNN Activa- tions Better than State-of-the-Art Feature Visualization. InProceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021

[25] [25]

Zimmermann, David Klindt, and Wieland Brendel

Roland S. Zimmermann, David Klindt, and Wieland Brendel. Measuring per-unit inter- pretability at scale without humans. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Process- ing Systems (NeurIPS), volume 37, pages 48448–48483. Curran Associates, Inc., 2024. doi: 10.52202/07901...

work page doi:10.52202/079017-1535 2024

[26] [26]

Choosing the right basis for interpretability: Psychophysical comparison between neuron-based and dictionary-based representations.arXiv preprint arXiv:2411.03993, 2024

Julien Colin, Lore Goetschalckx, Thomas Fel, Victor Boutin, Thomas Serre, and Nuria Oliver. Choosing the right basis for interpretability: Psychophysical comparison between neuron-based and dictionary-based representations.arXiv preprint arXiv:2411.03993, 2024

work page arXiv 2024

[27] [27]

Learning important features through propagating activation differences

Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. InProceedings of the International Conference on Machine Learning (ICML), 2017

work page 2017

[28] [28]

A holistic approach to unifying automatic concept extraction and concept importance estimation.Advances in Neural Information Processing Systems, 36:54805–54818, 2023

Thomas Fel, Victor Boutin, Louis Béthune, Rémi Cadène, Mazda Moayeri, Léo Andéol, Mathieu Chalvidal, and Thomas Serre. A holistic approach to unifying automatic concept extraction and concept importance estimation.Advances in Neural Information Processing Systems, 36:54805–54818, 2023. 11

work page 2023

[29] [29]

Negative results for saes on downstream tasks and deprioritising sae research (gdm mech interp team progress update #2

Lewis Smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah, and Neel Nanda. Negative results for saes on downstream tasks and deprioritising sae research (gdm mech interp team progress update #2. AI Alignment Forum, 2025

work page 2025

[30] [30]

Dense sae latents are features, not bugs

Xiaoqing Sun, Alessandro Stolfo, Joshua Engels, Ben Peng Wu, Senthooran Rajamanoharan, Mrinmaya Sachan, and Max Tegmark. Dense sae latents are features, not bugs. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025

work page 2025

[31] [31]

Unlocking Feature Visualization for Deeper Networks with Magnitude Constrained Optimization

Thomas Fel, Thibaut Boissin, Victor Boutin, Agustin Picard, Paul Novello, Julien Colin, Drew Linsley, Tom Rousseau, Rémi Cadène, Lore Goetschalckx, Laurent Gardes, and Thomas Serre. Unlocking Feature Visualization for Deeper Networks with Magnitude Constrained Optimization. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, pages 37...

work page 2023

[32] [32]

RISE: Randomized Input Sampling for Explanation of Black-box Models

Vitali Petsiuk, Abir Das, and Kate Saenko. RISE: Randomized Input Sampling for Explanation of Black-box Models. InProceedings of the British Machine Vision Conference (BMVC), 2018

work page 2018

[33] [33]

Pure: Turning polysemantic neurons into pure features by identifying relevant circuits

Maximilian Dreyer, Erblina Purelku, Johanna Vielhaben, Wojciech Samek, and Sebastian La- puschkin. Pure: Turning polysemantic neurons into pure features by identifying relevant circuits. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8212–8217, 2024

work page 2024

[34] [34]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017

work page 2017

[35] [35]

Patrik O. Hoyer. Non-negative matrix factorization with sparseness constraints.The Journal of Machine Learning Research (JMLR), 5(Nov):1457–1469, 2004

work page 2004

[36] [36]

Alignment with human representations supports robust few- shot learning.Advances in Neural Information Processing Systems (NeurIPS), 36:73464–73479, 2023

Ilia Sucholutsky and Tom Griffiths. Alignment with human representations supports robust few- shot learning.Advances in Neural Information Processing Systems (NeurIPS), 36:73464–73479, 2023

work page 2023

[37] [37]

Improving neural network representations using human similarity judgments.Advances in Neural Information Processing Systems (NeurIPS), 36:50978–51007, 2023

Lukas Muttenthaler, Lorenz Linhardt, Jonas Dippel, Robert A Vandermeulen, Katherine Her- mann, Andrew Lampinen, and Simon Kornblith. Improving neural network representations using human similarity judgments.Advances in Neural Information Processing Systems (NeurIPS), 36:50978–51007, 2023

work page 2023

[38] [38]

When does perceptual alignment benefit vision representations?Advances in Neural Information Processing Systems (NeurIPS), 37:55314– 55341, 2024

Shobhita Sundaram, Stephanie Fu, Lukas Muttenthaler, Netanel Tamir, Lucy Chai, Simon Kornblith, Trevor Darrell, and Phillip Isola. When does perceptual alignment benefit vision representations?Advances in Neural Information Processing Systems (NeurIPS), 37:55314– 55341, 2024

work page 2024

[39] [39]

Aligning machine and human visual representations across abstraction levels.Nature, 647(8089):349–355, 2025

Lukas Muttenthaler, Klaus Greff, Frieda Born, Bernhard Spitzer, Simon Kornblith, Michael C Mozer, Klaus-Robert Müller, Thomas Unterthiner, and Andrew K Lampinen. Aligning machine and human visual representations across abstraction levels.Nature, 647(8089):349–355, 2025. doi: 10.1038/s41586-025-09631-6

work page doi:10.1038/s41586-025-09631-6 2025

[40] [40]

Wichmann, and Robert Geirhos

Jannis Ahlert, Thomas Klein, Felix A. Wichmann, and Robert Geirhos. How aligned are different alignment metrics? InICLR 2024 Workshop on Representational Alignment (Re-Align), 2024. URLhttps://openreview.net/forum?id=cHlKB28bjV

work page 2024

[41] [41]

Learning What and Where to Attend

Drew Linsley, Dan Shiebler, Sven Eberhardt, and Thomas Serre. Learning What and Where to Attend. InProceedings of the International Conference on Learning Representations (ICLR), 2019

work page 2019

[42] [42]

Harmonizing the object recognition strategies of deep neural networks with humans.Advances in Neural Information Processing Systems (NeurIPS), 35:9432–9446, 2022

Thomas Fel, Ivan F Rodriguez Rodriguez, Drew Linsley, and Thomas Serre. Harmonizing the object recognition strategies of deep neural networks with humans.Advances in Neural Information Processing Systems (NeurIPS), 35:9432–9446, 2022

work page 2022

[43] [43]

Hebart, Charles Y

Martin N. Hebart, Charles Y . Zheng, Francisco Pereira, and Chris I. Baker. Revealing the multi- dimensional mental representations of natural objects underlying human similarity judgements. Nature Human Behaviour, 4(11):1173–1185, 2020

work page 2020

[44] [44]

THINGS-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior , volume =

Martin N. Hebart, Oliver Contier, Lina Teichmann, Adam H. Rockter, Charles Y . Zheng, Alexis Kidder, Anna Corriveau, Maryam Vaziri-Pashkam, and Chris I. Baker. THINGS-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior.eLife, 12:e82580, 2023. doi: 10.7554/eLife.82580. 12

work page doi:10.7554/elife.82580 2023

[45] [45]

Dreamsim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. DreamSim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

work page arXiv 2023

[46] [46]

Revisiting the platonic representation hypothesis: An aristotelian view.arXiv preprint arXiv:2602.14486,

Fabian Gröger, Shuo Wen, and Maria Brbi´c. Revisiting the platonic representation hypothesis: An aristotelian view.arXiv preprint arXiv:2602.14486, 2026

work page arXiv 2026

[47] [47]

Getting aligned on representa- tional alignment.arXiv preprint arXiv:2310.13018, 2023

Ilia Sucholutsky, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C Love, Erin Grant, Iris Groen, Jascha Achterberg, et al. Getting aligned on representa- tional alignment.arXiv preprint arXiv:2310.13018, 2023

work page arXiv 2023

[48] [48]

Natural language descriptions of deep visual features

Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, and Jacob Andreas. Natural language descriptions of deep visual features. InProceedings of the International Conference on Learning Representations (ICLR), 2022

work page 2022

[49] [49]

Language models can explain neurons in language models.OpenAI Blog, 2023

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models.OpenAI Blog, 2023. URL https://openaipublic.blob.core.windows. net/neuron-explainer/paper/index.html

work page 2023

[50] [50]

CLIP-Dissect: Automatic description of neuron representations in deep vision networks

Tuomas Oikarinen and Tsui-Wei Weng. CLIP-Dissect: Automatic description of neuron representations in deep vision networks. InProceedings of the International Conference on Learning Representations (ICLR), 2023

work page 2023

[51] [51]

The similarity metric.IEEE transactions on Information Theory, 50(12):3250–3264, 2004

Ming Li, Xin Chen, Xin Li, Bin Ma, and Paul MB Vitányi. The similarity metric.IEEE transactions on Information Theory, 50(12):3250–3264, 2004. 13 A Limitations and broader impact. Limitations.We acknowledge several limitations. First, our model-level analyses span only six architectures, which limits statistical power and makes it difficult to disentangle...

work page 2004