Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers

Mengmeng Ma; Tang Li; Xi Peng; Yanlin Chen

arxiv: 2606.06664 · v1 · pith:SQY5E4SUnew · submitted 2026-06-04 · 💻 cs.CV · cs.AI· cs.LG

Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers

Tang Li , Yanlin Chen , Mengmeng Ma , Xi Peng This is my paper

Pith reviewed 2026-06-28 01:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords vision transformersmechanistic interpretabilityconcept circuitssparse autoencodersspurious correlationsmodel editingWaterBirdsconcept probing

0 comments

The pith

ViSAE recovers concept circuits inside Vision Transformers to interpret their decisions and edit away spurious cues, raising worst-group accuracy on WaterBirds by 48.2%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ViSAE as a toolbox that maps Vision Transformer activations onto a fixed vocabulary of 16,000 visually grounded concepts using a 64,000-image probe. Top-down reading and bottom-up tracing algorithms then assemble these concepts into circuits that show how features combine layer by layer. The circuits support auditing for unwanted shortcuts and direct editing of the model to suppress them. On the WaterBirds benchmark the editing step lifts worst-group accuracy by 48.2 percent and beats prior methods by 23.8 percent. The work therefore supplies both an interpretation method and a practical intervention for bias reduction in ViTs.

Core claim

ViSAE supplies a 64K-image probing suite and 16K-concept vocabulary that together decompose ViT representations, then uses top-down concept reading and bottom-up circuit tracing to recover the internal pathways that link concepts; targeted editing of these circuits removes spurious correlations and produces a 48.2 percent gain in worst-group accuracy on WaterBirds while outperforming existing editing techniques by 23.8 percent.

What carries the argument

Concept circuits recovered by top-down reading and bottom-up tracing algorithms that operate over activations aligned to a 16K-concept vocabulary.

If this is right

Concept editing can suppress spurious correlations without retraining the full model.
The probing suite provides 20 times higher concept coverage efficiency than ImageNet-based sets.
Automated circuit tracing scales interpretation beyond manual feature labeling.
The same circuits support both auditing and steering of model behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same probing and tracing pipeline could be applied to other transformer families to check whether similar circuits appear.
If the recovered circuits generalize across datasets, they could guide data collection that reduces spurious cues before training.
Failures of the editing step on new tasks would indicate which concepts the current vocabulary still misses.
The neuroscience motivation could be tested by comparing circuit structure against known visual cortex pathways in biological data.

Load-bearing premise

The 16K-concept vocabulary and 64K-image suite are assumed to capture the actual internal features the Vision Transformer uses.

What would settle it

A controlled experiment that applies the reported concept edits to a held-out ViT and measures no improvement in worst-group accuracy on WaterBirds, or that shows a different concept vocabulary yields equal or larger gains.

Figures

Figures reproduced from arXiv: 2606.06664 by Mengmeng Ma, Tang Li, Xi Peng, Yanlin Chen.

**Figure 1.** Figure 1: Existing interpretable machine learning methods (IML) mainly identify where the evidence is, while our concept circuits reveal how concepts interact across layers to support a prediction. foundations of high-impact systems. Despite their strong empirical performance, their internal mechanisms remain opaque to users, revealing little about how information is represented, transformed, and used inside the net… view at source ↗

**Figure 2.** Figure 2: Overview of our ViSAE toolbox for interpreting ViT inner workings. Left: Motivated by the human visual cortex hierarchy, we construct a probing suite (64K images + 16K concepts) for SAE training and interpretation. Middle: Our top-down concept reading and bottom-up concept circuit tracing algorithms. Right: Our mechanistic view of ViT inner workings enables various downstream applications, such as concept … view at source ↗

**Figure 3.** Figure 3: Our design mirrors the hierarchy of human visual cortex. Background: visual cortex hierarchy. The human visual system processes information along abstraction levels. As shown in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Benchmark results for expansion factor (8×). As shown, BatchTopK-SAE strikes a better trade-off across all metrics. Therefore, in subsequent experiments we use the BatchTopK-SAE with expansion factor = 8× and L0 Sparsity = 128. Full tables in Appendix G [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of concept circuits. For an input image containing both a dog and a cat, our method traces the unique causal pathways leading to each prediction. The circuit for “dog” composes primitive and intermediate concepts (e.g., “orange” and “marble”) into high-level semantics (e.g., “bark” and “dog”). In contrast, the circuit for “cat” relies on a different set of concepts (e.g., “dotted fur”). Our m… view at source ↗

**Figure 6.** Figure 6: Localize concepts in the pixel space. Notably, our method can even localize highly abstract semantics, such as “looking at”, by highlighting both the subject (i.e., the person) and the object involved (i.e., the paper). probing suite outperforms existing fine-grained interpretability datasets by 27.3%. Compared with MLLM-based summarization: We compare our auto-interpretation method against an MLLMbased… view at source ↗

**Figure 7.** Figure 7: Failure mode analysis. We identify seven failure modes of CLIP on the ImageNet-val set. For example, CLIP tends to misclassify parallel bar images as high bars when the gymnast is “flying” or oriented “downward” [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Full Benchmark results for cls tokens. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Full Benchmark results for image tokens. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

read the original abstract

Despite high accuracy, Vision Transformer (ViT) predictions can be driven by spurious cues, raising the need to understand their inner workings before safe deployment. Sparse autoencoders (SAEs) provide a promising lens for decomposing model representations into human-interpretable concepts, yet adapting SAE-based interpretation to ViTs remains challenging due to limited control over concept coverage and subjective, non-scalable feature interpretation. To fill the gaps, motivated by neuroscience-inspired principles, we propose ViSAE, a mechanistic interpretability toolbox for understanding ViT inner workings through concept circuits. ViSAE consists of three components: (1) A probing suite with 64K images and a 16K visually grounded concept vocabulary, improving concept coverage efficiency by 20x over ImageNet and interpretation accuracy by 28.7% over existing concept sets. (2) Top-down concept reading and Bottom-up circuit tracing algorithms that automatically recover ViT inner workings via concept circuits. (3) Applications for auditing and steering ViT behavior. Through concept editing, ViSAE improves the worst-group accuracy on WaterBirds by 48.2%, outperforming existing methods by 23.8%. Our data and code: https://github.com/deep-real/ViSAE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ViSAE scales up concept vocabularies and tracing for ViTs and reports large steering gains on WaterBirds, but the causal mapping from concepts to model features is not yet shown.

read the letter

The main takeaway is that this paper gives a concrete toolkit for finding and editing concepts inside Vision Transformers. It uses a 16K grounded vocabulary probed on 64K images, adds top-down reading and bottom-up tracing to recover circuits, and then demonstrates steering that lifts worst-group accuracy on WaterBirds by 48.2 percent, beating prior methods by 23.8 percent.

What is actually new is the scale of the vocabulary and the dual-direction tracing applied specifically to ViT bias problems. Earlier SAE work on vision models existed, but the combination of this coverage level, the automatic circuit recovery, and the measured editing result on a standard spurious-cue benchmark is fresh. The public code and data are also a plus.

The paper does the application part cleanly. It moves from interpretation to a usable intervention and reports a sizable practical effect. That is the part worth paying attention to if the numbers hold.

The soft spot is the missing link between the recovered concepts and the model's actual decision process. The abstract gives the accuracy and coverage numbers but no error bars, no ablation on the tracing steps, and no direct test that the edited concepts are the ones the ViT relies on for the WaterBirds boundary. If the 16K set mostly matches human labels rather than the model's internal features, the steering gains could come from a different mechanism. That concern from the stress-test note still stands on the given text.

This is for people working on mechanistic interpretability in vision who need auditing or editing tools. A reader who wants to try concept-level control on ViTs would get usable methods and a benchmark result to check. It deserves a serious referee because the claims are specific, the code is released, and the empirical target is clear even if revisions on validation are needed.

Referee Report

2 major / 1 minor

Summary. The paper introduces ViSAE, a neuroscience-motivated toolbox for mechanistic interpretability of Vision Transformers. It comprises (1) a 64K-image probing suite paired with a 16K visually grounded concept vocabulary claimed to improve coverage efficiency by 20x over ImageNet and interpretation accuracy by 28.7% over prior sets, (2) top-down concept reading and bottom-up circuit tracing algorithms to recover concept circuits, and (3) applications to auditing and steering ViT behavior. The central empirical result is that concept editing via ViSAE raises worst-group accuracy on WaterBirds by 48.2%, outperforming existing methods by 23.8%.

Significance. If the recovered concepts and circuits are verifiably aligned with the model's internal representations of spurious cues, the framework could offer a scalable route to auditing and controlling ViT robustness. The scale of the probing suite and the reported accuracy gains on an external benchmark constitute concrete strengths, but the absence of direct validation that the 16K concepts match the features actually used by the target ViT leaves the causal link between circuit tracing and the observed editing gains unestablished.

major comments (2)

[Abstract] Abstract: the central claim that concept editing improves worst-group accuracy on WaterBirds by 48.2% (outperforming baselines by 23.8%) is presented without error bars, baseline specifications, ablation controls, or statistical tests. This information is load-bearing for the steering application and cannot be evaluated from the given text.
[Abstract] Abstract: the 16K-concept vocabulary is asserted to enable faithful recovery of ViT inner workings, yet no direct test (e.g., intervention on held-out spurious features or comparison against model-derived feature attributions) is described to confirm that the selected concepts correspond to the representations the ViT actually uses for the WaterBirds decision boundary. This mapping is required for the editing procedure to produce genuine causal control rather than an artifact of the intervention method.

minor comments (1)

[Abstract] The abstract states a 20x coverage improvement and 28.7% accuracy gain but does not specify the exact metrics or the prior concept sets used for comparison; adding these details would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the need for clearer validation of concept alignment. We address each major comment below. The full experimental details (including error bars, baselines, ablations, and statistical tests) appear in the results section of the manuscript; we will revise the abstract for better self-containment while preserving its brevity. On the second point, we will add explicit discussion of the indirect nature of our validation and acknowledge the value of more direct tests.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that concept editing improves worst-group accuracy on WaterBirds by 48.2% (outperforming baselines by 23.8%) is presented without error bars, baseline specifications, ablation controls, or statistical tests. This information is load-bearing for the steering application and cannot be evaluated from the given text.

Authors: We agree the abstract is too terse on this load-bearing claim. The main text (Experiments section) reports the 48.2% improvement with standard deviations over 5 random seeds, specifies all baselines (including the 23.8% margin over the strongest prior method), includes ablation controls on circuit components, and reports p-values from paired t-tests. We will revise the abstract to include a parenthetical note on statistical significance and the use of multiple runs, while directing readers to the full results for complete specifications. revision: yes
Referee: [Abstract] Abstract: the 16K-concept vocabulary is asserted to enable faithful recovery of ViT inner workings, yet no direct test (e.g., intervention on held-out spurious features or comparison against model-derived feature attributions) is described to confirm that the selected concepts correspond to the representations the ViT actually uses for the WaterBirds decision boundary. This mapping is required for the editing procedure to produce genuine causal control rather than an artifact of the intervention method.

Authors: The manuscript provides indirect but quantitative evidence via the large, consistent gains on the external WaterBirds benchmark after targeted editing; spurious-cue interventions that do not align with model representations would not be expected to produce a 48.2% worst-group lift. Nevertheless, we acknowledge that direct tests (held-out feature interventions or attribution comparisons) are absent. We will expand the discussion section to explicitly note this limitation and outline how such tests could be performed in follow-up work. revision: partial

Circularity Check

0 steps flagged

No circularity; results on external benchmark with independent validation

full rationale

The paper reports empirical gains on the external WaterBirds benchmark via concept editing, with the 16K-concept vocabulary and 64K-image suite constructed and validated separately through coverage and interpretation accuracy metrics. No equations or claims reduce a prediction to a fitted input by construction, and no load-bearing step relies on self-citation chains or self-definitional mappings. The derivation chain from probing to circuit tracing to editing is presented as a sequence of independent algorithmic steps evaluated against baselines and held-out data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the unverified assumption that the neuroscience-motivated vocabulary faithfully aligns with ViT internals and that the reported accuracy lifts are not artifacts of dataset construction or metric choice.

axioms (1)

domain assumption Neuroscience-inspired principles translate directly into effective concept decomposition and circuit tracing for Vision Transformers.
Stated as motivation in the abstract without further justification or validation.

pith-pipeline@v0.9.1-grok · 5763 in / 1253 out tokens · 23571 ms · 2026-06-28T01:59:54.741336+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

170 extracted references · 1 canonical work pages

[1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000
[2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980
[3]

M. J. Kearns , title =
[4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983
[5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000
[6]

Suppressed for Anonymity , author=
[7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981
[8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959
[10]

Sanity checks for saliency maps

Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., and Kim, B. Sanity checks for saliency maps. Advances in neural information processing systems, 31, 2018

2018
[12]

Barack, D. L. and Krakauer, J. W. Two views on the cognitive brain. Nature Reviews Neuroscience, 22 0 (6): 0 359--371, 2021

2021
[13]

Network dissection: Quantifying interpretability of deep visual representations

Bau, D., Zhou, B., Khosla, A., Oliva, A., and Torralba, A. Network dissection: Quantifying interpretability of deep visual representations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 6541--6549, 2017

2017
[14]

and Gavves, S

Bereska, L. and Gavves, S. Mechanistic interpretability for ai safety-a review. Transactions on Machine Learning Research, 2024

2024
[15]

Interpreting clip with sparse linear concept embeddings (splice)

Bhalla, U., Oesterling, A., Srinivas, S., Calmon, F., and Lakkaraju, H. Interpreting clip with sparse linear concept embeddings (splice). Advances in Neural Information Processing Systems, 37: 0 84298--84328, 2024

2024
[16]

Language models can explain neurons in language models

Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., and Saunders, W. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html, 2023

2023
[17]

E., Hume, T., Carter, S., Henighan, T., and Olah, C

Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., and Olah, C. Towards monosemanticity: Decomposing languag...

2023
[18]

Batchtopk sparse autoencoders

Bussmann, B., Leask, P., and Nanda, N. Batchtopk sparse autoencoders. In NeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning, 2024

2024
[19]

Learning multi-level features with matryoshka sparse autoencoders

Bussmann, B., Nabeshima, N., Karvonen, A., and Nanda, N. Learning multi-level features with matryoshka sparse autoencoders. In International Conference on Machine Learning, pp.\ 6077--6101. PMLR, 2025

2025
[20]

B., Mante, V., Tolhurst, D

Carandini, M., Demb, J. B., Mante, V., Tolhurst, D. J., Dan, Y., Olshausen, B. A., Gallant, J. L., and Rust, N. C. Do we know what the early visual system does? Journal of Neuroscience, 25 0 (46): 0 10577--10597, 2005

2005
[22]

Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers

Chefer, H., Gur, S., and Wolf, L. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 397--406, 2021

2021
[23]

Chen, C., Li, O., Tao, D., Barnett, A., Rudin, C., and Su, J. K. This looks like that: deep learning for interpretable image recognition. Advances in neural information processing systems, 32, 2019

2019
[24]

Describing textures in the wild

Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., , and Vedaldi, A. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition ( CVPR ) , 2014

2014
[25]

Towards automated circuit discovery for mechanistic interpretability

Conmy, A., Mavor-Parker, A., Lynch, A., Heimersheim, S., and Garriga-Alonso, A. Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems, 36: 0 16318--16352, 2023

2023
[26]

Imagenet: A large-scale hierarchical image database

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.\ 248--255. Ieee, 2009

2009
[27]

J., Zoccolan, D., and Rust, N

DiCarlo, J. J., Zoccolan, D., and Rust, N. C. How does the brain solve visual object recognition? Neuron, 73 0 (3): 0 415--434, 2012

2012
[28]

An image is worth 16x16 words: Transformers for image recognition at scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020

2020
[29]

The P ile: An 800gb dataset of diverse text for language modeling

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. The P ile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

Pith/arXiv arXiv 2020
[30]

and Zou, J

Ghorbani, A. and Zou, J. Y. Neuron shapley: Discovering the responsible neurons. Advances in neural information processing systems, 33: 0 5922--5932, 2020

2020
[31]

Goodale, M. A. and Milner, A. D. Separate visual pathways for perception and action. Trends in neurosciences, 15 0 (1): 0 20--25, 1992

1992
[32]

o m, A., Weber, L., Krakowczyk, D., Bareeva, D., Motzkus, F., Samek, W., Lapuschkin, S., and H \

Hedstr \"o m, A., Weber, L., Krakowczyk, D., Bareeva, D., Motzkus, F., Samek, W., Lapuschkin, S., and H \"o hne, M. M.-C. Quantus: An explainable ai toolkit for responsible evaluation of neural network explanations and beyond. Journal of Machine Learning Research, 24 0 (34): 0 1--11, 2023

2023
[33]

R., Ewart, A., and Sharkey, L

Huben, R., Cunningham, H., Smith, L. R., Ewart, A., and Sharkey, L. Sparse autoencoders find highly interpretable features in language models. In The Twelfth International Conference on Learning Representations, 2023

2023
[34]

Joseph, S., Suresh, P., Goldfarb, E., Hufe, L., Gandelsman, Y., Graham, R., Bzdok, D., Samek, W., and Richards, B. A. Steering clip's vision transformer with sparse autoencoders. In Mechanistic Interpretability for Vision at CVPR 2025 (Non-proceedings Track), 2025

2025
[35]

Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav)

Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, pp.\ 2668--2677. PMLR, 2018

2018
[36]

u tt, K. T., D \

Kindermans, P.-J., Hooker, S., Adebayo, J., Alber, M., Sch \"u tt, K. T., D \"a hne, S., Erhan, D., and Kim, B. The (un) reliability of saliency methods. Explainable AI: Interpreting, explaining and visualizing deep learning, pp.\ 267--280, 2019

2019
[37]

W., Nguyen, T., Tang, Y

Koh, P. W., Nguyen, T., Tang, Y. S., Mussmann, S., Pierson, E., Kim, B., and Liang, P. Concept bottleneck models. In International conference on machine learning, pp.\ 5338--5348. PMLR, 2020

2020
[38]

A., et al

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123: 0 32--73, 2017

2017
[39]

Are data-driven explanations robust against out-of-distribution data? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 3821--3831, 2023

Li, T., Qiao, F., Ma, M., and Peng, X. Are data-driven explanations robust against out-of-distribution data? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 3821--3831, 2023

2023
[40]

Beyond accuracy: ensuring correct predictions with correct rationales

Li, T., Ma, M., and Peng, X. Beyond accuracy: ensuring correct predictions with correct rationales. Advances in Neural Information Processing Systems, 37: 0 43164--43188, 2024 a

2024
[41]

Deal: Disentangle and localize concept-level explanations for vlms

Li, T., Ma, M., and Peng, X. Deal: Disentangle and localize concept-level explanations for vlms. In European Conference on Computer Vision, pp.\ 383--401. Springer, 2024 b

2024
[42]

Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2

Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V., Kram \'a r, J., Dragan, A., Shah, R., and Nanda, N. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pp.\ 278--300, 2024

2024
[43]

Sparse autoencoders reveal selective remapping of visual concepts during adaptation

Lim, H., Choi, J., Choo, J., and Schneider, S. Sparse autoencoders reveal selective remapping of visual concepts during adaptation. In International Conference on Learning Representations, volume 2025, pp.\ 24444--24469, 2025

2025
[44]

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll \'a r, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In Computer vision--ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pp.\ 740--755. Springer, 2014

2014
[45]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European conference on computer vision, pp.\ 38--55. Springer, 2024

2024
[46]

Lundberg, S. M. and Lee, S.-I. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017

2017
[47]

why is there a tumor?

Ma, M., Li, T., Peng, Y., Lin, L., Beylergil, V., Zhao, B., Akin, O., and Peng, X. “why is there a tumor?”: Tell me the reason, show me the evidence. Proceedings of machine learning research, 267: 0 41992, 2025

2025
[50]

Locating and editing factual associations in gpt

Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35: 0 17359--17372, 2022

2022
[51]

Progress measures for grokking via mechanistic interpretability

Nanda, N., Chan, L., Lieberum, T., Smith, J., and Steinhardt, J. Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations, 2023

2023
[52]

Ng, A. et al. Sparse autoencoder. CS294A Lecture notes, 72 0 (2011): 0 1--19, 2011

2011
[53]

X., Li, T., and Peng, X

Nguyen, K. X., Li, T., and Peng, X. Interpretable failure detection with human-level concepts. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 26326--26334, 2025

2025
[54]

and Weng, T.-W

Oikarinen, T. and Weng, T.-W. Clip-dissect: Automatic description of neuron representations in deep vision networks. In ICLR 2022 Workshop on PAIR \ textasciicircum \ 2Struct: Privacy, Accountability, Interpretability, Robustness, Reasoning on Structured Data , 2022

2022
[55]

Zoom in: An introduction to circuits

Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., and Carter, S. Zoom in: An introduction to circuits. Distill, 2020. doi:10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in

work page doi:10.23915/distill.00024.001 2020
[56]

Olshausen, B. A. and Field, D. J. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37 0 (23): 0 3311--3325, 1997

1997
[57]

Gpt-5 system card

OpenAI. Gpt-5 system card. https://cdn.openai.com/gpt-5-system-card.pdf, August 2025

2025
[58]

Sparse autoencoders learn monosemantic features in vision-language models

Pach, M., Karthik, S., Bouniot, Q., Belongie, S., and Akata, Z. Sparse autoencoders learn monosemantic features in vision-language models. Advances in Neural Information Processing Systems, 38: 0 95706--95742, 2026

2026
[59]

Direct and indirect effects

Pearl, J. Direct and indirect effects. In Probabilistic and causal inference: the works of Judea Pearl, pp.\ 373--392. 2001

2001
[60]

Inside-out: Measuring generalization in vision transformers through inner workings

Peng, Y., Ma, M., Yao, Z., and Peng, X. Inside-out: Measuring generalization in vision transformers through inner workings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

2026
[61]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PmLR, 2021

2021
[64]

Discover-then-name: Task-agnostic concept bottlenecks via automated concept discovery

Rao, S., Mahajan, S., B \"o hle, M., and Schiele, B. Discover-then-name: Task-agnostic concept bottlenecks via automated concept discovery. In European Conference on Computer Vision, pp.\ 444--461. Springer, 2024

2024
[65]

why should i trust you?

Ribeiro, M. T., Singh, S., and Guestrin, C. " why should i trust you?" explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp.\ 1135--1144, 2016

2016
[66]

S., Hughes, M

Ross, A. S., Hughes, M. C., and Doshi-Velez, F. Right for the right reasons: Training differentiable models by constraining their explanations. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, pp.\ 2662--2670. International Joint Conferences on Artificial Intelligence Organization, 2017

2017
[67]

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence, 1 0 (5): 0 206--215, 2019

2019
[68]

W., Hashimoto, T

Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks. In International Conference on Learning Representations, 2019

2019
[69]

R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D

Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp.\ 618--626, 2017

2017
[70]

Decomposing and editing predictions by modeling model computation

Shah, H., Ilyas, A., and Madry, A. Decomposing and editing predictions by modeling model computation. In Proceedings of the 41st International Conference on Machine Learning, pp.\ 44244--44292, 2024

2024
[73]

Thasarathan, H., Forsyth, J., Fel, T., Kowal, M., and Derpanis, K. G. Universal sparse autoencoders: Interpretable cross-model concept alignment. In Forty-second International Conference on Machine Learning, 2025

2025
[74]

X., and Peng, X

Wang, Q., Li, T., Nguyen, K. X., and Peng, X. Beyond accuracy: On the effects of fine-tuning towards vision-language model’s prediction rationality. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 21225--21233, 2025

2025
[75]

Language in a bottle: Language model guided concept bottlenecks for interpretable image classification

Yang, Y., Panagopoulou, A., Zhou, S., Jin, D., Callison-Burch, C., and Yatskar, M. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 19187--19197, 2023

2023
[76]

Post-hoc concept bottleneck models

Yuksekgonul, M., Wang, M., and Zou, J. Post-hoc concept bottleneck models. In The Eleventh International Conference on Learning Representations, 2022

2022
[77]

Interpreting clip with hierarchical sparse autoencoders

Zaigrajew, V., Baniecki, H., and Biecek, P. Interpreting clip with hierarchical sparse autoencoders. In International Conference on Machine Learning, pp.\ 73918--73956. PMLR, 2025

2025
[78]

Large multi-modal models can interpret features in large multi-modal models

Zhang, K., Shen, Y., Li, B., and Liu, Z. Large multi-modal models can interpret features in large multi-modal models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 3650--3661, 2025

2025
[79]

Places: A 10 million image database for scene recognition

Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., and Torralba, A. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017

2017
[81]

2024 , url=

Leonard Bereska and Stratis Gavves , journal=. 2024 , url=

2024
[82]

International conference on machine learning , pages=

Concept bottleneck models , author=. International conference on machine learning , pages=. 2020 , organization=

2020
[83]

Advances in Neural Information Processing Systems , volume=

Interpreting clip with sparse linear concept embeddings (splice) , author=. Advances in Neural Information Processing Systems , volume=
[84]

International journal of computer vision , volume=

Visual genome: Connecting language and vision using crowdsourced dense image annotations , author=. International journal of computer vision , volume=. 2017 , publisher=

2017
[85]

Proceedings of the IEEE international conference on computer vision , pages=

Grad-cam: Visual explanations from deep networks via gradient-based localization , author=. Proceedings of the IEEE international conference on computer vision , pages=
[86]

Why should i trust you?

" Why should i trust you?" Explaining the predictions of any classifier , author=. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining , pages=
[87]

Advances in neural information processing systems , volume=

A unified approach to interpreting model predictions , author=. Advances in neural information processing systems , volume=
[88]

Advances in neural information processing systems , volume=

This looks like that: deep learning for interpretable image recognition , author=. Advances in neural information processing systems , volume=
[89]

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence , pages=

Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations , author=. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence , pages=. 2017 , organization=

2017
[90]

Advances in neural information processing systems , volume=

Sanity checks for saliency maps , author=. Advances in neural information processing systems , volume=

Showing first 80 references.

[1] [1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000

[2] [2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980

[3] [3]

M. J. Kearns , title =

[4] [4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983

[5] [5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000

[6] [6]

Suppressed for Anonymity , author=

[7] [7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981

[8] [8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959

[9] [10]

Sanity checks for saliency maps

Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., and Kim, B. Sanity checks for saliency maps. Advances in neural information processing systems, 31, 2018

2018

[10] [12]

Barack, D. L. and Krakauer, J. W. Two views on the cognitive brain. Nature Reviews Neuroscience, 22 0 (6): 0 359--371, 2021

2021

[11] [13]

Network dissection: Quantifying interpretability of deep visual representations

Bau, D., Zhou, B., Khosla, A., Oliva, A., and Torralba, A. Network dissection: Quantifying interpretability of deep visual representations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 6541--6549, 2017

2017

[12] [14]

and Gavves, S

Bereska, L. and Gavves, S. Mechanistic interpretability for ai safety-a review. Transactions on Machine Learning Research, 2024

2024

[13] [15]

Interpreting clip with sparse linear concept embeddings (splice)

Bhalla, U., Oesterling, A., Srinivas, S., Calmon, F., and Lakkaraju, H. Interpreting clip with sparse linear concept embeddings (splice). Advances in Neural Information Processing Systems, 37: 0 84298--84328, 2024

2024

[14] [16]

Language models can explain neurons in language models

Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., and Saunders, W. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html, 2023

2023

[15] [17]

E., Hume, T., Carter, S., Henighan, T., and Olah, C

Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., and Olah, C. Towards monosemanticity: Decomposing languag...

2023

[16] [18]

Batchtopk sparse autoencoders

Bussmann, B., Leask, P., and Nanda, N. Batchtopk sparse autoencoders. In NeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning, 2024

2024

[17] [19]

Learning multi-level features with matryoshka sparse autoencoders

Bussmann, B., Nabeshima, N., Karvonen, A., and Nanda, N. Learning multi-level features with matryoshka sparse autoencoders. In International Conference on Machine Learning, pp.\ 6077--6101. PMLR, 2025

2025

[18] [20]

B., Mante, V., Tolhurst, D

Carandini, M., Demb, J. B., Mante, V., Tolhurst, D. J., Dan, Y., Olshausen, B. A., Gallant, J. L., and Rust, N. C. Do we know what the early visual system does? Journal of Neuroscience, 25 0 (46): 0 10577--10597, 2005

2005

[19] [22]

Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers

Chefer, H., Gur, S., and Wolf, L. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 397--406, 2021

2021

[20] [23]

Chen, C., Li, O., Tao, D., Barnett, A., Rudin, C., and Su, J. K. This looks like that: deep learning for interpretable image recognition. Advances in neural information processing systems, 32, 2019

2019

[21] [24]

Describing textures in the wild

Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., , and Vedaldi, A. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition ( CVPR ) , 2014

2014

[22] [25]

Towards automated circuit discovery for mechanistic interpretability

Conmy, A., Mavor-Parker, A., Lynch, A., Heimersheim, S., and Garriga-Alonso, A. Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems, 36: 0 16318--16352, 2023

2023

[23] [26]

Imagenet: A large-scale hierarchical image database

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.\ 248--255. Ieee, 2009

2009

[24] [27]

J., Zoccolan, D., and Rust, N

DiCarlo, J. J., Zoccolan, D., and Rust, N. C. How does the brain solve visual object recognition? Neuron, 73 0 (3): 0 415--434, 2012

2012

[25] [28]

An image is worth 16x16 words: Transformers for image recognition at scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020

2020

[26] [29]

The P ile: An 800gb dataset of diverse text for language modeling

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. The P ile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

Pith/arXiv arXiv 2020

[27] [30]

and Zou, J

Ghorbani, A. and Zou, J. Y. Neuron shapley: Discovering the responsible neurons. Advances in neural information processing systems, 33: 0 5922--5932, 2020

2020

[28] [31]

Goodale, M. A. and Milner, A. D. Separate visual pathways for perception and action. Trends in neurosciences, 15 0 (1): 0 20--25, 1992

1992

[29] [32]

o m, A., Weber, L., Krakowczyk, D., Bareeva, D., Motzkus, F., Samek, W., Lapuschkin, S., and H \

Hedstr \"o m, A., Weber, L., Krakowczyk, D., Bareeva, D., Motzkus, F., Samek, W., Lapuschkin, S., and H \"o hne, M. M.-C. Quantus: An explainable ai toolkit for responsible evaluation of neural network explanations and beyond. Journal of Machine Learning Research, 24 0 (34): 0 1--11, 2023

2023

[30] [33]

R., Ewart, A., and Sharkey, L

Huben, R., Cunningham, H., Smith, L. R., Ewart, A., and Sharkey, L. Sparse autoencoders find highly interpretable features in language models. In The Twelfth International Conference on Learning Representations, 2023

2023

[31] [34]

Joseph, S., Suresh, P., Goldfarb, E., Hufe, L., Gandelsman, Y., Graham, R., Bzdok, D., Samek, W., and Richards, B. A. Steering clip's vision transformer with sparse autoencoders. In Mechanistic Interpretability for Vision at CVPR 2025 (Non-proceedings Track), 2025

2025

[32] [35]

Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav)

Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, pp.\ 2668--2677. PMLR, 2018

2018

[33] [36]

u tt, K. T., D \

Kindermans, P.-J., Hooker, S., Adebayo, J., Alber, M., Sch \"u tt, K. T., D \"a hne, S., Erhan, D., and Kim, B. The (un) reliability of saliency methods. Explainable AI: Interpreting, explaining and visualizing deep learning, pp.\ 267--280, 2019

2019

[34] [37]

W., Nguyen, T., Tang, Y

Koh, P. W., Nguyen, T., Tang, Y. S., Mussmann, S., Pierson, E., Kim, B., and Liang, P. Concept bottleneck models. In International conference on machine learning, pp.\ 5338--5348. PMLR, 2020

2020

[35] [38]

A., et al

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123: 0 32--73, 2017

2017

[36] [39]

Are data-driven explanations robust against out-of-distribution data? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 3821--3831, 2023

Li, T., Qiao, F., Ma, M., and Peng, X. Are data-driven explanations robust against out-of-distribution data? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 3821--3831, 2023

2023

[37] [40]

Beyond accuracy: ensuring correct predictions with correct rationales

Li, T., Ma, M., and Peng, X. Beyond accuracy: ensuring correct predictions with correct rationales. Advances in Neural Information Processing Systems, 37: 0 43164--43188, 2024 a

2024

[38] [41]

Deal: Disentangle and localize concept-level explanations for vlms

Li, T., Ma, M., and Peng, X. Deal: Disentangle and localize concept-level explanations for vlms. In European Conference on Computer Vision, pp.\ 383--401. Springer, 2024 b

2024

[39] [42]

Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2

Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V., Kram \'a r, J., Dragan, A., Shah, R., and Nanda, N. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pp.\ 278--300, 2024

2024

[40] [43]

Sparse autoencoders reveal selective remapping of visual concepts during adaptation

Lim, H., Choi, J., Choo, J., and Schneider, S. Sparse autoencoders reveal selective remapping of visual concepts during adaptation. In International Conference on Learning Representations, volume 2025, pp.\ 24444--24469, 2025

2025

[41] [44]

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll \'a r, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In Computer vision--ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pp.\ 740--755. Springer, 2014

2014

[42] [45]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European conference on computer vision, pp.\ 38--55. Springer, 2024

2024

[43] [46]

Lundberg, S. M. and Lee, S.-I. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017

2017

[44] [47]

why is there a tumor?

Ma, M., Li, T., Peng, Y., Lin, L., Beylergil, V., Zhao, B., Akin, O., and Peng, X. “why is there a tumor?”: Tell me the reason, show me the evidence. Proceedings of machine learning research, 267: 0 41992, 2025

2025

[45] [50]

Locating and editing factual associations in gpt

Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35: 0 17359--17372, 2022

2022

[46] [51]

Progress measures for grokking via mechanistic interpretability

Nanda, N., Chan, L., Lieberum, T., Smith, J., and Steinhardt, J. Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations, 2023

2023

[47] [52]

Ng, A. et al. Sparse autoencoder. CS294A Lecture notes, 72 0 (2011): 0 1--19, 2011

2011

[48] [53]

X., Li, T., and Peng, X

Nguyen, K. X., Li, T., and Peng, X. Interpretable failure detection with human-level concepts. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 26326--26334, 2025

2025

[49] [54]

and Weng, T.-W

Oikarinen, T. and Weng, T.-W. Clip-dissect: Automatic description of neuron representations in deep vision networks. In ICLR 2022 Workshop on PAIR \ textasciicircum \ 2Struct: Privacy, Accountability, Interpretability, Robustness, Reasoning on Structured Data , 2022

2022

[50] [55]

Zoom in: An introduction to circuits

Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., and Carter, S. Zoom in: An introduction to circuits. Distill, 2020. doi:10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in

work page doi:10.23915/distill.00024.001 2020

[51] [56]

Olshausen, B. A. and Field, D. J. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37 0 (23): 0 3311--3325, 1997

1997

[52] [57]

Gpt-5 system card

OpenAI. Gpt-5 system card. https://cdn.openai.com/gpt-5-system-card.pdf, August 2025

2025

[53] [58]

Sparse autoencoders learn monosemantic features in vision-language models

Pach, M., Karthik, S., Bouniot, Q., Belongie, S., and Akata, Z. Sparse autoencoders learn monosemantic features in vision-language models. Advances in Neural Information Processing Systems, 38: 0 95706--95742, 2026

2026

[54] [59]

Direct and indirect effects

Pearl, J. Direct and indirect effects. In Probabilistic and causal inference: the works of Judea Pearl, pp.\ 373--392. 2001

2001

[55] [60]

Inside-out: Measuring generalization in vision transformers through inner workings

Peng, Y., Ma, M., Yao, Z., and Peng, X. Inside-out: Measuring generalization in vision transformers through inner workings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

2026

[56] [61]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PmLR, 2021

2021

[57] [64]

Discover-then-name: Task-agnostic concept bottlenecks via automated concept discovery

Rao, S., Mahajan, S., B \"o hle, M., and Schiele, B. Discover-then-name: Task-agnostic concept bottlenecks via automated concept discovery. In European Conference on Computer Vision, pp.\ 444--461. Springer, 2024

2024

[58] [65]

why should i trust you?

Ribeiro, M. T., Singh, S., and Guestrin, C. " why should i trust you?" explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp.\ 1135--1144, 2016

2016

[59] [66]

S., Hughes, M

Ross, A. S., Hughes, M. C., and Doshi-Velez, F. Right for the right reasons: Training differentiable models by constraining their explanations. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, pp.\ 2662--2670. International Joint Conferences on Artificial Intelligence Organization, 2017

2017

[60] [67]

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence, 1 0 (5): 0 206--215, 2019

2019

[61] [68]

W., Hashimoto, T

Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks. In International Conference on Learning Representations, 2019

2019

[62] [69]

R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D

Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp.\ 618--626, 2017

2017

[63] [70]

Decomposing and editing predictions by modeling model computation

Shah, H., Ilyas, A., and Madry, A. Decomposing and editing predictions by modeling model computation. In Proceedings of the 41st International Conference on Machine Learning, pp.\ 44244--44292, 2024

2024

[64] [73]

Thasarathan, H., Forsyth, J., Fel, T., Kowal, M., and Derpanis, K. G. Universal sparse autoencoders: Interpretable cross-model concept alignment. In Forty-second International Conference on Machine Learning, 2025

2025

[65] [74]

X., and Peng, X

Wang, Q., Li, T., Nguyen, K. X., and Peng, X. Beyond accuracy: On the effects of fine-tuning towards vision-language model’s prediction rationality. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 21225--21233, 2025

2025

[66] [75]

Language in a bottle: Language model guided concept bottlenecks for interpretable image classification

Yang, Y., Panagopoulou, A., Zhou, S., Jin, D., Callison-Burch, C., and Yatskar, M. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 19187--19197, 2023

2023

[67] [76]

Post-hoc concept bottleneck models

Yuksekgonul, M., Wang, M., and Zou, J. Post-hoc concept bottleneck models. In The Eleventh International Conference on Learning Representations, 2022

2022

[68] [77]

Interpreting clip with hierarchical sparse autoencoders

Zaigrajew, V., Baniecki, H., and Biecek, P. Interpreting clip with hierarchical sparse autoencoders. In International Conference on Machine Learning, pp.\ 73918--73956. PMLR, 2025

2025

[69] [78]

Large multi-modal models can interpret features in large multi-modal models

Zhang, K., Shen, Y., Li, B., and Liu, Z. Large multi-modal models can interpret features in large multi-modal models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 3650--3661, 2025

2025

[70] [79]

Places: A 10 million image database for scene recognition

Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., and Torralba, A. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017

2017

[71] [81]

2024 , url=

Leonard Bereska and Stratis Gavves , journal=. 2024 , url=

2024

[72] [82]

International conference on machine learning , pages=

Concept bottleneck models , author=. International conference on machine learning , pages=. 2020 , organization=

2020

[73] [83]

Advances in Neural Information Processing Systems , volume=

Interpreting clip with sparse linear concept embeddings (splice) , author=. Advances in Neural Information Processing Systems , volume=

[74] [84]

International journal of computer vision , volume=

Visual genome: Connecting language and vision using crowdsourced dense image annotations , author=. International journal of computer vision , volume=. 2017 , publisher=

2017

[75] [85]

Proceedings of the IEEE international conference on computer vision , pages=

Grad-cam: Visual explanations from deep networks via gradient-based localization , author=. Proceedings of the IEEE international conference on computer vision , pages=

[76] [86]

Why should i trust you?

" Why should i trust you?" Explaining the predictions of any classifier , author=. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining , pages=

[77] [87]

Advances in neural information processing systems , volume=

A unified approach to interpreting model predictions , author=. Advances in neural information processing systems , volume=

[78] [88]

Advances in neural information processing systems , volume=

This looks like that: deep learning for interpretable image recognition , author=. Advances in neural information processing systems , volume=

[79] [89]

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence , pages=

Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations , author=. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence , pages=. 2017 , organization=

2017

[80] [90]

Advances in neural information processing systems , volume=

Sanity checks for saliency maps , author=. Advances in neural information processing systems , volume=