pith. sign in

arxiv: 2606.06664 · v1 · pith:SQY5E4SUnew · submitted 2026-06-04 · 💻 cs.CV · cs.AI· cs.LG

Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers

Pith reviewed 2026-06-28 01:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords vision transformersmechanistic interpretabilityconcept circuitssparse autoencodersspurious correlationsmodel editingWaterBirdsconcept probing
0
0 comments X

The pith

ViSAE recovers concept circuits inside Vision Transformers to interpret their decisions and edit away spurious cues, raising worst-group accuracy on WaterBirds by 48.2%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ViSAE as a toolbox that maps Vision Transformer activations onto a fixed vocabulary of 16,000 visually grounded concepts using a 64,000-image probe. Top-down reading and bottom-up tracing algorithms then assemble these concepts into circuits that show how features combine layer by layer. The circuits support auditing for unwanted shortcuts and direct editing of the model to suppress them. On the WaterBirds benchmark the editing step lifts worst-group accuracy by 48.2 percent and beats prior methods by 23.8 percent. The work therefore supplies both an interpretation method and a practical intervention for bias reduction in ViTs.

Core claim

ViSAE supplies a 64K-image probing suite and 16K-concept vocabulary that together decompose ViT representations, then uses top-down concept reading and bottom-up circuit tracing to recover the internal pathways that link concepts; targeted editing of these circuits removes spurious correlations and produces a 48.2 percent gain in worst-group accuracy on WaterBirds while outperforming existing editing techniques by 23.8 percent.

What carries the argument

Concept circuits recovered by top-down reading and bottom-up tracing algorithms that operate over activations aligned to a 16K-concept vocabulary.

If this is right

  • Concept editing can suppress spurious correlations without retraining the full model.
  • The probing suite provides 20 times higher concept coverage efficiency than ImageNet-based sets.
  • Automated circuit tracing scales interpretation beyond manual feature labeling.
  • The same circuits support both auditing and steering of model behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same probing and tracing pipeline could be applied to other transformer families to check whether similar circuits appear.
  • If the recovered circuits generalize across datasets, they could guide data collection that reduces spurious cues before training.
  • Failures of the editing step on new tasks would indicate which concepts the current vocabulary still misses.
  • The neuroscience motivation could be tested by comparing circuit structure against known visual cortex pathways in biological data.

Load-bearing premise

The 16K-concept vocabulary and 64K-image suite are assumed to capture the actual internal features the Vision Transformer uses.

What would settle it

A controlled experiment that applies the reported concept edits to a held-out ViT and measures no improvement in worst-group accuracy on WaterBirds, or that shows a different concept vocabulary yields equal or larger gains.

Figures

Figures reproduced from arXiv: 2606.06664 by Mengmeng Ma, Tang Li, Xi Peng, Yanlin Chen.

Figure 1
Figure 1. Figure 1: Existing interpretable machine learning methods (IML) mainly identify where the evidence is, while our concept circuits reveal how concepts interact across layers to support a prediction. foundations of high-impact systems. Despite their strong empirical performance, their internal mechanisms remain opaque to users, revealing little about how information is represented, transformed, and used inside the net… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our ViSAE toolbox for interpreting ViT inner workings. Left: Motivated by the human visual cortex hierarchy, we construct a probing suite (64K images + 16K concepts) for SAE training and interpretation. Middle: Our top-down concept reading and bottom-up concept circuit tracing algorithms. Right: Our mechanistic view of ViT inner workings enables various downstream applications, such as concept … view at source ↗
Figure 3
Figure 3. Figure 3: Our design mirrors the hierarchy of human visual cortex. Background: visual cortex hierarchy. The human visual system processes information along abstraction levels. As shown in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Benchmark results for expansion factor (8×). As shown, BatchTopK-SAE strikes a better trade-off across all metrics. Therefore, in subsequent experiments we use the BatchTopK-SAE with expansion factor = 8× and L0 Sparsity = 128. Full tables in Appendix G [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of concept circuits. For an input image containing both a dog and a cat, our method traces the unique causal pathways leading to each prediction. The circuit for “dog” composes primitive and intermediate concepts (e.g., “orange” and “marble”) into high-level semantics (e.g., “bark” and “dog”). In contrast, the circuit for “cat” relies on a different set of concepts (e.g., “dotted fur”). Our m… view at source ↗
Figure 6
Figure 6. Figure 6: Localize concepts in the pixel space. Notably, our method can even localize highly abstract semantics, such as “looking at”, by highlighting both the subject (i.e., the person) and the object involved (i.e., the paper). probing suite outperforms existing fine-grained interpretabil￾ity datasets by 27.3%. Compared with MLLM-based summarization: We com￾pare our auto-interpretation method against an MLLM￾based… view at source ↗
Figure 7
Figure 7. Figure 7: Failure mode analysis. We identify seven failure modes of CLIP on the ImageNet-val set. For example, CLIP tends to misclassify parallel bar images as high bars when the gymnast is “flying” or oriented “downward” [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Full Benchmark results for cls tokens. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Full Benchmark results for image tokens. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
read the original abstract

Despite high accuracy, Vision Transformer (ViT) predictions can be driven by spurious cues, raising the need to understand their inner workings before safe deployment. Sparse autoencoders (SAEs) provide a promising lens for decomposing model representations into human-interpretable concepts, yet adapting SAE-based interpretation to ViTs remains challenging due to limited control over concept coverage and subjective, non-scalable feature interpretation. To fill the gaps, motivated by neuroscience-inspired principles, we propose ViSAE, a mechanistic interpretability toolbox for understanding ViT inner workings through concept circuits. ViSAE consists of three components: (1) A probing suite with 64K images and a 16K visually grounded concept vocabulary, improving concept coverage efficiency by 20x over ImageNet and interpretation accuracy by 28.7% over existing concept sets. (2) Top-down concept reading and Bottom-up circuit tracing algorithms that automatically recover ViT inner workings via concept circuits. (3) Applications for auditing and steering ViT behavior. Through concept editing, ViSAE improves the worst-group accuracy on WaterBirds by 48.2%, outperforming existing methods by 23.8%. Our data and code: https://github.com/deep-real/ViSAE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces ViSAE, a neuroscience-motivated toolbox for mechanistic interpretability of Vision Transformers. It comprises (1) a 64K-image probing suite paired with a 16K visually grounded concept vocabulary claimed to improve coverage efficiency by 20x over ImageNet and interpretation accuracy by 28.7% over prior sets, (2) top-down concept reading and bottom-up circuit tracing algorithms to recover concept circuits, and (3) applications to auditing and steering ViT behavior. The central empirical result is that concept editing via ViSAE raises worst-group accuracy on WaterBirds by 48.2%, outperforming existing methods by 23.8%.

Significance. If the recovered concepts and circuits are verifiably aligned with the model's internal representations of spurious cues, the framework could offer a scalable route to auditing and controlling ViT robustness. The scale of the probing suite and the reported accuracy gains on an external benchmark constitute concrete strengths, but the absence of direct validation that the 16K concepts match the features actually used by the target ViT leaves the causal link between circuit tracing and the observed editing gains unestablished.

major comments (2)
  1. [Abstract] Abstract: the central claim that concept editing improves worst-group accuracy on WaterBirds by 48.2% (outperforming baselines by 23.8%) is presented without error bars, baseline specifications, ablation controls, or statistical tests. This information is load-bearing for the steering application and cannot be evaluated from the given text.
  2. [Abstract] Abstract: the 16K-concept vocabulary is asserted to enable faithful recovery of ViT inner workings, yet no direct test (e.g., intervention on held-out spurious features or comparison against model-derived feature attributions) is described to confirm that the selected concepts correspond to the representations the ViT actually uses for the WaterBirds decision boundary. This mapping is required for the editing procedure to produce genuine causal control rather than an artifact of the intervention method.
minor comments (1)
  1. [Abstract] The abstract states a 20x coverage improvement and 28.7% accuracy gain but does not specify the exact metrics or the prior concept sets used for comparison; adding these details would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the need for clearer validation of concept alignment. We address each major comment below. The full experimental details (including error bars, baselines, ablations, and statistical tests) appear in the results section of the manuscript; we will revise the abstract for better self-containment while preserving its brevity. On the second point, we will add explicit discussion of the indirect nature of our validation and acknowledge the value of more direct tests.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that concept editing improves worst-group accuracy on WaterBirds by 48.2% (outperforming baselines by 23.8%) is presented without error bars, baseline specifications, ablation controls, or statistical tests. This information is load-bearing for the steering application and cannot be evaluated from the given text.

    Authors: We agree the abstract is too terse on this load-bearing claim. The main text (Experiments section) reports the 48.2% improvement with standard deviations over 5 random seeds, specifies all baselines (including the 23.8% margin over the strongest prior method), includes ablation controls on circuit components, and reports p-values from paired t-tests. We will revise the abstract to include a parenthetical note on statistical significance and the use of multiple runs, while directing readers to the full results for complete specifications. revision: yes

  2. Referee: [Abstract] Abstract: the 16K-concept vocabulary is asserted to enable faithful recovery of ViT inner workings, yet no direct test (e.g., intervention on held-out spurious features or comparison against model-derived feature attributions) is described to confirm that the selected concepts correspond to the representations the ViT actually uses for the WaterBirds decision boundary. This mapping is required for the editing procedure to produce genuine causal control rather than an artifact of the intervention method.

    Authors: The manuscript provides indirect but quantitative evidence via the large, consistent gains on the external WaterBirds benchmark after targeted editing; spurious-cue interventions that do not align with model representations would not be expected to produce a 48.2% worst-group lift. Nevertheless, we acknowledge that direct tests (held-out feature interventions or attribution comparisons) are absent. We will expand the discussion section to explicitly note this limitation and outline how such tests could be performed in follow-up work. revision: partial

Circularity Check

0 steps flagged

No circularity; results on external benchmark with independent validation

full rationale

The paper reports empirical gains on the external WaterBirds benchmark via concept editing, with the 16K-concept vocabulary and 64K-image suite constructed and validated separately through coverage and interpretation accuracy metrics. No equations or claims reduce a prediction to a fitted input by construction, and no load-bearing step relies on self-citation chains or self-definitional mappings. The derivation chain from probing to circuit tracing to editing is presented as a sequence of independent algorithmic steps evaluated against baselines and held-out data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the unverified assumption that the neuroscience-motivated vocabulary faithfully aligns with ViT internals and that the reported accuracy lifts are not artifacts of dataset construction or metric choice.

axioms (1)
  • domain assumption Neuroscience-inspired principles translate directly into effective concept decomposition and circuit tracing for Vision Transformers.
    Stated as motivation in the abstract without further justification or validation.

pith-pipeline@v0.9.1-grok · 5763 in / 1253 out tokens · 23571 ms · 2026-06-28T01:59:54.741336+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

170 extracted references · 1 canonical work pages

  1. [1]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  2. [2]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  3. [3]

    M. J. Kearns , title =

  4. [4]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  5. [5]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  6. [6]

    Suppressed for Anonymity , author=

  7. [7]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  8. [8]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  9. [10]

    Sanity checks for saliency maps

    Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., and Kim, B. Sanity checks for saliency maps. Advances in neural information processing systems, 31, 2018

  10. [12]

    Barack, D. L. and Krakauer, J. W. Two views on the cognitive brain. Nature Reviews Neuroscience, 22 0 (6): 0 359--371, 2021

  11. [13]

    Network dissection: Quantifying interpretability of deep visual representations

    Bau, D., Zhou, B., Khosla, A., Oliva, A., and Torralba, A. Network dissection: Quantifying interpretability of deep visual representations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 6541--6549, 2017

  12. [14]

    and Gavves, S

    Bereska, L. and Gavves, S. Mechanistic interpretability for ai safety-a review. Transactions on Machine Learning Research, 2024

  13. [15]

    Interpreting clip with sparse linear concept embeddings (splice)

    Bhalla, U., Oesterling, A., Srinivas, S., Calmon, F., and Lakkaraju, H. Interpreting clip with sparse linear concept embeddings (splice). Advances in Neural Information Processing Systems, 37: 0 84298--84328, 2024

  14. [16]

    Language models can explain neurons in language models

    Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., and Saunders, W. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html, 2023

  15. [17]

    E., Hume, T., Carter, S., Henighan, T., and Olah, C

    Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., and Olah, C. Towards monosemanticity: Decomposing languag...

  16. [18]

    Batchtopk sparse autoencoders

    Bussmann, B., Leask, P., and Nanda, N. Batchtopk sparse autoencoders. In NeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning, 2024

  17. [19]

    Learning multi-level features with matryoshka sparse autoencoders

    Bussmann, B., Nabeshima, N., Karvonen, A., and Nanda, N. Learning multi-level features with matryoshka sparse autoencoders. In International Conference on Machine Learning, pp.\ 6077--6101. PMLR, 2025

  18. [20]

    B., Mante, V., Tolhurst, D

    Carandini, M., Demb, J. B., Mante, V., Tolhurst, D. J., Dan, Y., Olshausen, B. A., Gallant, J. L., and Rust, N. C. Do we know what the early visual system does? Journal of Neuroscience, 25 0 (46): 0 10577--10597, 2005

  19. [22]

    Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers

    Chefer, H., Gur, S., and Wolf, L. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 397--406, 2021

  20. [23]

    Chen, C., Li, O., Tao, D., Barnett, A., Rudin, C., and Su, J. K. This looks like that: deep learning for interpretable image recognition. Advances in neural information processing systems, 32, 2019

  21. [24]

    Describing textures in the wild

    Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., , and Vedaldi, A. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition ( CVPR ) , 2014

  22. [25]

    Towards automated circuit discovery for mechanistic interpretability

    Conmy, A., Mavor-Parker, A., Lynch, A., Heimersheim, S., and Garriga-Alonso, A. Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems, 36: 0 16318--16352, 2023

  23. [26]

    Imagenet: A large-scale hierarchical image database

    Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.\ 248--255. Ieee, 2009

  24. [27]

    J., Zoccolan, D., and Rust, N

    DiCarlo, J. J., Zoccolan, D., and Rust, N. C. How does the brain solve visual object recognition? Neuron, 73 0 (3): 0 415--434, 2012

  25. [28]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020

  26. [29]

    The P ile: An 800gb dataset of diverse text for language modeling

    Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. The P ile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

  27. [30]

    and Zou, J

    Ghorbani, A. and Zou, J. Y. Neuron shapley: Discovering the responsible neurons. Advances in neural information processing systems, 33: 0 5922--5932, 2020

  28. [31]

    Goodale, M. A. and Milner, A. D. Separate visual pathways for perception and action. Trends in neurosciences, 15 0 (1): 0 20--25, 1992

  29. [32]

    o m, A., Weber, L., Krakowczyk, D., Bareeva, D., Motzkus, F., Samek, W., Lapuschkin, S., and H \

    Hedstr \"o m, A., Weber, L., Krakowczyk, D., Bareeva, D., Motzkus, F., Samek, W., Lapuschkin, S., and H \"o hne, M. M.-C. Quantus: An explainable ai toolkit for responsible evaluation of neural network explanations and beyond. Journal of Machine Learning Research, 24 0 (34): 0 1--11, 2023

  30. [33]

    R., Ewart, A., and Sharkey, L

    Huben, R., Cunningham, H., Smith, L. R., Ewart, A., and Sharkey, L. Sparse autoencoders find highly interpretable features in language models. In The Twelfth International Conference on Learning Representations, 2023

  31. [34]

    Joseph, S., Suresh, P., Goldfarb, E., Hufe, L., Gandelsman, Y., Graham, R., Bzdok, D., Samek, W., and Richards, B. A. Steering clip's vision transformer with sparse autoencoders. In Mechanistic Interpretability for Vision at CVPR 2025 (Non-proceedings Track), 2025

  32. [35]

    Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav)

    Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, pp.\ 2668--2677. PMLR, 2018

  33. [36]

    u tt, K. T., D \

    Kindermans, P.-J., Hooker, S., Adebayo, J., Alber, M., Sch \"u tt, K. T., D \"a hne, S., Erhan, D., and Kim, B. The (un) reliability of saliency methods. Explainable AI: Interpreting, explaining and visualizing deep learning, pp.\ 267--280, 2019

  34. [37]

    W., Nguyen, T., Tang, Y

    Koh, P. W., Nguyen, T., Tang, Y. S., Mussmann, S., Pierson, E., Kim, B., and Liang, P. Concept bottleneck models. In International conference on machine learning, pp.\ 5338--5348. PMLR, 2020

  35. [38]

    A., et al

    Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123: 0 32--73, 2017

  36. [39]

    Are data-driven explanations robust against out-of-distribution data? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 3821--3831, 2023

    Li, T., Qiao, F., Ma, M., and Peng, X. Are data-driven explanations robust against out-of-distribution data? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 3821--3831, 2023

  37. [40]

    Beyond accuracy: ensuring correct predictions with correct rationales

    Li, T., Ma, M., and Peng, X. Beyond accuracy: ensuring correct predictions with correct rationales. Advances in Neural Information Processing Systems, 37: 0 43164--43188, 2024 a

  38. [41]

    Deal: Disentangle and localize concept-level explanations for vlms

    Li, T., Ma, M., and Peng, X. Deal: Disentangle and localize concept-level explanations for vlms. In European Conference on Computer Vision, pp.\ 383--401. Springer, 2024 b

  39. [42]

    Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2

    Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V., Kram \'a r, J., Dragan, A., Shah, R., and Nanda, N. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pp.\ 278--300, 2024

  40. [43]

    Sparse autoencoders reveal selective remapping of visual concepts during adaptation

    Lim, H., Choi, J., Choo, J., and Schneider, S. Sparse autoencoders reveal selective remapping of visual concepts during adaptation. In International Conference on Learning Representations, volume 2025, pp.\ 24444--24469, 2025

  41. [44]

    Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll \'a r, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In Computer vision--ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pp.\ 740--755. Springer, 2014

  42. [45]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European conference on computer vision, pp.\ 38--55. Springer, 2024

  43. [46]

    Lundberg, S. M. and Lee, S.-I. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017

  44. [47]

    why is there a tumor?

    Ma, M., Li, T., Peng, Y., Lin, L., Beylergil, V., Zhao, B., Akin, O., and Peng, X. “why is there a tumor?”: Tell me the reason, show me the evidence. Proceedings of machine learning research, 267: 0 41992, 2025

  45. [50]

    Locating and editing factual associations in gpt

    Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35: 0 17359--17372, 2022

  46. [51]

    Progress measures for grokking via mechanistic interpretability

    Nanda, N., Chan, L., Lieberum, T., Smith, J., and Steinhardt, J. Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations, 2023

  47. [52]

    Ng, A. et al. Sparse autoencoder. CS294A Lecture notes, 72 0 (2011): 0 1--19, 2011

  48. [53]

    X., Li, T., and Peng, X

    Nguyen, K. X., Li, T., and Peng, X. Interpretable failure detection with human-level concepts. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 26326--26334, 2025

  49. [54]

    and Weng, T.-W

    Oikarinen, T. and Weng, T.-W. Clip-dissect: Automatic description of neuron representations in deep vision networks. In ICLR 2022 Workshop on PAIR \ textasciicircum \ 2Struct: Privacy, Accountability, Interpretability, Robustness, Reasoning on Structured Data , 2022

  50. [55]

    Zoom in: An introduction to circuits

    Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., and Carter, S. Zoom in: An introduction to circuits. Distill, 2020. doi:10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in

  51. [56]

    Olshausen, B. A. and Field, D. J. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37 0 (23): 0 3311--3325, 1997

  52. [57]

    Gpt-5 system card

    OpenAI. Gpt-5 system card. https://cdn.openai.com/gpt-5-system-card.pdf, August 2025

  53. [58]

    Sparse autoencoders learn monosemantic features in vision-language models

    Pach, M., Karthik, S., Bouniot, Q., Belongie, S., and Akata, Z. Sparse autoencoders learn monosemantic features in vision-language models. Advances in Neural Information Processing Systems, 38: 0 95706--95742, 2026

  54. [59]

    Direct and indirect effects

    Pearl, J. Direct and indirect effects. In Probabilistic and causal inference: the works of Judea Pearl, pp.\ 373--392. 2001

  55. [60]

    Inside-out: Measuring generalization in vision transformers through inner workings

    Peng, Y., Ma, M., Yao, Z., and Peng, X. Inside-out: Measuring generalization in vision transformers through inner workings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

  56. [61]

    W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al

    Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PmLR, 2021

  57. [64]

    Discover-then-name: Task-agnostic concept bottlenecks via automated concept discovery

    Rao, S., Mahajan, S., B \"o hle, M., and Schiele, B. Discover-then-name: Task-agnostic concept bottlenecks via automated concept discovery. In European Conference on Computer Vision, pp.\ 444--461. Springer, 2024

  58. [65]

    why should i trust you?

    Ribeiro, M. T., Singh, S., and Guestrin, C. " why should i trust you?" explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp.\ 1135--1144, 2016

  59. [66]

    S., Hughes, M

    Ross, A. S., Hughes, M. C., and Doshi-Velez, F. Right for the right reasons: Training differentiable models by constraining their explanations. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, pp.\ 2662--2670. International Joint Conferences on Artificial Intelligence Organization, 2017

  60. [67]

    Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

    Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence, 1 0 (5): 0 206--215, 2019

  61. [68]

    W., Hashimoto, T

    Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks. In International Conference on Learning Representations, 2019

  62. [69]

    R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D

    Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp.\ 618--626, 2017

  63. [70]

    Decomposing and editing predictions by modeling model computation

    Shah, H., Ilyas, A., and Madry, A. Decomposing and editing predictions by modeling model computation. In Proceedings of the 41st International Conference on Machine Learning, pp.\ 44244--44292, 2024

  64. [73]

    Thasarathan, H., Forsyth, J., Fel, T., Kowal, M., and Derpanis, K. G. Universal sparse autoencoders: Interpretable cross-model concept alignment. In Forty-second International Conference on Machine Learning, 2025

  65. [74]

    X., and Peng, X

    Wang, Q., Li, T., Nguyen, K. X., and Peng, X. Beyond accuracy: On the effects of fine-tuning towards vision-language model’s prediction rationality. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 21225--21233, 2025

  66. [75]

    Language in a bottle: Language model guided concept bottlenecks for interpretable image classification

    Yang, Y., Panagopoulou, A., Zhou, S., Jin, D., Callison-Burch, C., and Yatskar, M. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 19187--19197, 2023

  67. [76]

    Post-hoc concept bottleneck models

    Yuksekgonul, M., Wang, M., and Zou, J. Post-hoc concept bottleneck models. In The Eleventh International Conference on Learning Representations, 2022

  68. [77]

    Interpreting clip with hierarchical sparse autoencoders

    Zaigrajew, V., Baniecki, H., and Biecek, P. Interpreting clip with hierarchical sparse autoencoders. In International Conference on Machine Learning, pp.\ 73918--73956. PMLR, 2025

  69. [78]

    Large multi-modal models can interpret features in large multi-modal models

    Zhang, K., Shen, Y., Li, B., and Liu, Z. Large multi-modal models can interpret features in large multi-modal models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 3650--3661, 2025

  70. [79]

    Places: A 10 million image database for scene recognition

    Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., and Torralba, A. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017

  71. [81]

    2024 , url=

    Leonard Bereska and Stratis Gavves , journal=. 2024 , url=

  72. [82]

    International conference on machine learning , pages=

    Concept bottleneck models , author=. International conference on machine learning , pages=. 2020 , organization=

  73. [83]

    Advances in Neural Information Processing Systems , volume=

    Interpreting clip with sparse linear concept embeddings (splice) , author=. Advances in Neural Information Processing Systems , volume=

  74. [84]

    International journal of computer vision , volume=

    Visual genome: Connecting language and vision using crowdsourced dense image annotations , author=. International journal of computer vision , volume=. 2017 , publisher=

  75. [85]

    Proceedings of the IEEE international conference on computer vision , pages=

    Grad-cam: Visual explanations from deep networks via gradient-based localization , author=. Proceedings of the IEEE international conference on computer vision , pages=

  76. [86]

    Why should i trust you?

    " Why should i trust you?" Explaining the predictions of any classifier , author=. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining , pages=

  77. [87]

    Advances in neural information processing systems , volume=

    A unified approach to interpreting model predictions , author=. Advances in neural information processing systems , volume=

  78. [88]

    Advances in neural information processing systems , volume=

    This looks like that: deep learning for interpretable image recognition , author=. Advances in neural information processing systems , volume=

  79. [89]

    Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence , pages=

    Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations , author=. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence , pages=. 2017 , organization=

  80. [90]

    Advances in neural information processing systems , volume=

    Sanity checks for saliency maps , author=. Advances in neural information processing systems , volume=

Showing first 80 references.