pith. machine review for the scientific record. sign in

arxiv: 2605.12952 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: no theorem link

Debunking Grad-ECLIP: A Comprehensive Study on Its Incorrectness and Fundamental Principles for Model Interpretation

Xiaohui Fan, Yongjin Cui

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords Grad-ECLIPmodel interpretationtransformerattention mechanismsintermediate featuresinterpretation principlesvision transformersfeature attribution
0
0 comments X

The pith

Grad-ECLIP produces model interpretations that do not match the original model's behavior or performance because its method is equivalent to a simpler attention-based route.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that Grad-ECLIP, presented as a new intermediate feature-based route for interpreting transformer models, is mathematically equivalent to an attention-based method. The authors construct Attention-ECLIP as a direct equivalent with simpler computation and prove the match through formal derivation plus experiments. They then show that Grad-ECLIP outputs are not the interpretations of the original model and are misaligned with how well the model actually performs. The paper ends by stating two fundamental principles any model interpretation technique must follow to avoid this kind of detachment from the model.

Core claim

The central claim is that the intermediate feature-based technical route represented by Grad-ECLIP is actually an equivalent variant of the attention-based route, as demonstrated by developing Attention-ECLIP which yields identical results with reduced computation. Both formal derivation and experimental validation confirm this equivalence. In addition, the interpretation results from Grad-ECLIP are not those produced by the original model and are misaligned with the model's performance on the task, revealing flaws in the method that violate core requirements for faithful model interpretation.

What carries the argument

The mathematical equivalence between Grad-ECLIP's intermediate feature computations and attention maps in the transformer, which carries the argument that both routes generate the same but incorrect interpretations.

If this is right

  • Any interpretation method based on intermediate features can be reduced to attention computations without loss of information.
  • Using Grad-ECLIP would lead to explanations that do not reflect the features the model actually uses for its predictions.
  • Model interpretation results must be validated against both the original model's outputs and its measured performance on the data.
  • New technical routes for interpretation should first be checked for equivalence to existing attention-based methods before being presented as distinct.
  • The two stated principles for model interpretation would block similar detachment between the explanation and the model's actual decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Attention-based routes may be sufficient for all faithful transformer interpretations, rendering separate intermediate-feature variants unnecessary.
  • The same equivalence and misalignment pattern could appear in other feature-attribution methods applied to vision or language transformers.
  • New interpretation tools should include routine checks for both equivalence to attention maps and direct alignment with model accuracy before release.

Load-bearing premise

The load-bearing premise is that Attention-ECLIP is exactly equivalent to Grad-ECLIP across all practical cases and that the experiments correctly isolate misalignment without selection bias or implementation differences.

What would settle it

A side-by-side test on a held-out transformer model and dataset where Grad-ECLIP and Attention-ECLIP produce different saliency maps, or where Grad-ECLIP maps align with the model's accuracy while the paper claims they do not.

Figures

Figures reproduced from arXiv: 2605.12952 by Xiaohui Fan, Yongjin Cui.

Figure 1
Figure 1. Figure 1: The overall idea of this work. tional Conference on Machine Learning (ICML) 2024, is the most representative. Intermediate feature-based meth￾ods primarily utilize intermediate features to measure the contribution of the location where the feature resides to the model’s output. Grad-ECLIP served as the catalyst for this research, as it in￾troduced a novel approach to Transformer interpretation and garnered… view at source ↗
Figure 2
Figure 2. Figure 2: Grad-ECLIP and GAE. (Q, K, V are Query, Key, Value matrices; A is the attention map; A¯ is the mean attention map across multi-head attention maps; √1 dk is the scaling factor.) CLIP model. Therefore, we also mainly conducted our experiments based on the CLIP model. 3.1. Grad-ECLIP The computational process of Grad-ECLIP is illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Interpretation results of Grad-ECLIP w/o wi, Attention-ECLIP w/o wi, Grad-ECLIP and Attention-ECLIP. tion maps instead of intermediate features, reducing gradi￾ent computation requirements and eliminating a summation operation. This design renders it both more concise and computationally efficient. Up to now, we have confirmed that the two methods are equivalent. If you agree with the concept behind Grad-E… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of Interpretation Effects Among Different Components.(∇A ⊙ w is equivalent to Grad-ECLIP.) interpretation results obtained may carry a risk of mislead￾ing(Mariotti et al., 2023). Even disregarding these risks, differences between the surrogate and original models will always persist. From a developmental perspective, we still prefer that model interpretation should remain faithful to the origina… view at source ↗
Figure 5
Figure 5. Figure 5: Experiment on updating attention map.(As A undergoes updates, the model’s performance continues to improve; the interpretation results from ∇A ⊙ A align with the model’s performance well. Because the computation of w is not directly associated with A, therefore, as A updates, w remains unchanged, and w is shown in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance of the modified CLIP (the last attention layer of the image encoder is modified to single-head attention) [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance of the modified CLIP (the last attention layer of the text encoder is modified to single-head attention). 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance of the modified CLIP (the last eight attention layers of the text encoder is modified to single-head attention) [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Performance of the original CLIP. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
read the original abstract

Grad-ECLIP is published at ICML 2024 and represents a new Transformer interpretation technical route (intermediate features-based). First, this paper demonstrates that the intermediate features-based technical route is not a novel one. Based on the existing attention-based route, we have developed Attention-ECLIP, which is completely equivalent to Grad-ECLIP but with simpler computation. Both through formal derivation and experimental validation, we prove that the intermediate feature-based route represented by Grad-ECLIP is actually an equivalent variant of the attention-based route. Next, this paper demonstrates that the Grad-ECLIP method is flawed. The model interpretation results obtained by Grad-ECLIP are not those of the original model, and the interpretation results are misaligned with the model's performance. We analyze the causes of Grad-ECLIP's flaws and propose, or rather, explicitly emphasize two fundamental principles that model interpretation should adhere to in order to avoid similar errors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that Grad-ECLIP, an intermediate feature-based interpretation method for Transformers published at ICML 2024, is formally equivalent to a simpler attention-based method termed Attention-ECLIP developed by the authors from existing attention routes. Both formal derivation and experimental validation are presented to establish this equivalence. The paper further argues that Grad-ECLIP produces interpretations misaligned with the original model's performance and not representative of the model itself, analyzes the causes, and proposes two fundamental principles for model interpretation to avoid such errors.

Significance. If the equivalence is shown to be exact (including bit-identical outputs) and the misalignment experiments cleanly isolate the method from implementation differences, the work would be significant for Transformer interpretability in vision-language models. It would provide a simpler equivalent baseline, highlight risks in intermediate-feature routes, and supply explicit principles that could steer future methods away from similar flaws. The combination of derivation and validation strengthens the contribution relative to purely empirical critiques.

major comments (3)
  1. [Formal Derivation] Formal Derivation section: the claim of complete equivalence between Grad-ECLIP and Attention-ECLIP requires explicit verification that the two implementations produce bit-identical outputs on identical inputs; without this, differences in gradient routing, normalization order, or head aggregation could confound the subsequent misalignment results and prevent clean attribution to the intermediate-feature route.
  2. [Experimental Validation] Experimental Validation section: the misalignment experiments must include controls confirming that any performance-interpretation gap is not an artifact of implementation discrepancies between the two routes; reporting only aggregate metrics without per-sample or per-head output identity checks leaves the central claim vulnerable to the concern that observed misalignment stems from code differences rather than a fundamental flaw.
  3. [Flaw Analysis] Flaw Analysis section: the identification of causes for Grad-ECLIP's misalignment should reference specific equations or components from the original Grad-ECLIP paper (e.g., the intermediate feature computation steps) to demonstrate precisely where the deviation from the original model occurs, rather than relying solely on post-hoc comparison with Attention-ECLIP.
minor comments (2)
  1. [Abstract] Abstract: the two fundamental principles for model interpretation should be stated explicitly rather than summarized, to allow readers to evaluate their generality immediately.
  2. [References] References: add citations to prior attention-based interpretation methods to strengthen the claim that Attention-ECLIP is a direct, non-novel extension of existing routes.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each of the major comments point by point below, and will make revisions to incorporate the suggestions where they strengthen the presentation of our results.

read point-by-point responses
  1. Referee: Formal Derivation section: the claim of complete equivalence between Grad-ECLIP and Attention-ECLIP requires explicit verification that the two implementations produce bit-identical outputs on identical inputs; without this, differences in gradient routing, normalization order, or head aggregation could confound the subsequent misalignment results and prevent clean attribution to the intermediate-feature route.

    Authors: We agree that demonstrating bit-identical outputs would provide the strongest possible evidence for equivalence. Our formal derivation shows that the methods are mathematically equivalent, and our current experiments report high numerical agreement (e.g., cosine similarity >0.99). However, to fully address this concern, we will add explicit verification in the revised manuscript, including code snippets or results showing identical outputs at the bit level for the same inputs, controlling for floating-point precision. revision: yes

  2. Referee: Experimental Validation section: the misalignment experiments must include controls confirming that any performance-interpretation gap is not an artifact of implementation discrepancies between the two routes; reporting only aggregate metrics without per-sample or per-head output identity checks leaves the central claim vulnerable to the concern that observed misalignment stems from code differences rather than a fundamental flaw.

    Authors: We will enhance the Experimental Validation section by adding per-sample and per-head output identity checks between Grad-ECLIP and Attention-ECLIP. This will include reporting metrics such as exact match rates or L2 differences per sample to confirm that any observed misalignment with model performance is attributable to the method itself rather than implementation artifacts. revision: yes

  3. Referee: Flaw Analysis section: the identification of causes for Grad-ECLIP's misalignment should reference specific equations or components from the original Grad-ECLIP paper (e.g., the intermediate feature computation steps) to demonstrate precisely where the deviation from the original model occurs, rather than relying solely on post-hoc comparison with Attention-ECLIP.

    Authors: We will revise the Flaw Analysis section to include direct references to specific equations and components from the original Grad-ECLIP paper, such as the steps for computing intermediate features and gradients. This will precisely identify where the deviation occurs, complementing our comparison with Attention-ECLIP. revision: yes

Circularity Check

0 steps flagged

No significant circularity; equivalence derived independently from attention mechanisms

full rationale

The paper constructs Attention-ECLIP explicitly from the existing attention-based route and provides a formal derivation showing equivalence to Grad-ECLIP's intermediate-feature path. No step reduces by construction to a fitted parameter, self-referential definition, or load-bearing self-citation chain. The equivalence claim rests on explicit equations comparing the two routes, and misalignment is shown via separate experiments on model performance. The derivation chain is self-contained against external attention formulations and does not rename known results or smuggle ansatzes via prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard mathematical reformulations of attention operations and empirical checks of alignment; no free parameters, new entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)
  • standard math Attention mechanisms in Transformers permit equivalent reformulations between feature-based and attention-weight computations
    Invoked to establish that Grad-ECLIP equals Attention-ECLIP

pith-pipeline@v0.9.0 · 5464 in / 1209 out tokens · 59793 ms · 2026-05-14T19:49:53.312777+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 35 canonical work pages · 5 internal anchors

  1. [1]

    and Zuidema, W

    Abnar, S. and Zuidema, W. H. Quantifying attention flow in transformers. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. R. (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020 , pp.\ 4190--4197. Association for Computational Linguistics, 2020. doi:10.18653/V1/2020.AC...

  2. [2]

    Deep integrated explanations

    Barkan, O., Elisha, Y., Weill, J., Asher, Y., Eshel, A., and Koenigstein, N. Deep integrated explanations. In Frommholz, I., Hopfgartner, F., Lee, M., Oakes, M., Lalmas, M., Zhang, M., and Santos, R. L. T. (eds.), Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, CIKM 2023, Birmingham, United Kingdom, October 21...

  3. [3]

    Interpretability via Model Extraction

    Bastani, O., Kim, C., and Bastani, H. Interpretability via model extraction. CoRR, abs/1706.09773, 2017. URL http://arxiv.org/abs/1706.09773

  4. [4]

    Layer-wise relevance propagation for neural networks with local renormalization layers

    Binder, A., Montavon, G., Lapuschkin, S., M \" u ller, K., and Samek, W. Layer-wise relevance propagation for neural networks with local renormalization layers. In Villa, A. E. P., Masulli, P., and Rivero, A. J. P. (eds.), Artificial Neural Networks and Machine Learning - ICANN 2016 - 25th International Conference on Artificial Neural Networks, Barcelona,...

  5. [5]

    2021 , url =

    Changpinyo, S., Sharma, P., Ding, N., and Soricut, R. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021 , pp.\ 3558--3568. Computer Vision Foundation / IEEE , 2021. doi:10.1109/CVPR46437.2021.00356. URL https:...

  6. [6]

    Editing conditional radiance fields

    Chefer, H., Gur, S., and Wolf, L. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021 , pp.\ 387--396. IEEE , 2021 a . doi:10.1109/ICCV48922.2021.00045. URL https://doi.org/10.1109/ICCV48922.2021.00045

  7. [7]

    2021 , url =

    Chefer, H., Gur, S., and Wolf, L. Transformer interpretability beyond attention visualization. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021 , pp.\ 782--791. Computer Vision Foundation / IEEE , 2021 b . doi:10.1109/CVPR46437.2021.00084. URL https://openaccess.thecvf.com/content/CVPR2021/html/Chefer\_Tr...

  8. [8]

    Beyond intuition: Rethinking token attributions inside transformers

    Chen, J., Li, X., Yu, L., Dou, D., and Xiong, H. Beyond intuition: Rethinking token attributions inside transformers. Trans. Mach. Learn. Res., 2023, 2023. URL https://openreview.net/forum?id=rm0zIzlhcX

  9. [9]

    Developing real-time streaming transformer transducer for speech recognition on large-scale dataset

    Chen, X., Wu, Y., Wang, Z., Liu, S., and Li, J. Developing real-time streaming transformer transducer for speech recognition on large-scale dataset. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021 , pp.\ 5904--5908. IEEE , 2021. doi:10.1109/ICASSP39728.2021.9413535. URL https:/...

  10. [10]

    Efficient and effective text encoding for chinese llama and alpaca

    Cui, Y., Yang, Z., and Yao, X. Efficient and effective text encoding for chinese llama and alpaca. CoRR, abs/2304.08177, 2023. doi:10.48550/ARXIV.2304.08177. URL https://doi.org/10.48550/arXiv.2304.08177

  11. [11]

    Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition

    Dong, L., Xu, S., and Xu, B. Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 15-20, 2018 , pp.\ 5884--5888. IEEE , 2018. doi:10.1109/ICASSP.2018.8462506. URL https://doi.org/10.1109/ICASSP.2018.8462506

  12. [12]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 ...

  13. [13]

    A., Siebert, X., Cornu, O., and Vleeschouwer, C

    Englebert, A., Stassin, S., Nanfack, G., Mahmoudi, S. A., Siebert, X., Cornu, O., and Vleeschouwer, C. D. Explaining through transformer input sampling. In IEEE/CVF International Conference on Computer Vision, ICCV 2023 - Workshops, Paris, France, October 2-6, 2023 , pp.\ 806--815. IEEE , 2023. doi:10.1109/ICCVW60793.2023.00088. URL https://doi.org/10.110...

  14. [14]

    nuScenes: A multimodal dataset for autonomous driving,

    Hu, R., Singh, A., Darrell, T., and Rohrbach, M. Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020 , pp.\ 9989--9999. Computer Vision Foundation / IEEE , 2020. doi:10.1109/CVPR42600.2020.01001. URL htt...

  15. [15]

    Multi-compound transformer for accurate biomedical image segmentation

    Ji, Y., Zhang, R., Wang, H., Li, Z., Wu, L., Zhang, S., and Luo, P. Multi-compound transformer for accurate biomedical image segmentation. In de Bruijne, M., Cattin, P. C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., and Essert, C. (eds.), Medical Image Computing and Computer Assisted Intervention - MICCAI 2021 - 24th International Conference, Strasbour...

  16. [16]

    Layercam: Exploring hierarchical class activation maps for localization

    Jiang, P., Zhang, C., Hou, Q., Cheng, M., and Wei, Y. Layercam: Exploring hierarchical class activation maps for localization. IEEE Trans. Image Process. , 30: 0 5875--5888, 2021. doi:10.1109/TIP.2021.3089943. URL https://doi.org/10.1109/TIP.2021.3089943

  17. [17]

    Li, J., Li, D., Xiong, C., and Hoi, S. C. H. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In Chaudhuri, K., Jegelka, S., Song, L., Szepesv \' a ri, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA , volume 162 ...

  18. [18]

    arXiv preprint arXiv:1908.03557 , year=

    Li, L. H., Yatskar, M., Yin, D., Hsieh, C., and Chang, K. Visualbert: A simple and performant baseline for vision and language. CoRR, abs/1908.03557, 2019. URL http://arxiv.org/abs/1908.03557

  19. [19]

    UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning

    Li, W., Gao, C., Niu, G., Xiao, X., Liu, H., Liu, J., Wu, H., and Wang, H. UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference ...

  20. [20]

    Microsoft COCO: common objects in context, in: Computer Vision - ECCV 2014 - 13th European Confer- ence, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, Springer

    Lin, T., Maire, M., Belongie, S. J., Hays, J., Perona, P., Ramanan, D., Doll \' a r, P., and Zitnick, C. L. Microsoft COCO: common objects in context. In Fleet, D. J., Pajdla, T., Schiele, B., and Tuytelaars, T. (eds.), Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V , volume 8693 of L...

  21. [21]

    Editing conditional radiance fields

    Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021 , pp.\ 9992--10002. IEEE , 2021. doi:10.1109/ICCV48922.2021.00986. URL https://doi.org/10.110...

  22. [22]

    Clip4clip: An empirical study of CLIP for end to end video clip retrieval and captioning

    Luo, H., Ji, L., Zhong, M., Chen, Y., Lei, W., Duan, N., and Li, T. Clip4clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing, 508: 0 293--304, 2022. doi:10.1016/J.NEUCOM.2022.07.028. URL https://doi.org/10.1016/j.neucom.2022.07.028

  23. [23]

    Mariotti, E., Sivaprasad, A., and Alonso - Moral, J. M. Beyond prediction similarity: Shapgap for evaluating faithful surrogate models in XAI . In Longo, L. (ed.), Explainable Artificial Intelligence - First World Conference, xAI 2023, Lisbon, Portugal, July 26-28, 2023, Proceedings, Part I , Communications in Computer and Information Science, pp.\ 160--1...

  24. [24]

    Are sixteen heads really better than one? In Wallach, H

    Michel, P., Levy, O., and Neubig, G. Are sixteen heads really better than one? In Wallach, H. M., Larochelle, H., Beygelzimer, A., d'Alch \' e - Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC...

  25. [25]

    W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I

    Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Vi...

  26. [26]

    Do ImageNet Classifiers Generalize to ImageNet?

    Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do imagenet classifiers generalize to imagenet? CoRR, abs/1902.10811, 2019. URL http://arxiv.org/abs/1902.10811

  27. [27]

    "Why Should

    Ribeiro, M. T., Singh, S., and Guestrin, C. "why should I trust you?": Explaining the predictions of any classifier. In Krishnapuram, B., Shah, M., Smola, A. J., Aggarwal, C. C., Shen, D., and Rastogi, R. (eds.), Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016 ,...

  28. [28]

    S., Berg, A

    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. S., Berg, A. C., and Fei - Fei, L. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis., 115 0 (3): 0 211--252, 2015. doi:10.1007/S11263-015-0816-Y. URL https://doi.org/10.1007/s11263-015-0816-y

  29. [29]

    R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D

    Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017 , pp.\ 618--626. IEEE Computer Society, 2017. doi:10.1109/ICCV.2017.74. URL https://doi.org/10....

  30. [30]

    TreeView: Peeking into Deep Neural Networks Via Feature-Space Partitioning

    Thiagarajan, J. J., Kailkhura, B., Sattigeri, P., and Ramamurthy, K. N. Treeview: Peeking into deep neural networks via feature-space partitioning. CoRR, abs/1611.07429, 2016. URL http://arxiv.org/abs/1611.07429

  31. [31]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M., Lacroix, T., Rozi \` e re, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023 a . doi:10.48550/ARXIV.2302.13971. URL https://doi.org/10.48550/arXiv.2302.13971

  32. [32]

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Canton - Ferrer, C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., ...

  33. [33]

    N., Kaiser, L., and Polosukhin, I

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processi...

  34. [34]

    Tfnet: Transformer fusion network for ultrasound image segmentation

    Wang, T., Lai, Z., and Kong, H. Tfnet: Transformer fusion network for ultrasound image segmentation. In Wallraven, C., Liu, Q., and Nagahara, H. (eds.), Pattern Recognition - 6th Asian Conference, ACPR 2021, Jeju Island, South Korea, November 9-12, 2021, Revised Selected Papers, Part I , volume 13188 of Lecture Notes in Computer Science, pp.\ 314--325. Sp...

  35. [35]

    Shadows can be

    Wang, Z., Lu, Y., Li, Q., Tao, X., Guo, Y., Gong, M., and Liu, T. CRIS: clip-driven referring image segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022 , pp.\ 11676--11685. IEEE , 2022. doi:10.1109/CVPR52688.2022.01139. URL https://doi.org/10.1109/CVPR52688.2022.01139

  36. [36]

    C., and Zhang, N

    Xie, W., Li, X., Cao, C. C., and Zhang, N. L. Vit-cx: Causal explanation of vision transformers. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China , pp.\ 1569--1577. ijcai.org, 2023. doi:10.24963/IJCAI.2023/174. URL https://doi.org/10.24963/ijcai.2023/174

  37. [37]

    Cotr: Efficiently bridging CNN and transformer for 3d medical image segmentation

    Xie, Y., Zhang, J., Shen, C., and Xia, Y. Cotr: Efficiently bridging CNN and transformer for 3d medical image segmentation. In de Bruijne, M., Cattin, P. C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., and Essert, C. (eds.), Medical Image Computing and Computer Assisted Intervention - MICCAI 2021 - 24th International Conference, Strasbourg, France, Sept...

  38. [38]

    Shadows can be

    Xu, J., Mello, S. D., Liu, S., Byeon, W., Breuel, T. M., Kautz, J., and Wang, X. Groupvit: Semantic segmentation emerges from text supervision. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022 , pp.\ 18113--18123. IEEE , 2022. doi:10.1109/CVPR52688.2022.01760. URL https://doi.org/10.1109/...

  39. [39]

    Explaining information flow inside vision transformers using markov chain

    Yuan, T., Li, X., Xiong, H., Cao, H., and Dou, D. Explaining information flow inside vision transformers using markov chain. In eXplainable AI approaches for debugging and diagnosis., 2021. URL https://openreview.net/forum?id=TT-cf6QSDaQ

  40. [40]

    Zhao, C., Wang, K., Zeng, X., Zhao, R., and Chan, A. B. Gradient-based visual explanation for transformer-based CLIP . In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=WT4X3QYopC

  41. [41]

    Shadows can be

    Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L. H., Zhou, L., Dai, X., Yuan, L., Li, Y., and Gao, J. Regionclip: Region-based language-image pretraining. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022 , pp.\ 16772--16782. IEEE , 2022. doi:10.1109/CVPR52688.2022.01629. URL ht...

  42. [42]

    Learning deep features for discriminative localization

    Zhou, B., Khosla, A., Lapedriza, \` A ., Oliva, A., and Torralba, A. Learning deep features for discriminative localization. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016 , pp.\ 2921--2929. IEEE Computer Society, 2016. doi:10.1109/CVPR.2016.319. URL https://doi.org/10.1109/CVPR.2016.319

  43. [43]

    Learning to prompt for vision-language models.Int

    Zhou, K., Yang, J., Loy, C. C., and Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis., 130 0 (9): 0 2337--2348, 2022. doi:10.1007/S11263-022-01653-1. URL https://doi.org/10.1007/s11263-022-01653-1