pith. machine review for the scientific record. sign in

arxiv: 2605.10142 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Scaling Vision Models Does Not Consistently Improve Localisation-Based Explanation Quality

Marcin Chlebus, Mateusz Cedro

Pith reviewed 2026-05-12 04:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords model scalingvision modelsexplainable AIpost-hoc explanationslocalisation metricsResNetVision Transformerexplanation quality
0
0 comments X

The pith

Scaling up vision models does not consistently improve the quality of their post-hoc explanations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether larger computer vision models yield higher-quality post-hoc explanations than smaller ones. Eleven models from the ResNet, DenseNet, and Vision Transformer families are evaluated on three datasets that include ground-truth segmentation masks. Five standard explanation methods are applied to each model, and alignment is scored with two localisation metrics that check how attributions match the annotated object regions. Results across most comparisons show that added depth and parameters bring no reliable gain, while smaller models frequently perform as well or better. Pretraining improves accuracy but does not produce consistent gains in explanation alignment, and some high-accuracy models show near-zero localisation scores.

Core claim

Across datasets and methods, increasing architectural depth and parameter count does not improve explanation quality in most statistical comparisons, and smaller models often match or exceed deeper variants. Pretraining typically improves predictive performance and increases dependence on learned weights, yet does not consistently increase localisation scores. Models can reach strong predictive performance while localisation precision remains near zero.

What carries the argument

The localisation metrics Relevance Rank Accuracy and Dual-Polarity Precision, which measure how well positive attributions fall inside class masks and negative attributions fall outside them.

If this is right

  • Model selection for safety-sensitive uses cannot rely on scale alone and must include explicit checks of explanation quality.
  • High predictive accuracy does not ensure that a model bases its decisions on the annotated object regions.
  • Pretraining benefits for explanations are separate from its benefits for accuracy and must be verified independently.
  • Smaller models remain competitive options when explanation alignment is a priority.
  • Explainability should be assessed as a distinct criterion during deployment decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Scaling trends observed for accuracy may not apply to interpretability, so future scaling studies should track explanation metrics separately.
  • Current post-hoc methods may require adaptation to maintain effectiveness as model complexity grows.
  • The results motivate targeted comparisons of architectural choices other than size on explanation behaviour.
  • Similar patterns could appear in non-vision domains where scale is assumed to improve trustworthiness.

Load-bearing premise

Alignment with ground-truth segmentation masks, as measured by the two localisation metrics, provides a valid general indicator of explanation quality for downstream safety or human understanding.

What would settle it

A new experiment on additional datasets or with further explanation methods in which larger models produce statistically higher scores on both Relevance Rank Accuracy and Dual-Polarity Precision would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.10142 by Marcin Chlebus, Mateusz Cedro.

Figure 1
Figure 1. Figure 1: Schematic overview of the experimental methodology. Deep learning models are trained or fine-tuned on the image datasets and then used to generate saliency maps for test samples. The resulting explanations are compared with ground truth masks and assessed using Relevance Rank Accuracy13 and the proposed Dual-Polarity Precision to quantify the precision of the XAI outputs. Explainable AI in Sensitive Imagin… view at source ↗
Figure 2
Figure 2. Figure 2: Examples of images from the datasets with corresponding ground-truth binary masks. Masks show the specific segments of interest (disease or an animal) used to evaluate the quality of the XAI explanations. Models We study nine convolutional models and two Vision Transformers to examine how model complexity relates to the quality of XAI explanations. Seven CNNs are trained from scratch, including ResNet-18, … view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative saliency examples on chest radiographs. (a) COVID-QU-Ex example and (b) Pneumothorax example. For each case, the top row shows the input X-ray, the corresponding expert-annotated binary ground-truth mask, and the model prediction with confidence as shown in the panel header. The bottom row shows attribution heatmaps generated by Saliency, Integrated Gradients, GradientSHAP, Grad-CAM, and Featur… view at source ↗
Figure 4
Figure 4. Figure 4: Sanity checks for an ImageNet-pretrained ResNet-50 across five XAI methods. Columns show attribution maps from Saliency, Integrated Gradients, GradientSHAP, Feature Permutation, and Grad-CAM. For each method, the scores above the maps report DPP and RRA, and the colour bar indicates normalised attribution magnitude. Panels (a) and (b) define the weight and input randomisation settings and report the corres… view at source ↗
Figure 5
Figure 5. Figure 5: Layer-wise randomisation of ImageNet-pretrained ResNet-50. 25/28 [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Input-wise randomisation of ResNet-18. 26/28 [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Saliency-based explanations on the Chest X-ray Pneumothorax dataset using a ResNet-50 pretrained on ImageNet. Each row shows the original X-ray, the ground truth mask, and attributions from five methods: Saliency, Integrated Gradients, GradientSHAP, Grad-CAM, and Feature Permutation. DPP and RRA scores are reported below each attribution map. 27/28 [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Saliency-based explanations on the Chest X-ray Pneumothorax dataset using a DenseNet-121 pretrained on CheXpert. Each row shows the original X-ray, the ground truth mask, and attributions from five methods: Saliency, Integrated Gradients, GradientSHAP, Grad-CAM, and Feature Permutation. DPP and RRA scores are reported below each attribution map. 28/28 [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗
read the original abstract

Artificial intelligence models are increasingly scaled to improve predictive accuracy, yet it remains unclear whether scale improves the quality of post-hoc explanations. We investigate this relationship by evaluating 11 computer vision models representing increasing levels of depth and complexity within the ResNet, DenseNet, and Vision Transformer families, trained from scratch or pretrained, across three image datasets with ground-truth segmentation masks. For each model, we generate explanations using five post-hoc explainable AI methods and quantify mask alignment using two localisation metrics: Relevance Rank Accuracy (Arras et al., 2022) and the proposed Dual-Polarity Precision, which measures positive attributions inside the class mask and negative attributions outside it. Across datasets and methods, increasing architectural depth and parameter count does not improve explanation quality in most statistical comparisons, and smaller models often match or exceed deeper variants. While pretraining typically improves predictive performance and increases the dependence of explanations on learned weights, it does not consistently increase localisation scores. We also observe scenarios in which models achieve strong predictive performance while localisation precision is near zero, suggesting that performance metrics alone may not indicate whether predictions are based on the annotated regions. These results indicate that larger models do not reliably provide higher-quality explanations, and that explainability should therefore be assessed explicitly during model selection for safety-sensitive deployments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper conducts a comprehensive empirical evaluation of 11 vision models from ResNet, DenseNet, and Vision Transformer families with varying depths and parameter counts on three datasets equipped with ground-truth segmentation masks. Using five post-hoc XAI methods, it measures explanation quality via Relevance Rank Accuracy and the newly proposed Dual-Polarity Precision metric, which assesses positive attributions within the mask and negative attributions outside it. The main conclusion is that increasing model scale does not consistently enhance localisation-based explanation quality, with smaller models frequently performing on par or better, and pretraining improving predictive accuracy but not reliably boosting localisation scores. The study also identifies instances where high predictive performance is accompanied by near-zero localisation, underscoring the need for explicit explainability assessment.

Significance. If these findings hold, they challenge the prevailing assumption in the field that larger, more accurate models inherently produce better explanations. This has significant implications for the deployment of AI in safety-critical domains, where both performance and interpretability are crucial. The broad scope across architectures, datasets, and methods strengthens the generalizability of the results. The introduction of Dual-Polarity Precision provides a new tool for the community. The paper's own acknowledgment of the disconnect between performance and localisation adds nuance and prevents overinterpretation. The stress-test concern about mask validity does not land as a reason to doubt the central claim, because the work is explicitly framed around localisation-based metrics and already flags the performance-localisation disconnect.

major comments (3)
  1. The claim that scaling does not improve explanation quality 'in most statistical comparisons' is load-bearing for the title and abstract conclusions. The specific tests (e.g., paired t-tests or Wilcoxon), effect sizes, and any multiple-comparison corrections applied across the 11 models × 5 methods × 3 datasets × 2 metrics grid should be reported explicitly with raw distributions or supplementary tables to allow assessment of robustness.
  2. Dual-Polarity Precision is introduced as a contribution that captures both polarities of attribution. Its exact formula, normalization, and handling of zero-attribution cases (likely in the methods) must be given as an equation, together with a direct comparison or correlation analysis against Relevance Rank Accuracy on the same explanations to demonstrate incremental value.
  3. The observation of strong predictive performance with near-zero localisation is highlighted in the abstract. A quantitative breakdown by dataset and model family (e.g., percentage of cases or average localisation scores for top-accuracy models) would strengthen the interpretation that performance metrics alone are insufficient.
minor comments (3)
  1. Figure captions should explicitly state the number of runs or seeds used and whether error bars represent standard deviation or standard error to improve clarity of the scaling trends.
  2. The reference list should confirm that Arras et al. (2022) and all five XAI method citations are complete and consistently formatted.
  3. A short paragraph in the discussion on potential dataset-specific biases in the segmentation masks (e.g., annotation quality or class-specific challenges) would aid readers in generalizing the results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the constructive suggestions for improvement. We address each major comment below with clarifications and revisions.

read point-by-point responses
  1. Referee: The claim that scaling does not improve explanation quality 'in most statistical comparisons' is load-bearing for the title and abstract conclusions. The specific tests (e.g., paired t-tests or Wilcoxon), effect sizes, and any multiple-comparison corrections applied across the 11 models × 5 methods × 3 datasets × 2 metrics grid should be reported explicitly with raw distributions or supplementary tables to allow assessment of robustness.

    Authors: We agree that greater transparency on the statistical procedures is warranted. The manuscript already employs Wilcoxon signed-rank tests (chosen for non-normality of localisation scores) with Bonferroni correction across the full grid of comparisons, and reports effect sizes via rank-biserial correlation. To strengthen the claim, the revised version will add an explicit methods subsection detailing these choices and will include supplementary tables with all raw p-values, effect sizes, and per-comparison score distributions. revision: yes

  2. Referee: Dual-Polarity Precision is introduced as a contribution that captures both polarities of attribution. Its exact formula, normalization, and handling of zero-attribution cases (likely in the methods) must be given as an equation, together with a direct comparison or correlation analysis against Relevance Rank Accuracy on the same explanations to demonstrate incremental value.

    Authors: We accept this request for formalization. The revised manuscript will present the Dual-Polarity Precision formula as an explicit equation in the methods, including the normalization (division by total attribution mass) and the convention for zero-attribution cases (score set to 0.5 when both polarities are absent). We will also add a supplementary correlation analysis (Pearson r and agreement rates) between Dual-Polarity Precision and Relevance Rank Accuracy across all generated explanations to quantify its added value. revision: yes

  3. Referee: The observation of strong predictive performance with near-zero localisation is highlighted in the abstract. A quantitative breakdown by dataset and model family (e.g., percentage of cases or average localisation scores for top-accuracy models) would strengthen the interpretation that performance metrics alone are insufficient.

    Authors: We will incorporate this quantitative support. The revised results section will include a new table (or supplementary table) reporting, for each dataset and model family, (i) the percentage of top-quartile accuracy models that exhibit near-zero localisation (<0.1) and (ii) the mean localisation scores for the highest-accuracy models. This will provide concrete numbers backing the abstract statement. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with direct measurements

full rationale

This paper performs an empirical comparison of explanation quality across model scales using two localisation metrics (one cited, one newly proposed) applied to generated attributions on datasets with segmentation masks. No mathematical derivations, parameter fits, or predictions are claimed; results consist of statistical comparisons of measured scores. No self-citations are load-bearing for any central claim, and the analysis does not reduce any quantity to a definition or fit by construction. The study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that localisation metrics validly proxy explanation quality and on the introduction of one new metric; no free parameters are fitted to support the main negative finding on scaling.

axioms (1)
  • domain assumption Localisation to ground-truth segmentation masks is a meaningful proxy for explanation quality
    Invoked when using Relevance Rank Accuracy and Dual-Polarity Precision to quantify explanation quality.
invented entities (1)
  • Dual-Polarity Precision no independent evidence
    purpose: Quantifies positive attributions inside the class mask and negative attributions outside it
    New metric proposed to complement existing localisation measures.

pith-pipeline@v0.9.0 · 5524 in / 1398 out tokens · 61426 ms · 2026-05-12T04:11:00.128965+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 2 internal anchors

  1. [1]

    & Samek, W

    Arras, L., Osman, A. & Samek, W. Ground Truth Evaluation of Neural Network Explanations with CLEVR-XAI.Inf. Fusion81, 14–40, DOI: 10.1016/j.inffus.2021.11.008 (2022). ArXiv:2003.07258 [cs, eess]

  2. [2]

    Morch, N. J. S.et al.Visualization of neural networks using saliency maps. InProceedings of ICNN’95 - International Conference on Neural Networks, vol. 4, 2085–2090, DOI: 10.1109/ICNN.1995.488997 (IEEE, Perth, W A, Australia, 1995)

  3. [3]

    & Hansen, K

    Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M. & Hansen, K. How to explain individual classification decisions (2010). ArXiv preprint

  4. [4]

    Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

    Simonyanet al.Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps (2013). ArXiv:1312.6034 [cs]. 5.Lundberg, S. & Lee, S.-I. A Unified Approach to Interpreting Model Predictions (2017). ArXiv:1705.07874 [cs, stat]

  5. [5]

    & Yan, Q

    Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. InInternational conference on machine learning, 3319–3328 (PMLR, 2017). 7.Molnar, C.Interpretable Machine Learning(2022), 2 edn

  6. [6]

    PMID: 41047921, https://doi.org/10.1080/07853890.2025.2568119

    Dobrzycka, M.et al.Machine learning-based identification of small rna signatures in aqueous humor as a step toward precision diagnosis of glaucoma.Annals Medicine57, 2568119, DOI: 10.1080/07853890.2025.2568119 (2025). PMID: 41047921, https://doi.org/10.1080/07853890.2025.2568119

  7. [7]

    Surv.51, 1–42, DOI: 10.1145/3236009 (2019)

    Guidotti, R.et al.A Survey of Methods for Explaining Black Box Models.ACM Comput. Surv.51, 1–42, DOI: 10.1145/3236009 (2019)

  8. [8]

    ``why should i trust you?": Explaining the predictions of any classifier

    Ribeiro, M. T., Singh, S. & Guestrin, C. "Why Should I Trust You?": Explaining the Predictions of Any Classifier (2016). ArXiv:1602.04938 [cs, stat]

  9. [9]

    Apley, D. W. & Zhu, J. Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models.J. Royal Stat. Soc. Ser. B: Stat. Methodol.82, 1059–1086, DOI: 10.1111/rssb.12377 (2020)

  10. [10]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929(2020)

  11. [11]

    Hedström, A.et al.Quantus: An explainable ai toolkit for responsible evaluation of neural network explanations and beyond.J. Mach. Learn. Res.24, 1–11 (2023). 18/28

  12. [12]

    Samek, W., Montavon, G., Vedaldi, A., Hansen, L. K. & Müller, K.-R. (eds.)Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, vol. 11700 ofLecture Notes in Computer Science(Springer International Publishing, Cham, 2019)

  13. [13]

    & Bischl, B

    Molnar, C., Casalicchio, G. & Bischl, B. Interpretable machine learning – a brief history, state-of-the-art and challenges. vol. 1323, 417–431, DOI: 10.1007/978-3-030-65965-3_28 (2020). ArXiv:2010.09337 [cs, stat]

  14. [14]

    Sanity checks for saliency maps, 2020

    Cedro, M. & Martens, D. GraphXAIN: Narratives to Explain Graph Neural Networks. In Guidotti, R., Schmid, U. & Longo, L. (eds.)Explainable Artificial Intelligence, 91–114 (Springer Nature Switzerland, Cham, 2026). 17.Miller, T. Explanation in artificial intelligence: Insights from the social sciences.Artif. intelligence267, 1–38 (2019). 18.Adebayo, J.et al...

  15. [15]

    & Kundaje, A

    Shrikumar, A., Greenside, P. & Kundaje, A. Learning Important Features Through Propagating Activation Differences (2019). ArXiv:1704.02685 [cs]

  16. [16]

    PLOS ONE10(7), e0130140 (Jul 2015).https://doi.org/ 10.1371/journal.pone.0130140

    Bach, S.et al.On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. PLOS ONE10, e0130140, DOI: 10.1371/journal.pone.0130140 (2015). 22.Shapley, L. S.Contributions to the Theory of Games(1953). “A value for n-person games”, pp. 307–317

  17. [17]

    SmoothGrad: removing noise by adding noise

    Smilkov, D., Thorat, N., Kim, B., Viégas, F. & Wattenberg, M. SmoothGrad: removing noise by adding noise (2017). ArXiv:1706.03825 [cs, stat]

  18. [18]

    & Höhne, M

    Bykov, K., Hedström, A., Nakajima, S. & Höhne, M. M.-C. NoiseGrad: Enhancing Explanations by Introducing Stochasticity to Model Weights (2022). ArXiv:2106.10185 [cs]

  19. [19]

    R.et al.Grad-cam: Visual explanations from deep networks via gradient-based localization

    Selvaraju, R. R.et al.Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE international conference on computer vision, 618–626 (2017)

  20. [20]

    LeCun, Y

    Hooker, S., Erhan, D., Kindermans, P.-J. & Kim, B. A Benchmark for Interpretability Methods in Deep Neural Networks (2019). ArXiv:1806.10758 [cs, stat]. 27.LeCun, Y ., Bengio, Y . & Hinton, G. Deep learning.Nature521, 436–444, DOI: 10.1038/nature14539 (2015)

  21. [21]

    WIREs Data Mining and Knowledge Discovery , volume =

    Holzinger, A., Langs, G., Denk, H., Zatloukal, K. & Müller, H. Causability and explainability of artificial intelligence in medicine.WIREs Data Min. Knowl. Discov.9, e1312, DOI: 10.1002/widm.1312 (2019). 29.Cabitza, F., Rasoini, R. & Gensini, G. F. Unintended Consequences of Machine Learning in Medicine (2017)

  22. [22]

    & Kohane, I

    Rajkomar, A., Dean, J. & Kohane, I. Machine Learning in Medicine.New Engl. J. Medicine380, 1347–1358, DOI: 10.1056/NEJMra1814259 (2019). 31.Katuwal, G. J. & Chen, R. Machine Learning Model Interpretability for Precision Medicine (2016). 32.Che, Z., Purushotham, S., Khemani, R. & Liu, Y . Interpretable Deep Models for ICU Outcome Prediction (2017)

  23. [23]

    Deep Learning—A Technology With the Potential to Transform Health Care.JAMA320, 1101, DOI: 10.1001/jama.2018.11100 (2018)

    Hinton, G. Deep Learning—A Technology With the Potential to Transform Health Care.JAMA320, 1101, DOI: 10.1001/jama.2018.11100 (2018)

  24. [24]

    & Hengel, A

    Wu, Z., Shen, C. & Hengel, A. v. d. Wider or Deeper: Revisiting the ResNet Model for Visual Recognition (2016). ArXiv:1611.10080 [cs]. 35.Zhao, H.et al.Understanding scene in the wild (2016). 36.Brown, T. B.et al.Language models are few-shot learners.CoRRabs/2005.14165(2020). 2005.14165

  25. [25]

    Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better.ACM Comput

    Menghani, G. Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better.ACM Comput. Surv.55, 1–37, DOI: 10.1145/3578938 (2023). 38.Davenport, T. & Kalakota, R. The potential for artificial intelligence in healthcare (2019)

  26. [26]

    & Biancone, P

    Secinaro, S., Calandra, D., Secinaro, A., Muthurangu, V . & Biancone, P. The role of artificial intelligence in healthcare: a structured literature review.BMC Med. Informatics Decis. Mak.21, 125, DOI: 10.1186/s12911-021-01488-9 (2021). 40.Chowdhury, M. E. H.et al.Can AI help in screening viral and COVID-19 Pneumonia?IEEE Access8, 132665–132676, DOI: 10.11...

  27. [27]

    Degerli, A.et al.COVID-19 infection map generation and detection from chest X-ray images.Heal. Inf. Sci. Syst.9, 15, DOI: 10.1007/s13755-021-00146-8 (2021). 19/28

  28. [28]

    Rahman, T.et al.Exploring the effect of image enhancement techniques on COVID-19 detection using chest X-ray images. Comput. Biol. Medicine132, 104319, DOI: 10.1016/j.compbiomed.2021.104319 (2021). 43.He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition (2015). ArXiv:1512.03385 [cs]

  29. [29]

    & Qureshi, S

    Showkat, S. & Qureshi, S. Efficacy of Transfer Learning-based ResNet models in Chest X-ray image classification for detecting COVID-19 Pneumonia.Chemom. Intell. Lab. Syst.224, 104534, DOI: 10.1016/j.chemolab.2022.104534 (2022)

  30. [30]

    Saporta, A.et al.Benchmarking saliency methods for chest X-ray interpretation.Nat. Mach. Intell.4, 867–878, DOI: 10.1038/s42256-022-00536-x (2022)

  31. [31]

    & LeCun, Y

    Eigen, D., Rolfe, J., Fergus, R. & LeCun, Y . Understanding Deep Architectures using a Recursive Convolutional Network (2014). ArXiv:1312.1847 [cs]

  32. [32]

    U., Zhang, X., Kumar, R

    Khan, R. U., Zhang, X., Kumar, R. & Aboagye, E. O. Evaluating the Performance of ResNet Model Based on Image Recognition. InProceedings of the 2018 International Conference on Computing and Artificial Intelligence, 86–90, DOI: 10.1145/3194452.3194461 (ACM, Chengdu China, 2018)

  33. [33]

    H., Bustamam, A

    Sarwinda, D., Paradisa, R. H., Bustamam, A. & Anggia, P. Deep learning in image classification using residual network (resnet) variants for detection of colorectal cancer.Procedia Comput. Sci.179, 423–431, DOI: 10.1016/j.procs.2021.01.025 (2021)

  34. [34]

    & Iocchi, L

    Brigato, L. & Iocchi, L. A close look at deep learning with small data. In2020 25th international conference on pattern recognition (ICPR), 2490–2497 (IEEE, 2021)

  35. [35]

    Guo, M. & Du, Y . Classification of Thyroid Ultrasound Standard Plane Images using ResNet-18 Networks. In2019 IEEE 13th International Conference on Anti-counterfeiting, Security, and Identification (ASID), 324–328, DOI: 10.1109/ICASID. 2019.8925267 (IEEE, Xiamen, China, 2019). 51.Hestness, J.et al.Deep learning scaling is predictable, empirically.arXiv pr...

  36. [36]

    & Sharma, U

    Bahri, Y ., Dyer, E., Kaplan, J., Lee, J. & Sharma, U. Explaining neural scaling laws.Proc. Natl. Acad. Sci.121, e2311878121 (2024)

  37. [38]

    Rajpurkar, P.et al.Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning.arXiv preprint arXiv:1711.05225(2017)

  38. [39]

    InProceedings of the AAAI conference on artificial intelligence, vol

    Irvin, J.et al.Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. InProceedings of the AAAI conference on artificial intelligence, vol. 33, 590–597 (2019). 56.Breiman, L. Random forests.Mach. learning45, 5–32 (2001)

  39. [40]

    & Dominici, F

    Fisher, A., Rudin, C. & Dominici, F. All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously.J. Mach. Learn. Res.20, 1–81 (2019). 58.Tahir, A.et al.Covid-qu-ex dataset, DOI: 10.34740/kaggle/dsv/3122958 (2021). Accessed: 2022-12-01

  40. [41]

    & Weinberger, K

    Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition, 4700–4708 (2017)

  41. [42]

    In2009 IEEE conference on computer vision and pattern recognition, 248–255 (Ieee, 2009)

    Deng, J.et al.Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, 248–255 (Ieee, 2009)

  42. [43]

    Mong, Safwan S

    Irvin, J.et al.Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison.CoRR abs/1901.07031(2019). 1901.07031

  43. [44]

    & Wolf, L

    Chefer, H., Gur, S. & Wolf, L. Transformer interpretability beyond attention visualization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 782–791 (2021)

  44. [45]

    Kokhlikyan, N.et al.Captum: A unified and generic model interpretability library for pytorch.arXiv preprint arXiv:2009.07896(2020)

  45. [46]

    & Rainer, A

    Balogh, G., McLaughlin, N. & Rainer, A. Evaluation of explainable ai localisation performance using relevance f-score. In 25th Irish Machine Vision and Image Processing Conference 2023, 96–103 (Irish Pattern Recognition & Classification Society, 2023)

  46. [47]

    In International conference on machine learning, 2668–2677 (PMLR, 2018)

    Kim, B.et al.Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, 2668–2677 (PMLR, 2018). 20/28

  47. [48]

    neural information processing systems32(2019)

    Chen, C.et al.This looks like that: deep learning for interpretable image recognition.Adv. neural information processing systems32(2019)

  48. [49]

    G.et al.The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans.Med

    Armato III, S. G.et al.The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans.Med. physics38, 915–931 (2011). A Statistical Significance Tests This section describes the statistical procedures used to compare explanation quality across model variants. We use non- ...