arxiv: 2605.10142 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Scaling Vision Models Does Not Consistently Improve Localisation-Based Explanation Quality

Marcin Chlebus, Mateusz Cedro

Pith reviewed 2026-05-12 04:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords model scalingvision modelsexplainable AIpost-hoc explanationslocalisation metricsResNetVision Transformerexplanation quality

0 comments

The pith

Scaling up vision models does not consistently improve the quality of their post-hoc explanations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether larger computer vision models yield higher-quality post-hoc explanations than smaller ones. Eleven models from the ResNet, DenseNet, and Vision Transformer families are evaluated on three datasets that include ground-truth segmentation masks. Five standard explanation methods are applied to each model, and alignment is scored with two localisation metrics that check how attributions match the annotated object regions. Results across most comparisons show that added depth and parameters bring no reliable gain, while smaller models frequently perform as well or better. Pretraining improves accuracy but does not produce consistent gains in explanation alignment, and some high-accuracy models show near-zero localisation scores.

Core claim

Across datasets and methods, increasing architectural depth and parameter count does not improve explanation quality in most statistical comparisons, and smaller models often match or exceed deeper variants. Pretraining typically improves predictive performance and increases dependence on learned weights, yet does not consistently increase localisation scores. Models can reach strong predictive performance while localisation precision remains near zero.

What carries the argument

The localisation metrics Relevance Rank Accuracy and Dual-Polarity Precision, which measure how well positive attributions fall inside class masks and negative attributions fall outside them.

If this is right

Model selection for safety-sensitive uses cannot rely on scale alone and must include explicit checks of explanation quality.
High predictive accuracy does not ensure that a model bases its decisions on the annotated object regions.
Pretraining benefits for explanations are separate from its benefits for accuracy and must be verified independently.
Smaller models remain competitive options when explanation alignment is a priority.
Explainability should be assessed as a distinct criterion during deployment decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Scaling trends observed for accuracy may not apply to interpretability, so future scaling studies should track explanation metrics separately.
Current post-hoc methods may require adaptation to maintain effectiveness as model complexity grows.
The results motivate targeted comparisons of architectural choices other than size on explanation behaviour.
Similar patterns could appear in non-vision domains where scale is assumed to improve trustworthiness.

Load-bearing premise

Alignment with ground-truth segmentation masks, as measured by the two localisation metrics, provides a valid general indicator of explanation quality for downstream safety or human understanding.

What would settle it

A new experiment on additional datasets or with further explanation methods in which larger models produce statistically higher scores on both Relevance Rank Accuracy and Dual-Polarity Precision would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.10142 by Marcin Chlebus, Mateusz Cedro.

**Figure 1.** Figure 1: Schematic overview of the experimental methodology. Deep learning models are trained or fine-tuned on the image datasets and then used to generate saliency maps for test samples. The resulting explanations are compared with ground truth masks and assessed using Relevance Rank Accuracy13 and the proposed Dual-Polarity Precision to quantify the precision of the XAI outputs. Explainable AI in Sensitive Imagin… view at source ↗

**Figure 2.** Figure 2: Examples of images from the datasets with corresponding ground-truth binary masks. Masks show the specific segments of interest (disease or an animal) used to evaluate the quality of the XAI explanations. Models We study nine convolutional models and two Vision Transformers to examine how model complexity relates to the quality of XAI explanations. Seven CNNs are trained from scratch, including ResNet-18, … view at source ↗

**Figure 3.** Figure 3: Qualitative saliency examples on chest radiographs. (a) COVID-QU-Ex example and (b) Pneumothorax example. For each case, the top row shows the input X-ray, the corresponding expert-annotated binary ground-truth mask, and the model prediction with confidence as shown in the panel header. The bottom row shows attribution heatmaps generated by Saliency, Integrated Gradients, GradientSHAP, Grad-CAM, and Featur… view at source ↗

**Figure 4.** Figure 4: Sanity checks for an ImageNet-pretrained ResNet-50 across five XAI methods. Columns show attribution maps from Saliency, Integrated Gradients, GradientSHAP, Feature Permutation, and Grad-CAM. For each method, the scores above the maps report DPP and RRA, and the colour bar indicates normalised attribution magnitude. Panels (a) and (b) define the weight and input randomisation settings and report the corres… view at source ↗

**Figure 5.** Figure 5: Layer-wise randomisation of ImageNet-pretrained ResNet-50. 25/28 [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Input-wise randomisation of ResNet-18. 26/28 [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

**Figure 7.** Figure 7: Saliency-based explanations on the Chest X-ray Pneumothorax dataset using a ResNet-50 pretrained on ImageNet. Each row shows the original X-ray, the ground truth mask, and attributions from five methods: Saliency, Integrated Gradients, GradientSHAP, Grad-CAM, and Feature Permutation. DPP and RRA scores are reported below each attribution map. 27/28 [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

**Figure 8.** Figure 8: Saliency-based explanations on the Chest X-ray Pneumothorax dataset using a DenseNet-121 pretrained on CheXpert. Each row shows the original X-ray, the ground truth mask, and attributions from five methods: Saliency, Integrated Gradients, GradientSHAP, Grad-CAM, and Feature Permutation. DPP and RRA scores are reported below each attribution map. 28/28 [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

read the original abstract

Artificial intelligence models are increasingly scaled to improve predictive accuracy, yet it remains unclear whether scale improves the quality of post-hoc explanations. We investigate this relationship by evaluating 11 computer vision models representing increasing levels of depth and complexity within the ResNet, DenseNet, and Vision Transformer families, trained from scratch or pretrained, across three image datasets with ground-truth segmentation masks. For each model, we generate explanations using five post-hoc explainable AI methods and quantify mask alignment using two localisation metrics: Relevance Rank Accuracy (Arras et al., 2022) and the proposed Dual-Polarity Precision, which measures positive attributions inside the class mask and negative attributions outside it. Across datasets and methods, increasing architectural depth and parameter count does not improve explanation quality in most statistical comparisons, and smaller models often match or exceed deeper variants. While pretraining typically improves predictive performance and increases the dependence of explanations on learned weights, it does not consistently increase localisation scores. We also observe scenarios in which models achieve strong predictive performance while localisation precision is near zero, suggesting that performance metrics alone may not indicate whether predictions are based on the annotated regions. These results indicate that larger models do not reliably provide higher-quality explanations, and that explainability should therefore be assessed explicitly during model selection for safety-sensitive deployments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scaling does not reliably improve localisation-based explanations on these metrics, and the paper documents that cleanly across model families.

read the letter

Scaling vision models does not reliably improve how well their explanations align with object segmentation masks, at least under the two localisation metrics tested here. The paper runs a controlled comparison of 11 models from the ResNet, DenseNet, and Vision Transformer families, with increasing depth and parameter counts, some trained from scratch and some pretrained. It evaluates them on three datasets that provide ground-truth masks, using five post-hoc explanation methods and scoring the outputs with Relevance Rank Accuracy plus the new Dual-Polarity Precision metric that checks positive attributions inside the mask and negative ones outside it. The central empirical result is that larger models do not produce better scores in most statistical comparisons, smaller models often match or exceed them, and pretraining improves accuracy without consistently lifting localisation. The work also flags cases of strong predictive performance paired with near-zero mask alignment. This is a straightforward empirical contribution that tests a common assumption about scale and interpretability with a reasonably broad setup. The new metric is a sensible addition because it captures both sides of the attribution rather than treating all positive weight as good. The main soft spot is that the entire evaluation treats the segmentation masks as the correct features the model should attend to. If larger models legitimately rely on other cues outside the masks, the metrics will rate their explanations as poorer even when those explanations are valid for the model's actual decision process. The abstract already notes the performance-localisation disconnect, but there is no independent check such as human ratings to confirm the metrics track useful explanations. The statistical procedures are only outlined at a high level, so the robustness of the “most comparisons” claim needs verification in the full text. This paper is for researchers working on XAI for vision models who want data on whether scale helps explanations as a byproduct. It deserves peer review because the experimental scope is wide enough and the negative result on scaling is worth putting on record, even with the metric limitations.

Referee Report

3 major / 3 minor

Summary. The paper conducts a comprehensive empirical evaluation of 11 vision models from ResNet, DenseNet, and Vision Transformer families with varying depths and parameter counts on three datasets equipped with ground-truth segmentation masks. Using five post-hoc XAI methods, it measures explanation quality via Relevance Rank Accuracy and the newly proposed Dual-Polarity Precision metric, which assesses positive attributions within the mask and negative attributions outside it. The main conclusion is that increasing model scale does not consistently enhance localisation-based explanation quality, with smaller models frequently performing on par or better, and pretraining improving predictive accuracy but not reliably boosting localisation scores. The study also identifies instances where high predictive performance is accompanied by near-zero localisation, underscoring the need for explicit explainability assessment.

Significance. If these findings hold, they challenge the prevailing assumption in the field that larger, more accurate models inherently produce better explanations. This has significant implications for the deployment of AI in safety-critical domains, where both performance and interpretability are crucial. The broad scope across architectures, datasets, and methods strengthens the generalizability of the results. The introduction of Dual-Polarity Precision provides a new tool for the community. The paper's own acknowledgment of the disconnect between performance and localisation adds nuance and prevents overinterpretation. The stress-test concern about mask validity does not land as a reason to doubt the central claim, because the work is explicitly framed around localisation-based metrics and already flags the performance-localisation disconnect.

major comments (3)

The claim that scaling does not improve explanation quality 'in most statistical comparisons' is load-bearing for the title and abstract conclusions. The specific tests (e.g., paired t-tests or Wilcoxon), effect sizes, and any multiple-comparison corrections applied across the 11 models × 5 methods × 3 datasets × 2 metrics grid should be reported explicitly with raw distributions or supplementary tables to allow assessment of robustness.
Dual-Polarity Precision is introduced as a contribution that captures both polarities of attribution. Its exact formula, normalization, and handling of zero-attribution cases (likely in the methods) must be given as an equation, together with a direct comparison or correlation analysis against Relevance Rank Accuracy on the same explanations to demonstrate incremental value.
The observation of strong predictive performance with near-zero localisation is highlighted in the abstract. A quantitative breakdown by dataset and model family (e.g., percentage of cases or average localisation scores for top-accuracy models) would strengthen the interpretation that performance metrics alone are insufficient.

minor comments (3)

Figure captions should explicitly state the number of runs or seeds used and whether error bars represent standard deviation or standard error to improve clarity of the scaling trends.
The reference list should confirm that Arras et al. (2022) and all five XAI method citations are complete and consistently formatted.
A short paragraph in the discussion on potential dataset-specific biases in the segmentation masks (e.g., annotation quality or class-specific challenges) would aid readers in generalizing the results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the constructive suggestions for improvement. We address each major comment below with clarifications and revisions.

read point-by-point responses

Referee: The claim that scaling does not improve explanation quality 'in most statistical comparisons' is load-bearing for the title and abstract conclusions. The specific tests (e.g., paired t-tests or Wilcoxon), effect sizes, and any multiple-comparison corrections applied across the 11 models × 5 methods × 3 datasets × 2 metrics grid should be reported explicitly with raw distributions or supplementary tables to allow assessment of robustness.

Authors: We agree that greater transparency on the statistical procedures is warranted. The manuscript already employs Wilcoxon signed-rank tests (chosen for non-normality of localisation scores) with Bonferroni correction across the full grid of comparisons, and reports effect sizes via rank-biserial correlation. To strengthen the claim, the revised version will add an explicit methods subsection detailing these choices and will include supplementary tables with all raw p-values, effect sizes, and per-comparison score distributions. revision: yes
Referee: Dual-Polarity Precision is introduced as a contribution that captures both polarities of attribution. Its exact formula, normalization, and handling of zero-attribution cases (likely in the methods) must be given as an equation, together with a direct comparison or correlation analysis against Relevance Rank Accuracy on the same explanations to demonstrate incremental value.

Authors: We accept this request for formalization. The revised manuscript will present the Dual-Polarity Precision formula as an explicit equation in the methods, including the normalization (division by total attribution mass) and the convention for zero-attribution cases (score set to 0.5 when both polarities are absent). We will also add a supplementary correlation analysis (Pearson r and agreement rates) between Dual-Polarity Precision and Relevance Rank Accuracy across all generated explanations to quantify its added value. revision: yes
Referee: The observation of strong predictive performance with near-zero localisation is highlighted in the abstract. A quantitative breakdown by dataset and model family (e.g., percentage of cases or average localisation scores for top-accuracy models) would strengthen the interpretation that performance metrics alone are insufficient.

Authors: We will incorporate this quantitative support. The revised results section will include a new table (or supplementary table) reporting, for each dataset and model family, (i) the percentage of top-quartile accuracy models that exhibit near-zero localisation (<0.1) and (ii) the mean localisation scores for the highest-accuracy models. This will provide concrete numbers backing the abstract statement. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with direct measurements

full rationale

This paper performs an empirical comparison of explanation quality across model scales using two localisation metrics (one cited, one newly proposed) applied to generated attributions on datasets with segmentation masks. No mathematical derivations, parameter fits, or predictions are claimed; results consist of statistical comparisons of measured scores. No self-citations are load-bearing for any central claim, and the analysis does not reduce any quantity to a definition or fit by construction. The study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that localisation metrics validly proxy explanation quality and on the introduction of one new metric; no free parameters are fitted to support the main negative finding on scaling.

axioms (1)

domain assumption Localisation to ground-truth segmentation masks is a meaningful proxy for explanation quality
Invoked when using Relevance Rank Accuracy and Dual-Polarity Precision to quantify explanation quality.

invented entities (1)

Dual-Polarity Precision no independent evidence
purpose: Quantifies positive attributions inside the class mask and negative attributions outside it
New metric proposed to complement existing localisation measures.

pith-pipeline@v0.9.0 · 5524 in / 1398 out tokens · 61426 ms · 2026-05-12T04:11:00.128965+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 2 internal anchors

[1]

& Samek, W

Arras, L., Osman, A. & Samek, W. Ground Truth Evaluation of Neural Network Explanations with CLEVR-XAI.Inf. Fusion81, 14–40, DOI: 10.1016/j.inffus.2021.11.008 (2022). ArXiv:2003.07258 [cs, eess]

work page doi:10.1016/j.inffus.2021.11.008 2021
[2]

Morch, N. J. S.et al.Visualization of neural networks using saliency maps. InProceedings of ICNN’95 - International Conference on Neural Networks, vol. 4, 2085–2090, DOI: 10.1109/ICNN.1995.488997 (IEEE, Perth, W A, Australia, 1995)

work page doi:10.1109/icnn.1995.488997 2085
[3]

& Hansen, K

Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M. & Hansen, K. How to explain individual classification decisions (2010). ArXiv preprint

work page 2010
[4]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Simonyanet al.Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps (2013). ArXiv:1312.6034 [cs]. 5.Lundberg, S. & Lee, S.-I. A Unified Approach to Interpreting Model Predictions (2017). ArXiv:1705.07874 [cs, stat]

work page internal anchor Pith review Pith/arXiv arXiv 2013
[5]

& Yan, Q

Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. InInternational conference on machine learning, 3319–3328 (PMLR, 2017). 7.Molnar, C.Interpretable Machine Learning(2022), 2 edn

work page 2017
[6]

PMID: 41047921, https://doi.org/10.1080/07853890.2025.2568119

Dobrzycka, M.et al.Machine learning-based identification of small rna signatures in aqueous humor as a step toward precision diagnosis of glaucoma.Annals Medicine57, 2568119, DOI: 10.1080/07853890.2025.2568119 (2025). PMID: 41047921, https://doi.org/10.1080/07853890.2025.2568119

work page doi:10.1080/07853890.2025.2568119 2025
[7]

Surv.51, 1–42, DOI: 10.1145/3236009 (2019)

Guidotti, R.et al.A Survey of Methods for Explaining Black Box Models.ACM Comput. Surv.51, 1–42, DOI: 10.1145/3236009 (2019)

work page doi:10.1145/3236009 2019
[8]

``why should i trust you?": Explaining the predictions of any classifier

Ribeiro, M. T., Singh, S. & Guestrin, C. "Why Should I Trust You?": Explaining the Predictions of Any Classifier (2016). ArXiv:1602.04938 [cs, stat]

work page arXiv 2016
[9]

Apley, D. W. & Zhu, J. Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models.J. Royal Stat. Soc. Ser. B: Stat. Methodol.82, 1059–1086, DOI: 10.1111/rssb.12377 (2020)

work page doi:10.1111/rssb.12377 2020
[10]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[11]

Hedström, A.et al.Quantus: An explainable ai toolkit for responsible evaluation of neural network explanations and beyond.J. Mach. Learn. Res.24, 1–11 (2023). 18/28

work page 2023
[12]

Samek, W., Montavon, G., Vedaldi, A., Hansen, L. K. & Müller, K.-R. (eds.)Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, vol. 11700 ofLecture Notes in Computer Science(Springer International Publishing, Cham, 2019)

work page 2019
[13]

& Bischl, B

Molnar, C., Casalicchio, G. & Bischl, B. Interpretable machine learning – a brief history, state-of-the-art and challenges. vol. 1323, 417–431, DOI: 10.1007/978-3-030-65965-3_28 (2020). ArXiv:2010.09337 [cs, stat]

work page doi:10.1007/978-3-030-65965-3_28 2020
[14]

Sanity checks for saliency maps, 2020

Cedro, M. & Martens, D. GraphXAIN: Narratives to Explain Graph Neural Networks. In Guidotti, R., Schmid, U. & Longo, L. (eds.)Explainable Artificial Intelligence, 91–114 (Springer Nature Switzerland, Cham, 2026). 17.Miller, T. Explanation in artificial intelligence: Insights from the social sciences.Artif. intelligence267, 1–38 (2019). 18.Adebayo, J.et al...

work page arXiv 2026
[15]

& Kundaje, A

Shrikumar, A., Greenside, P. & Kundaje, A. Learning Important Features Through Propagating Activation Differences (2019). ArXiv:1704.02685 [cs]

work page arXiv 2019
[16]

PLOS ONE10(7), e0130140 (Jul 2015).https://doi.org/ 10.1371/journal.pone.0130140

Bach, S.et al.On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. PLOS ONE10, e0130140, DOI: 10.1371/journal.pone.0130140 (2015). 22.Shapley, L. S.Contributions to the Theory of Games(1953). “A value for n-person games”, pp. 307–317

work page doi:10.1371/journal.pone.0130140 2015
[17]

SmoothGrad: removing noise by adding noise

Smilkov, D., Thorat, N., Kim, B., Viégas, F. & Wattenberg, M. SmoothGrad: removing noise by adding noise (2017). ArXiv:1706.03825 [cs, stat]

work page Pith review arXiv 2017
[18]

& Höhne, M

Bykov, K., Hedström, A., Nakajima, S. & Höhne, M. M.-C. NoiseGrad: Enhancing Explanations by Introducing Stochasticity to Model Weights (2022). ArXiv:2106.10185 [cs]

work page arXiv 2022
[19]

R.et al.Grad-cam: Visual explanations from deep networks via gradient-based localization

Selvaraju, R. R.et al.Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE international conference on computer vision, 618–626 (2017)

work page 2017
[20]

LeCun, Y

Hooker, S., Erhan, D., Kindermans, P.-J. & Kim, B. A Benchmark for Interpretability Methods in Deep Neural Networks (2019). ArXiv:1806.10758 [cs, stat]. 27.LeCun, Y ., Bengio, Y . & Hinton, G. Deep learning.Nature521, 436–444, DOI: 10.1038/nature14539 (2015)

work page doi:10.1038/nature14539 2019
[21]

WIREs Data Mining and Knowledge Discovery , volume =

Holzinger, A., Langs, G., Denk, H., Zatloukal, K. & Müller, H. Causability and explainability of artificial intelligence in medicine.WIREs Data Min. Knowl. Discov.9, e1312, DOI: 10.1002/widm.1312 (2019). 29.Cabitza, F., Rasoini, R. & Gensini, G. F. Unintended Consequences of Machine Learning in Medicine (2017)

work page doi:10.1002/widm.1312 2019
[22]

& Kohane, I

Rajkomar, A., Dean, J. & Kohane, I. Machine Learning in Medicine.New Engl. J. Medicine380, 1347–1358, DOI: 10.1056/NEJMra1814259 (2019). 31.Katuwal, G. J. & Chen, R. Machine Learning Model Interpretability for Precision Medicine (2016). 32.Che, Z., Purushotham, S., Khemani, R. & Liu, Y . Interpretable Deep Models for ICU Outcome Prediction (2017)

work page doi:10.1056/nejmra1814259 2019
[23]

Deep Learning—A Technology With the Potential to Transform Health Care.JAMA320, 1101, DOI: 10.1001/jama.2018.11100 (2018)

Hinton, G. Deep Learning—A Technology With the Potential to Transform Health Care.JAMA320, 1101, DOI: 10.1001/jama.2018.11100 (2018)

work page doi:10.1001/jama.2018.11100 2018
[24]

& Hengel, A

Wu, Z., Shen, C. & Hengel, A. v. d. Wider or Deeper: Revisiting the ResNet Model for Visual Recognition (2016). ArXiv:1611.10080 [cs]. 35.Zhao, H.et al.Understanding scene in the wild (2016). 36.Brown, T. B.et al.Language models are few-shot learners.CoRRabs/2005.14165(2020). 2005.14165

work page arXiv 2016
[25]

Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better.ACM Comput

Menghani, G. Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better.ACM Comput. Surv.55, 1–37, DOI: 10.1145/3578938 (2023). 38.Davenport, T. & Kalakota, R. The potential for artificial intelligence in healthcare (2019)

work page doi:10.1145/3578938 2023
[26]

& Biancone, P

Secinaro, S., Calandra, D., Secinaro, A., Muthurangu, V . & Biancone, P. The role of artificial intelligence in healthcare: a structured literature review.BMC Med. Informatics Decis. Mak.21, 125, DOI: 10.1186/s12911-021-01488-9 (2021). 40.Chowdhury, M. E. H.et al.Can AI help in screening viral and COVID-19 Pneumonia?IEEE Access8, 132665–132676, DOI: 10.11...

work page doi:10.1186/s12911-021-01488-9 2021
[27]

Degerli, A.et al.COVID-19 infection map generation and detection from chest X-ray images.Heal. Inf. Sci. Syst.9, 15, DOI: 10.1007/s13755-021-00146-8 (2021). 19/28

work page doi:10.1007/s13755-021-00146-8 2021
[28]

Rahman, T.et al.Exploring the effect of image enhancement techniques on COVID-19 detection using chest X-ray images. Comput. Biol. Medicine132, 104319, DOI: 10.1016/j.compbiomed.2021.104319 (2021). 43.He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition (2015). ArXiv:1512.03385 [cs]

work page doi:10.1016/j.compbiomed.2021.104319 2021
[29]

& Qureshi, S

Showkat, S. & Qureshi, S. Efficacy of Transfer Learning-based ResNet models in Chest X-ray image classification for detecting COVID-19 Pneumonia.Chemom. Intell. Lab. Syst.224, 104534, DOI: 10.1016/j.chemolab.2022.104534 (2022)

work page doi:10.1016/j.chemolab.2022.104534 2022
[30]

Saporta, A.et al.Benchmarking saliency methods for chest X-ray interpretation.Nat. Mach. Intell.4, 867–878, DOI: 10.1038/s42256-022-00536-x (2022)

work page doi:10.1038/s42256-022-00536-x 2022
[31]

& LeCun, Y

Eigen, D., Rolfe, J., Fergus, R. & LeCun, Y . Understanding Deep Architectures using a Recursive Convolutional Network (2014). ArXiv:1312.1847 [cs]

work page arXiv 2014
[32]

U., Zhang, X., Kumar, R

Khan, R. U., Zhang, X., Kumar, R. & Aboagye, E. O. Evaluating the Performance of ResNet Model Based on Image Recognition. InProceedings of the 2018 International Conference on Computing and Artificial Intelligence, 86–90, DOI: 10.1145/3194452.3194461 (ACM, Chengdu China, 2018)

work page doi:10.1145/3194452.3194461 2018
[33]

H., Bustamam, A

Sarwinda, D., Paradisa, R. H., Bustamam, A. & Anggia, P. Deep learning in image classification using residual network (resnet) variants for detection of colorectal cancer.Procedia Comput. Sci.179, 423–431, DOI: 10.1016/j.procs.2021.01.025 (2021)

work page doi:10.1016/j.procs.2021.01.025 2021
[34]

& Iocchi, L

Brigato, L. & Iocchi, L. A close look at deep learning with small data. In2020 25th international conference on pattern recognition (ICPR), 2490–2497 (IEEE, 2021)

work page 2021
[35]

Guo, M. & Du, Y . Classification of Thyroid Ultrasound Standard Plane Images using ResNet-18 Networks. In2019 IEEE 13th International Conference on Anti-counterfeiting, Security, and Identification (ASID), 324–328, DOI: 10.1109/ICASID. 2019.8925267 (IEEE, Xiamen, China, 2019). 51.Hestness, J.et al.Deep learning scaling is predictable, empirically.arXiv pr...

work page doi:10.1109/icasid 2019
[36]

& Sharma, U

Bahri, Y ., Dyer, E., Kaplan, J., Lee, J. & Sharma, U. Explaining neural scaling laws.Proc. Natl. Acad. Sci.121, e2311878121 (2024)

work page 2024
[38]

Rajpurkar, P.et al.Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning.arXiv preprint arXiv:1711.05225(2017)

work page arXiv 2017
[39]

InProceedings of the AAAI conference on artificial intelligence, vol

Irvin, J.et al.Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. InProceedings of the AAAI conference on artificial intelligence, vol. 33, 590–597 (2019). 56.Breiman, L. Random forests.Mach. learning45, 5–32 (2001)

work page 2019
[40]

& Dominici, F

Fisher, A., Rudin, C. & Dominici, F. All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously.J. Mach. Learn. Res.20, 1–81 (2019). 58.Tahir, A.et al.Covid-qu-ex dataset, DOI: 10.34740/kaggle/dsv/3122958 (2021). Accessed: 2022-12-01

work page doi:10.34740/kaggle/dsv/3122958 2019
[41]

& Weinberger, K

Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition, 4700–4708 (2017)

work page 2017
[42]

In2009 IEEE conference on computer vision and pattern recognition, 248–255 (Ieee, 2009)

Deng, J.et al.Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, 248–255 (Ieee, 2009)

work page 2009
[43]

Mong, Safwan S

Irvin, J.et al.Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison.CoRR abs/1901.07031(2019). 1901.07031

work page arXiv 1901
[44]

& Wolf, L

Chefer, H., Gur, S. & Wolf, L. Transformer interpretability beyond attention visualization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 782–791 (2021)

work page 2021
[45]

Kokhlikyan, N.et al.Captum: A unified and generic model interpretability library for pytorch.arXiv preprint arXiv:2009.07896(2020)

work page arXiv 2009
[46]

& Rainer, A

Balogh, G., McLaughlin, N. & Rainer, A. Evaluation of explainable ai localisation performance using relevance f-score. In 25th Irish Machine Vision and Image Processing Conference 2023, 96–103 (Irish Pattern Recognition & Classification Society, 2023)

work page 2023
[47]

In International conference on machine learning, 2668–2677 (PMLR, 2018)

Kim, B.et al.Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, 2668–2677 (PMLR, 2018). 20/28

work page 2018
[48]

neural information processing systems32(2019)

Chen, C.et al.This looks like that: deep learning for interpretable image recognition.Adv. neural information processing systems32(2019)

work page 2019
[49]

G.et al.The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans.Med

Armato III, S. G.et al.The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans.Med. physics38, 915–931 (2011). A Statistical Significance Tests This section describes the statistical procedures used to compare explanation quality across model variants. We use non- ...

work page 2011