pith. machine review for the scientific record. sign in

arxiv: 2605.14641 · v1 · submitted 2026-05-14 · 💻 cs.CV · cs.AI

Recognition: no theorem link

How to Evaluate and Refine your CAM

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords class attribution mapsCAM evaluationsynthetic datasethigh-resolution explanationsconvolutional neural networksexplainable AIARCC metricRefineCAM
0
0 comments X

The pith

RefineCAM produces higher-resolution attribution maps for CNN decisions by aggregating multiple layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles the dual problems of unreliable evaluation for class attribution maps in CNNs and their typically low spatial resolution. It creates a synthetic dataset with exact ground-truth attributions to allow objective comparison of evaluation metrics. From tests on this dataset the authors derive ARCC as a composite metric that more reliably ranks faithful explanations. They then introduce RefineCAM, which combines attribution maps computed at several layers of the network to raise resolution without retraining. Experiments show RefineCAM scores higher than prior methods under the new evaluation protocol.

Core claim

Using a synthetic dataset whose images come with precisely known ground-truth attribution maps, the authors show that standard CAM evaluation metrics can be compared for soundness. They propose ARCC as a composite metric that better identifies faithful explanations than existing single metrics. On this foundation they present RefineCAM, a post-processing technique that aggregates class activation maps from multiple convolutional layers to produce higher-resolution attribution maps. The resulting maps are shown to outperform standard single-layer CAMs when measured by the proposed evaluation on the synthetic data.

What carries the argument

RefineCAM, the aggregation of class activation maps computed independently at several layers of the same convolutional network to increase spatial resolution while preserving decision faithfulness.

If this is right

  • CAM methods can be refined for detailed visual explanations without modifying or retraining the underlying convolutional network.
  • Evaluation of new attribution techniques becomes more objective once a ground-truth dataset is available for calibration.
  • ARCC can be adopted as a standard benchmark score when comparing future CAM variants or explanation algorithms.
  • Higher-resolution maps support finer localization tasks such as identifying which pixels within an object most influence the class score.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same synthetic evaluation protocol could be applied to other explanation families such as gradient-based or perturbation-based methods to test whether layer aggregation helps them as well.
  • The multi-layer aggregation pattern may transfer to vision transformers by combining attention heads or layers in an analogous way.
  • In practice, RefineCAM could be inserted into existing interpretability pipelines for domains like medical imaging where pixel-level detail matters for trust.

Load-bearing premise

The ground-truth attributions supplied with the synthetic images match the features that actually drive the network's decisions on real photographs.

What would settle it

Human raters viewing real-world images consistently prefer the regions highlighted by a baseline CAM over those highlighted by RefineCAM, or ARCC rankings diverge from human judgments of explanation quality.

Figures

Figures reproduced from arXiv: 2605.14641 by Alessandra Stramiglio, Luca Domeniconi, Michele Lombardi, Samuele Salti.

Figure 1
Figure 1. Figure 1: Results of various metrics computed on a random sample of [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example images (top row) from our proposed synthetic dataset paired [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pearson correlation between the studied metrics and: (a) Cosine Similarity [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of GradCAM++ (center) and its [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visual comparison of attribution maps generated on ImageNet. (a) shows [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
read the original abstract

Class attribution maps (CAMs) provide local explanations for the decisions of convolutional neural networks. While widely used in practice, the evaluation of CAMs remains challenging due to the lack of ground-truth explanations, making it difficult to evaluate the soundness of existing metrics. Independently, most commonly used CAM methods produce low-resolution attribution maps, which limits their usefulness for detailed interpretability. To address the evaluation challenge, we introduce a synthetic dataset with ground-truth attributions that enables a rigorous comparison of CAM evaluation metrics. Using this dataset, we analyze existing metrics and propose ARCC, a new composite metric that more reliably identifies faithful explanations. To address the low resolution issue, we introduce RefineCAM, a method that produces high-resolution attribution maps by aggregating CAMs across multiple network layers. Our results show that RefineCAM consistently outperforms existing methods according to the proposed evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that CAM evaluation is hindered by lack of ground-truth explanations and low-resolution outputs from standard methods. It introduces a synthetic dataset with constructed ground-truth attributions to enable rigorous metric comparison, proposes ARCC as a new composite metric that better identifies faithful explanations, and presents RefineCAM, which aggregates CAMs across multiple network layers to produce high-resolution maps. Results indicate RefineCAM outperforms baselines under the proposed ARCC evaluation.

Significance. If the synthetic ground-truth attributions prove to be a reliable proxy for faithfulness on natural images, the work would supply a much-needed controlled benchmark for CAM methods and a practical technique for higher-resolution explanations. The contribution hinges on whether ARCC and the dataset avoid favoring layer-aggregation heuristics by construction; absent explicit validation against real-image entanglement, the significance remains provisional.

major comments (2)
  1. [Abstract] Abstract: the central claim that RefineCAM 'consistently outperforms existing methods according to the proposed evaluation' rests on ARCC scores computed against a synthetic dataset, yet the abstract supplies no details on how ground-truth attributions are generated (additive patterns, spatial separability, or feature placement rules). Without this, it is impossible to determine whether the dataset systematically advantages multi-layer aggregation over single-layer baselines.
  2. [Evaluation] Evaluation and metric definition: the assertion that ARCC 'more reliably identifies faithful explanations' requires explicit formulas, comparison tables against prior metrics (e.g., deletion/insertion AUC, pointing game), and statistical controls on the synthetic data. The absence of these elements in the manuscript makes the superiority claim load-bearing but unverifiable from the provided description.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'our results show' should reference specific quantitative improvements (e.g., ARCC deltas or table numbers) to allow readers to gauge effect size without reading the full results section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our contributions. We address each major point below and have revised the manuscript to improve verifiability while preserving the core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that RefineCAM 'consistently outperforms existing methods according to the proposed evaluation' rests on ARCC scores computed against a synthetic dataset, yet the abstract supplies no details on how ground-truth attributions are generated (additive patterns, spatial separability, or feature placement rules). Without this, it is impossible to determine whether the dataset systematically advantages multi-layer aggregation over single-layer baselines.

    Authors: We agree that the abstract, being concise, omitted key details on synthetic ground-truth construction. The dataset uses additive patterns with explicit spatial separability constraints and controlled feature placement to ensure no inherent bias toward layer aggregation; single-layer baselines are evaluated identically. We have revised the abstract to include a brief clause describing these generation rules, directing readers to Section 4.1 for full specification. revision: yes

  2. Referee: [Evaluation] Evaluation and metric definition: the assertion that ARCC 'more reliably identifies faithful explanations' requires explicit formulas, comparison tables against prior metrics (e.g., deletion/insertion AUC, pointing game), and statistical controls on the synthetic data. The absence of these elements in the manuscript makes the superiority claim load-bearing but unverifiable from the provided description.

    Authors: The manuscript already presents the ARCC formula as a weighted composite in Equation (4) and includes comparison tables (Table 3) against deletion/insertion AUC and pointing game. To address the referee's concern about prominence, we have expanded Section 5.2 with explicit formulas, additional side-by-side tables, and statistical controls (variance and significance tests across synthetic configurations). These revisions make the superiority claim directly verifiable without altering results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's chain introduces an independent synthetic dataset whose ground-truth attributions are generated separately from any CAM method or metric. ARCC is proposed after analyzing existing metrics against this external GT; RefineCAM is defined as explicit layer aggregation. The outperformance result is a direct comparison to the synthetic GT and does not reduce to a fitted parameter, self-definition, or self-citation load-bearing step. No equation or claim equates a prediction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claims rest on the domain assumption that synthetic attributions serve as valid proxies for real faithfulness; no free parameters or invented entities are described.

axioms (1)
  • domain assumption Synthetic data can simulate real attribution faithfulness
    Invoked to justify using the dataset for metric validation.

pith-pipeline@v0.9.0 · 5442 in / 976 out tokens · 32353 ms · 2026-05-15T05:38:43.919454+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 3 internal anchors

  1. [1]

    Nature Machine Intelligence5(9), 1006– 1019 (2023)

    Achtibat, R., Dreyer, M., Eisenbraun, I., Bosse, S., Wiegand, T., Samek, W., Lapuschkin, S.: From attribution maps to human-understandable explanations through concept relevance propagation. Nature Machine Intelligence5(9), 1006– 1019 (2023)

  2. [2]

    Bohle, M., Fritz, M., Schiele, B.: Convolutional dynamic alignment networks for interpretableclassifications.In:ProceedingsoftheIEEE/CVFConferenceonCom- puter Vision and Pattern Recognition. pp. 10029–10038 (2021)

  3. [3]

    The Visual Computer 41(10), 7249–7267 (Jan 2025)

    Cai, H., Yang, Y., Tang, Y., Sun, Z., Zhang, W.: Shapley value-based class activa- tion mapping for improved explainability in neural networks. The Visual Computer 41(10), 7249–7267 (Jan 2025). https://doi.org/10.1007/s00371-025-03803-1

  4. [4]

    https://universe.roboflow.com/carddataset/ whereswaldy (2023)

    carddataset: Whereswaldy dataset. https://universe.roboflow.com/carddataset/ whereswaldy (2023)

  5. [5]

    In: 2018 IEEE winter conference on applications of computer vision (WACV)

    Chattopadhay, A., Sarkar, A., Howlader, P., Balasubramanian, V.N.: Grad- cam++: Generalized gradient-based visual explanations for deep convolutional networks. In: 2018 IEEE winter conference on applications of computer vision (WACV). pp. 839–847. IEEE (2018)

  6. [6]

    Ihongbe, I., Fouad, S., F

    E. Ihongbe, I., Fouad, S., F. Mahmoud, T., Rajasekaran, A., Bhatia, B.: Evalu- ating explainable artificial intelligence (xai) techniques in chest radiology imaging through a human-centered lens. Plos one19(10), e0308758 (2024)

  7. [7]

    In: Proceedings of the IEEE international conference on computer vision

    Fong, R.C., Vedaldi, A.: Interpretable explanations of black boxes by meaningful perturbation. In: Proceedings of the IEEE international conference on computer vision. pp. 3429–3437 (2017)

  8. [8]

    Nature Machine In- telligence2(11), 665–673 (2020)

    Geirhos, R., Jacobsen, J.H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., Wichmann, F.A.: Shortcut learning in deep neural networks. Nature Machine In- telligence2(11), 665–673 (2020)

  9. [9]

    He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

  10. [10]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Hesse, R., Schaub-Meyer, S., Roth, S.: Funnybirds: A synthetic vision dataset for a part-based analysis of explainable ai methods. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3981–3991 (2023)

  11. [11]

    IEEE Transactions on Image Processing30, 5875–5888 (2021)

    Jiang, P.T., Zhang, C.B., Hou, Q., Cheng, M.M., Wei, Y.: Layercam: Exploring hierarchical class activation maps for localization. IEEE Transactions on Image Processing30, 5875–5888 (2021)

  12. [12]

    Kolmogorov, A.N., Castelnuovo, G.: Sur la notion de la moyenne. G. Bardi, tip. della R. Accad. dei Lincei (1930)

  13. [13]

    Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer:Hierarchicalvisiontransformerusingshiftedwindows.In:Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021)

  14. [14]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11976–11986 (2022) How to Evaluate and Refine your CAM 15

  15. [15]

    Advances in neural information processing systems30(2017)

    Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. Advances in neural information processing systems30(2017)

  16. [16]

    Molnar, C.: Interpretable machine learning. Lulu. com (2020)

  17. [17]

    RISE: Randomized Input Sampling for Explanation of Black-box Models

    Petsiuk, V., Das, A., Saenko, K.: Rise: Randomized input sampling for explanation of black-box models. arXiv preprint arXiv:1806.07421 (2018)

  18. [18]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Poppi, S., Cornia, M., Baraldi, L., Cucchiara, R.: Revisiting the evaluation of class activation mapping for explainability: A novel metric and experimental analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2299–2304 (2021)

  19. [19]

    In: proceedings of the IEEE/CVF winter conference on applications of computer vision

    Ramaswamy,H.G.,etal.:Ablation-cam:Visualexplanationsfordeepconvolutional network via gradient-free localization. In: proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 983–991 (2020)

  20. [20]

    why should i trust you?

    Ribeiro, M.T., Singh, S., Guestrin, C.: " why should i trust you?" explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD interna- tional conference on knowledge discovery and data mining. pp. 1135–1144 (2016)

  21. [21]

    In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S

    Rong, Y., Leemann, T., Borisov, V., Kasneci, G., Kasneci, E.: A consistent and efficient evaluation strategy for attribution methods. In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S. (eds.) Proceedings of the 39th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 162, pp. 18770–18795....

  22. [22]

    International journal of computer vision115(3), 211–252 (2015)

    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog- nition challenge. International journal of computer vision115(3), 211–252 (2015)

  23. [23]

    In- ternational journal of computer vision128, 336–359 (2020)

    Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad- cam: visual explanations from deep networks via gradient-based localization. In- ternational journal of computer vision128, 336–359 (2020)

  24. [24]

    In: International conference on machine learn- ing

    Shrikumar, A., Greenside, P., Kundaje, A.: Learning important features through propagating activation differences. In: International conference on machine learn- ing. pp. 3145–3153. PMlR (2017)

  25. [25]

    Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

    Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional net- works: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013)

  26. [26]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  27. [27]

    In: International conference on machine learning

    Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: International conference on machine learning. pp. 3319–3328. PMLR (2017)

  28. [28]

    IEEE Transactions on Neural Networks and Learning Systems32(11), 4793–4813 (2021)

    Tjoa, E., Guan, C.: A survey on explainable artificial intelligence (xai): Toward medical xai. IEEE Transactions on Neural Networks and Learning Systems32(11), 4793–4813 (2021). https://doi.org/10.1109/TNNLS.2020.3027314

  29. [29]

    In: Proceedings of the IEEE/CVF conference on computer vision and pat- tern recognition workshops

    Wang, H., Wang, Z., Du, M., Yang, F., Zhang, Z., Ding, S., Mardziel, P., Hu, X.: Score-cam: Score-weighted visual explanations for convolutional neural net- works. In: Proceedings of the IEEE/CVF conference on computer vision and pat- tern recognition workshops. pp. 24–25 (2020)

  30. [30]

    In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13

    Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13. pp. 818–833. Springer (2014)

  31. [31]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2921–2929 (2016)

  32. [32]

    Displays76, 102339 (2023)

    Zhou, X., Li, Y., Cao, G., Cao, W.: Master-cam: Multi-scale fusion guided by master map for high-quality class activation maps. Displays76, 102339 (2023)