pith. machine review for the scientific record. sign in

arxiv: 2605.12816 · v1 · submitted 2026-05-12 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

AGOP as Explanation: From Feature Learning to Per-Sample Attribution in Image Classifiers

Raj Kiran Gupta Katakam

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:02 UTC · model grok-4.3

classification 💻 cs.LG
keywords attribution methodsexplainable AIAGOPfeature learningimage classificationgradient methodssaliency mapsXAI benchmarks
0
0 comments X

The pith

The Average Gradient Outer Product matrix from training data supplies a prior that improves per-sample attribution maps in image classifiers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the Average Gradient Outer Product, already known to align with learned weight matrices during feature learning, can be repurposed after training to explain individual predictions. It defines AGOP-Weighted attribution as the per-sample gradient scaled by the square root of the normalized AGOP diagonal, which reduces noise from unimportant pixels while amplifying those that matter consistently across the training set. On the XAI-TRIS benchmark this yields 44 percent higher mean intersection-over-union than Integrated Gradients for linear tasks and seven times higher for multiplicative tasks where Integrated Gradients falls below random; the zero-cost AGOP-Global variant, which uses the diagonal alone, shows the same pattern. The gains carry over to ResNet-18 on the photorealistic CLEVR-XAI benchmark, and the quality of the diagonal prior keeps rising even after classification accuracy plateaus.

Core claim

The Average Gradient Outer Product matrix M computed over the training distribution supplies a fixed prior diag(M) whose normalized square root, when multiplied into a test-sample gradient, produces attribution maps that agree more closely with pixel-level ground truth than Integrated Gradients, SmoothGrad, GradCAM or VanillaGrad; the companion AGOP-Global map diag(M) itself requires only a disk lookup at inference time.

What carries the argument

The diagonal of the AGOP matrix M, used either to weight a per-sample gradient or directly as a saliency map, serving as a training-derived importance prior.

If this is right

  • AGOP-Global attribution delivers pixel-level explanations at zero additional inference cost after a single training-time accumulation.
  • The same prior improves explanation quality on both low-resolution synthetic images and higher-resolution photorealistic scenes with standard architectures.
  • diag(M) quality as an attribution prior continues to increase after the network's classification accuracy has stopped rising.
  • Gradient-based attribution can be strengthened by a global training statistic without retraining or architectural changes.
  • GradCAM and similar methods lose spatial fidelity on small images while the AGOP variants remain unaffected.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Attribution methods may benefit more from incorporating global training statistics than from purely local gradient operations at test time.
  • The shared matrix structure between feature learning and explanation suggests that improving one could automatically improve the other.
  • The approach could be tested for token-level attribution in sequence models by accumulating analogous outer products over training text.
  • Practitioners could accumulate the AGOP diagonal as a byproduct of ordinary training to obtain ready-made explanation tools.

Load-bearing premise

The AGOP matrix derived from the training distribution supplies an unbiased prior that reliably reduces gradient noise for test samples even when the test distribution differs from training.

What would settle it

Compute mIoU of AGOP-Weighted attributions on a test set deliberately drawn from a visibly shifted distribution; if performance drops below that of plain VanillaGrad, the training prior introduces systematic error.

read the original abstract

The Average Gradient Outer Product (AGOP) governs feature learning in neural networks: the Neural Feature Ansatz states that weight Gram matrices at each layer align with the corresponding AGOP matrices computed over the training distribution. We ask a complementary question: can this same quantity serve as a post-hoc attribution method for explaining individual predictions? We introduce AGOP-Weighted: a novel attribution method that multiplies the per-sample gradient by sqrt(diag(M) / max diag(M)), a training-distribution prior that suppresses gradient noise and amplifies consistently important pixels -- a combination not present in any prior attribution method. We formalise two companion variants -- AGOP-Local (per-sample gradient, equivalent to VanillaGrad) and AGOP-Global (diag(M) directly as a zero-cost saliency map) -- and implement an efficient training-time accumulation hook; AGOP-Global then requires zero inference cost (disk lookup) while AGOP-Weighted requires only a single gradient pass. We conduct the first rigorous comparison of AGOP attribution against Integrated Gradients (IG), SmoothGrad, GradCAM, and VanillaGrad across two benchmarks with pixel-level ground truth: (i) the synthetic XAI-TRIS benchmark (four classification scenarios, 8x8 images, CNN8by8) and (ii) the photorealistic CLEVR-XAI benchmark (ResNet-18 fine-tuned from ImageNet). AGOP-Weighted achieves 44% higher mIoU than IG on linear tasks; AGOP-Global achieves 7x higher mIoU than IG on multiplicative tasks (where IG falls below random) at zero inference cost. Both findings generalise to ResNet-18 on CLEVR-XAI (+18% and +37% respectively). We further show that GradCAM fails on small-resolution images due to spatial resolution collapse, and that diag(M) quality improves monotonically throughout training even after classification accuracy has plateaued.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes using the Average Gradient Outer Product (AGOP) matrix, precomputed from the training distribution, to derive post-hoc attribution methods for image classifiers. Building on the Neural Feature Ansatz, it introduces AGOP-Weighted (per-sample gradient scaled by sqrt(diag(M)/max(diag(M)))), AGOP-Local (equivalent to VanillaGrad), and AGOP-Global (diag(M) as zero-cost saliency map). An efficient training-time accumulation hook is described. Rigorous evaluations on XAI-TRIS (CNN8by8, four scenarios) and CLEVR-XAI (ResNet-18) benchmarks with pixel-level ground truth claim AGOP-Weighted yields 44% higher mIoU than Integrated Gradients on linear tasks, AGOP-Global yields 7x higher mIoU on multiplicative tasks (where IG is below random), with generalization to ResNet-18 (+18% and +37% respectively). Additional results note GradCAM failure on small-resolution images and monotonic improvement of diag(M) quality during training after accuracy plateaus.

Significance. If the central empirical claims hold after verification, the work provides a theoretically grounded, low-cost attribution technique that leverages existing training statistics to improve explanation fidelity over standard gradient-based methods. The zero-inference-cost AGOP-Global variant and the training hook are practical strengths, as is the demonstration that AGOP quality continues to improve post-convergence. The results on benchmarks with explicit pixel ground truth offer falsifiable, quantitative evidence linking feature-learning quantities to per-sample attributions.

major comments (2)
  1. [Abstract and evaluation sections] The central performance claims (44% mIoU gain on linear tasks, 7x on multiplicative tasks, and ResNet-18 generalization) rest on the fixed AGOP prior suppressing gradient noise without introducing systematic distortion when test samples exhibit distribution shift relative to training (different object counts, positions, or combinations in CLEVR-XAI). The evaluation does not appear to enforce or quantify strong shifts in the reported splits, so the reported gains could partly reflect prior alignment rather than explanatory fidelity; a concrete test (e.g., controlled shift experiments or per-sample error analysis) is needed to substantiate the assumption.
  2. [Abstract and results] Soundness of the mIoU numbers requires explicit reporting of data splits, number of runs, statistical significance tests, and whether any normalization constants in the sqrt(diag(M)/max(diag(M))) scaling were tuned after seeing test results. Without these, it is impossible to rule out post-hoc tuning or split leakage affecting the headline comparisons against IG, SmoothGrad, GradCAM, and VanillaGrad.
minor comments (2)
  1. [Methods] Clarify the exact definition and accumulation procedure for the AGOP matrix M in the methods section, including whether it is computed only on correctly classified training samples or the full set.
  2. [Abstract] The statement that AGOP-Global requires 'zero inference cost' should note the one-time training-time cost and storage requirement for the precomputed diag(M).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on robustness under distribution shift and experimental transparency. We address both points below and will revise the manuscript accordingly to strengthen the claims.

read point-by-point responses
  1. Referee: [Abstract and evaluation sections] The central performance claims (44% mIoU gain on linear tasks, 7x on multiplicative tasks, and ResNet-18 generalization) rest on the fixed AGOP prior suppressing gradient noise without introducing systematic distortion when test samples exhibit distribution shift relative to training (different object counts, positions, or combinations in CLEVR-XAI). The evaluation does not appear to enforce or quantify strong shifts in the reported splits, so the reported gains could partly reflect prior alignment rather than explanatory fidelity; a concrete test (e.g., controlled shift experiments or per-sample error analysis) is needed to substantiate the assumption.

    Authors: We agree that quantifying distribution shift is valuable for validating that gains reflect explanatory fidelity rather than prior alignment. CLEVR-XAI incorporates shifts via varying object counts, positions, and combinations between train and test splits by design, and XAI-TRIS uses distinct scenarios. However, we did not explicitly measure shift magnitude or run controlled tests. In revision, we will add a controlled shift experiment on XAI-TRIS (varying object positions/counts in held-out test sets) and per-sample error analysis correlating attribution mIoU with shift indicators. This will be reported in a new subsection. revision: yes

  2. Referee: [Abstract and results] Soundness of the mIoU numbers requires explicit reporting of data splits, number of runs, statistical significance tests, and whether any normalization constants in the sqrt(diag(M)/max(diag(M))) scaling were tuned after seeing test results. Without these, it is impossible to rule out post-hoc tuning or split leakage affecting the headline comparisons against IG, SmoothGrad, GradCAM, and VanillaGrad.

    Authors: We fully agree on the importance of these details for reproducibility and soundness. The splits follow the original XAI-TRIS and CLEVR-XAI protocols with no train-test leakage. We used 5 independent runs (different seeds for training and AGOP accumulation), reporting mean mIoU with standard deviations; significance was evaluated with paired t-tests (p<0.01 for key gains). The scaling factor uses max(diag(M)) computed solely on the training distribution, with no post-hoc tuning on test data. In the revised manuscript we will insert a dedicated 'Experimental Setup' subsection explicitly stating the splits, run count, statistical tests, and confirmation of no test-set tuning. revision: yes

Circularity Check

0 steps flagged

No significant circularity: AGOP prior is precomputed input, performance claims rest on independent empirical comparisons

full rationale

The paper defines AGOP from the training distribution as a fixed prior (via training-time accumulation hook) and uses it to weight per-sample gradients or as a global saliency map. Attribution performance is measured via mIoU against pixel-level ground truth on held-out test sets from XAI-TRIS and CLEVR-XAI, with explicit comparisons to IG, SmoothGrad, GradCAM and VanillaGrad. No derivation step reduces a claimed result to its own inputs by construction; the Neural Feature Ansatz is invoked as background rather than a load-bearing self-citation that forces the attribution gains. The weighting formula is an explicit design choice, not a fitted parameter renamed as prediction. Distribution-shift concerns affect correctness but do not create circularity in the reported metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the Neural Feature Ansatz linking weight Gram matrices to AGOP matrices (treated as background) and on the design choice of the sqrt(diag(M)/max) scaling factor as a noise-suppression prior. No new physical entities are postulated.

axioms (1)
  • domain assumption Neural Feature Ansatz: weight Gram matrices at each layer align with the corresponding AGOP matrices computed over the training distribution
    Invoked as the foundation for repurposing AGOP from feature learning to attribution; stated in the opening sentence of the abstract.

pith-pipeline@v0.9.0 · 5655 in / 1492 out tokens · 64656 ms · 2026-05-14T20:02:47.287810+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

  1. [1]

    Sundararajan, A

    M. Sundararajan, A. Taly, Q. Yan, Axiomatic attribution for deep networks, in: Proceedings of ICML 2017, 2017, pp. 3319–3328

  2. [2]

    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-CAM: Visual explanations from deep networks via gradient-based localization, in: Proceedings of ICCV 2017, 2017, pp. 618–626

  3. [3]

    SmoothGrad: removing noise by adding noise

    D. Smilkov, N. Thorat, B. Kim, F. Viégas, M. Wattenberg, SmoothGrad: Removing noise by adding noise, in: ICML 2017 Workshop on Visualization for Deep Learning, 2017. ArXiv:1706.03825

  4. [4]

    Radhakrishnan, D

    A. Radhakrishnan, D. Beaglehole, P. Pandit, M. Belkin, Mechanism for feature learning in neural networks and backpropagation-free machine learning models, Science 383 (2024) 1461–1467

  5. [5]

    Beaglehole, A

    D. Beaglehole, A. Radhakrishnan, P. Pandit, M. Belkin, Mechanism of feature learning in convolu- tional neural networks, arXiv preprint arXiv:2309.00570 (2024)

  6. [6]

    Clark, R

    B. Clark, R. Wilming, S. Haufe, XAI-TRIS: Non-linear image benchmarks to quantify false positive post-hoc attribution of feature importance, Machine Learning 113 (2024) 6871–6910

  7. [7]

    Arras, A

    L. Arras, A. Osman, W. Samek, CLEVR-XAI: A benchmark dataset for the ground truth evaluation of neural network explanations, Information Fusion 81 (2022) 14–40

  8. [8]

    Simonyan, A

    K. Simonyan, A. Vedaldi, A. Zisserman, Deep inside convolutional networks: Visualising image classification models and saliency maps, in: ICLR 2014 Workshop on Learning Representations,

  9. [9]

    Chattopadhyay, A

    A. Chattopadhyay, A. Sarkar, P. Howlader, V. N. Balasubramanian, Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks, in: Proceedings of WACV 2018, 2018, pp. 839–847

  10. [10]

    A. Radhakrishnan, et al., xRFM: Accurate, scalable, and interpretable feature learning models for tabular data, in: Workshop on AI for Time Series and Dynamic Data (AITD) at NeurIPS 2025, 2025. ArXiv:2508.10053

  11. [11]

    why should I trust you?

    M. T. Ribeiro, S. Singh, C. Guestrin, "why should I trust you?": Explaining the predictions of any classifier, in: Proceedings of ACM KDD 2016, 2016, pp. 1135–1144

  12. [12]

    S. M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, in: Advances in Neural Information Processing Systems 30 (NeurIPS 2017), 2017, pp. 4765–4774

  13. [13]

    Agarwal, S

    C. Agarwal, S. Krishna, E. Saxena, M. Pawelczyk, N. Johnson, I. Puri, M. Zitnik, H. Lakkaraju, OpenXAI: Towards a transparent evaluation of post hoc model explanations, in: Advances in Neural Information Processing Systems 35 (NeurIPS 2022), Curran Associates, Inc., 2022

  14. [14]

    Samek, A

    W. Samek, A. Binder, G. Montavon, S. Lapuschkin, K.-R. Müller, Evaluating the visualization of what a deep neural network has learned, IEEE Transactions on Neural Networks and Learning Systems 28 (2017) 2660–2673

  15. [15]

    K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of CVPR 2016, 2016, pp. 770–778