pith. machine review for the scientific record. sign in

arxiv: 2604.22580 · v1 · submitted 2026-04-24 · 📊 stat.ML · cs.LG

Recognition: unknown

Explanation of Dynamic Physical Field Predictions using WassersteinGrad: Application to Autoregressive Weather Forecasting

Laurent Risser, Laure Raynaud, Luciano Drozda, Younes Essafouri

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:52 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords Wasserstein barycenterfeature attributionexplainable AIweather forecastingautoregressive modelsgradient explanationsdynamic physical fields
0
0 comments X

The pith

WassersteinGrad extracts a geometric consensus of perturbed attribution maps by computing their entropic Wasserstein barycenter to explain neural predictions on dynamic physical fields.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Gradient-based attribution methods such as SmoothGrad rely on pointwise averaging of maps from stochastically perturbed inputs, yet on dynamic physical fields like weather this produces blurred results because perturbations displace features spatially rather than adding stationary noise. The paper introduces WassersteinGrad to compute the entropic Wasserstein barycenter across those maps, thereby aligning and averaging them geometrically instead. Experiments on regional weather data with a meteorologist-validated neural model show clearer attributions than baselines in both single-step and autoregressive forecasting. A sympathetic reader would care because trustworthy spatial explanations matter when neural models guide decisions in safety-critical physical systems.

Core claim

On dynamic physical fields, stochastic input perturbations induce geometric displacements in attribution maps rather than stationary amplitude noise, so pointwise averaging blurs spatially misaligned features. WassersteinGrad extracts a geometric consensus by computing the entropic Wasserstein barycenter of the perturbed attribution maps. The authors demonstrate that this yields sharper and more informative explanations than standard gradient-based baselines on regional weather data for both single-step and autoregressive forecasting settings.

What carries the argument

The entropic Wasserstein barycenter of perturbed attribution maps, which aligns displaced features via optimal transport and produces a spatially coherent average.

Load-bearing premise

The displacements induced by input perturbations are primarily geometric and can be corrected by optimal transport without introducing artifacts or losing attribution fidelity.

What would settle it

Expert meteorologist review of side-by-side visualizations on the same weather data showing that WassersteinGrad attributions obscure key physical structures or contain more artifacts than pointwise-averaged gradients would refute the claim of improved explainability.

Figures

Figures reproduced from arXiv: 2604.22580 by Laurent Risser, Laure Raynaud, Luciano Drozda, Younes Essafouri.

Figure 1
Figure 1. Figure 1: (a) Autoregressive use of a neural-based forecasting model f, producing a sequence of physical fields (here regional atmospheric states with C = 21 channels) at successive lead times t + 1, . . . , t + T. (b - from left to right) input zonal wind at 250 hPa at time t; predicted surface precipitation at time t + 5; explanation of rain prediction in Paris area using the gradient attribution method of [7]. (c… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison of gradient attribution methods at one-step ( view at source ↗
Figure 3
Figure 3. Figure 3: Temporal Resolution. The dataset provides analyses at a 1-hour time step. Our models are trained on lead times ∆t ∈ {+1h, . . . , +6h}, with a context window of one preceding analysis step as input. State Variables. We define the atmospheric state at timestep t as a multi-channel spatial tensor xt ∈ R H×W×C , where C is the number of meteorological channels view at source ↗
Figure 3
Figure 3. Figure 3: Geographic extent of the TITAN/AROME dataset. The view at source ↗
Figure 4
Figure 4. Figure 4: Centroid and peak displacement of gradient attributions under input perturbations, at single-step (t + 1, top) and autoregressive (t + 5, bottom) lead times. Left: peak displacement (location of arg max |G|). Right: centroid displacement alongside relative prediction error (secondary axis). Peak displacement amplifies by 3–4× between t + 1 and t + 5, while the prediction error remains below 1.5% in both ca… view at source ↗
read the original abstract

As the demand to integrate Artificial Intelligence into high-stakes environments continues to grow, explaining the reasoning behind neural-network predictions has shifted from a theoretical curiosity to a strict operational requirement. Our work is motivated by the explanations of autoregressive neural predictions on dynamic physical fields, as in weather forecasting. Gradient-based feature attribution methods are widely used to explain the predictions on such data, in particular due to their scalability to high-dimensional inputs. It is also interesting to remark that gradient-based techniques such as SmoothGrad are now standard on images to robustify the explanations using pointwise averages of the attribution maps obtained from several noised inputs. Our goal is to efficiently adapt this aggregation strategy to dynamic physical fields. To do so, our first contribution is to identify a fundamental failure mode when averaging perturbed attribution maps on dynamic physical fields: stochastic input perturbations do not induce stationary amplitude noise in attribution maps, but instead cause a geometric displacement of the attributions. Consequently, pointwise averaging blurs these spatially misaligned features. To tackle this issue, we introduce WassersteinGrad, which extracts a geometric consensus of perturbed attribution maps by computing their entropic Wasserstein barycenter. The results, obtained on regional weather data and a meteorologist-validated neural model, demonstrate promising explainability properties of WassersteinGrad over gradient-based baselines across both single-step and autoregressive forecasting settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper identifies a failure mode in standard gradient-based attribution methods (e.g., SmoothGrad-style pointwise averaging) for explaining neural predictions on dynamic physical fields such as weather data: stochastic input perturbations induce geometric displacements in attribution maps rather than stationary amplitude noise, causing blurring upon averaging. To address this, the authors propose WassersteinGrad, which aggregates perturbed attribution maps via their entropic Wasserstein barycenter to recover a geometric consensus. They evaluate the method on regional weather forecasting tasks using a meteorologist-validated neural model, reporting improved explainability properties relative to gradient baselines in both single-step and autoregressive settings.

Significance. If the core geometric-displacement hypothesis and the superiority of the Wasserstein aggregation hold under rigorous validation, the work could meaningfully advance explainable AI for high-dimensional spatiotemporal physical systems. It correctly identifies a limitation of pointwise averaging on non-stationary fields and repurposes established optimal-transport tools (entropic Wasserstein barycenters) in a new domain. The application to autoregressive weather models is timely given operational demands for trustworthy AI in meteorology. However, the absence of quantitative faithfulness metrics, sensitivity analyses, or blind meteorologist evaluations in the provided abstract limits the immediate impact assessment.

major comments (3)
  1. [Abstract / Motivation section] The central claim that input perturbations produce purely geometric displacements (rather than mixed amplitude or noise effects that would violate OT assumptions) is load-bearing for the motivation and for the choice of Wasserstein barycenter over simpler robust aggregators. The abstract states this as a 'fundamental failure mode' but provides no quantitative demonstration (e.g., displacement magnitude statistics, transport-cost histograms, or comparison of L2 vs. Wasserstein distances on example attribution maps). This must be shown explicitly, preferably with controlled synthetic fields before real weather data.
  2. [Results / Experiments section] Results are described only qualitatively as 'promising explainability properties' without any reported metrics (faithfulness scores, insertion/deletion AUC, meteorologist agreement rates, or statistical significance tests against baselines). Given that the entropic regularization parameter is a free hyperparameter, sensitivity to its value, to the ground metric, and to Sinkhorn approximation tolerance must be quantified; otherwise the claimed advantage over pointwise averaging cannot be assessed.
  3. [Autoregressive forecasting experiments] The paper applies the method to both single-step and autoregressive forecasting but does not address whether the geometric-consensus property persists under error accumulation in multi-step rollouts. If attribution maps become increasingly diffuse or multi-modal in autoregressive mode, the entropic barycenter may introduce its own smoothing artifacts; this interaction should be analyzed.
minor comments (2)
  1. [Methods] Notation for the entropic regularization parameter and the precise definition of the Wasserstein barycenter (including the ground cost on the spatial grid) should be stated explicitly in the methods section for reproducibility.
  2. [Abstract / Model description] The abstract mentions 'a meteorologist-validated neural model' but does not specify the validation protocol or the architecture; a brief description would help readers assess domain relevance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each of the three major comments point by point below, indicating the revisions we will incorporate to strengthen the work.

read point-by-point responses
  1. Referee: [Abstract / Motivation section] The central claim that input perturbations produce purely geometric displacements (rather than mixed amplitude or noise effects that would violate OT assumptions) is load-bearing for the motivation and for the choice of Wasserstein barycenter over simpler robust aggregators. The abstract states this as a 'fundamental failure mode' but provides no quantitative demonstration (e.g., displacement magnitude statistics, transport-cost histograms, or comparison of L2 vs. Wasserstein distances on example attribution maps). This must be shown explicitly, preferably with controlled synthetic fields before real weather data.

    Authors: We agree that quantitative support for the geometric-displacement hypothesis is important to justify the use of Wasserstein barycenters. The current manuscript relies on visual examples from weather attribution maps to illustrate the displacement effect. In the revision we will add a dedicated subsection to the Motivation section containing controlled synthetic experiments. These will report displacement magnitude statistics, transport-cost histograms, and explicit L2 versus Wasserstein distance comparisons on synthetic fields with known geometric shifts, thereby providing the requested quantitative validation before presenting the real-data results. revision: yes

  2. Referee: [Results / Experiments section] Results are described only qualitatively as 'promising explainability properties' without any reported metrics (faithfulness scores, insertion/deletion AUC, meteorologist agreement rates, or statistical significance tests against baselines). Given that the entropic regularization parameter is a free hyperparameter, sensitivity to its value, to the ground metric, and to Sinkhorn approximation tolerance must be quantified; otherwise the claimed advantage over pointwise averaging cannot be assessed.

    Authors: We acknowledge that the current presentation is primarily qualitative. We will revise the Results section to include quantitative faithfulness metrics (insertion/deletion AUC), meteorologist agreement rates, and statistical significance tests against the gradient baselines. In addition, we will report a full sensitivity analysis with respect to the entropic regularization parameter, the choice of ground metric, and Sinkhorn approximation tolerance, thereby allowing readers to assess the robustness of the reported advantage. revision: yes

  3. Referee: [Autoregressive forecasting experiments] The paper applies the method to both single-step and autoregressive forecasting but does not address whether the geometric-consensus property persists under error accumulation in multi-step rollouts. If attribution maps become increasingly diffuse or multi-modal in autoregressive mode, the entropic barycenter may introduce its own smoothing artifacts; this interaction should be analyzed.

    Authors: We agree that the interaction between error accumulation and the entropic barycenter warrants explicit examination. We will extend the autoregressive experiments subsection to include a direct comparison of attribution-map properties (diffuseness, modality, and transport cost to the single-step reference) across increasing rollout horizons. This analysis will quantify whether the geometric-consensus property is preserved and will discuss any additional smoothing introduced by the barycenter under accumulated forecast error. revision: yes

Circularity Check

0 steps flagged

No circularity: method applies established Wasserstein barycenter to observed attribution displacements

full rationale

The paper identifies (via observation on weather fields) that input perturbations induce geometric shifts rather than stationary noise in gradient attributions, then aggregates via the standard entropic Wasserstein barycenter. No derivation step reduces by construction to a fitted parameter, self-defined quantity, or load-bearing self-citation; the core claim rests on applying an external optimal-transport primitive to a diagnosed failure mode of pointwise averaging. The approach remains self-contained against external benchmarks and does not rename or smuggle prior results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review prevents exhaustive identification; the method implicitly relies on treating attribution maps as distributions suitable for Wasserstein geometry and on the entropic regularization of the barycenter computation.

free parameters (1)
  • entropic regularization parameter
    Standard in entropic Wasserstein computations to ensure tractability, but no value or selection procedure is described.
axioms (1)
  • domain assumption Attribution maps from perturbed inputs can be meaningfully represented and averaged as probability distributions under the Wasserstein metric.
    This underpins the shift from pointwise averaging to barycenter computation.

pith-pipeline@v0.9.0 · 5552 in / 1337 out tokens · 63088 ms · 2026-05-08T09:52:23.408997+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 4 canonical work pages

  1. [1]

    Local explanation methods for deep neural networks lack sensitivity to parameter values

    Julius Adebayo, Justin Gilmer, Ian Goodfellow, and Been Kim. Local explanation methods for deep neural networks lack sensitivity to parameter values. InProceedings of ICLR Workshop, 2018

  2. [2]

    Barycenters in the wasserstein space.SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011

    Martial Agueh and Guillaume Carlier. Barycenters in the wasserstein space.SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011

  3. [3]

    Skillful joint probabilistic weather forecasting from marginals.arXiv preprint arXiv:2506.10772, 2025

    Ferran Alet, Ilan Price, Andrew El-Kadi, Dominic Masters, Stratis Markou, Tom R Andersson, Jacklynn Stott, Remi Lam, Matthew Willson, Alvaro Sanchez-Gonzalez, et al. Skillful joint probabilistic weather forecasting from marginals.arXiv preprint arXiv:2506.10772, 2025

  4. [4]

    On the robustness of interpretability methods

    David Alvarez-Melis and Tommi S Jaakkola. On the robustness of interpretability methods. In Proceedings of ICML Workshop on Human Interpretability in Machine Learning, 2018

  5. [5]

    Towards better understanding of gradient-based attribution methods for deep neural networks

    Marco Ancona, Enea Ceolini, Cengiz Öztireli, and Markus Gross. Towards better understanding of gradient-based attribution methods for deep neural networks. InProceedings of International Conference on Learning Representations (ICLR), 2017

  6. [6]

    Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI.Information fusion, 58:82–115, 2020

    Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador García, Sergio Gil-López, Daniel Molina, Richard Ben- jamins, et al. Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI.Information fusion, 58:82–115, 2020

  7. [7]

    How to explain individual classification decisions.The Journal of Machine Learning Research, 11:1803–1831, 2010

    David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Müller. How to explain individual classification decisions.The Journal of Machine Learning Research, 11:1803–1831, 2010

  8. [8]

    The shattered gradients problem: If resnets are the answer, then what is the question? InInternational conference on machine learning, pages 342–350

    David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. The shattered gradients problem: If resnets are the answer, then what is the question? InInternational conference on machine learning, pages 342–350. PMLR, 2017

  9. [9]

    Benitez, J.L

    J.M. Benitez, J.L. Castro, and I. Requena. Are artificial neural networks black boxes?IEEE Transactions on Neural Networks, 8(5):1156–1164, 1997

  10. [10]

    Finding the right XAI method—a guide for the evaluation and ranking of explainable AI methods in climate science.Artificial Intelligence for the Earth Systems, 3(3):e230074, 2024

    Philine Lou Bommer, Marlene Kretschmer, Anna Hedström, Dilyara Bareeva, and Marina M-C Höhne. Finding the right XAI method—a guide for the evaluation and ranking of explainable AI methods in climate science.Artificial Intelligence for the Earth Systems, 3(3):e230074, 2024

  11. [11]

    A multiscale analysis of mean-field transformers in the moderate interaction regime

    Giuseppe Bruno, Federico Pasqualotto, and Andrea Agazzi. A multiscale analysis of mean-field transformers in the moderate interaction regime. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

  12. [12]

    Noiseg- rad—enhancing explanations by introducing stochasticity to model weights

    Kirill Bykov, Anna Hedström, Shinichi Nakajima, and Marina M-C Höhne. Noiseg- rad—enhancing explanations by introducing stochasticity to model weights. InProceedings of the AAAI Conference on Artificial Intelligence, 2022

  13. [13]

    Concise explanations of neural networks using adversarial training

    Prasad Chalasani, Jiefeng Chen, Amrita Roy Chowdhury, Xi Wu, and Somesh Jha. Concise explanations of neural networks using adversarial training. InInternational Conference on Machine Learning (ICML), pages 1383–1391, 2020

  14. [14]

    Certified adversarial robustness via randomized smoothing

    Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 1310–1320, 2019. 10

  15. [15]

    Sinkhorn distances: Lightspeed computation of optimal transport.Advances in neural information processing systems (NeurIPS), 26, 2013

    Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport.Advances in neural information processing systems (NeurIPS), 26, 2013

  16. [16]

    Fast computation of Wasserstein barycenters

    Marco Cuturi and Arnaud Doucet. Fast computation of Wasserstein barycenters. In Eric P. Xing and Tony Jebara, editors,Proceedings of the 31st International Conference on Machine Learning, Proceedings of Machine Learning Research, pages 685–693, Bejing, China, 22–24 Jun 2014

  17. [17]

    Wasserstein barycenter model ensembling

    Pierre Dognin, Igor Melnyk, Youssef Mroueh, Jerret Ross, Cicero Dos Santos, and Tom Sercu. Wasserstein barycenter model ensembling. InProceedings of International Conference on Learning Representations (ICLR), 2019

  18. [18]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InProceedings of the 9th International Conference on Learning Representations (ICLR), 2020

  19. [19]

    Fuzzy verification of high-resolution gridded forecasts: A review and proposed framework.Meteorological Applications, 15:51 – 64, 03 2008

    Elizabeth Ebert. Fuzzy verification of high-resolution gridded forecasts: A review and proposed framework.Meteorological Applications, 15:51 – 64, 03 2008

  20. [20]

    Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H

    Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong,...

  21. [21]

    Pot python optimal transport (version 0.9.5), 2024

    Rémi Flamary, Cédric Vincent-Cuaz, Nicolas Courty, Alexandre Gramfort, Oleksii Kachaiev, Huy Quang Tran, Laurène David, Clément Bonet, Nathan Cassereau, Théo Gnassounou, Eloi Tanguy, Julie Delon, Antoine Collas, Sonia Mazelet, Laetitia Chapel, Tanguy Kerdoncuff, Xizheng Yu, Matthew Feickert, Paul Krzakala, Tianlin Liu, and Eduardo Fernandes Montesuma. Pot...

  22. [22]

    A mathematical perspective on transformers.Bulletin of the American Mathematical Society, 62(3):427–479, 2025

    Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on transformers.Bulletin of the American Mathematical Society, 62(3):427–479, 2025

  23. [23]

    Smoothed differentiation efficiently mitigates shattered gradients in explanations

    Adrian Hill, Neal McKee, Johannes Maeß, Stefan Bluecher, and Klaus Robert Muller. Smoothed differentiation efficiently mitigates shattered gradients in explanations. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

  24. [24]

    Hoffman, Zheng Liu, Jean-Francois Louis, and Christopher Grassoti

    Ross N. Hoffman, Zheng Liu, Jean-Francois Louis, and Christopher Grassoti. Distortion representation of forecast errors.Monthly Weather Review, 123(9):2758 – 2770, 1995

  25. [25]

    Cambridge University Press, 2003

    Eugenia Kalnay.Atmospheric Modeling, Data Assimilation and Predictability. Cambridge University Press, 2003

  26. [26]

    Christian Keil and George C. Craig. A displacement-based error measure applied in a regional ensemble forecasting system.Monthly Weather Review, 135(9):3248 – 3259, 2007

  27. [27]

    The lipschitz constant of self-attention

    Hyunjik Kim, George Papamakarios, and Andriy Mnih. The lipschitz constant of self-attention. InInternational Conference on Machine Learning, pages 5562–5571. PMLR, 2021

  28. [28]

    Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023

    Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, et al. Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023

  29. [29]

    LeCun, B

    Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition.Neural Computation, 1(4):541–551, 1989

  30. [30]

    The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery.Queue, 16(3):31–57, 2018

    Zachary C Lipton. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery.Queue, 16(3):31–57, 2018. 11

  31. [31]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of International Conference on Learning Representations (ICLR), 2019

  32. [32]

    On spectral properties of gradient- based explanation methods

    Amir Mehrpanah, Erik Englesson, and Hossein Azizpour. On spectral properties of gradient- based explanation methods. InEuropean Conference on Computer Vision, pages 282–299. Springer, 2024

  33. [33]

    An update to ecmwf’s machine-learned weather forecast model AIFS.arXiv preprint arXiv:2509.18994, 2025

    Gabriel Moldovan, Ewan Pinnington, Ana Prieto Nemesio, Simon Lang, Zied Ben Bouallègue, Jesper Dramsch, Mihai Alexe, Mario Santa Cruz, Sara Hahner, Harrison Cook, et al. An update to ecmwf’s machine-learned weather forecast model AIFS.arXiv preprint arXiv:2509.18994, 2025

  34. [34]

    Py4cast: Weather forecasting with deep learning

    Météo-France. Py4cast: Weather forecasting with deep learning. https://github.com/ meteofrance/py4cast

  35. [35]

    Titan: Training inputs & targets from arome for neural networks

    Météo-France. Titan: Training inputs & targets from arome for neural networks. https: //huggingface.co/datasets/meteofrance/titan, 2024

  36. [36]

    Wasserstein distances made explainable: Insights into dataset shifts and transport phenomena.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

    Philip Naumann, Jacob Kauffmann, and Grégoire Montavon. Wasserstein distances made explainable: Insights into dataset shifts and transport phenomena.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

  37. [37]

    On the difficulty of training recurrent neural networks

    Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. InInternational conference on machine learning, pages 1310–1318. Pmlr, 2013

  38. [38]

    Quantifying explainability with multi-scale gaussian mixture models

    Anthony Rhodes, Yali Bian, and Ilke Demir. Quantifying explainability with multi-scale gaussian mixture models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 8223–8228, June 2024

  39. [39]

    The mean-field dynamics of transformers

    Philippe Rigollet. The mean-field dynamics of transformers.arXiv preprint arXiv:2512.01868, 2025

  40. [40]

    A consis- tent and efficient evaluation strategy for attribution methods

    Yao Rong, Tobias Leemann, Vadim Borisov, Gjergji Kasneci, and Enkelejda Kasneci. A consis- tent and efficient evaluation strategy for attribution methods. InProceedings of International Conference on Machine Learning (ICML), 2022

  41. [41]

    Unetr++: delving into efficient and accurate 3d medical image segmentation.IEEE Transactions on Medical Imaging, 43(9):3377–3390, 2024

    Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Unetr++: delving into efficient and accurate 3d medical image segmentation.IEEE Transactions on Medical Imaging, 43(9):3377–3390, 2024

  42. [42]

    Solutions of stationary mckean–vlasov equation on a high-dimensional sphere and other riemannian manifolds.Advances in Nonlinear Analysis, 15(1):20250141, 2026

    Anna Shalova and André Schlichting. Solutions of stationary mckean–vlasov equation on a high-dimensional sphere and other riemannian manifolds.Advances in Nonlinear Analysis, 15(1):20250141, 2026

  43. [43]

    Learning important features through propagating activation differences

    Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. InProceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 3145–3153, 2017

  44. [44]

    Deep inside convolutional networks: Visualising image classification models and saliency maps

    Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. InWorkshop Track Proceedings of International Conference on Learning Representations (ICLR), 2013

  45. [45]

    Smooth- grad: removing noise by adding noise

    Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. Smooth- grad: removing noise by adding noise. InICML workshop on Visualization for Deep Learning, 2017

  46. [46]

    Convolutional wasserstein distances: efficient optimal transportation on geometric domains.ACM Transactions on Graphics (TOG), 34(4), July 2015

    Justin Solomon, Fernando de Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du, and Leonidas Guibas. Convolutional wasserstein distances: efficient optimal transportation on geometric domains.ACM Transactions on Graphics (TOG), 34(4), July 2015

  47. [47]

    Axiomatic attribution for deep networks

    Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319–3328. PMLR, 2017. 12 Table 2: Meteorological variables composing the state tensor xt. Surface diagnostics provide 5 channels; upper-air profiles at four pressure levels contribute 16 channels, forC= 21total. V...

  48. [48]

    Attention is all you need.Advances in neural information processing systems (NeurIPS), 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems (NeurIPS), 30, 2017

  49. [49]

    Grundlehren der mathematis- chen Wissenschaften

    Cédric Villani.Optimal transport : old and new / Cédric Villani. Grundlehren der mathematis- chen Wissenschaften. Springer, Berlin, 2009

  50. [50]

    Gradient based feature attribution in explainable ai: A technical review.arXiv preprint arXiv:2403.10415, 2024

    Yongjie Wang, Tong Zhang, Xu Guo, and Zhiqi Shen. Gradient based feature attribution in explainable ai: A technical review.arXiv preprint arXiv:2403.10415, 2024

  51. [51]

    Fast discrete distribution clustering us- ing wasserstein barycenter with sparse support.IEEE Transactions on Signal Processing, 65(9):2317–2332, 2017

    Jianbo Ye, Panruo Wu, James Z Wang, and Jia Li. Fast discrete distribution clustering us- ing wasserstein barycenter with sparse support.IEEE Transactions on Signal Processing, 65(9):2317–2332, 2017

  52. [52]

    Adaptgrad: Adap- tive sampling to reduce noise

    Linjiang Zhou, Chao Ma, Zepeng Wang, Libing Wu, and XIAOCHUAN SHI. Adaptgrad: Adap- tive sampling to reduce noise. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025. A The TITAN Meteorological Dataset We utilize theTITANdataset, a high-resolution meteorological benchmark developed by Météo- France for deep learn...

  53. [53]

    Linear or affine functions(e.g.dense and convolutional layers): their Jacobian entries are constant weights, [Jf n]ρ =W n ρ

  54. [54]

    Nonlinear functions(e.g.ReLU, max pooling): their Jacobian entries depend on the input activationα n, [Jf n(αn)]ρ =g n ρ (αn). Separating the constant linear weights from the input-dependent nonlinear derivatives yields: ∂f T (x) ∂xi = X ρ∈Pi→c Y u∈ρ W u ρ ! Y v∈ρ gv ρ(αv) ! .(37) We collapse these products for notational clarity: Wρ = Y u∈ρ W u ρ (path-w...

  55. [55]

    Attraction can accelerate exponentially (v∝e β) [39, 22]

    USA as Wasserstein gradient flow:The unnormalized dynamics correspond to a gradi- ent flow with respect to the standard Wasserstein metric W2. Attraction can accelerate exponentially (v∝e β) [39, 22]

  56. [56]

    The normalization term Z cancels exponential growth, keeping the velocity field bounded and regulating clustering speed [39, 22]

    SA as weighted gradient flow:Standard self-attention corresponds to a gradient flow with respect to aweighted Wasserstein metric. The normalization term Z cancels exponential growth, keeping the velocity field bounded and regulating clustering speed [39, 22]. K.4 The Noisy Regime K.4.1 Continuous Noise (SDE) When noise is injected at every time step, the ...