arxiv: 2604.22580 · v1 · submitted 2026-04-24 · 📊 stat.ML · cs.LG

Recognition: unknown

Explanation of Dynamic Physical Field Predictions using WassersteinGrad: Application to Autoregressive Weather Forecasting

Laurent Risser, Laure Raynaud, Luciano Drozda, Younes Essafouri

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:52 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords Wasserstein barycenterfeature attributionexplainable AIweather forecastingautoregressive modelsgradient explanationsdynamic physical fields

0 comments

The pith

WassersteinGrad extracts a geometric consensus of perturbed attribution maps by computing their entropic Wasserstein barycenter to explain neural predictions on dynamic physical fields.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Gradient-based attribution methods such as SmoothGrad rely on pointwise averaging of maps from stochastically perturbed inputs, yet on dynamic physical fields like weather this produces blurred results because perturbations displace features spatially rather than adding stationary noise. The paper introduces WassersteinGrad to compute the entropic Wasserstein barycenter across those maps, thereby aligning and averaging them geometrically instead. Experiments on regional weather data with a meteorologist-validated neural model show clearer attributions than baselines in both single-step and autoregressive forecasting. A sympathetic reader would care because trustworthy spatial explanations matter when neural models guide decisions in safety-critical physical systems.

Core claim

On dynamic physical fields, stochastic input perturbations induce geometric displacements in attribution maps rather than stationary amplitude noise, so pointwise averaging blurs spatially misaligned features. WassersteinGrad extracts a geometric consensus by computing the entropic Wasserstein barycenter of the perturbed attribution maps. The authors demonstrate that this yields sharper and more informative explanations than standard gradient-based baselines on regional weather data for both single-step and autoregressive forecasting settings.

What carries the argument

The entropic Wasserstein barycenter of perturbed attribution maps, which aligns displaced features via optimal transport and produces a spatially coherent average.

Load-bearing premise

The displacements induced by input perturbations are primarily geometric and can be corrected by optimal transport without introducing artifacts or losing attribution fidelity.

What would settle it

Expert meteorologist review of side-by-side visualizations on the same weather data showing that WassersteinGrad attributions obscure key physical structures or contain more artifacts than pointwise-averaged gradients would refute the claim of improved explainability.

Figures

Figures reproduced from arXiv: 2604.22580 by Laurent Risser, Laure Raynaud, Luciano Drozda, Younes Essafouri.

**Figure 1.** Figure 1: (a) Autoregressive use of a neural-based forecasting model f, producing a sequence of physical fields (here regional atmospheric states with C = 21 channels) at successive lead times t + 1, . . . , t + T. (b - from left to right) input zonal wind at 250 hPa at time t; predicted surface precipitation at time t + 5; explanation of rain prediction in Paris area using the gradient attribution method of [7]. (c… view at source ↗

**Figure 2.** Figure 2: Qualitative comparison of gradient attribution methods at one-step ( view at source ↗

**Figure 3.** Figure 3: Temporal Resolution. The dataset provides analyses at a 1-hour time step. Our models are trained on lead times ∆t ∈ {+1h, . . . , +6h}, with a context window of one preceding analysis step as input. State Variables. We define the atmospheric state at timestep t as a multi-channel spatial tensor xt ∈ R H×W×C , where C is the number of meteorological channels view at source ↗

**Figure 3.** Figure 3: Geographic extent of the TITAN/AROME dataset. The view at source ↗

**Figure 4.** Figure 4: Centroid and peak displacement of gradient attributions under input perturbations, at single-step (t + 1, top) and autoregressive (t + 5, bottom) lead times. Left: peak displacement (location of arg max |G|). Right: centroid displacement alongside relative prediction error (secondary axis). Peak displacement amplifies by 3–4× between t + 1 and t + 5, while the prediction error remains below 1.5% in both ca… view at source ↗

read the original abstract

As the demand to integrate Artificial Intelligence into high-stakes environments continues to grow, explaining the reasoning behind neural-network predictions has shifted from a theoretical curiosity to a strict operational requirement. Our work is motivated by the explanations of autoregressive neural predictions on dynamic physical fields, as in weather forecasting. Gradient-based feature attribution methods are widely used to explain the predictions on such data, in particular due to their scalability to high-dimensional inputs. It is also interesting to remark that gradient-based techniques such as SmoothGrad are now standard on images to robustify the explanations using pointwise averages of the attribution maps obtained from several noised inputs. Our goal is to efficiently adapt this aggregation strategy to dynamic physical fields. To do so, our first contribution is to identify a fundamental failure mode when averaging perturbed attribution maps on dynamic physical fields: stochastic input perturbations do not induce stationary amplitude noise in attribution maps, but instead cause a geometric displacement of the attributions. Consequently, pointwise averaging blurs these spatially misaligned features. To tackle this issue, we introduce WassersteinGrad, which extracts a geometric consensus of perturbed attribution maps by computing their entropic Wasserstein barycenter. The results, obtained on regional weather data and a meteorologist-validated neural model, demonstrate promising explainability properties of WassersteinGrad over gradient-based baselines across both single-step and autoregressive forecasting settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper spots that averaging attribution maps blurs dynamic fields because perturbations shift features geometrically, then proposes WassersteinGrad via entropic barycenters as a fix, but the supporting evidence stays light on numbers and checks.

read the letter

The paper's core observation is that pointwise averaging of perturbed attribution maps fails on dynamic physical fields like weather because input noise displaces features spatially rather than adding stationary amplitude noise, and WassersteinGrad replaces the average with an entropic Wasserstein barycenter to pull out a geometric consensus instead. This diagnosis and the adaptation are the actual new pieces. The authors apply it to regional weather data using a neural model already checked by meteorologists, and they report that it shows better properties than standard gradient baselines for both single-step and autoregressive forecasts. That direct tie to an operational setting is the part that lands cleanly. The rest of the work builds on known Wasserstein barycenter tools and gradient attribution methods without claiming to reinvent them. The soft spots sit in the validation. The abstract and results describe the outcomes as promising without giving quantitative faithfulness scores, error bars, or sensitivity tests on the entropic regularization parameter. The stress-test concern about possible smoothing artifacts from the barycenter itself or unstated tuning for weather scales is fair, and the paper would need to show that the displacements are dominantly geometric rather than mixed with amplitude effects that break the transport assumptions. Without those details it is hard to separate real improvement from method-induced bias. This is for researchers working on explainable AI for spatiotemporal physical models, particularly anyone handling autoregressive weather or fluid forecasts. A reader already familiar with gradient attributions would pick up the geometric failure mode quickly and see a concrete alternative. I would send it for peer review. The observation is practical and the proposal is straightforward enough that referees can test the gaps directly.

Referee Report

3 major / 2 minor

Summary. The paper identifies a failure mode in standard gradient-based attribution methods (e.g., SmoothGrad-style pointwise averaging) for explaining neural predictions on dynamic physical fields such as weather data: stochastic input perturbations induce geometric displacements in attribution maps rather than stationary amplitude noise, causing blurring upon averaging. To address this, the authors propose WassersteinGrad, which aggregates perturbed attribution maps via their entropic Wasserstein barycenter to recover a geometric consensus. They evaluate the method on regional weather forecasting tasks using a meteorologist-validated neural model, reporting improved explainability properties relative to gradient baselines in both single-step and autoregressive settings.

Significance. If the core geometric-displacement hypothesis and the superiority of the Wasserstein aggregation hold under rigorous validation, the work could meaningfully advance explainable AI for high-dimensional spatiotemporal physical systems. It correctly identifies a limitation of pointwise averaging on non-stationary fields and repurposes established optimal-transport tools (entropic Wasserstein barycenters) in a new domain. The application to autoregressive weather models is timely given operational demands for trustworthy AI in meteorology. However, the absence of quantitative faithfulness metrics, sensitivity analyses, or blind meteorologist evaluations in the provided abstract limits the immediate impact assessment.

major comments (3)

[Abstract / Motivation section] The central claim that input perturbations produce purely geometric displacements (rather than mixed amplitude or noise effects that would violate OT assumptions) is load-bearing for the motivation and for the choice of Wasserstein barycenter over simpler robust aggregators. The abstract states this as a 'fundamental failure mode' but provides no quantitative demonstration (e.g., displacement magnitude statistics, transport-cost histograms, or comparison of L2 vs. Wasserstein distances on example attribution maps). This must be shown explicitly, preferably with controlled synthetic fields before real weather data.
[Results / Experiments section] Results are described only qualitatively as 'promising explainability properties' without any reported metrics (faithfulness scores, insertion/deletion AUC, meteorologist agreement rates, or statistical significance tests against baselines). Given that the entropic regularization parameter is a free hyperparameter, sensitivity to its value, to the ground metric, and to Sinkhorn approximation tolerance must be quantified; otherwise the claimed advantage over pointwise averaging cannot be assessed.
[Autoregressive forecasting experiments] The paper applies the method to both single-step and autoregressive forecasting but does not address whether the geometric-consensus property persists under error accumulation in multi-step rollouts. If attribution maps become increasingly diffuse or multi-modal in autoregressive mode, the entropic barycenter may introduce its own smoothing artifacts; this interaction should be analyzed.

minor comments (2)

[Methods] Notation for the entropic regularization parameter and the precise definition of the Wasserstein barycenter (including the ground cost on the spatial grid) should be stated explicitly in the methods section for reproducibility.
[Abstract / Model description] The abstract mentions 'a meteorologist-validated neural model' but does not specify the validation protocol or the architecture; a brief description would help readers assess domain relevance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each of the three major comments point by point below, indicating the revisions we will incorporate to strengthen the work.

read point-by-point responses

Referee: [Abstract / Motivation section] The central claim that input perturbations produce purely geometric displacements (rather than mixed amplitude or noise effects that would violate OT assumptions) is load-bearing for the motivation and for the choice of Wasserstein barycenter over simpler robust aggregators. The abstract states this as a 'fundamental failure mode' but provides no quantitative demonstration (e.g., displacement magnitude statistics, transport-cost histograms, or comparison of L2 vs. Wasserstein distances on example attribution maps). This must be shown explicitly, preferably with controlled synthetic fields before real weather data.

Authors: We agree that quantitative support for the geometric-displacement hypothesis is important to justify the use of Wasserstein barycenters. The current manuscript relies on visual examples from weather attribution maps to illustrate the displacement effect. In the revision we will add a dedicated subsection to the Motivation section containing controlled synthetic experiments. These will report displacement magnitude statistics, transport-cost histograms, and explicit L2 versus Wasserstein distance comparisons on synthetic fields with known geometric shifts, thereby providing the requested quantitative validation before presenting the real-data results. revision: yes
Referee: [Results / Experiments section] Results are described only qualitatively as 'promising explainability properties' without any reported metrics (faithfulness scores, insertion/deletion AUC, meteorologist agreement rates, or statistical significance tests against baselines). Given that the entropic regularization parameter is a free hyperparameter, sensitivity to its value, to the ground metric, and to Sinkhorn approximation tolerance must be quantified; otherwise the claimed advantage over pointwise averaging cannot be assessed.

Authors: We acknowledge that the current presentation is primarily qualitative. We will revise the Results section to include quantitative faithfulness metrics (insertion/deletion AUC), meteorologist agreement rates, and statistical significance tests against the gradient baselines. In addition, we will report a full sensitivity analysis with respect to the entropic regularization parameter, the choice of ground metric, and Sinkhorn approximation tolerance, thereby allowing readers to assess the robustness of the reported advantage. revision: yes
Referee: [Autoregressive forecasting experiments] The paper applies the method to both single-step and autoregressive forecasting but does not address whether the geometric-consensus property persists under error accumulation in multi-step rollouts. If attribution maps become increasingly diffuse or multi-modal in autoregressive mode, the entropic barycenter may introduce its own smoothing artifacts; this interaction should be analyzed.

Authors: We agree that the interaction between error accumulation and the entropic barycenter warrants explicit examination. We will extend the autoregressive experiments subsection to include a direct comparison of attribution-map properties (diffuseness, modality, and transport cost to the single-step reference) across increasing rollout horizons. This analysis will quantify whether the geometric-consensus property is preserved and will discuss any additional smoothing introduced by the barycenter under accumulated forecast error. revision: yes

Circularity Check

0 steps flagged

No circularity: method applies established Wasserstein barycenter to observed attribution displacements

full rationale

The paper identifies (via observation on weather fields) that input perturbations induce geometric shifts rather than stationary noise in gradient attributions, then aggregates via the standard entropic Wasserstein barycenter. No derivation step reduces by construction to a fitted parameter, self-defined quantity, or load-bearing self-citation; the core claim rests on applying an external optimal-transport primitive to a diagnosed failure mode of pointwise averaging. The approach remains self-contained against external benchmarks and does not rename or smuggle prior results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review prevents exhaustive identification; the method implicitly relies on treating attribution maps as distributions suitable for Wasserstein geometry and on the entropic regularization of the barycenter computation.

free parameters (1)

entropic regularization parameter
Standard in entropic Wasserstein computations to ensure tractability, but no value or selection procedure is described.

axioms (1)

domain assumption Attribution maps from perturbed inputs can be meaningfully represented and averaged as probability distributions under the Wasserstein metric.
This underpins the shift from pointwise averaging to barycenter computation.

pith-pipeline@v0.9.0 · 5552 in / 1337 out tokens · 63088 ms · 2026-05-08T09:52:23.408997+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 4 canonical work pages

[1]

Local explanation methods for deep neural networks lack sensitivity to parameter values

Julius Adebayo, Justin Gilmer, Ian Goodfellow, and Been Kim. Local explanation methods for deep neural networks lack sensitivity to parameter values. InProceedings of ICLR Workshop, 2018

2018
[2]

Barycenters in the wasserstein space.SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011

Martial Agueh and Guillaume Carlier. Barycenters in the wasserstein space.SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011

2011
[3]

Skillful joint probabilistic weather forecasting from marginals.arXiv preprint arXiv:2506.10772, 2025

Ferran Alet, Ilan Price, Andrew El-Kadi, Dominic Masters, Stratis Markou, Tom R Andersson, Jacklynn Stott, Remi Lam, Matthew Willson, Alvaro Sanchez-Gonzalez, et al. Skillful joint probabilistic weather forecasting from marginals.arXiv preprint arXiv:2506.10772, 2025

work page arXiv 2025
[4]

On the robustness of interpretability methods

David Alvarez-Melis and Tommi S Jaakkola. On the robustness of interpretability methods. In Proceedings of ICML Workshop on Human Interpretability in Machine Learning, 2018

2018
[5]

Towards better understanding of gradient-based attribution methods for deep neural networks

Marco Ancona, Enea Ceolini, Cengiz Öztireli, and Markus Gross. Towards better understanding of gradient-based attribution methods for deep neural networks. InProceedings of International Conference on Learning Representations (ICLR), 2017

2017
[6]

Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI.Information fusion, 58:82–115, 2020

Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador García, Sergio Gil-López, Daniel Molina, Richard Ben- jamins, et al. Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI.Information fusion, 58:82–115, 2020

2020
[7]

How to explain individual classification decisions.The Journal of Machine Learning Research, 11:1803–1831, 2010

David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Müller. How to explain individual classification decisions.The Journal of Machine Learning Research, 11:1803–1831, 2010

2010
[8]

The shattered gradients problem: If resnets are the answer, then what is the question? InInternational conference on machine learning, pages 342–350

David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. The shattered gradients problem: If resnets are the answer, then what is the question? InInternational conference on machine learning, pages 342–350. PMLR, 2017

2017
[9]

Benitez, J.L

J.M. Benitez, J.L. Castro, and I. Requena. Are artificial neural networks black boxes?IEEE Transactions on Neural Networks, 8(5):1156–1164, 1997

1997
[10]

Finding the right XAI method—a guide for the evaluation and ranking of explainable AI methods in climate science.Artificial Intelligence for the Earth Systems, 3(3):e230074, 2024

Philine Lou Bommer, Marlene Kretschmer, Anna Hedström, Dilyara Bareeva, and Marina M-C Höhne. Finding the right XAI method—a guide for the evaluation and ranking of explainable AI methods in climate science.Artificial Intelligence for the Earth Systems, 3(3):e230074, 2024

2024
[11]

A multiscale analysis of mean-field transformers in the moderate interaction regime

Giuseppe Bruno, Federico Pasqualotto, and Andrea Agazzi. A multiscale analysis of mean-field transformers in the moderate interaction regime. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

2025
[12]

Noiseg- rad—enhancing explanations by introducing stochasticity to model weights

Kirill Bykov, Anna Hedström, Shinichi Nakajima, and Marina M-C Höhne. Noiseg- rad—enhancing explanations by introducing stochasticity to model weights. InProceedings of the AAAI Conference on Artificial Intelligence, 2022

2022
[13]

Concise explanations of neural networks using adversarial training

Prasad Chalasani, Jiefeng Chen, Amrita Roy Chowdhury, Xi Wu, and Somesh Jha. Concise explanations of neural networks using adversarial training. InInternational Conference on Machine Learning (ICML), pages 1383–1391, 2020

2020
[14]

Certified adversarial robustness via randomized smoothing

Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 1310–1320, 2019. 10

2019
[15]

Sinkhorn distances: Lightspeed computation of optimal transport.Advances in neural information processing systems (NeurIPS), 26, 2013

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport.Advances in neural information processing systems (NeurIPS), 26, 2013

2013
[16]

Fast computation of Wasserstein barycenters

Marco Cuturi and Arnaud Doucet. Fast computation of Wasserstein barycenters. In Eric P. Xing and Tony Jebara, editors,Proceedings of the 31st International Conference on Machine Learning, Proceedings of Machine Learning Research, pages 685–693, Bejing, China, 22–24 Jun 2014

2014
[17]

Wasserstein barycenter model ensembling

Pierre Dognin, Igor Melnyk, Youssef Mroueh, Jerret Ross, Cicero Dos Santos, and Tom Sercu. Wasserstein barycenter model ensembling. InProceedings of International Conference on Learning Representations (ICLR), 2019

2019
[18]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InProceedings of the 9th International Conference on Learning Representations (ICLR), 2020

2020
[19]

Fuzzy verification of high-resolution gridded forecasts: A review and proposed framework.Meteorological Applications, 15:51 – 64, 03 2008

Elizabeth Ebert. Fuzzy verification of high-resolution gridded forecasts: A review and proposed framework.Meteorological Applications, 15:51 – 64, 03 2008

2008
[20]

Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H

Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong,...

2021
[21]

Pot python optimal transport (version 0.9.5), 2024

Rémi Flamary, Cédric Vincent-Cuaz, Nicolas Courty, Alexandre Gramfort, Oleksii Kachaiev, Huy Quang Tran, Laurène David, Clément Bonet, Nathan Cassereau, Théo Gnassounou, Eloi Tanguy, Julie Delon, Antoine Collas, Sonia Mazelet, Laetitia Chapel, Tanguy Kerdoncuff, Xizheng Yu, Matthew Feickert, Paul Krzakala, Tianlin Liu, and Eduardo Fernandes Montesuma. Pot...

2024
[22]

A mathematical perspective on transformers.Bulletin of the American Mathematical Society, 62(3):427–479, 2025

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on transformers.Bulletin of the American Mathematical Society, 62(3):427–479, 2025

2025
[23]

Smoothed differentiation efficiently mitigates shattered gradients in explanations

Adrian Hill, Neal McKee, Johannes Maeß, Stefan Bluecher, and Klaus Robert Muller. Smoothed differentiation efficiently mitigates shattered gradients in explanations. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

2025
[24]

Hoffman, Zheng Liu, Jean-Francois Louis, and Christopher Grassoti

Ross N. Hoffman, Zheng Liu, Jean-Francois Louis, and Christopher Grassoti. Distortion representation of forecast errors.Monthly Weather Review, 123(9):2758 – 2770, 1995

1995
[25]

Cambridge University Press, 2003

Eugenia Kalnay.Atmospheric Modeling, Data Assimilation and Predictability. Cambridge University Press, 2003

2003
[26]

Christian Keil and George C. Craig. A displacement-based error measure applied in a regional ensemble forecasting system.Monthly Weather Review, 135(9):3248 – 3259, 2007

2007
[27]

The lipschitz constant of self-attention

Hyunjik Kim, George Papamakarios, and Andriy Mnih. The lipschitz constant of self-attention. InInternational Conference on Machine Learning, pages 5562–5571. PMLR, 2021

2021
[28]

Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023

Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, et al. Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023

2023
[29]

LeCun, B

Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition.Neural Computation, 1(4):541–551, 1989

1989
[30]

The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery.Queue, 16(3):31–57, 2018

Zachary C Lipton. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery.Queue, 16(3):31–57, 2018. 11

2018
[31]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of International Conference on Learning Representations (ICLR), 2019

2019
[32]

On spectral properties of gradient- based explanation methods

Amir Mehrpanah, Erik Englesson, and Hossein Azizpour. On spectral properties of gradient- based explanation methods. InEuropean Conference on Computer Vision, pages 282–299. Springer, 2024

2024
[33]

An update to ecmwf’s machine-learned weather forecast model AIFS.arXiv preprint arXiv:2509.18994, 2025

Gabriel Moldovan, Ewan Pinnington, Ana Prieto Nemesio, Simon Lang, Zied Ben Bouallègue, Jesper Dramsch, Mihai Alexe, Mario Santa Cruz, Sara Hahner, Harrison Cook, et al. An update to ecmwf’s machine-learned weather forecast model AIFS.arXiv preprint arXiv:2509.18994, 2025

work page arXiv 2025
[34]

Py4cast: Weather forecasting with deep learning

Météo-France. Py4cast: Weather forecasting with deep learning. https://github.com/ meteofrance/py4cast
[35]

Titan: Training inputs & targets from arome for neural networks

Météo-France. Titan: Training inputs & targets from arome for neural networks. https: //huggingface.co/datasets/meteofrance/titan, 2024

2024
[36]

Wasserstein distances made explainable: Insights into dataset shifts and transport phenomena.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

Philip Naumann, Jacob Kauffmann, and Grégoire Montavon. Wasserstein distances made explainable: Insights into dataset shifts and transport phenomena.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

2026
[37]

On the difficulty of training recurrent neural networks

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. InInternational conference on machine learning, pages 1310–1318. Pmlr, 2013

2013
[38]

Quantifying explainability with multi-scale gaussian mixture models

Anthony Rhodes, Yali Bian, and Ilke Demir. Quantifying explainability with multi-scale gaussian mixture models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 8223–8228, June 2024

2024
[39]

The mean-field dynamics of transformers

Philippe Rigollet. The mean-field dynamics of transformers.arXiv preprint arXiv:2512.01868, 2025

work page arXiv 2025
[40]

A consis- tent and efficient evaluation strategy for attribution methods

Yao Rong, Tobias Leemann, Vadim Borisov, Gjergji Kasneci, and Enkelejda Kasneci. A consis- tent and efficient evaluation strategy for attribution methods. InProceedings of International Conference on Machine Learning (ICML), 2022

2022
[41]

Unetr++: delving into efficient and accurate 3d medical image segmentation.IEEE Transactions on Medical Imaging, 43(9):3377–3390, 2024

Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Unetr++: delving into efficient and accurate 3d medical image segmentation.IEEE Transactions on Medical Imaging, 43(9):3377–3390, 2024

2024
[42]

Solutions of stationary mckean–vlasov equation on a high-dimensional sphere and other riemannian manifolds.Advances in Nonlinear Analysis, 15(1):20250141, 2026

Anna Shalova and André Schlichting. Solutions of stationary mckean–vlasov equation on a high-dimensional sphere and other riemannian manifolds.Advances in Nonlinear Analysis, 15(1):20250141, 2026

2026
[43]

Learning important features through propagating activation differences

Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. InProceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 3145–3153, 2017

2017
[44]

Deep inside convolutional networks: Visualising image classification models and saliency maps

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. InWorkshop Track Proceedings of International Conference on Learning Representations (ICLR), 2013

2013
[45]

Smooth- grad: removing noise by adding noise

Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. Smooth- grad: removing noise by adding noise. InICML workshop on Visualization for Deep Learning, 2017

2017
[46]

Convolutional wasserstein distances: efficient optimal transportation on geometric domains.ACM Transactions on Graphics (TOG), 34(4), July 2015

Justin Solomon, Fernando de Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du, and Leonidas Guibas. Convolutional wasserstein distances: efficient optimal transportation on geometric domains.ACM Transactions on Graphics (TOG), 34(4), July 2015

2015
[47]

Axiomatic attribution for deep networks

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319–3328. PMLR, 2017. 12 Table 2: Meteorological variables composing the state tensor xt. Surface diagnostics provide 5 channels; upper-air profiles at four pressure levels contribute 16 channels, forC= 21total. V...

2017
[48]

Attention is all you need.Advances in neural information processing systems (NeurIPS), 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems (NeurIPS), 30, 2017

2017
[49]

Grundlehren der mathematis- chen Wissenschaften

Cédric Villani.Optimal transport : old and new / Cédric Villani. Grundlehren der mathematis- chen Wissenschaften. Springer, Berlin, 2009

2009
[50]

Gradient based feature attribution in explainable ai: A technical review.arXiv preprint arXiv:2403.10415, 2024

Yongjie Wang, Tong Zhang, Xu Guo, and Zhiqi Shen. Gradient based feature attribution in explainable ai: A technical review.arXiv preprint arXiv:2403.10415, 2024

work page arXiv 2024
[51]

Fast discrete distribution clustering us- ing wasserstein barycenter with sparse support.IEEE Transactions on Signal Processing, 65(9):2317–2332, 2017

Jianbo Ye, Panruo Wu, James Z Wang, and Jia Li. Fast discrete distribution clustering us- ing wasserstein barycenter with sparse support.IEEE Transactions on Signal Processing, 65(9):2317–2332, 2017

2017
[52]

Adaptgrad: Adap- tive sampling to reduce noise

Linjiang Zhou, Chao Ma, Zepeng Wang, Libing Wu, and XIAOCHUAN SHI. Adaptgrad: Adap- tive sampling to reduce noise. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025. A The TITAN Meteorological Dataset We utilize theTITANdataset, a high-resolution meteorological benchmark developed by Météo- France for deep learn...

2025
[53]

Linear or affine functions(e.g.dense and convolutional layers): their Jacobian entries are constant weights, [Jf n]ρ =W n ρ
[54]

Nonlinear functions(e.g.ReLU, max pooling): their Jacobian entries depend on the input activationα n, [Jf n(αn)]ρ =g n ρ (αn). Separating the constant linear weights from the input-dependent nonlinear derivatives yields: ∂f T (x) ∂xi = X ρ∈Pi→c Y u∈ρ W u ρ ! Y v∈ρ gv ρ(αv) ! .(37) We collapse these products for notational clarity: Wρ = Y u∈ρ W u ρ (path-w...
[55]

Attraction can accelerate exponentially (v∝e β) [39, 22]

USA as Wasserstein gradient flow:The unnormalized dynamics correspond to a gradi- ent flow with respect to the standard Wasserstein metric W2. Attraction can accelerate exponentially (v∝e β) [39, 22]
[56]

The normalization term Z cancels exponential growth, keeping the velocity field bounded and regulating clustering speed [39, 22]

SA as weighted gradient flow:Standard self-attention corresponds to a gradient flow with respect to aweighted Wasserstein metric. The normalization term Z cancels exponential growth, keeping the velocity field bounded and regulating clustering speed [39, 22]. K.4 The Noisy Regime K.4.1 Continuous Noise (SDE) When noise is injected at every time step, the ...