Recognition: unknown
Explanation of Dynamic Physical Field Predictions using WassersteinGrad: Application to Autoregressive Weather Forecasting
Pith reviewed 2026-05-08 09:52 UTC · model grok-4.3
The pith
WassersteinGrad extracts a geometric consensus of perturbed attribution maps by computing their entropic Wasserstein barycenter to explain neural predictions on dynamic physical fields.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On dynamic physical fields, stochastic input perturbations induce geometric displacements in attribution maps rather than stationary amplitude noise, so pointwise averaging blurs spatially misaligned features. WassersteinGrad extracts a geometric consensus by computing the entropic Wasserstein barycenter of the perturbed attribution maps. The authors demonstrate that this yields sharper and more informative explanations than standard gradient-based baselines on regional weather data for both single-step and autoregressive forecasting settings.
What carries the argument
The entropic Wasserstein barycenter of perturbed attribution maps, which aligns displaced features via optimal transport and produces a spatially coherent average.
Load-bearing premise
The displacements induced by input perturbations are primarily geometric and can be corrected by optimal transport without introducing artifacts or losing attribution fidelity.
What would settle it
Expert meteorologist review of side-by-side visualizations on the same weather data showing that WassersteinGrad attributions obscure key physical structures or contain more artifacts than pointwise-averaged gradients would refute the claim of improved explainability.
Figures
read the original abstract
As the demand to integrate Artificial Intelligence into high-stakes environments continues to grow, explaining the reasoning behind neural-network predictions has shifted from a theoretical curiosity to a strict operational requirement. Our work is motivated by the explanations of autoregressive neural predictions on dynamic physical fields, as in weather forecasting. Gradient-based feature attribution methods are widely used to explain the predictions on such data, in particular due to their scalability to high-dimensional inputs. It is also interesting to remark that gradient-based techniques such as SmoothGrad are now standard on images to robustify the explanations using pointwise averages of the attribution maps obtained from several noised inputs. Our goal is to efficiently adapt this aggregation strategy to dynamic physical fields. To do so, our first contribution is to identify a fundamental failure mode when averaging perturbed attribution maps on dynamic physical fields: stochastic input perturbations do not induce stationary amplitude noise in attribution maps, but instead cause a geometric displacement of the attributions. Consequently, pointwise averaging blurs these spatially misaligned features. To tackle this issue, we introduce WassersteinGrad, which extracts a geometric consensus of perturbed attribution maps by computing their entropic Wasserstein barycenter. The results, obtained on regional weather data and a meteorologist-validated neural model, demonstrate promising explainability properties of WassersteinGrad over gradient-based baselines across both single-step and autoregressive forecasting settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies a failure mode in standard gradient-based attribution methods (e.g., SmoothGrad-style pointwise averaging) for explaining neural predictions on dynamic physical fields such as weather data: stochastic input perturbations induce geometric displacements in attribution maps rather than stationary amplitude noise, causing blurring upon averaging. To address this, the authors propose WassersteinGrad, which aggregates perturbed attribution maps via their entropic Wasserstein barycenter to recover a geometric consensus. They evaluate the method on regional weather forecasting tasks using a meteorologist-validated neural model, reporting improved explainability properties relative to gradient baselines in both single-step and autoregressive settings.
Significance. If the core geometric-displacement hypothesis and the superiority of the Wasserstein aggregation hold under rigorous validation, the work could meaningfully advance explainable AI for high-dimensional spatiotemporal physical systems. It correctly identifies a limitation of pointwise averaging on non-stationary fields and repurposes established optimal-transport tools (entropic Wasserstein barycenters) in a new domain. The application to autoregressive weather models is timely given operational demands for trustworthy AI in meteorology. However, the absence of quantitative faithfulness metrics, sensitivity analyses, or blind meteorologist evaluations in the provided abstract limits the immediate impact assessment.
major comments (3)
- [Abstract / Motivation section] The central claim that input perturbations produce purely geometric displacements (rather than mixed amplitude or noise effects that would violate OT assumptions) is load-bearing for the motivation and for the choice of Wasserstein barycenter over simpler robust aggregators. The abstract states this as a 'fundamental failure mode' but provides no quantitative demonstration (e.g., displacement magnitude statistics, transport-cost histograms, or comparison of L2 vs. Wasserstein distances on example attribution maps). This must be shown explicitly, preferably with controlled synthetic fields before real weather data.
- [Results / Experiments section] Results are described only qualitatively as 'promising explainability properties' without any reported metrics (faithfulness scores, insertion/deletion AUC, meteorologist agreement rates, or statistical significance tests against baselines). Given that the entropic regularization parameter is a free hyperparameter, sensitivity to its value, to the ground metric, and to Sinkhorn approximation tolerance must be quantified; otherwise the claimed advantage over pointwise averaging cannot be assessed.
- [Autoregressive forecasting experiments] The paper applies the method to both single-step and autoregressive forecasting but does not address whether the geometric-consensus property persists under error accumulation in multi-step rollouts. If attribution maps become increasingly diffuse or multi-modal in autoregressive mode, the entropic barycenter may introduce its own smoothing artifacts; this interaction should be analyzed.
minor comments (2)
- [Methods] Notation for the entropic regularization parameter and the precise definition of the Wasserstein barycenter (including the ground cost on the spatial grid) should be stated explicitly in the methods section for reproducibility.
- [Abstract / Model description] The abstract mentions 'a meteorologist-validated neural model' but does not specify the validation protocol or the architecture; a brief description would help readers assess domain relevance.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each of the three major comments point by point below, indicating the revisions we will incorporate to strengthen the work.
read point-by-point responses
-
Referee: [Abstract / Motivation section] The central claim that input perturbations produce purely geometric displacements (rather than mixed amplitude or noise effects that would violate OT assumptions) is load-bearing for the motivation and for the choice of Wasserstein barycenter over simpler robust aggregators. The abstract states this as a 'fundamental failure mode' but provides no quantitative demonstration (e.g., displacement magnitude statistics, transport-cost histograms, or comparison of L2 vs. Wasserstein distances on example attribution maps). This must be shown explicitly, preferably with controlled synthetic fields before real weather data.
Authors: We agree that quantitative support for the geometric-displacement hypothesis is important to justify the use of Wasserstein barycenters. The current manuscript relies on visual examples from weather attribution maps to illustrate the displacement effect. In the revision we will add a dedicated subsection to the Motivation section containing controlled synthetic experiments. These will report displacement magnitude statistics, transport-cost histograms, and explicit L2 versus Wasserstein distance comparisons on synthetic fields with known geometric shifts, thereby providing the requested quantitative validation before presenting the real-data results. revision: yes
-
Referee: [Results / Experiments section] Results are described only qualitatively as 'promising explainability properties' without any reported metrics (faithfulness scores, insertion/deletion AUC, meteorologist agreement rates, or statistical significance tests against baselines). Given that the entropic regularization parameter is a free hyperparameter, sensitivity to its value, to the ground metric, and to Sinkhorn approximation tolerance must be quantified; otherwise the claimed advantage over pointwise averaging cannot be assessed.
Authors: We acknowledge that the current presentation is primarily qualitative. We will revise the Results section to include quantitative faithfulness metrics (insertion/deletion AUC), meteorologist agreement rates, and statistical significance tests against the gradient baselines. In addition, we will report a full sensitivity analysis with respect to the entropic regularization parameter, the choice of ground metric, and Sinkhorn approximation tolerance, thereby allowing readers to assess the robustness of the reported advantage. revision: yes
-
Referee: [Autoregressive forecasting experiments] The paper applies the method to both single-step and autoregressive forecasting but does not address whether the geometric-consensus property persists under error accumulation in multi-step rollouts. If attribution maps become increasingly diffuse or multi-modal in autoregressive mode, the entropic barycenter may introduce its own smoothing artifacts; this interaction should be analyzed.
Authors: We agree that the interaction between error accumulation and the entropic barycenter warrants explicit examination. We will extend the autoregressive experiments subsection to include a direct comparison of attribution-map properties (diffuseness, modality, and transport cost to the single-step reference) across increasing rollout horizons. This analysis will quantify whether the geometric-consensus property is preserved and will discuss any additional smoothing introduced by the barycenter under accumulated forecast error. revision: yes
Circularity Check
No circularity: method applies established Wasserstein barycenter to observed attribution displacements
full rationale
The paper identifies (via observation on weather fields) that input perturbations induce geometric shifts rather than stationary noise in gradient attributions, then aggregates via the standard entropic Wasserstein barycenter. No derivation step reduces by construction to a fitted parameter, self-defined quantity, or load-bearing self-citation; the core claim rests on applying an external optimal-transport primitive to a diagnosed failure mode of pointwise averaging. The approach remains self-contained against external benchmarks and does not rename or smuggle prior results.
Axiom & Free-Parameter Ledger
free parameters (1)
- entropic regularization parameter
axioms (1)
- domain assumption Attribution maps from perturbed inputs can be meaningfully represented and averaged as probability distributions under the Wasserstein metric.
Reference graph
Works this paper leans on
-
[1]
Local explanation methods for deep neural networks lack sensitivity to parameter values
Julius Adebayo, Justin Gilmer, Ian Goodfellow, and Been Kim. Local explanation methods for deep neural networks lack sensitivity to parameter values. InProceedings of ICLR Workshop, 2018
2018
-
[2]
Barycenters in the wasserstein space.SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011
Martial Agueh and Guillaume Carlier. Barycenters in the wasserstein space.SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011
2011
-
[3]
Ferran Alet, Ilan Price, Andrew El-Kadi, Dominic Masters, Stratis Markou, Tom R Andersson, Jacklynn Stott, Remi Lam, Matthew Willson, Alvaro Sanchez-Gonzalez, et al. Skillful joint probabilistic weather forecasting from marginals.arXiv preprint arXiv:2506.10772, 2025
-
[4]
On the robustness of interpretability methods
David Alvarez-Melis and Tommi S Jaakkola. On the robustness of interpretability methods. In Proceedings of ICML Workshop on Human Interpretability in Machine Learning, 2018
2018
-
[5]
Towards better understanding of gradient-based attribution methods for deep neural networks
Marco Ancona, Enea Ceolini, Cengiz Öztireli, and Markus Gross. Towards better understanding of gradient-based attribution methods for deep neural networks. InProceedings of International Conference on Learning Representations (ICLR), 2017
2017
-
[6]
Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI.Information fusion, 58:82–115, 2020
Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador García, Sergio Gil-López, Daniel Molina, Richard Ben- jamins, et al. Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI.Information fusion, 58:82–115, 2020
2020
-
[7]
How to explain individual classification decisions.The Journal of Machine Learning Research, 11:1803–1831, 2010
David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Müller. How to explain individual classification decisions.The Journal of Machine Learning Research, 11:1803–1831, 2010
2010
-
[8]
The shattered gradients problem: If resnets are the answer, then what is the question? InInternational conference on machine learning, pages 342–350
David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. The shattered gradients problem: If resnets are the answer, then what is the question? InInternational conference on machine learning, pages 342–350. PMLR, 2017
2017
-
[9]
Benitez, J.L
J.M. Benitez, J.L. Castro, and I. Requena. Are artificial neural networks black boxes?IEEE Transactions on Neural Networks, 8(5):1156–1164, 1997
1997
-
[10]
Finding the right XAI method—a guide for the evaluation and ranking of explainable AI methods in climate science.Artificial Intelligence for the Earth Systems, 3(3):e230074, 2024
Philine Lou Bommer, Marlene Kretschmer, Anna Hedström, Dilyara Bareeva, and Marina M-C Höhne. Finding the right XAI method—a guide for the evaluation and ranking of explainable AI methods in climate science.Artificial Intelligence for the Earth Systems, 3(3):e230074, 2024
2024
-
[11]
A multiscale analysis of mean-field transformers in the moderate interaction regime
Giuseppe Bruno, Federico Pasqualotto, and Andrea Agazzi. A multiscale analysis of mean-field transformers in the moderate interaction regime. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025
2025
-
[12]
Noiseg- rad—enhancing explanations by introducing stochasticity to model weights
Kirill Bykov, Anna Hedström, Shinichi Nakajima, and Marina M-C Höhne. Noiseg- rad—enhancing explanations by introducing stochasticity to model weights. InProceedings of the AAAI Conference on Artificial Intelligence, 2022
2022
-
[13]
Concise explanations of neural networks using adversarial training
Prasad Chalasani, Jiefeng Chen, Amrita Roy Chowdhury, Xi Wu, and Somesh Jha. Concise explanations of neural networks using adversarial training. InInternational Conference on Machine Learning (ICML), pages 1383–1391, 2020
2020
-
[14]
Certified adversarial robustness via randomized smoothing
Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 1310–1320, 2019. 10
2019
-
[15]
Sinkhorn distances: Lightspeed computation of optimal transport.Advances in neural information processing systems (NeurIPS), 26, 2013
Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport.Advances in neural information processing systems (NeurIPS), 26, 2013
2013
-
[16]
Fast computation of Wasserstein barycenters
Marco Cuturi and Arnaud Doucet. Fast computation of Wasserstein barycenters. In Eric P. Xing and Tony Jebara, editors,Proceedings of the 31st International Conference on Machine Learning, Proceedings of Machine Learning Research, pages 685–693, Bejing, China, 22–24 Jun 2014
2014
-
[17]
Wasserstein barycenter model ensembling
Pierre Dognin, Igor Melnyk, Youssef Mroueh, Jerret Ross, Cicero Dos Santos, and Tom Sercu. Wasserstein barycenter model ensembling. InProceedings of International Conference on Learning Representations (ICLR), 2019
2019
-
[18]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InProceedings of the 9th International Conference on Learning Representations (ICLR), 2020
2020
-
[19]
Fuzzy verification of high-resolution gridded forecasts: A review and proposed framework.Meteorological Applications, 15:51 – 64, 03 2008
Elizabeth Ebert. Fuzzy verification of high-resolution gridded forecasts: A review and proposed framework.Meteorological Applications, 15:51 – 64, 03 2008
2008
-
[20]
Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H
Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong,...
2021
-
[21]
Pot python optimal transport (version 0.9.5), 2024
Rémi Flamary, Cédric Vincent-Cuaz, Nicolas Courty, Alexandre Gramfort, Oleksii Kachaiev, Huy Quang Tran, Laurène David, Clément Bonet, Nathan Cassereau, Théo Gnassounou, Eloi Tanguy, Julie Delon, Antoine Collas, Sonia Mazelet, Laetitia Chapel, Tanguy Kerdoncuff, Xizheng Yu, Matthew Feickert, Paul Krzakala, Tianlin Liu, and Eduardo Fernandes Montesuma. Pot...
2024
-
[22]
A mathematical perspective on transformers.Bulletin of the American Mathematical Society, 62(3):427–479, 2025
Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on transformers.Bulletin of the American Mathematical Society, 62(3):427–479, 2025
2025
-
[23]
Smoothed differentiation efficiently mitigates shattered gradients in explanations
Adrian Hill, Neal McKee, Johannes Maeß, Stefan Bluecher, and Klaus Robert Muller. Smoothed differentiation efficiently mitigates shattered gradients in explanations. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025
2025
-
[24]
Hoffman, Zheng Liu, Jean-Francois Louis, and Christopher Grassoti
Ross N. Hoffman, Zheng Liu, Jean-Francois Louis, and Christopher Grassoti. Distortion representation of forecast errors.Monthly Weather Review, 123(9):2758 – 2770, 1995
1995
-
[25]
Cambridge University Press, 2003
Eugenia Kalnay.Atmospheric Modeling, Data Assimilation and Predictability. Cambridge University Press, 2003
2003
-
[26]
Christian Keil and George C. Craig. A displacement-based error measure applied in a regional ensemble forecasting system.Monthly Weather Review, 135(9):3248 – 3259, 2007
2007
-
[27]
The lipschitz constant of self-attention
Hyunjik Kim, George Papamakarios, and Andriy Mnih. The lipschitz constant of self-attention. InInternational Conference on Machine Learning, pages 5562–5571. PMLR, 2021
2021
-
[28]
Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023
Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, et al. Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023
2023
-
[29]
LeCun, B
Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition.Neural Computation, 1(4):541–551, 1989
1989
-
[30]
The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery.Queue, 16(3):31–57, 2018
Zachary C Lipton. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery.Queue, 16(3):31–57, 2018. 11
2018
-
[31]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of International Conference on Learning Representations (ICLR), 2019
2019
-
[32]
On spectral properties of gradient- based explanation methods
Amir Mehrpanah, Erik Englesson, and Hossein Azizpour. On spectral properties of gradient- based explanation methods. InEuropean Conference on Computer Vision, pages 282–299. Springer, 2024
2024
-
[33]
Gabriel Moldovan, Ewan Pinnington, Ana Prieto Nemesio, Simon Lang, Zied Ben Bouallègue, Jesper Dramsch, Mihai Alexe, Mario Santa Cruz, Sara Hahner, Harrison Cook, et al. An update to ecmwf’s machine-learned weather forecast model AIFS.arXiv preprint arXiv:2509.18994, 2025
-
[34]
Py4cast: Weather forecasting with deep learning
Météo-France. Py4cast: Weather forecasting with deep learning. https://github.com/ meteofrance/py4cast
-
[35]
Titan: Training inputs & targets from arome for neural networks
Météo-France. Titan: Training inputs & targets from arome for neural networks. https: //huggingface.co/datasets/meteofrance/titan, 2024
2024
-
[36]
Wasserstein distances made explainable: Insights into dataset shifts and transport phenomena.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026
Philip Naumann, Jacob Kauffmann, and Grégoire Montavon. Wasserstein distances made explainable: Insights into dataset shifts and transport phenomena.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026
2026
-
[37]
On the difficulty of training recurrent neural networks
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. InInternational conference on machine learning, pages 1310–1318. Pmlr, 2013
2013
-
[38]
Quantifying explainability with multi-scale gaussian mixture models
Anthony Rhodes, Yali Bian, and Ilke Demir. Quantifying explainability with multi-scale gaussian mixture models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 8223–8228, June 2024
2024
-
[39]
The mean-field dynamics of transformers
Philippe Rigollet. The mean-field dynamics of transformers.arXiv preprint arXiv:2512.01868, 2025
-
[40]
A consis- tent and efficient evaluation strategy for attribution methods
Yao Rong, Tobias Leemann, Vadim Borisov, Gjergji Kasneci, and Enkelejda Kasneci. A consis- tent and efficient evaluation strategy for attribution methods. InProceedings of International Conference on Machine Learning (ICML), 2022
2022
-
[41]
Unetr++: delving into efficient and accurate 3d medical image segmentation.IEEE Transactions on Medical Imaging, 43(9):3377–3390, 2024
Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Unetr++: delving into efficient and accurate 3d medical image segmentation.IEEE Transactions on Medical Imaging, 43(9):3377–3390, 2024
2024
-
[42]
Solutions of stationary mckean–vlasov equation on a high-dimensional sphere and other riemannian manifolds.Advances in Nonlinear Analysis, 15(1):20250141, 2026
Anna Shalova and André Schlichting. Solutions of stationary mckean–vlasov equation on a high-dimensional sphere and other riemannian manifolds.Advances in Nonlinear Analysis, 15(1):20250141, 2026
2026
-
[43]
Learning important features through propagating activation differences
Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. InProceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 3145–3153, 2017
2017
-
[44]
Deep inside convolutional networks: Visualising image classification models and saliency maps
Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. InWorkshop Track Proceedings of International Conference on Learning Representations (ICLR), 2013
2013
-
[45]
Smooth- grad: removing noise by adding noise
Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. Smooth- grad: removing noise by adding noise. InICML workshop on Visualization for Deep Learning, 2017
2017
-
[46]
Convolutional wasserstein distances: efficient optimal transportation on geometric domains.ACM Transactions on Graphics (TOG), 34(4), July 2015
Justin Solomon, Fernando de Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du, and Leonidas Guibas. Convolutional wasserstein distances: efficient optimal transportation on geometric domains.ACM Transactions on Graphics (TOG), 34(4), July 2015
2015
-
[47]
Axiomatic attribution for deep networks
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319–3328. PMLR, 2017. 12 Table 2: Meteorological variables composing the state tensor xt. Surface diagnostics provide 5 channels; upper-air profiles at four pressure levels contribute 16 channels, forC= 21total. V...
2017
-
[48]
Attention is all you need.Advances in neural information processing systems (NeurIPS), 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems (NeurIPS), 30, 2017
2017
-
[49]
Grundlehren der mathematis- chen Wissenschaften
Cédric Villani.Optimal transport : old and new / Cédric Villani. Grundlehren der mathematis- chen Wissenschaften. Springer, Berlin, 2009
2009
-
[50]
Yongjie Wang, Tong Zhang, Xu Guo, and Zhiqi Shen. Gradient based feature attribution in explainable ai: A technical review.arXiv preprint arXiv:2403.10415, 2024
-
[51]
Fast discrete distribution clustering us- ing wasserstein barycenter with sparse support.IEEE Transactions on Signal Processing, 65(9):2317–2332, 2017
Jianbo Ye, Panruo Wu, James Z Wang, and Jia Li. Fast discrete distribution clustering us- ing wasserstein barycenter with sparse support.IEEE Transactions on Signal Processing, 65(9):2317–2332, 2017
2017
-
[52]
Adaptgrad: Adap- tive sampling to reduce noise
Linjiang Zhou, Chao Ma, Zepeng Wang, Libing Wu, and XIAOCHUAN SHI. Adaptgrad: Adap- tive sampling to reduce noise. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025. A The TITAN Meteorological Dataset We utilize theTITANdataset, a high-resolution meteorological benchmark developed by Météo- France for deep learn...
2025
-
[53]
Linear or affine functions(e.g.dense and convolutional layers): their Jacobian entries are constant weights, [Jf n]ρ =W n ρ
-
[54]
Nonlinear functions(e.g.ReLU, max pooling): their Jacobian entries depend on the input activationα n, [Jf n(αn)]ρ =g n ρ (αn). Separating the constant linear weights from the input-dependent nonlinear derivatives yields: ∂f T (x) ∂xi = X ρ∈Pi→c Y u∈ρ W u ρ ! Y v∈ρ gv ρ(αv) ! .(37) We collapse these products for notational clarity: Wρ = Y u∈ρ W u ρ (path-w...
-
[55]
Attraction can accelerate exponentially (v∝e β) [39, 22]
USA as Wasserstein gradient flow:The unnormalized dynamics correspond to a gradi- ent flow with respect to the standard Wasserstein metric W2. Attraction can accelerate exponentially (v∝e β) [39, 22]
-
[56]
The normalization term Z cancels exponential growth, keeping the velocity field bounded and regulating clustering speed [39, 22]
SA as weighted gradient flow:Standard self-attention corresponds to a gradient flow with respect to aweighted Wasserstein metric. The normalization term Z cancels exponential growth, keeping the velocity field bounded and regulating clustering speed [39, 22]. K.4 The Noisy Regime K.4.1 Continuous Noise (SDE) When noise is injected at every time step, the ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.