arxiv: 2605.13181 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Stable Attention Response for Reliable Precipitation Nowcasting

Allen Benter, Kun Hu, Patrick Filippi, Penghui Wen, Sen Zhang, Thomas Bishop, Xiaogang Zhu, Zexin Hu, Zhiyong Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords precipitation nowcastingattention stabilityHARECastgroup-wise regularizationforecast reliabilitySEVIR benchmarkMeteoNet datasetself-attention mechanisms

0 comments

The pith

Cross-sample instability in attention-response energy drives unreliable precipitation nowcasts, which HARECast corrects via group-wise regularization on head-wise energy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Precipitation nowcasting faces challenges from localized and rapidly changing atmospheric patterns. Attention-based models improve representation power but often produce unstable attention responses that vary sharply across different input samples. The paper shows this instability correlates with higher forecast errors and can propagate through self-attention layers to widen error bounds. HARECast addresses the issue by explicitly tracking head-wise attention-response energy and applying group-wise regularization to reduce cross-sample fluctuations while preserving predictive capacity.

Core claim

The paper establishes that cross-sample instability of attention-response energy is a key source of unreliability in attention-based precipitation nowcasting. It proposes HARECast, which models head-wise attention-response energy and stabilizes it through a group-wise regularization objective that reduces fluctuations across samples. The formulation applies to both unimodal and multimodal architectures and yields state-of-the-art results on the SEVIR and MeteoNet benchmarks when combined with reconstruction branches and a diffusion-based predictor.

What carries the argument

Group-wise regularization objective that explicitly models and reduces cross-sample variance in head-wise attention-response energy

If this is right

Lower attention-response energy variance across heads and layers is associated with reduced forecast error on the same inputs.
The regularization improves reliability in both unimodal and multimodal nowcasting pipelines.
HARECast reaches state-of-the-art performance on the SEVIR and MeteoNet benchmarks.
Stabilization occurs without sacrificing the model's representation learning for accurate precipitation prediction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same energy-stabilization approach could be tested on attention models for related tasks such as radar extrapolation or short-term wind forecasting.
Tracking attention energy variance might serve as a lightweight diagnostic for when to trigger model ensembles or human review in operational nowcasting systems.
If the causal link holds, similar regularization terms could be added to other self-attention forecasting domains to improve consistency without extra data.

Load-bearing premise

Reducing cross-sample attention energy variance will improve forecast accuracy without degrading the model's ability to learn useful precipitation representations, and the observed link between variance and error is causal.

What would settle it

Train matched model pairs on the same data with the regularization term enabled versus disabled and measure whether prediction error on held-out samples rises when attention-response energy variance is higher.

Figures

Figures reproduced from arXiv: 2605.13181 by Allen Benter, Kun Hu, Patrick Filippi, Penghui Wen, Sen Zhang, Thomas Bishop, Xiaogang Zhu, Zexin Hu, Zhiyong Wang.

**Figure 2.** Figure 2: Overview of HARECast. Given historical radar observations and, when available, satellite inputs, the model first [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on a SEVIR event (uni [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on a MeteoNet event (uni [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 8.** Figure 8: Training dynamics of Lhare w/ different batch sizes. categories consistently improves all evaluation metrics, with especially clear gains on stricter threshold-based CSI metrics. This suggests that head grouping enables more precise energy control than treating all heads uniformly. 4.4 Sensitivity Analysis Effect of Batch Size. Since Lhare is built from batch-level energy statistics, the batch size direct… view at source ↗

**Figure 7.** Figure 7: Effect of HARE stabilization on head-wise attention [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 9.** Figure 9: Training dynamics of the loss terms. All compo [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization of attention maps from different [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Prediction examples on the SEVIR dataset. [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Prediction examples on the MeteoNet dataset. [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

read the original abstract

Precipitation nowcasting remains challenging due to the highly localized, rapidly evolving, and heterogeneous nature of atmospheric dynamics. Although recent methods increasingly adopt attention-based architectures in both unimodal and multimodal settings, they mainly emphasize stronger representation learning and prediction capacity, while paying less attention to the stability of attention responses across samples. In this work, we show that cross-sample instability of attention-response energy is an important and previously underexplored source of forecasting unreliability. Empirically, inaccurate forecasts are associated with larger attention-response energy variance across heads and layers. Theoretically, we show that cross-sample variability can propagate through self-attention, and enlarge a lower bound on prediction error. Based on this insight, we propose HARECast, a Head-wise Attention Response Energy-regulated framework for precipitation nowcasting. HARECast explicitly models head-wise attention-response energy and stabilizes it through a group-wise regularization objective that reduces cross-sample fluctuations. The proposed formulation is generic and applicable to both unimodal and multimodal nowcasting architectures. We instantiate HARECast in a standard forecasting pipeline with reconstruction branches and a diffusion-based predictor, and evaluate it on commonly used benchmarks--SEVIR and MeteoNet. Experimental results demonstrate that HARECast achieves state-of-the-art performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HARECast adds a group-wise regularization on head-wise attention energy variance to stabilize nowcasting, with claimed SOTA on SEVIR and MeteoNet, but the experiments do not isolate whether that term drives the gains or if other pipeline pieces do.

read the letter

The main point is that the authors link cross-sample variance in attention-response energy to higher forecast error in precipitation nowcasting, then add a regularization term to reduce it. They back this with a theoretical propagation argument through self-attention and show improved results on the usual benchmarks. That combination of observation plus targeted fix is the actual new piece here, and it is presented in a way that could transfer to other attention-based spatiotemporal models. The generic framing for both unimodal and multimodal cases is a plus, as is the choice to work inside an existing diffusion-plus-reconstruction pipeline rather than starting from scratch. The empirical association between energy variance and error looks real enough from the abstract description. The soft spot is that nothing in the reported results separates the regularization effect from the rest of the architecture. Gains could come from the diffusion predictor, the reconstruction branches, or simple capacity control rather than from lowered attention instability specifically. Without ablations that turn the term on and off while holding everything else fixed, or error bars that show the improvement is robust, the causality claim stays correlational. The theoretical lower-bound argument is plausible on paper but would need the full derivation checked for hidden assumptions. This paper is for people already working on attention or transformer models for weather and short-term forecasting. A reader who needs practical stabilization tricks for unreliable attention outputs would get a usable idea to test, even if the current evidence is preliminary. I would send it to peer review so the experimental controls and derivation details can be examined properly.

Referee Report

2 major / 1 minor

Summary. The paper claims that cross-sample instability of attention-response energy is an underexplored source of unreliability in attention-based precipitation nowcasting models. It proposes HARECast, which explicitly models head-wise attention-response energy and stabilizes it via a group-wise regularization objective. The method is instantiated in a pipeline including reconstruction branches and a diffusion predictor, and is shown to achieve state-of-the-art results on the SEVIR and MeteoNet benchmarks.

Significance. If the regularization causally reduces forecast error by stabilizing attention energy variance rather than through generic capacity control, the work would address a practically relevant gap in reliable spatiotemporal forecasting. The generic formulation for unimodal and multimodal settings and the reported SOTA gains on standard benchmarks indicate potential impact for operational nowcasting systems, provided the mechanism is isolated from confounding pipeline components.

major comments (2)

[Experiments section] Experiments section: No ablation studies isolate the contribution of the group-wise regularization from the reconstruction branches and diffusion-based predictor. Without such controls, the central claim that attention-energy stabilization is the operative mechanism behind the SOTA gains on SEVIR and MeteoNet cannot be verified, as performance lifts could arise from other pipeline elements.
[Theoretical section] Theoretical section: The propagation bound on prediction error is presented as independent of the empirical fit, yet the manuscript provides no explicit derivation showing that the proposed regularization term directly tightens this bound rather than merely correlating with reduced variance.

minor comments (1)

[Abstract and Method] The abstract and method description refer to 'group-wise regularization' without providing the precise mathematical formulation or hyperparameter sensitivity analysis for lambda, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the manuscript. We address each major point below and will incorporate revisions to improve clarity and empirical support.

read point-by-point responses

Referee: [Experiments section] Experiments section: No ablation studies isolate the contribution of the group-wise regularization from the reconstruction branches and diffusion-based predictor. Without such controls, the central claim that attention-energy stabilization is the operative mechanism behind the SOTA gains on SEVIR and MeteoNet cannot be verified, as performance lifts could arise from other pipeline elements.

Authors: We agree that the current experiments do not fully isolate the group-wise regularization. In the revised manuscript we will add targeted ablations that disable or scale the regularization term while freezing the reconstruction branches and diffusion predictor, reporting results on both SEVIR and MeteoNet. These controls will quantify the incremental contribution of attention-energy stabilization to the observed performance gains. revision: yes
Referee: [Theoretical section] Theoretical section: The propagation bound on prediction error is presented as independent of the empirical fit, yet the manuscript provides no explicit derivation showing that the proposed regularization term directly tightens this bound rather than merely correlating with reduced variance.

Authors: We acknowledge that the link between the regularization and the bound could be stated more rigorously. The revised theoretical section (with an expanded appendix derivation) will explicitly show how the group-wise penalty reduces the cross-sample variance term inside the propagation bound, thereby directly tightening the lower bound on prediction error rather than only correlating with it. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical association between attention-response energy variance and forecast inaccuracy, followed by a theoretical argument that cross-sample variability propagates through self-attention to enlarge a lower bound on prediction error. It then defines a group-wise regularization objective directly on head-wise attention energy statistics to reduce fluctuations. None of these elements reduce by construction to a fitted parameter renamed as prediction, a self-definitional loop, or a load-bearing self-citation whose content is unverified. The regularization term is explicitly formulated on the observed energy quantities rather than being equivalent to them by definition, and the lower-bound derivation is presented as independent of the empirical fit. The SOTA claims rest on benchmark evaluation rather than on any circular renaming or ansatz smuggling.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework introduces a group-wise regularization term whose strength is a tunable hyperparameter. The theoretical propagation argument assumes standard properties of self-attention without additional ad-hoc assumptions beyond those in transformer literature.

free parameters (1)

regularization strength lambda
Controls the weight of the attention energy stabilization loss; must be chosen or tuned on validation data.

axioms (1)

domain assumption Self-attention layers propagate cross-sample variability in attention energy into prediction error
Invoked in the theoretical section to link energy variance to a lower bound on forecast error.

pith-pipeline@v0.9.0 · 5542 in / 1381 out tokens · 25851 ms · 2026-05-14T19:16:31.151356+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Oussama Boussif, Ghait Boukachab, Dan Assouline, Stefano Massaroli, Tianle Yuan, Loubna Benabbou, and Yoshua Bengio. 2023. Improving day-ahead solar ir- radiance time series forecasting by leveraging spatio-temporal context.Advances in Neural Information Processing Systems36 (2023), 2342–2367

2023
[2]

Zheng Chang, Xinfeng Zhang, Shanshe Wang, Siwei Ma, and Wen Gao. 2022. STRPM: A spatiotemporal residual predictive model for high-resolution video prediction. InConference on Computer Vision and Pattern Recognition. 13946– 13955

2022
[3]

Zheng Chang, Xinfeng Zhang, Shanshe Wang, Siwei Ma, Yan Ye, Xiang Xinguang, and Wen Gao. 2021. MAU: A motion-aware unit for video prediction and beyond. Advances in Neural Information Processing Systems34 (2021), 26950–26962

2021
[4]

Yeji Choi, Keumgang Cha, Minyoung Back, Hyunguk Choi, and Taegyun Jeon
[5]

InIEEE International Geoscience and Remote Sensing Symposium

RAIN-F: A fusion dataset for rainfall prediction using convolutional neural network. InIEEE International Geoscience and Remote Sensing Symposium. IEEE, 7145–7148
[6]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Ima- genet: A large-scale hierarchical image database. InComputer Vision and Pattern Recognition Conference. IEEE, 248–255

2009
[7]

Lasse Espeholt, Shreya Agrawal, Casper Sønderby, Manoj Kumar, Jonathan Heek, Carla Bromberg, Cenk Gazen, Rob Carver, Marcin Andrychowicz, Jason Hickey, et al. 2022. Deep learning for twelve hour precipitation forecasts.Nature Com- munications13, 1 (2022), 5145

2022
[8]

Wenzhi Feng, Xutao Li, Zhe Wu, Kenghong Lin, Demin Yu, Yunming Ye, and Yaowei Wang. 2025. Perceptually Constrained Precipitation Nowcasting Model. InInternational Conference on Machine Learning

2025
[9]

Zhihan Gao, Xingjian Shi, Boran Han, Hao Wang, Xiaoyong Jin, Danielle Maddix, Yi Zhu, Mu Li, and Yuyang Bernie Wang. 2024. Prediff: Precipitation nowcasting with latent diffusion models.Advances in Neural Information Processing Systems 36 (2024)

2024
[10]

Zhihan Gao, Xingjian Shi, Hao Wang, Yi Zhu, Yuyang Bernie Wang, Mu Li, and Dit-Yan Yeung. 2022. Earthformer: Exploring space-time transformers for earth system forecasting.Advances in Neural Information Processing Systems35 (2022), 25390–25403

2022
[11]

Zhangyang Gao, Cheng Tan, Lirong Wu, and Stan Z Li. 2022. Simvp: Simpler yet better video prediction. InConference on Computer Vision and Pattern Recognition. 3170–3180

2022
[12]

Junchao Gong, Lei Bai, Peng Ye, Wanghan Xu, Na Liu, Jianhua Dai, Xiaokang Yang, and Wanli Ouyang. 2024. CasCast: Skillful High-resolution Precipita- tion Nowcasting via Cascaded Modelling.International Conference on Machine Learning(2024)

2024
[13]

Vincent Le Guen and Nicolas Thome. 2020. Disentangling physical dynamics from unknown factors for unsupervised video prediction. InConference on Computer Vision and Pattern Recognition. 11474–11484

2020
[14]

Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. 2020. At- tention is not only a weight: Analyzing transformers with vector norms.arXiv preprint arXiv:2004.10102(2020)

work page arXiv 2020
[15]

Gwennaëlle Larvor and Lea Berthomier. 2021. Meteonet: An open reference weather dataset for ai by météo-france. InAmerican Meteorological Society Meeting Abstracts, Vol. 101. 1–ii

2021
[16]

Kenghong Lin, Baoquan Zhang, Demin Yu, Wenzhi Feng, Shidong Chen, Feifan Gao, Xutao Li, and Yunming Ye. 2025. AlphaPre: Amplitude-Phase Disentan- glement Model for Precipitation Nowcasting. InComputer Vision and Pattern Recognition Conference. 17841–17850

2025
[17]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision. Springer, 740–755

2014
[18]

Paul Michel, Omer Levy, and Graham Neubig. 2019. Are sixteen heads really better than one?Advances in Neural Information Processing Systems32 (2019)

2019
[19]

Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, et al . 2022. Fourcastnet: A global data-driven high- resolution weather model using adaptive fourier neural operators.arXiv preprint arXiv:2202.11214(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Suman Ravuri, Karel Lenc, Matthew Willson, Dmitry Kangin, Remi Lam, Pi- otr Mirowski, Megan Fitzsimons, Maria Athanassiadou, Sheleem Kashem, Sam Madge, et al. 2021. Skilful precipitation nowcasting using deep generative models of radar.Nature597, 7878 (2021), 672–677

2021
[21]

Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting.Advances in Neural Information Processing Systems28 (2015)

2015
[22]

Xingjian Shi, Zhihan Gao, Leonard Lausen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, and Wang-chun Woo. 2017. Deep learning for precipitation nowcasting: A benchmark and a new model.Advances in Neural Information Processing Systems 30 (2017)

2017
[23]

Jiaming Song, Chenlin Meng, and Stefano Ermon. 2020. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[24]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in Neural Information Processing Systems30 (2017)

2017
[25]

Mark Veillette, Siddharth Samsi, and Chris Mattioli. 2020. SEVIR: A storm event imagery dataset for deep learning applications in radar and satellite meteorology. Advances in Neural Information Processing Systems33 (2020), 22009–22019

2020
[26]

Vikram Voleti, Alexia Jolicoeur-Martineau, and Chris Pal. 2022. MCVD-masked conditional video diffusion for prediction, generation, and interpolation.Advances in Neural Information Processing Systems35 (2022), 23371–23385

2022
[27]

Penghui Wen, Mengwei He, Patrick Filippi, Na Zhao, Feng Zhang, Thomas Francis Bishop, Zhiyong Wang, and Kun Hu. 2026. DuoCast: Duo-probabilistic diffusion for precipitation nowcasting. InAAAI Conference on Artificial Intelligence, Vol. 40. 39442–39450

2026
[28]

Hao Wu, Yuxuan Liang, Wei Xiong, Zhengyang Zhou, Wei Huang, Shilong Wang, and Kun Wang. 2024. Earthfarsser: Versatile spatio-temporal dynamical systems modeling in one model. InAAAI conference on artificial intelligence, Vol. 38. 15906–15914

2024
[29]

Chiu-Wai Yan, Shi Quan Foo, Van Hoan Trinh, Dit-Yan Yeung, Ka-Hing Wong, and Wai-Kin Wong. 2024. Fourier Amplitude and Correlation Loss: Beyond Using L2 Loss for Skillful Precipitation Nowcasting. InAdvances in Neural Information Processing Systems. Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

2024
[30]

Demin Yu, Wenchuan Du, Kenghong Lin, Xutao Li, Yunming Ye, Chuyao Luo, and Xunlai Chen. 2025. PiMMNet: Introducing Multi-Modal Precipitation Now- casting via a Physics-informed Perspective. InACM International Conference on Multimedia. 11522–11531

2025
[31]

Demin Yu, Wenzhi Feng, Kenghong Lin, Xutao Li, Yunming Ye, Chuyao Luo, and Wenchuan Du. 2025. Integrating Multi-Source Data for Long Sequence Precipitation Forecasting. InAAAI Conference on Artificial Intelligence, Vol. 39. 28539–28547

2025
[32]

Demin Yu, Xutao Li, Yunming Ye, Baoquan Zhang, Chuyao Luo, Kuai Dai, Rui Wang, and Xunlai Chen. 2024. Diffcast: A unified framework via residual diffu- sion for precipitation nowcasting. InConference on Computer Vision and Pattern Recognition. 27758–27767

2024
[33]

Lu Yu and Wei Xiang. 2023. X-pruner: explainable pruning for vision transformers. InConference on Computer Vision and Pattern Recognition. 24355–24363

2023
[34]

Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Rama- puram, Yizhe Zhang, Jiatao Gu, and Joshua M Susskind. 2023. Stabilizing trans- former training by preventing attention entropy collapse. InInternational Con- ference on Machine Learning. PMLR, 40770–40803

2023
[35]

Yuchen Zhang, Mingsheng Long, Kaiyuan Chen, Lanxiang Xing, Ronghua Jin, Michael I Jordan, and Jianmin Wang. 2023. Skilful nowcasting of extreme precip- itation with NowcastNet.Nature619, 7970 (2023), 526–532

2023
[36]

w/o G-HARE

Kun Zheng, Long He, Huihua Ruan, Shuo Yang, Jinbiao Zhang, Cong Luo, Siyu Tang, Jiaolong Zhang, Yugang Tian, and Jianmei Cheng. 2024. A cross-modal spatiotemporal joint predictive network for rainfall nowcasting.IEEE Transactions on Geoscience and Remote Sensing62 (2024), 1–23. Stable Attention Response for Reliable Precipitation Nowcasting Conference acr...

2024