arxiv: 2604.16590 · v1 · submitted 2026-04-17 · 💻 cs.LG · cs.AI

Recognition: unknown

Global Attention with Linear Complexity for Exascale Generative Data Assimilation in Earth System Prediction

Xiao Wang , Zezhong Zhang , Isaac Lyngaas , Hong-Jun Yoon , Jong-Youl Choi , Siming Liang , Janet Wang , Hristo G. Chipilski

show 5 more authors

Ashwin M. Aji Feng Bao Peter Jan van Leeuwen Dan Lu Guannan Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:46 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords earthpredictionsystemassimilationattentionglobalaccuratedata

0 comments

The pith

STORM achieves linear-complexity global attention in a generative DA framework, scaling to 20 billion tokens and 1.6 ExaFLOPs on 32k GPUs for km-scale Earth modeling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Data assimilation estimates the current state of the atmosphere and ocean by blending observations with simulation models. Conventional approaches alternate between forecasting the next state and updating it with new data, which can be computationally expensive at large scales. This work replaces the cycle with a single generative step that samples directly from the probability distribution of possible states given the observations. The core technical advance is a transformer architecture whose attention mechanism scales linearly rather than quadratically with the number of data points, allowing it to process billions of spatiotemporal tokens. The authors demonstrate the approach on the Frontier supercomputer, reporting 63 percent strong-scaling efficiency and sustained exascale performance while modeling at kilometer resolution over many time steps.

Core claim

On 32,768 GPUs of the Frontier supercomputer, our method achieves 63% strong scaling efficiency and 1.6 ExaFLOP sustained performance. We further scale to 20 billion spatiotemporal tokens, enabling km-scale global modeling over 177k temporal frames.

Load-bearing premise

The generative one-stage framework accurately samples the Bayesian posterior for Earth-system states without the conventional forecast-update cycle, and the linear-complexity attention preserves the fidelity needed for reliable assimilation at scale.

Figures

Figures reproduced from arXiv: 2604.16590 by Ashwin M. Aji, Dan Lu, Feng Bao, Guannan Zhang, Hong-Jun Yoon, Hristo G. Chipilski, Isaac Lyngaas, Janet Wang, Jong-Youl Choi, Peter Jan van Leeuwen, Siming Liang, Xiao Wang, Zezhong Zhang.

**Figure 1.** Figure 1: Overview of the TimeSFormer operations with O(K2N + KN2 ) complexity. extreme events. There have been exploration on AI-based DA methods [14]–[16] including our previous work [17]–[19] but these methods still rely on the two-stage DA workflow. State-of-the-art AI prediction and the scaling gap. The limitations with the conventional numerical methods motivate recent work on AI-based models, such as GenCast … view at source ↗

**Figure 2.** Figure 2: (a) Overview of one-stage DA workflow. (b) Overview of STORM architecture. Historical states are compressed into a global temporal representation, while the current state remains at full resolution. Noise-gated spatial and temporal attention decouple space–time interactions, reducing complexity from O(K2N + KN2 ) to O(N2 ) while preserving global correlations. where zt, the solution to Eqn. (5), represents… view at source ↗

**Figure 3.** Figure 3: (a) Tiling achieves linear complexity but limits interactions to local regions, while halo overlap improves continuity but remains local. STORM averages denoised outputs (gradients) in overlapping regions and propagates them iteratively across tiles, enabling global context with linear complexity. Hanning weighting stabilizes boundary interactions. (b) Hierarchical mapping of parallelism strategies to supe… view at source ↗

**Figure 5.** Figure 5: Strong scaling efficiencies at 10M and 10B model parameters, scaling to 32,768 GPUs with 61% to 64% strong scaling efficiencies at 1.6 exaFLOP sustained computing throughput. in Sec. III-D, which aligns communication frequency with hardware hierarchy. In particular, the tiling algorithm ensures that most communication remains local, avoiding the global synchronization bottlenecks that typically limit scala… view at source ↗

**Figure 6.** Figure 6: Hurricane track skill: Ensemble trajectories of four hurricanes. Black: STORM forecast-only ensembles (Sprior); blue: one-stage DA ensembles (Sposterior) with 20% observations; red: ground truth reference [29]. DA reduces ensemble spread and improves agreement with reference trajectories, demonstrating improved uncertainty quantification and trajectory accuracy. Ground Truth Prediction without DA Predictio… view at source ↗

**Figure 7.** Figure 7: Hurricane intensity skill: Each row corresponds to one of the four hurricane cases shown in [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Regional high-resolution temperature simulations: (a) Ground truth 2-m temperature at 72-hour lead time. (b) Forecastonly ensemble mean (Sprior), showing large regional errors. (c) One-stage DA posterior mean (Sposterior) with 50% observations. (e,f) Corresponding absolute errors, where DA significantly reduces errors, particularly in regions with strong diurnal heating (e.g., US–Mexico border). (d) RMSE … view at source ↗

read the original abstract

Accurate weather and climate prediction relies on data assimilation (DA), which estimates the Earth system state by integrating observations with models. While exascale computing has significantly advanced earth simulation, scalable and accurate inference of the Earth system state remains a fundamental bottleneck, limiting uncertainty quantification and prediction of extreme events. We introduce a unified one-stage generative DA framework that reformulates assimilation as Bayesian posterior sampling, replacing the conventional forecast-update cycle with compute-dense, GPU-efficient inference. At the core is STORM, a novel spatiotemporal transformer with a global attention linear-complexity scaling algorithm that breaks the quadratic attention barrier. On 32,768 GPUs of the Frontier supercomputer, our method achieves 63% strong scaling efficiency and 1.6 ExaFLOP sustained performance. We further scale to 20 billion spatiotemporal tokens, enabling km-scale global modeling over 177k temporal frames, regimes previously unreachable, establishing a new paradigm for Earth system prediction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers concrete exascale scaling numbers for a new linear-complexity spatiotemporal attention model in a one-stage generative DA framework, but supplies no accuracy metrics or posterior validation.

read the letter

The main takeaway is that this work gets a generative data assimilation system running at scale on Frontier with 32k GPUs, 63% strong scaling efficiency, 1.6 exaflops sustained, and 20 billion spatiotemporal tokens for km-scale global coverage over 177k frames. That is a real engineering step forward for Earth-system modeling at exascale. The core technical move is replacing the usual forecast-update loop with direct Bayesian posterior sampling via their STORM transformer, which uses a custom global attention algorithm that claims linear complexity instead of quadratic. If the approximation holds, it removes a long-standing barrier for large spatiotemporal problems. The scaling results and token counts are specific and push into territory that prior methods could not reach, which is the part worth noting. The soft spot is the missing validation. The abstract and available details give no error metrics against observations, no comparisons to established methods like ensemble Kalman filters, and no error bounds or consistency checks on how well the linear attention captures the long-range covariances needed for accurate posterior sampling. Without those, the claim that the generated states are statistically reliable for uncertainty quantification or extreme-event forecasting rests on an untested assumption. The stress-test point about potential uncontrolled approximation error is fair until the paper shows small-scale tests or derivations that close the gap. This is for people working at the intersection of data assimilation, generative modeling, and exascale HPC. Readers who need practical scaling recipes or architecture details for similar physical systems will find the performance numbers and attention design useful. It deserves a serious referee because the computational achievement is substantial and the application area has clear stakes, even though the review will need to press hard on the accuracy and fidelity sections. I would send it to peer review.

Circularity Check

0 steps flagged

No circularity: performance claims are empirical measurements, not derived by construction

full rationale

The abstract presents the 63% strong scaling efficiency, 1.6 ExaFLOP sustained performance, and scaling to 20 billion spatiotemporal tokens as direct measurements obtained from runs on 32,768 GPUs of the Frontier supercomputer. The STORM linear-complexity global attention algorithm is introduced as a novel component that enables these regimes, but no equations, derivations, or self-citations are shown that reduce the reported performance numbers to fitted parameters, renamed inputs, or tautological definitions. The one-stage generative DA reformulation is a methodological choice whose fidelity is asserted as an assumption rather than proven by reducing to prior self-referential results. The derivation chain therefore remains self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Review performed from abstract only; full paper would be required to enumerate all free parameters, axioms, and invented entities. The abstract implies reliance on standard transformer assumptions and the validity of the generative posterior-sampling reformulation.

axioms (2)

domain assumption Global attention can be reformulated to achieve linear complexity while retaining sufficient expressivity for spatiotemporal Earth-system data.
Central to the STORM architecture described in the abstract.
domain assumption One-stage generative sampling can replace the iterative forecast-update cycle without loss of assimilation accuracy.
Foundational premise of the unified framework.

invented entities (1)

STORM spatiotemporal transformer no independent evidence
purpose: To implement global attention with linear complexity for large-scale generative data assimilation.
New architecture introduced by the paper.

pith-pipeline@v0.9.0 · 5512 in / 1415 out tokens · 38849 ms · 2026-05-10T08:46:29.006929+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 1 canonical work pages · 1 internal anchor

[1]

The E3SM-MMF case study: A case study in global cloud-resolving multi-scale modeling at exascale,

K. Zhanget al., “The E3SM-MMF case study: A case study in global cloud-resolving multi-scale modeling at exascale,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC23). ACM, 2023

2023
[2]

Pushing the frontier: Global cloud-resolving climate simulations at 1km resolution on the Frontier exascale system,

W. Linet al., “Pushing the frontier: Global cloud-resolving climate simulations at 1km resolution on the Frontier exascale system,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC25). ACM, 2025

2025
[3]

FourCastNet-V2: Multi-scale global data-driven weather forecasting at 0.1 degree resolution on exascale systems,

T. Kurthet al., “FourCastNet-V2: Multi-scale global data-driven weather forecasting at 0.1 degree resolution on exascale systems,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC24). ACM, 2024

2024
[4]

The simple cloud-resolving e3sm atmosphere model running on the frontier exascale system,

M. Tayloret al., “The simple cloud-resolving e3sm atmosphere model running on the frontier exascale system,” inProceedings of the interna- tional conference for high performance computing, networking, storage and analysis, 2023, pp. 1–11

2023
[5]

Computing the full earth system at 1km resolution,

D. Klockeet al., “Computing the full earth system at 1km resolution,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025, pp. 125–136

2025
[6]

Learning skillful medium-range global weather forecast- ing,

R. Lamet al., “Learning skillful medium-range global weather forecast- ing,”Science, vol. 382, no. 6677, pp. 1416–1424, 2023

2023
[7]

Accurate medium-range global weather forecasting with 3d neural networks,

K. Bi, L. Xie, H. Zhang, X. Chen, L. Gu, and Q. Tian, “Accurate medium-range global weather forecasting with 3d neural networks,” Nature, vol. 619, no. 7970, pp. 533–538, 2023

2023
[8]

FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators

J. Pathaket al., “FourCastNet: A global data-driven high-resolution weather forecasting model using adaptive fourier neural operators,” arXiv preprint arXiv:2202.11214, 2022

work page internal anchor Pith review arXiv 2022
[9]

Deterministic nonperiodic flow,

E. N. Lorenz, “Deterministic nonperiodic flow,”Journal of the Atmo- spheric Sciences, vol. 20, no. 2, pp. 130–141, 1963

1963
[10]

A generalization of Lorenz’s model for the predictability of flows with many scales of motion,

R. Rotunno and C. Snyder, “A generalization of Lorenz’s model for the predictability of flows with many scales of motion,”Journal of the Atmospheric Sciences, vol. 65, no. 3, pp. 1063–1076, 2008

2008
[11]

Upscale versus “up-amplitude

R. Rotunno, C. Snyder, and F. Judt, “Upscale versus “up-amplitude” growth of forecast-error spectra,”Journal of the Atmospheric Sciences, vol. 80, no. 1, pp. 63–72, 2023

2023
[12]

Is space-time attention all you need for video understanding?

G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 813–824

2021
[13]

The ECMWF ensemble prediction system: Looking back (more than) 25 years and projecting forward 25 years,

T. N. Palmer, “The ECMWF ensemble prediction system: Looking back (more than) 25 years and projecting forward 25 years,”Quarterly Journal of the Royal Meteorological Society, vol. 145, no. S1, pp. 12–24, 2018

2018
[14]

Diffda: a diffusion model for weather-scale data assimilation,

L. Huanget al., “Diffda: a diffusion model for weather-scale data assimilation,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML’24, 2024

2024
[15]

Using diffusion models to do data assimilation,

D. Hodyss and M. Morzfeld, “Using diffusion models to do data assimilation,”Monthly Weather Review, vol. 153, no. 6, pp. 1245–1262, 2025

2025
[16]

Generative data assimilation of sparse weather station observations at kilometer scales,

P. Manshausenet al., “Generative data assimilation of sparse weather station observations at kilometer scales,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 2154–2163

2024
[17]

A score-based filter for nonlinear data assimilation,

F. Bao, Z. Zhang, and G. Zhang, “A score-based filter for nonlinear data assimilation,”Journal of Computational Physics, vol. 514, p. 113207, 2024

2024
[18]

An ensemble score filter for tracking high-dimensional nonlin- ear dynamical system,

——, “An ensemble score filter for tracking high-dimensional nonlin- ear dynamical system,”Computer Methods in Applied Mechanics and Engineering, vol. 432, no. Part B, p. 117447, 2024

2024
[19]

Nonlinear ensemble filtering with diffusion models: application to the surface quasi-geostrophic dynamics,

F. Bao, H. Chipilski, S. Liang, G. Zhang, and J. Whitaker, “Nonlinear ensemble filtering with diffusion models: application to the surface quasi-geostrophic dynamics,”Monthly Weather Review, vol. 153, no. 7, pp. 1155–1169, 2025

2025
[20]

Gencast: Diffusion-based ensemble forecasting for medium-range weather,

I. Priceet al., “Gencast: Diffusion-based ensemble forecasting for medium-range weather,”Nature, 2024

2024
[21]

Swin transformer v2: Scaling up capacity and resolution,

Z. Liuet al., “Swin transformer v2: Scaling up capacity and resolution,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 12 009–12 019

2022
[22]

Maxvit: Multi-axis vision transformer,

Z. Tuet al., “Maxvit: Multi-axis vision transformer,”ECCV, 2022

2022
[23]

Orbit-2: Scaling exascale vision foundation models for weather and climate downscaling,

X. Wanget al., “Orbit-2: Scaling exascale vision foundation models for weather and climate downscaling,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 86–98

2025
[24]

Elucidating the design space of diffusion-based generative models,

T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the design space of diffusion-based generative models,” inProc. NeurIPS, 2022

2022
[25]

FlashAttention-2: Faster attention with better parallelism and work partitioning,

T. Dao, “FlashAttention-2: Faster attention with better parallelism and work partitioning,” 2023

2023
[26]

Orbit: Oak ridge base foundation model for earth system predictability,

X. Wanget al., “Orbit: Oak ridge base foundation model for earth system predictability,” ser. SC ’24, 2024

2024
[27]

The ERA5 global reanalysis,

H. Hersbachet al., “The ERA5 global reanalysis,”Quarterly Journal of the Royal Meteorological Society, vol. 146, no. 730, pp. 1999–2049, 2020

1999
[28]

The High-Resolution Rapid Refresh (HRRR): An hourly updating convection-allowing forecast model. Part I: Motivation and system description,

D. C. Dowellet al., “The High-Resolution Rapid Refresh (HRRR): An hourly updating convection-allowing forecast model. Part I: Motivation and system description,”Weather and Forecasting, vol. 37, no. 8, pp. 1371–1395, 2022

2022
[29]

Kmz event files for Hurricanes Laura, Delta, Michael, and Teddy,

National Weather Service, “Kmz event files for Hurricanes Laura, Delta, Michael, and Teddy,” Weather event KMZ files, 2018 and 2020, individual storm-specific KMZ files obtained from weather.gov

2018