pith. machine review for the scientific record. sign in

arxiv: 2604.16590 · v1 · submitted 2026-04-17 · 💻 cs.LG · cs.AI

Recognition: unknown

Global Attention with Linear Complexity for Exascale Generative Data Assimilation in Earth System Prediction

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:46 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords earthpredictionsystemassimilationattentionglobalaccuratedata
0
0 comments X

The pith

STORM achieves linear-complexity global attention in a generative DA framework, scaling to 20 billion tokens and 1.6 ExaFLOPs on 32k GPUs for km-scale Earth modeling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Data assimilation estimates the current state of the atmosphere and ocean by blending observations with simulation models. Conventional approaches alternate between forecasting the next state and updating it with new data, which can be computationally expensive at large scales. This work replaces the cycle with a single generative step that samples directly from the probability distribution of possible states given the observations. The core technical advance is a transformer architecture whose attention mechanism scales linearly rather than quadratically with the number of data points, allowing it to process billions of spatiotemporal tokens. The authors demonstrate the approach on the Frontier supercomputer, reporting 63 percent strong-scaling efficiency and sustained exascale performance while modeling at kilometer resolution over many time steps.

Core claim

On 32,768 GPUs of the Frontier supercomputer, our method achieves 63% strong scaling efficiency and 1.6 ExaFLOP sustained performance. We further scale to 20 billion spatiotemporal tokens, enabling km-scale global modeling over 177k temporal frames.

Load-bearing premise

The generative one-stage framework accurately samples the Bayesian posterior for Earth-system states without the conventional forecast-update cycle, and the linear-complexity attention preserves the fidelity needed for reliable assimilation at scale.

Figures

Figures reproduced from arXiv: 2604.16590 by Ashwin M. Aji, Dan Lu, Feng Bao, Guannan Zhang, Hong-Jun Yoon, Hristo G. Chipilski, Isaac Lyngaas, Janet Wang, Jong-Youl Choi, Peter Jan van Leeuwen, Siming Liang, Xiao Wang, Zezhong Zhang.

Figure 1
Figure 1. Figure 1: Overview of the TimeSFormer operations with O(K2N + KN2 ) complexity. extreme events. There have been exploration on AI-based DA methods [14]–[16] including our previous work [17]–[19] but these methods still rely on the two-stage DA workflow. State-of-the-art AI prediction and the scaling gap. The limitations with the conventional numerical methods motivate recent work on AI-based models, such as GenCast … view at source ↗
Figure 2
Figure 2. Figure 2: (a) Overview of one-stage DA workflow. (b) Overview of STORM architecture. Historical states are compressed into a global temporal representation, while the current state remains at full resolution. Noise-gated spatial and temporal attention decouple space–time interactions, reducing complexity from O(K2N + KN2 ) to O(N2 ) while preserving global correlations. where zt, the solution to Eqn. (5), represents… view at source ↗
Figure 3
Figure 3. Figure 3: (a) Tiling achieves linear complexity but limits interactions to local regions, while halo overlap improves continuity but remains local. STORM averages denoised outputs (gradients) in overlapping regions and propagates them iteratively across tiles, enabling global context with linear complexity. Hanning weighting stabilizes boundary interactions. (b) Hierarchical mapping of parallelism strategies to supe… view at source ↗
Figure 5
Figure 5. Figure 5: Strong scaling efficiencies at 10M and 10B model parameters, scaling to 32,768 GPUs with 61% to 64% strong scaling efficiencies at 1.6 exaFLOP sustained computing throughput. in Sec. III-D, which aligns communication frequency with hardware hierarchy. In particular, the tiling algorithm ensures that most communication remains local, avoiding the global synchronization bottlenecks that typically limit scala… view at source ↗
Figure 6
Figure 6. Figure 6: Hurricane track skill: Ensemble trajectories of four hurricanes. Black: STORM forecast-only ensembles (Sprior); blue: one-stage DA ensembles (Sposterior) with 20% observations; red: ground truth reference [29]. DA reduces ensemble spread and improves agreement with reference trajectories, demonstrating improved uncertainty quantification and trajectory accuracy. Ground Truth Prediction without DA Predictio… view at source ↗
Figure 7
Figure 7. Figure 7: Hurricane intensity skill: Each row corresponds to one of the four hurricane cases shown in [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Regional high-resolution temperature simulations: (a) Ground truth 2-m temperature at 72-hour lead time. (b) Forecast￾only ensemble mean (Sprior), showing large regional errors. (c) One-stage DA posterior mean (Sposterior) with 50% observations. (e,f) Corresponding absolute errors, where DA significantly reduces errors, particularly in regions with strong diurnal heating (e.g., US–Mexico border). (d) RMSE … view at source ↗
read the original abstract

Accurate weather and climate prediction relies on data assimilation (DA), which estimates the Earth system state by integrating observations with models. While exascale computing has significantly advanced earth simulation, scalable and accurate inference of the Earth system state remains a fundamental bottleneck, limiting uncertainty quantification and prediction of extreme events. We introduce a unified one-stage generative DA framework that reformulates assimilation as Bayesian posterior sampling, replacing the conventional forecast-update cycle with compute-dense, GPU-efficient inference. At the core is STORM, a novel spatiotemporal transformer with a global attention linear-complexity scaling algorithm that breaks the quadratic attention barrier. On 32,768 GPUs of the Frontier supercomputer, our method achieves 63% strong scaling efficiency and 1.6 ExaFLOP sustained performance. We further scale to 20 billion spatiotemporal tokens, enabling km-scale global modeling over 177k temporal frames, regimes previously unreachable, establishing a new paradigm for Earth system prediction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Circularity Check

0 steps flagged

No circularity: performance claims are empirical measurements, not derived by construction

full rationale

The abstract presents the 63% strong scaling efficiency, 1.6 ExaFLOP sustained performance, and scaling to 20 billion spatiotemporal tokens as direct measurements obtained from runs on 32,768 GPUs of the Frontier supercomputer. The STORM linear-complexity global attention algorithm is introduced as a novel component that enables these regimes, but no equations, derivations, or self-citations are shown that reduce the reported performance numbers to fitted parameters, renamed inputs, or tautological definitions. The one-stage generative DA reformulation is a methodological choice whose fidelity is asserted as an assumption rather than proven by reducing to prior self-referential results. The derivation chain therefore remains self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Review performed from abstract only; full paper would be required to enumerate all free parameters, axioms, and invented entities. The abstract implies reliance on standard transformer assumptions and the validity of the generative posterior-sampling reformulation.

axioms (2)
  • domain assumption Global attention can be reformulated to achieve linear complexity while retaining sufficient expressivity for spatiotemporal Earth-system data.
    Central to the STORM architecture described in the abstract.
  • domain assumption One-stage generative sampling can replace the iterative forecast-update cycle without loss of assimilation accuracy.
    Foundational premise of the unified framework.
invented entities (1)
  • STORM spatiotemporal transformer no independent evidence
    purpose: To implement global attention with linear complexity for large-scale generative data assimilation.
    New architecture introduced by the paper.

pith-pipeline@v0.9.0 · 5512 in / 1415 out tokens · 38849 ms · 2026-05-10T08:46:29.006929+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    The E3SM-MMF case study: A case study in global cloud-resolving multi-scale modeling at exascale,

    K. Zhanget al., “The E3SM-MMF case study: A case study in global cloud-resolving multi-scale modeling at exascale,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC23). ACM, 2023

  2. [2]

    Pushing the frontier: Global cloud-resolving climate simulations at 1km resolution on the Frontier exascale system,

    W. Linet al., “Pushing the frontier: Global cloud-resolving climate simulations at 1km resolution on the Frontier exascale system,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC25). ACM, 2025

  3. [3]

    FourCastNet-V2: Multi-scale global data-driven weather forecasting at 0.1 degree resolution on exascale systems,

    T. Kurthet al., “FourCastNet-V2: Multi-scale global data-driven weather forecasting at 0.1 degree resolution on exascale systems,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC24). ACM, 2024

  4. [4]

    The simple cloud-resolving e3sm atmosphere model running on the frontier exascale system,

    M. Tayloret al., “The simple cloud-resolving e3sm atmosphere model running on the frontier exascale system,” inProceedings of the interna- tional conference for high performance computing, networking, storage and analysis, 2023, pp. 1–11

  5. [5]

    Computing the full earth system at 1km resolution,

    D. Klockeet al., “Computing the full earth system at 1km resolution,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025, pp. 125–136

  6. [6]

    Learning skillful medium-range global weather forecast- ing,

    R. Lamet al., “Learning skillful medium-range global weather forecast- ing,”Science, vol. 382, no. 6677, pp. 1416–1424, 2023

  7. [7]

    Accurate medium-range global weather forecasting with 3d neural networks,

    K. Bi, L. Xie, H. Zhang, X. Chen, L. Gu, and Q. Tian, “Accurate medium-range global weather forecasting with 3d neural networks,” Nature, vol. 619, no. 7970, pp. 533–538, 2023

  8. [8]

    FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators

    J. Pathaket al., “FourCastNet: A global data-driven high-resolution weather forecasting model using adaptive fourier neural operators,” arXiv preprint arXiv:2202.11214, 2022

  9. [9]

    Deterministic nonperiodic flow,

    E. N. Lorenz, “Deterministic nonperiodic flow,”Journal of the Atmo- spheric Sciences, vol. 20, no. 2, pp. 130–141, 1963

  10. [10]

    A generalization of Lorenz’s model for the predictability of flows with many scales of motion,

    R. Rotunno and C. Snyder, “A generalization of Lorenz’s model for the predictability of flows with many scales of motion,”Journal of the Atmospheric Sciences, vol. 65, no. 3, pp. 1063–1076, 2008

  11. [11]

    Upscale versus “up-amplitude

    R. Rotunno, C. Snyder, and F. Judt, “Upscale versus “up-amplitude” growth of forecast-error spectra,”Journal of the Atmospheric Sciences, vol. 80, no. 1, pp. 63–72, 2023

  12. [12]

    Is space-time attention all you need for video understanding?

    G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 813–824

  13. [13]

    The ECMWF ensemble prediction system: Looking back (more than) 25 years and projecting forward 25 years,

    T. N. Palmer, “The ECMWF ensemble prediction system: Looking back (more than) 25 years and projecting forward 25 years,”Quarterly Journal of the Royal Meteorological Society, vol. 145, no. S1, pp. 12–24, 2018

  14. [14]

    Diffda: a diffusion model for weather-scale data assimilation,

    L. Huanget al., “Diffda: a diffusion model for weather-scale data assimilation,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML’24, 2024

  15. [15]

    Using diffusion models to do data assimilation,

    D. Hodyss and M. Morzfeld, “Using diffusion models to do data assimilation,”Monthly Weather Review, vol. 153, no. 6, pp. 1245–1262, 2025

  16. [16]

    Generative data assimilation of sparse weather station observations at kilometer scales,

    P. Manshausenet al., “Generative data assimilation of sparse weather station observations at kilometer scales,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 2154–2163

  17. [17]

    A score-based filter for nonlinear data assimilation,

    F. Bao, Z. Zhang, and G. Zhang, “A score-based filter for nonlinear data assimilation,”Journal of Computational Physics, vol. 514, p. 113207, 2024

  18. [18]

    An ensemble score filter for tracking high-dimensional nonlin- ear dynamical system,

    ——, “An ensemble score filter for tracking high-dimensional nonlin- ear dynamical system,”Computer Methods in Applied Mechanics and Engineering, vol. 432, no. Part B, p. 117447, 2024

  19. [19]

    Nonlinear ensemble filtering with diffusion models: application to the surface quasi-geostrophic dynamics,

    F. Bao, H. Chipilski, S. Liang, G. Zhang, and J. Whitaker, “Nonlinear ensemble filtering with diffusion models: application to the surface quasi-geostrophic dynamics,”Monthly Weather Review, vol. 153, no. 7, pp. 1155–1169, 2025

  20. [20]

    Gencast: Diffusion-based ensemble forecasting for medium-range weather,

    I. Priceet al., “Gencast: Diffusion-based ensemble forecasting for medium-range weather,”Nature, 2024

  21. [21]

    Swin transformer v2: Scaling up capacity and resolution,

    Z. Liuet al., “Swin transformer v2: Scaling up capacity and resolution,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 12 009–12 019

  22. [22]

    Maxvit: Multi-axis vision transformer,

    Z. Tuet al., “Maxvit: Multi-axis vision transformer,”ECCV, 2022

  23. [23]

    Orbit-2: Scaling exascale vision foundation models for weather and climate downscaling,

    X. Wanget al., “Orbit-2: Scaling exascale vision foundation models for weather and climate downscaling,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 86–98

  24. [24]

    Elucidating the design space of diffusion-based generative models,

    T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the design space of diffusion-based generative models,” inProc. NeurIPS, 2022

  25. [25]

    FlashAttention-2: Faster attention with better parallelism and work partitioning,

    T. Dao, “FlashAttention-2: Faster attention with better parallelism and work partitioning,” 2023

  26. [26]

    Orbit: Oak ridge base foundation model for earth system predictability,

    X. Wanget al., “Orbit: Oak ridge base foundation model for earth system predictability,” ser. SC ’24, 2024

  27. [27]

    The ERA5 global reanalysis,

    H. Hersbachet al., “The ERA5 global reanalysis,”Quarterly Journal of the Royal Meteorological Society, vol. 146, no. 730, pp. 1999–2049, 2020

  28. [28]

    The High-Resolution Rapid Refresh (HRRR): An hourly updating convection-allowing forecast model. Part I: Motivation and system description,

    D. C. Dowellet al., “The High-Resolution Rapid Refresh (HRRR): An hourly updating convection-allowing forecast model. Part I: Motivation and system description,”Weather and Forecasting, vol. 37, no. 8, pp. 1371–1395, 2022

  29. [29]

    Kmz event files for Hurricanes Laura, Delta, Michael, and Teddy,

    National Weather Service, “Kmz event files for Hurricanes Laura, Delta, Michael, and Teddy,” Weather event KMZ files, 2018 and 2020, individual storm-specific KMZ files obtained from weather.gov