pith. machine review for the scientific record. sign in

arxiv: 2604.05068 · v1 · submitted 2026-04-06 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Towards Scaling Law Analysis For Spatiotemporal Weather Data

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:55 UTC · model grok-4.3

classification 💻 cs.LG
keywords scaling lawsweather forecastingautoregressive modelsspatiotemporal dataneural scalingerror compoundingmulti-channel prediction
0
0 comments X

The pith

Scaling laws for autoregressive weather models show global pooling masks per-channel degradation at long horizons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends neural scaling analysis from single-step training loss to long autoregressive rollouts and per-channel metrics in weather forecasting. It measures how prediction error distributes across physical channels, how growth rates change with forecast horizon, and whether power-law scaling holds under parameter, data, and compute axes when errors are pooled globally. A reader cares because weather outputs couple variables with very different scales and predictability, and autoregressive error compounding can make short-horizon training misleading for long-lead performance. The central result is strong cross-channel and cross-horizon heterogeneity: pooled scaling often looks favorable while many channels degrade at late leads.

Core claim

We extend neural scaling analysis for autoregressive weather forecasting from single-step training loss to long rollouts and per-channel metrics. We quantify (1) how prediction error is distributed across channels and how its growth rate evolves with forecast horizon, (2) if power law scaling holds for test error relative to rollout length when error is pooled globally, and (3) how that fit varies jointly with horizon and channel for parameter, data, and compute-based scaling axes. We find strong cross-channel and cross-horizon heterogeneity: pooled scaling can look favorable while many channels degrade at late leads.

What carries the argument

Joint scaling of test error across forecast horizons and physical channels under autoregressive rollouts, using globally pooled versus per-channel decompositions.

If this is right

  • Weighted loss functions are needed to balance channels whose error grows differently with horizon.
  • Horizon-aware training curricula can reduce late-lead degradation without extra compute.
  • Resource allocation should prioritize channels that actually improve under scaling rather than relying on global averages.
  • Model selection and hyperparameter search based solely on pooled metrics can select suboptimal configurations for operational long-range forecasts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar heterogeneity may appear in any autoregressive physical simulation task where outputs have unequal predictability.
  • Evaluation protocols for climate or fluid models may need routine per-variable, per-lead decomposition to avoid over-optimism from aggregates.
  • If the pattern holds across datasets, scaling studies in spatiotemporal domains should report both pooled and disaggregated curves as standard practice.

Load-bearing premise

That globally pooled test metrics and per-channel late-lead behavior can be meaningfully compared under the same scaling axes without confounding from autoregressive error compounding or channel-specific normalization.

What would settle it

Direct measurement of whether individual channels continue to follow power-law error reduction at long horizons when model size or data volume increases, or instead saturate or worsen.

Figures

Figures reproduced from arXiv: 2604.05068 by Alexander Kiefer, Prasanna Balaprakash, Xiao Wang.

Figure 1
Figure 1. Figure 1: Swin weather forecaster: patch embedding of the multi-channel field, stacked shifted-window Swin blocks, and [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distributed communication in hybrid DP–SP [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-channel area-weighted RMSE at six hours. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average time derivative of area-weighted RMSE, [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Global scaling-law diagnostics for log–log fits [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

Compute-optimal scaling laws are relatively well studied for NLP and CV, where objectives are typically single-step and targets are comparatively homogeneous. Weather forecasting is harder to characterize in the same framework: autoregressive rollouts compound errors over long horizons, outputs couple many physical channels with disparate scales and predictability, and globally pooled test metrics can disagree sharply with per-channel, late-lead behavior implied by short-horizon training. We extend neural scaling analysis for autoregressive weather forecasting from single-step training loss to long rollouts and per-channel metrics. We quantify (1) how prediction error is distributed across channels and how its growth rate evolves with forecast horizon, (2) if power law scaling holds for test error, relative to rollout length when error is pooled globally, and (3) how that fit varies jointly with horizon and channel for parameter, data, and compute-based scaling axes. We find strong cross-channel and cross-horizon heterogeneity: pooled scaling can look favorable while many channels degrade at late leads. We discuss implications for weighted objectives, horizon-aware curricula, and resource allocation across outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper extends neural scaling analysis from single-step training loss to long autoregressive rollouts and per-channel metrics in spatiotemporal weather forecasting. It quantifies error distribution across channels and horizons, tests whether power-law scaling holds for globally pooled test error as a function of rollout length, and examines how scaling fits vary jointly with horizon and channel across parameter, data, and compute axes. The central empirical finding is strong cross-channel and cross-horizon heterogeneity: globally pooled scaling can appear favorable while many individual channels degrade at late leads.

Significance. If the reported heterogeneity is robust to the noted confounds, the result would be significant for scaling-law research in scientific ML. It demonstrates that standard pooled metrics can mask per-output degradation in heterogeneous, autoregressive settings, with direct implications for objective weighting, curriculum design, and compute allocation in weather and climate modeling. The work correctly identifies a gap between NLP/CV scaling practices and the demands of multi-channel spatiotemporal forecasting.

major comments (1)
  1. [Abstract / Methods] The central claim of favorable pooled scaling coexisting with per-channel degradation at late leads (abstract) requires that globally pooled test metrics and per-channel late-lead errors are comparable on the same scaling axes. However, the manuscript provides no description of channel-wise normalization, loss weighting, or correction for differential autoregressive error compounding rates across variables (e.g., temperature vs. precipitation). Without these controls, the heterogeneity could be an artifact of metric construction rather than a genuine scaling phenomenon.
minor comments (1)
  1. [Abstract] The abstract states that power-law scaling is tested 'relative to rollout length when error is pooled globally,' but does not specify the functional form, fitting procedure, or range of rollout lengths used.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for identifying the need to explicitly document metric construction details to support the central claim. We address the concern point-by-point below and will revise the manuscript to include the requested clarifications.

read point-by-point responses
  1. Referee: [Abstract / Methods] The central claim of favorable pooled scaling coexisting with per-channel degradation at late leads (abstract) requires that globally pooled test metrics and per-channel late-lead errors are comparable on the same scaling axes. However, the manuscript provides no description of channel-wise normalization, loss weighting, or correction for differential autoregressive error compounding rates across variables (e.g., temperature vs. precipitation). Without these controls, the heterogeneity could be an artifact of metric construction rather than a genuine scaling phenomenon.

    Authors: We agree that explicit documentation of normalization and evaluation choices is necessary for the claim to be robust. In the reported experiments, every channel was normalized independently to zero mean and unit variance using training-set statistics prior to both training and error computation; this is the standard preprocessing in WeatherBench-style benchmarks and ensures that RMSE values are on comparable scales across variables with different physical units. No per-channel loss weighting was used at training or test time, precisely so that the natural differences in predictability and error growth would remain visible. Autoregressive compounding rates are not corrected for; instead, the analysis deliberately measures raw per-channel error growth over increasing rollout lengths to quantify the heterogeneity. We will add a dedicated paragraph in the Methods section (and a short note in the abstract) describing the per-channel normalization, confirming the absence of weighting or post-hoc corrections, and stating that all pooled and per-channel metrics are computed on the same normalized fields. This revision should eliminate the possibility that the reported heterogeneity is an artifact of metric construction. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical scaling observations are self-contained.

full rationale

The paper reports empirical measurements of error growth in autoregressive weather models across channels, horizons, and scaling axes (parameters, data, compute). It quantifies heterogeneity between globally pooled metrics and per-channel late-lead behavior but does not derive any result from prior assumptions that presuppose the heterogeneity. No equations, self-citations, or fitted parameters are presented as predictions in the abstract or described claims. The central statements are direct experimental findings rather than reductions to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the work is framed as empirical quantification rather than derivation from first principles.

pith-pipeline@v0.9.0 · 5484 in / 998 out tokens · 71930 ms · 2026-05-10T18:55:04.202339+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 17 canonical work pages · 2 internal anchors

  1. [1]

    a rXiv preprint arXiv:2412.02732 (2024)

    D. Szwarcman, S. Roy, P. Fraccaro, T. E. Gíslason, B. Blumenstiel, R. Ghosal, P. H. de Oliveira, J. L. de Sousa Almeida, R. Sedona, Y. Kang, S. Chakraborty, S. Wang, A. Kumar, M. Truong, D. Godwin, H. Lee, C.-Y. Hsu, A. Akbari Asanjan, B. Mujeci, T. Keenan, P. Arévolo, W. Li, H. Alemohammad, P. Olofsson, C. Hain, R. Kennedy, B. Zadrozny, G. Cavallaro, C. ...

  2. [2]

    Prithvi wxc: Foundation model for weather and climate,

    J. Schmude, S. Roy, W. Trojak, J. Jakubik, D. S. Civitarese, S. Singh, J. Kuehnert, K. Ankur, A. Gupta, C. E. Phillips, R. Kienzler, D. Szwarcman, V. Gaur, R. Shinde, R. Lal, A. D. Silva, J. L. G. Diaz, A. Jones, S. Pfreundschuh, A. Lin, A. Sheshadri, U. Nair, V. Anantharaj, H. Hamann, C. Watson, M. Maskey, T. J. Lee, J. B. Moreno, and R. Ramachandran, “P...

  3. [3]

    PhysiX: A Foundation Model for Physics Simulations,

    T. Nguyen, A. Koneru, S. Li, and A. grover, “PhysiX: A Foundation Model for Physics Simulations,” Jun

  4. [4]

    A vailable: http://arxiv.org/abs/2506 .17774

    [Online]. A vailable: http://arxiv.org/abs/2506 .17774

  5. [5]

    Towards Foundation Models for Scientific Machine Learning: Characterizing Scaling and Transfer Behavior,

    S. Subramanian, P. Harrington, K. Keutzer, W. Bhimji, D. Morozov, M. W. Mahoney, and A. Gholami, “Towards Foundation Models for Scientific Machine Learning: Characterizing Scaling and Transfer Behavior,” Advances in Neural Information Processing Systems, vol. 36, pp. 71 242–71 262, Dec. 2023. [Online]. A vailable: https://proceedings.neurips.cc/paper_file...

  6. [6]

    Poseidon: Efficient Foundation Models for PDEs,

    M. Herde, B. Raonić, T. Rohner, R. Käppeli, R. Molinaro, E. de Bézenac, and S. Mishra, “Poseidon: Efficient Foundation Models for PDEs,” May 2024. [Online]. A vailable: http://arxiv.org/abs/2405.19101

  7. [7]

    In: Proceedings of the 40th International Conference on Machine Learning

    T. Nguyen, J. Brandstetter, A. Kapoor, J. K. Gupta, and A. Grover, “ClimaX: A foundation model for weather and climate,” Dec. 2023. [Online]. A vailable: http://arxiv.org/abs/2301.10343

  8. [8]

    TiMo: Spatiotemporal Foundation Model for Satellite Image Time Series,

    X. Qin, D. Wang, J. Zhang, F. Wang, X. Su, B. Du, and L. Zhang, “TiMo: Spatiotemporal Foundation Model for Satellite Image Time Series,” May 2025. [Online]. A vailable: http://arxiv.org/abs/2505.08723

  9. [9]

    Deep Learning Scaling is Predictable, Empirically

    J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, M. M. A. Patwary, Y. Yang, and Y. Zhou, “Deep Learning Scaling is Predictable, Empirically,” Dec. 2017. [Online]. A vailable: http://arxiv.org/abs/1712.00409

  10. [10]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling Laws for Neural Language Models,” Jan. 2020. [Online]. A vailable: http://arxiv.org/abs/2001.08361

  11. [11]

    An empirical analysis of compute-optimal large language model training,

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Milli- can, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, O. Vinyals, J. Rae, and L. Sifre, “An empirical analysis of compute-optimal large language model training,” Adva...

  12. [12]

    WeatherBench 2: A Benchmark for the Next Generation of Data-Driven Global Weather Models,

    S. Rasp, S. Hoyer, A. Merose, I. Langmore, P. Battaglia, T. Russell, A. Sanchez-Gonzalez, V. Yang, R. Carver, S. Agrawal, M. Chantry, Z. Ben Bouallegue, P. Dueben, C. Bromberg, J. Sisk, L. Barrington, A. Bell, and F. Sha, “WeatherBench 2: A Benchmark for the Next Generation of Data-Driven Global Weather Models,” Journal of Advances in Modeling Earth Syste...

  13. [13]

    The Copernicus Atmosphere Monitoring Service: From Research to Operations,

    V.-H. Peuch, R. Engelen, M. Rixen, D. Dee, J. Flemming, M. Suttie, M. Ades, A. Agustí- Panareda, C. Ananasso, E. Andersson, D. Armstrong, J. Barré, N. Bousserez, J. J. Dominguez, S. Garrigues, A. Inness, L. Jones, Z. Kipling, J. Letertre-Danczak, M. Parrington, M. Razinger, R. Ribas, S. Vermoote, X. Yang, A. Simmons, J. Garcés De Marcilla, and J.-N. Thépa...

  14. [14]

    A Scientific Description of the GFDL Finite-Volume Cubed-Sphere Dynamical Core,

    L. Harris, X. Chen, W. Putman, L. Zhou, and J.-H. Chen, “A Scientific Description of the GFDL Finite-Volume Cubed-Sphere Dynamical Core,” 2021. [Online]. A vailable: https://doi.org/10.25923/6nhs- 5897

  15. [15]

    ORBIT- 2: Scaling Exascale Vision Foundation Models for Weather and Climate Downscaling,

    X. Wang, J.-Y. Choi, T. Kurihaya, I. Lyngaas, H.-J. Yoon, M. Fan, N. M. Nafi, A. Tsaris, A. M. Aji, M. Hossain, M. Wahib, D. Wang, P. Thornton, P. Balaprakash, M. Ashfaq, and D. Lu, “ORBIT- 2: Scaling Exascale Vision Foundation Models for Weather and Climate Downscaling,” May 2025. [Online]. A vailable: http://arxiv.org/abs/2505.04802

  16. [16]

    On Neural Scaling Laws for Weather Emulation through Continual Training,

    S. Subramanian, A. Kiefer, A. Nigmetov, A. Gholami, D. Morozov, and M. W. Mahoney, “On Neural Scaling Laws for Weather Emulation through Continual Training,” in ICLR 2026 Workshop on Foundation Models for Science: Real-World Impact and Science- First Design, Mar. 2026. [Online]. A vailable: https: //openreview.net/forum?id=6xmHT5vO9P

  17. [17]

    Learning Curves for Decision Making in Supervised Machine Learning: A Survey,

    F. Mohr and J. N. v. Rijn, “Learning Curves for Decision Making in Supervised Machine Learning: A Survey,” Machine Learning, vol. 113, no. 11- 12, pp. 8371–8425, Dec. 2024. [Online]. A vailable: http://arxiv.org/abs/2201.12150

  18. [18]

    Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design,

    I. M. Alabdulmohsin, X. Zhai, A. Kolesnikov, and L. Beyer, “Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design,” Advances in Neural Information Processing Systems, vol. 36, pp. 16 406–16 425, Dec. 2023. [Online]. A vailable: https://proceedings.neurips.cc/paper_files/paper /2023/hash/3504a4fa45685d668ce92797fbbf1895- Abstract-Conference.html

  19. [19]

    Zhao, J.-C

    M. Zhao, J.-C. Golaz, I. M. Held, H. Guo, V. Balaji, R. Benson, J.-H. Chen, X. Chen, L. J. Donner, J. P. Dunne, K. Dunne, J. Durachta, S.-M. Fan, S. M. Freidenreich, S. T. Garner, P. Ginoux, L. M. Harris, L. W. Horowitz, J. P. Krasting, A. R. Langenhorst, Z. Liang, P. Lin, S.-J. Lin, S. L. Malyshev, E. Mason, P. C. D. Milly, Y. Ming, V. Naik, F. Paulot, D...

  20. [20]

    Simulation Characteristics With Prescribed SSTs,” Journal of Advances in Modeling Earth Systems, vol. 10, no. 3, pp. 691–734, 2018. [Online]. A vailable: https://onlinelibrary.wiley.com/doi/abs/10.1002/20 17MS001208

  21. [21]

    M. Eshagh, “Earth System Modeling, Data Assimila- tion, and Predictability: Atmosphere, Oceans, Land and Human Systems, 2nd Edition, Eugenia Kalnay, Sofa Mote and Cheng Da, Cambridge University Press, 2024, Paperback £49.99, 340 pp., ISBN 978-1-10-740146-4,” Weather, vol. 80, no. 9, pp. 319–319, 2025. [Online]. A vailable: https://rmets.onli nelibrary.wil...

  22. [22]

    A foundation model for the Earth system,

    C. Bodnar, W. P. Bruinsma, A. Lucic, M. Stanley, A. Allen, J. Brandstetter, P. Garvan, M. Riechert, J. A. Weyn, H. Dong, J. K. Gupta, K. Thambiratnam, A. T. Archibald, C.-C. Wu, E. Heider, M. Welling, R. E. Turner, and P. Perdikaris, “A foundation model for the Earth system,” Nature, vol. 641, no. 8065, pp. 1180–1187, May 2025. [Online]. A vailable: https...

  23. [23]

    Emulation and interpretation of high-dimensional climate model outputs,

    P. B. Holden, N. R. Edwards, P. H. Garthwaite, and R. D. Wilkinson, “Emulation and interpretation of high-dimensional climate model outputs,” Journal of Applied Statistics, vol. 42, no. 9, pp. 2038–2055, Sep

  24. [24]

    A vailable: https://doi.org/10.1080/02 664763.2015.1016412

    [Online]. A vailable: https://doi.org/10.1080/02 664763.2015.1016412

  25. [25]

    Scienc e 382(6669), 1416–1421 (2023) https://doi.org/10.1126/science.adi2336

    R. Lam, A. Sanchez-Gonzalez, M. Willson, P. Wirns- berger, M. Fortunato, F. Alet, S. Ravuri, T. Ewalds, Z. Eaton-Rosen, W. Hu, A. Merose, S. Hoyer, G. Hol- land, O. Vinyals, J. Stott, A. Pritzel, S. Mohamed, and P. Battaglia, “Learning skillful medium-range global weather forecasting,” Science, vol. 382, no. 6677, pp. 1416–1421, 2023. [Online]. A vailable...

  26. [26]

    Scoring Rules for Continuous Probability Distributions,

    J. E. Matheson and R. L. Winkler, “Scoring Rules for Continuous Probability Distributions,” Management Science, vol. 22, no. 10, pp. 1087–1096, Jun. 1976. [Online]. A vailable: https://pubsonline.informs.org/d oi/abs/10.1287/mnsc.22.10.1087

  27. [27]

    Fourcastnet 3: A geometric approach to probabilistic machine-learning weather forecasting at scale,

    B. Bonev, T. Kurth, A. Mahesh, M. Bisson, J. Kossaifi, K. Kashinath, A. Anandkumar, W. D. Collins, M. S. Pritchard, and A. Keller, “FourCastNet 3: A geometric approach to probabilistic machine- learning weather forecasting at scale,” Jul. 2025. [Online]. A vailable: http://arxiv.org/abs/2507.12144

  28. [28]

    Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows,

    Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows,” 2021, pp. 10 012–10 022. [Online]. A vailable: https://open access.thecvf.com/content/ICCV2021/html/Liu_Sw in_Transformer_Hierarchical_Vision_Transformer _Using_Shifted_Windows_ICCV_2021_paper

  29. [29]

    The ERA5 global reanalysis,

    H. Hersbach, B. Bell, P. Berrisford, S. Hirahara, A. Horányi, J. Muñoz-Sabater, J. Nicolas, C. Peubey, R. Radu, D. Schepers, and others, “The ERA5 global reanalysis,” Quarterly Journal of the Royal Meteorological Society, vol. 146, no. 730, pp. 1999– 2049, 2020. [Online]. A vailable: https://doi.org/10.1 002/qj.3803 Author Biography Alex Kiefer is a PhD s...