Recognition: 2 theorem links
· Lean TheoremTowards Scaling Law Analysis For Spatiotemporal Weather Data
Pith reviewed 2026-05-10 18:55 UTC · model grok-4.3
The pith
Scaling laws for autoregressive weather models show global pooling masks per-channel degradation at long horizons.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We extend neural scaling analysis for autoregressive weather forecasting from single-step training loss to long rollouts and per-channel metrics. We quantify (1) how prediction error is distributed across channels and how its growth rate evolves with forecast horizon, (2) if power law scaling holds for test error relative to rollout length when error is pooled globally, and (3) how that fit varies jointly with horizon and channel for parameter, data, and compute-based scaling axes. We find strong cross-channel and cross-horizon heterogeneity: pooled scaling can look favorable while many channels degrade at late leads.
What carries the argument
Joint scaling of test error across forecast horizons and physical channels under autoregressive rollouts, using globally pooled versus per-channel decompositions.
If this is right
- Weighted loss functions are needed to balance channels whose error grows differently with horizon.
- Horizon-aware training curricula can reduce late-lead degradation without extra compute.
- Resource allocation should prioritize channels that actually improve under scaling rather than relying on global averages.
- Model selection and hyperparameter search based solely on pooled metrics can select suboptimal configurations for operational long-range forecasts.
Where Pith is reading between the lines
- Similar heterogeneity may appear in any autoregressive physical simulation task where outputs have unequal predictability.
- Evaluation protocols for climate or fluid models may need routine per-variable, per-lead decomposition to avoid over-optimism from aggregates.
- If the pattern holds across datasets, scaling studies in spatiotemporal domains should report both pooled and disaggregated curves as standard practice.
Load-bearing premise
That globally pooled test metrics and per-channel late-lead behavior can be meaningfully compared under the same scaling axes without confounding from autoregressive error compounding or channel-specific normalization.
What would settle it
Direct measurement of whether individual channels continue to follow power-law error reduction at long horizons when model size or data volume increases, or instead saturate or worsen.
Figures
read the original abstract
Compute-optimal scaling laws are relatively well studied for NLP and CV, where objectives are typically single-step and targets are comparatively homogeneous. Weather forecasting is harder to characterize in the same framework: autoregressive rollouts compound errors over long horizons, outputs couple many physical channels with disparate scales and predictability, and globally pooled test metrics can disagree sharply with per-channel, late-lead behavior implied by short-horizon training. We extend neural scaling analysis for autoregressive weather forecasting from single-step training loss to long rollouts and per-channel metrics. We quantify (1) how prediction error is distributed across channels and how its growth rate evolves with forecast horizon, (2) if power law scaling holds for test error, relative to rollout length when error is pooled globally, and (3) how that fit varies jointly with horizon and channel for parameter, data, and compute-based scaling axes. We find strong cross-channel and cross-horizon heterogeneity: pooled scaling can look favorable while many channels degrade at late leads. We discuss implications for weighted objectives, horizon-aware curricula, and resource allocation across outputs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper extends neural scaling analysis from single-step training loss to long autoregressive rollouts and per-channel metrics in spatiotemporal weather forecasting. It quantifies error distribution across channels and horizons, tests whether power-law scaling holds for globally pooled test error as a function of rollout length, and examines how scaling fits vary jointly with horizon and channel across parameter, data, and compute axes. The central empirical finding is strong cross-channel and cross-horizon heterogeneity: globally pooled scaling can appear favorable while many individual channels degrade at late leads.
Significance. If the reported heterogeneity is robust to the noted confounds, the result would be significant for scaling-law research in scientific ML. It demonstrates that standard pooled metrics can mask per-output degradation in heterogeneous, autoregressive settings, with direct implications for objective weighting, curriculum design, and compute allocation in weather and climate modeling. The work correctly identifies a gap between NLP/CV scaling practices and the demands of multi-channel spatiotemporal forecasting.
major comments (1)
- [Abstract / Methods] The central claim of favorable pooled scaling coexisting with per-channel degradation at late leads (abstract) requires that globally pooled test metrics and per-channel late-lead errors are comparable on the same scaling axes. However, the manuscript provides no description of channel-wise normalization, loss weighting, or correction for differential autoregressive error compounding rates across variables (e.g., temperature vs. precipitation). Without these controls, the heterogeneity could be an artifact of metric construction rather than a genuine scaling phenomenon.
minor comments (1)
- [Abstract] The abstract states that power-law scaling is tested 'relative to rollout length when error is pooled globally,' but does not specify the functional form, fitting procedure, or range of rollout lengths used.
Simulated Author's Rebuttal
We thank the referee for the careful review and for identifying the need to explicitly document metric construction details to support the central claim. We address the concern point-by-point below and will revise the manuscript to include the requested clarifications.
read point-by-point responses
-
Referee: [Abstract / Methods] The central claim of favorable pooled scaling coexisting with per-channel degradation at late leads (abstract) requires that globally pooled test metrics and per-channel late-lead errors are comparable on the same scaling axes. However, the manuscript provides no description of channel-wise normalization, loss weighting, or correction for differential autoregressive error compounding rates across variables (e.g., temperature vs. precipitation). Without these controls, the heterogeneity could be an artifact of metric construction rather than a genuine scaling phenomenon.
Authors: We agree that explicit documentation of normalization and evaluation choices is necessary for the claim to be robust. In the reported experiments, every channel was normalized independently to zero mean and unit variance using training-set statistics prior to both training and error computation; this is the standard preprocessing in WeatherBench-style benchmarks and ensures that RMSE values are on comparable scales across variables with different physical units. No per-channel loss weighting was used at training or test time, precisely so that the natural differences in predictability and error growth would remain visible. Autoregressive compounding rates are not corrected for; instead, the analysis deliberately measures raw per-channel error growth over increasing rollout lengths to quantify the heterogeneity. We will add a dedicated paragraph in the Methods section (and a short note in the abstract) describing the per-channel normalization, confirming the absence of weighting or post-hoc corrections, and stating that all pooled and per-channel metrics are computed on the same normalized fields. This revision should eliminate the possibility that the reported heterogeneity is an artifact of metric construction. revision: yes
Circularity Check
No significant circularity; empirical scaling observations are self-contained.
full rationale
The paper reports empirical measurements of error growth in autoregressive weather models across channels, horizons, and scaling axes (parameters, data, compute). It quantifies heterogeneity between globally pooled metrics and per-channel late-lead behavior but does not derive any result from prior assumptions that presuppose the heterogeneity. No equations, self-citations, or fitted parameters are presented as predictions in the abstract or described claims. The central statements are direct experimental findings rather than reductions to inputs by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We fit log ε_{h,c} = a_{h,c} + b_{h,c} log s ... R²_{h,c} = 1 − Σ(log ε − dlog ε)² / Σ(log ε − mean)²
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
pooled scaling can look favorable while many channels degrade at late leads
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
a rXiv preprint arXiv:2412.02732 (2024)
D. Szwarcman, S. Roy, P. Fraccaro, T. E. Gíslason, B. Blumenstiel, R. Ghosal, P. H. de Oliveira, J. L. de Sousa Almeida, R. Sedona, Y. Kang, S. Chakraborty, S. Wang, A. Kumar, M. Truong, D. Godwin, H. Lee, C.-Y. Hsu, A. Akbari Asanjan, B. Mujeci, T. Keenan, P. Arévolo, W. Li, H. Alemohammad, P. Olofsson, C. Hain, R. Kennedy, B. Zadrozny, G. Cavallaro, C. ...
-
[2]
Prithvi wxc: Foundation model for weather and climate,
J. Schmude, S. Roy, W. Trojak, J. Jakubik, D. S. Civitarese, S. Singh, J. Kuehnert, K. Ankur, A. Gupta, C. E. Phillips, R. Kienzler, D. Szwarcman, V. Gaur, R. Shinde, R. Lal, A. D. Silva, J. L. G. Diaz, A. Jones, S. Pfreundschuh, A. Lin, A. Sheshadri, U. Nair, V. Anantharaj, H. Hamann, C. Watson, M. Maskey, T. J. Lee, J. B. Moreno, and R. Ramachandran, “P...
-
[3]
PhysiX: A Foundation Model for Physics Simulations,
T. Nguyen, A. Koneru, S. Li, and A. grover, “PhysiX: A Foundation Model for Physics Simulations,” Jun
-
[4]
A vailable: http://arxiv.org/abs/2506 .17774
[Online]. A vailable: http://arxiv.org/abs/2506 .17774
-
[5]
Towards Foundation Models for Scientific Machine Learning: Characterizing Scaling and Transfer Behavior,
S. Subramanian, P. Harrington, K. Keutzer, W. Bhimji, D. Morozov, M. W. Mahoney, and A. Gholami, “Towards Foundation Models for Scientific Machine Learning: Characterizing Scaling and Transfer Behavior,” Advances in Neural Information Processing Systems, vol. 36, pp. 71 242–71 262, Dec. 2023. [Online]. A vailable: https://proceedings.neurips.cc/paper_file...
2023
-
[6]
Poseidon: Efficient Foundation Models for PDEs,
M. Herde, B. Raonić, T. Rohner, R. Käppeli, R. Molinaro, E. de Bézenac, and S. Mishra, “Poseidon: Efficient Foundation Models for PDEs,” May 2024. [Online]. A vailable: http://arxiv.org/abs/2405.19101
-
[7]
In: Proceedings of the 40th International Conference on Machine Learning
T. Nguyen, J. Brandstetter, A. Kapoor, J. K. Gupta, and A. Grover, “ClimaX: A foundation model for weather and climate,” Dec. 2023. [Online]. A vailable: http://arxiv.org/abs/2301.10343
-
[8]
TiMo: Spatiotemporal Foundation Model for Satellite Image Time Series,
X. Qin, D. Wang, J. Zhang, F. Wang, X. Su, B. Du, and L. Zhang, “TiMo: Spatiotemporal Foundation Model for Satellite Image Time Series,” May 2025. [Online]. A vailable: http://arxiv.org/abs/2505.08723
-
[9]
Deep Learning Scaling is Predictable, Empirically
J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, M. M. A. Patwary, Y. Yang, and Y. Zhou, “Deep Learning Scaling is Predictable, Empirically,” Dec. 2017. [Online]. A vailable: http://arxiv.org/abs/1712.00409
work page internal anchor Pith review arXiv 2017
-
[10]
Scaling Laws for Neural Language Models
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling Laws for Neural Language Models,” Jan. 2020. [Online]. A vailable: http://arxiv.org/abs/2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[11]
An empirical analysis of compute-optimal large language model training,
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Milli- can, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, O. Vinyals, J. Rae, and L. Sifre, “An empirical analysis of compute-optimal large language model training,” Adva...
2022
-
[12]
WeatherBench 2: A Benchmark for the Next Generation of Data-Driven Global Weather Models,
S. Rasp, S. Hoyer, A. Merose, I. Langmore, P. Battaglia, T. Russell, A. Sanchez-Gonzalez, V. Yang, R. Carver, S. Agrawal, M. Chantry, Z. Ben Bouallegue, P. Dueben, C. Bromberg, J. Sisk, L. Barrington, A. Bell, and F. Sha, “WeatherBench 2: A Benchmark for the Next Generation of Data-Driven Global Weather Models,” Journal of Advances in Modeling Earth Syste...
work page doi:10.1029/20 2024
-
[13]
The Copernicus Atmosphere Monitoring Service: From Research to Operations,
V.-H. Peuch, R. Engelen, M. Rixen, D. Dee, J. Flemming, M. Suttie, M. Ades, A. Agustí- Panareda, C. Ananasso, E. Andersson, D. Armstrong, J. Barré, N. Bousserez, J. J. Dominguez, S. Garrigues, A. Inness, L. Jones, Z. Kipling, J. Letertre-Danczak, M. Parrington, M. Razinger, R. Ribas, S. Vermoote, X. Yang, A. Simmons, J. Garcés De Marcilla, and J.-N. Thépa...
2022
-
[14]
A Scientific Description of the GFDL Finite-Volume Cubed-Sphere Dynamical Core,
L. Harris, X. Chen, W. Putman, L. Zhou, and J.-H. Chen, “A Scientific Description of the GFDL Finite-Volume Cubed-Sphere Dynamical Core,” 2021. [Online]. A vailable: https://doi.org/10.25923/6nhs- 5897
-
[15]
ORBIT- 2: Scaling Exascale Vision Foundation Models for Weather and Climate Downscaling,
X. Wang, J.-Y. Choi, T. Kurihaya, I. Lyngaas, H.-J. Yoon, M. Fan, N. M. Nafi, A. Tsaris, A. M. Aji, M. Hossain, M. Wahib, D. Wang, P. Thornton, P. Balaprakash, M. Ashfaq, and D. Lu, “ORBIT- 2: Scaling Exascale Vision Foundation Models for Weather and Climate Downscaling,” May 2025. [Online]. A vailable: http://arxiv.org/abs/2505.04802
-
[16]
On Neural Scaling Laws for Weather Emulation through Continual Training,
S. Subramanian, A. Kiefer, A. Nigmetov, A. Gholami, D. Morozov, and M. W. Mahoney, “On Neural Scaling Laws for Weather Emulation through Continual Training,” in ICLR 2026 Workshop on Foundation Models for Science: Real-World Impact and Science- First Design, Mar. 2026. [Online]. A vailable: https: //openreview.net/forum?id=6xmHT5vO9P
2026
-
[17]
Learning Curves for Decision Making in Supervised Machine Learning: A Survey,
F. Mohr and J. N. v. Rijn, “Learning Curves for Decision Making in Supervised Machine Learning: A Survey,” Machine Learning, vol. 113, no. 11- 12, pp. 8371–8425, Dec. 2024. [Online]. A vailable: http://arxiv.org/abs/2201.12150
-
[18]
Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design,
I. M. Alabdulmohsin, X. Zhai, A. Kolesnikov, and L. Beyer, “Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design,” Advances in Neural Information Processing Systems, vol. 36, pp. 16 406–16 425, Dec. 2023. [Online]. A vailable: https://proceedings.neurips.cc/paper_files/paper /2023/hash/3504a4fa45685d668ce92797fbbf1895- Abstract-Conference.html
2023
-
[19]
Zhao, J.-C
M. Zhao, J.-C. Golaz, I. M. Held, H. Guo, V. Balaji, R. Benson, J.-H. Chen, X. Chen, L. J. Donner, J. P. Dunne, K. Dunne, J. Durachta, S.-M. Fan, S. M. Freidenreich, S. T. Garner, P. Ginoux, L. M. Harris, L. W. Horowitz, J. P. Krasting, A. R. Langenhorst, Z. Liang, P. Lin, S.-J. Lin, S. L. Malyshev, E. Mason, P. C. D. Milly, Y. Ming, V. Naik, F. Paulot, D...
-
[20]
Simulation Characteristics With Prescribed SSTs,” Journal of Advances in Modeling Earth Systems, vol. 10, no. 3, pp. 691–734, 2018. [Online]. A vailable: https://onlinelibrary.wiley.com/doi/abs/10.1002/20 17MS001208
work page doi:10.1002/20 2018
-
[21]
M. Eshagh, “Earth System Modeling, Data Assimila- tion, and Predictability: Atmosphere, Oceans, Land and Human Systems, 2nd Edition, Eugenia Kalnay, Sofa Mote and Cheng Da, Cambridge University Press, 2024, Paperback £49.99, 340 pp., ISBN 978-1-10-740146-4,” Weather, vol. 80, no. 9, pp. 319–319, 2025. [Online]. A vailable: https://rmets.onli nelibrary.wil...
-
[22]
A foundation model for the Earth system,
C. Bodnar, W. P. Bruinsma, A. Lucic, M. Stanley, A. Allen, J. Brandstetter, P. Garvan, M. Riechert, J. A. Weyn, H. Dong, J. K. Gupta, K. Thambiratnam, A. T. Archibald, C.-C. Wu, E. Heider, M. Welling, R. E. Turner, and P. Perdikaris, “A foundation model for the Earth system,” Nature, vol. 641, no. 8065, pp. 1180–1187, May 2025. [Online]. A vailable: https...
2025
-
[23]
Emulation and interpretation of high-dimensional climate model outputs,
P. B. Holden, N. R. Edwards, P. H. Garthwaite, and R. D. Wilkinson, “Emulation and interpretation of high-dimensional climate model outputs,” Journal of Applied Statistics, vol. 42, no. 9, pp. 2038–2055, Sep
2038
-
[24]
A vailable: https://doi.org/10.1080/02 664763.2015.1016412
[Online]. A vailable: https://doi.org/10.1080/02 664763.2015.1016412
work page doi:10.1080/02 2015
-
[25]
Scienc e 382(6669), 1416–1421 (2023) https://doi.org/10.1126/science.adi2336
R. Lam, A. Sanchez-Gonzalez, M. Willson, P. Wirns- berger, M. Fortunato, F. Alet, S. Ravuri, T. Ewalds, Z. Eaton-Rosen, W. Hu, A. Merose, S. Hoyer, G. Hol- land, O. Vinyals, J. Stott, A. Pritzel, S. Mohamed, and P. Battaglia, “Learning skillful medium-range global weather forecasting,” Science, vol. 382, no. 6677, pp. 1416–1421, 2023. [Online]. A vailable...
-
[26]
Scoring Rules for Continuous Probability Distributions,
J. E. Matheson and R. L. Winkler, “Scoring Rules for Continuous Probability Distributions,” Management Science, vol. 22, no. 10, pp. 1087–1096, Jun. 1976. [Online]. A vailable: https://pubsonline.informs.org/d oi/abs/10.1287/mnsc.22.10.1087
-
[27]
Fourcastnet 3: A geometric approach to probabilistic machine-learning weather forecasting at scale,
B. Bonev, T. Kurth, A. Mahesh, M. Bisson, J. Kossaifi, K. Kashinath, A. Anandkumar, W. D. Collins, M. S. Pritchard, and A. Keller, “FourCastNet 3: A geometric approach to probabilistic machine- learning weather forecasting at scale,” Jul. 2025. [Online]. A vailable: http://arxiv.org/abs/2507.12144
-
[28]
Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows,
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows,” 2021, pp. 10 012–10 022. [Online]. A vailable: https://open access.thecvf.com/content/ICCV2021/html/Liu_Sw in_Transformer_Hierarchical_Vision_Transformer _Using_Shifted_Windows_ICCV_2021_paper
2021
-
[29]
The ERA5 global reanalysis,
H. Hersbach, B. Bell, P. Berrisford, S. Hirahara, A. Horányi, J. Muñoz-Sabater, J. Nicolas, C. Peubey, R. Radu, D. Schepers, and others, “The ERA5 global reanalysis,” Quarterly Journal of the Royal Meteorological Society, vol. 146, no. 730, pp. 1999– 2049, 2020. [Online]. A vailable: https://doi.org/10.1 002/qj.3803 Author Biography Alex Kiefer is a PhD s...
1999
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.