pith. sign in

arxiv: 2605.24945 · v1 · pith:ZT2E3EQDnew · submitted 2026-05-24 · 💻 cs.LG · cs.AI· physics.ao-ph

RealBench: Benchmarking Data-Driven Numerical Weather Forecasting Under Operational Conditions and Extreme Event Challenges

Pith reviewed 2026-06-30 11:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AIphysics.ao-ph
keywords AI weather forecastingbenchmark evaluationoperational conditionsextreme eventsreanalysis mismatchin-situ observationsdata-driven modelstropical cyclones
0
0 comments X

The pith

RealBench shows that reanalysis-based benchmarks for AI weather models produce systematically different results from real operational data, especially on extremes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RealBench as a new evaluation framework for data-driven weather forecasting models. It argues that prior benchmarks rely on reanalysis products like ERA5, which use delayed data assimilation and therefore fail to match the constraints of actual real-time forecasting. RealBench instead uses a 2025 test period designed to be strictly out-of-distribution, combines low-latency operational analysis with direct in-situ measurements from more than 10,000 stations, and adds event-specific metrics for heatwaves, cold surges, and tropical cyclones. Evaluation with this setup produces substantially different performance numbers than reanalysis-based scoring, with the largest differences appearing during extreme events. A sympathetic reader would care because deployment decisions for AI forecasting systems depend on benchmarks that actually reflect the conditions under which those systems will be used.

Core claim

RealBench is a benchmark that evaluates AI weather models against low-latency operational analysis and a global network of in-situ station observations on a 2025 test set chosen to avoid data leakage. It demonstrates that metrics computed from reanalysis products diverge from metrics computed from these real measurements, with the divergence especially pronounced when models are scored on extreme events using event-specific criteria rather than standard global averages.

What carries the argument

RealBench benchmark, which supplies an out-of-distribution 2025 test set, multi-source operational ground truth, and extreme-event metrics to measure model performance directly against observations instead of reanalysis.

If this is right

  • Models that rank highly under reanalysis evaluation can rank differently when scored against operational data and in-situ observations.
  • Standard global metrics alone are insufficient for assessing value on high-impact extremes; event-specific scores are required.
  • Training and selection of data-driven forecasting systems should incorporate operational constraints to reduce the observed performance gap.
  • A benchmark built on real-time data sources provides a more relevant testbed for next-generation model development.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adoption of this style of benchmark could shift model development toward objectives that explicitly account for latency and observational sparsity.
  • The framework might be extended to include direct head-to-head comparisons between AI models and traditional physics-based numerical weather prediction systems under identical operational conditions.
  • If the discrepancies remain large, training pipelines may need to incorporate more recent or lower-latency data sources to close the gap.

Load-bearing premise

The 2025 test period contains no data leakage from training sets and the combination of low-latency analysis plus station observations forms a sufficiently complete and unbiased ground truth.

What would settle it

If model rankings and absolute scores computed on reanalysis data for the same 2025 period turn out to be nearly identical to those computed on the RealBench operational and station data, the claimed systematic mismatch would be falsified.

read the original abstract

Accurate evaluation of weather forecasting models is critical for their reliable deployment in real-world applications. However, existing benchmarks predominantly rely on reanalysis products such as ERA5, which are generated through delayed data assimilation and do not reflect the constraints of real-time operational forecasting, thereby resulting in a systematic mismatch between benchmark performance and real-world forecasting. In this work, we introduce RealBench, a next-generation benchmark for AI weather forecasting that emphasizes realistic evaluation under operational conditions. RealBench features a strictly out-of-distribution test set spanning 2025 to eliminate data leakage and capture recent atmospheric regimes. It integrates multiple data sources, including low-latency operational analysis and a large-scale global in-situ observation dataset comprising over 10,000 stations, enabling direct evaluation against real atmospheric measurements. Beyond standard global metrics, RealBench provides a comprehensive evaluation framework for high-impact extreme events, including heatwaves, cold surges, and tropical cyclones, using event-specific metrics that better reflect real-world forecasting priorities. The evaluation results reveal substantial discrepancies between reanalysis-based metrics and real-world performance, particularly concerning extreme events. By highlighting the limitations of existing benchmarks, this work establishes a more faithful and operationally relevant evaluation paradigm, providing a rigorous foundation for advancing next-generation AI weather forecasting systems. The benchmark implementation is available at: https://github.com/lixruize-del/NWP-Benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces RealBench, a benchmark for data-driven numerical weather forecasting that prioritizes operational realism over reanalysis-based evaluation. It defines a strictly out-of-distribution 2025 test window, combines low-latency operational analysis with >10k in-situ station observations as ground truth, and supplies event-specific metrics for extremes (heatwaves, cold surges, tropical cyclones). The central claim is that reanalysis-based benchmarks produce systematic mismatches with real-world performance, with RealBench revealing substantial discrepancies especially on extremes; the implementation is released on GitHub.

Significance. If the reported discrepancies are reproducible and the ground-truth protocol is shown to be unbiased, the work would meaningfully shift evaluation standards in AI weather forecasting away from ERA5-style reanalysis toward operationally relevant targets. This could reduce over-optimism in published scores and encourage models that generalize under real-time data constraints. The public benchmark release is a concrete strength that enables community follow-up.

major comments (3)
  1. [Abstract / Evaluation section] The abstract states that evaluation results reveal substantial discrepancies, yet the manuscript provides no quantitative tables, error bars, or statistical tests comparing reanalysis vs. RealBench metrics (e.g., no RMSE or event-specific scores for the 2025 period). This absence prevents verification of the central claim.
  2. [Data sources and evaluation framework] No protocol is described for aligning >10k in-situ station observations with model grid points or for characterizing station measurement error and representativeness (e.g., § on data sources). Without these details the assumption that the combined ground truth is complete and unbiased cannot be assessed.
  3. [Test-set construction] The claim that the 2025 test set is strictly OOD requires explicit documentation of all model training data cutoffs and any reanalysis overlap; this information is not supplied, leaving the no-leakage guarantee unverified.
minor comments (1)
  1. [Availability statement] The GitHub link is provided but the repository contents (data loaders, exact metric implementations) are not referenced in the text; adding a pointer to specific scripts would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight areas where additional detail will strengthen the manuscript's verifiability. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Abstract / Evaluation section] The abstract states that evaluation results reveal substantial discrepancies, yet the manuscript provides no quantitative tables, error bars, or statistical tests comparing reanalysis vs. RealBench metrics (e.g., no RMSE or event-specific scores for the 2025 period). This absence prevents verification of the central claim.

    Authors: We agree that the absence of explicit quantitative comparisons limits immediate verification of the central claim. In the revised manuscript we will expand the evaluation section with tables that directly compare reanalysis-based and RealBench metrics on the 2025 period, including RMSE, event-specific scores, error bars, and statistical significance tests. These additions will be cross-referenced from the abstract. revision: yes

  2. Referee: [Data sources and evaluation framework] No protocol is described for aligning >10k in-situ station observations with model grid points or for characterizing station measurement error and representativeness (e.g., § on data sources). Without these details the assumption that the combined ground truth is complete and unbiased cannot be assessed.

    Authors: We acknowledge that the current description of the ground-truth protocol is insufficient. The revised data-sources section will specify the alignment procedure (including interpolation method and grid-matching criteria) and will characterize station representativeness and measurement uncertainty by citing established protocols and, where feasible, providing quantitative bounds on error sources. revision: yes

  3. Referee: [Test-set construction] The claim that the 2025 test set is strictly OOD requires explicit documentation of all model training data cutoffs and any reanalysis overlap; this information is not supplied, leaving the no-leakage guarantee unverified.

    Authors: We will add a dedicated subsection under test-set construction that tabulates the training-data cutoffs for every model included in the benchmark and explicitly states the absence of overlap with the 2025 window or with any reanalysis products used during training. This documentation will be supported by references to the original model papers and the released benchmark code. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces RealBench as a benchmark definition and data protocol for evaluating weather forecasting models. It contains no derivations, equations, fitted parameters, predictions, or self-referential claims that reduce to inputs by construction. The central contribution is the specification of an out-of-distribution test set, data sources, and evaluation metrics, all presented as design choices rather than derived results. No self-citation chains or ansatzes are invoked to support any load-bearing step.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The benchmark rests on domain assumptions about data sources and test-set independence rather than new mathematical entities or fitted parameters.

axioms (2)
  • domain assumption The 2025 period is strictly out-of-distribution and captures recent atmospheric regimes without leakage from any training data.
    Explicitly stated in the abstract as the basis for the test set.
  • domain assumption Low-latency operational analysis and in-situ station observations provide a more faithful ground truth than reanalysis for model evaluation.
    Central premise motivating the benchmark design.

pith-pipeline@v0.9.1-grok · 5797 in / 1354 out tokens · 35839 ms · 2026-06-30T11:51:19.815884+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 17 canonical work pages · 2 internal anchors

  1. [1]

    Electricity load forecasting for urban area using weather forecast information

    Vasudev Dehalwar, Akhtar Kalam, Mohan Lal Kolhe, and Aladin Zayegh. Electricity load forecasting for urban area using weather forecast information. In2016 IEEE International Conference on Power and Renewable Energy (ICPRE), pages 355–359. IEEE, 2016

  2. [2]

    A review of high impact weather for aviation meteorology.Pure and applied geophysics, 176(5):1869–1921, 2019

    Ismail Gultepe, R Sharman, Paul D Williams, Binbin Zhou, G Ellrod, P Minnis, S Trier, S Griffin, Seong S Yum, B Gharabaghi, et al. A review of high impact weather for aviation meteorology.Pure and applied geophysics, 176(5):1869–1921, 2019

  3. [3]

    Precision agriculture: Weather forecasting for future farming

    Kingsley Eghonghon Ukhurebor, Charles Oluwaseun Adetunji, Olaniyan T Olugbemi, W Nwankwo, Akinola Samson Olayinka, C Umezuruike, and Daniel Ingo Hefft. Precision agriculture: Weather forecasting for future farming. InAi, edge and iot-based smart agriculture, pages 101–121. Elsevier, 2022

  4. [4]

    Learningskillfulmedium-rangeglobalweatherforecasting

    Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri,TimoEwalds,ZachEaton-Rosen,WeihuaHu,etal. Learningskillfulmedium-rangeglobalweatherforecasting. Science, 382(6677):1416–1421, 2023

  5. [5]

    Gencast: Diffusion-based ensemble forecasting for medium-range weather.arXiv preprint arXiv:2312.15796, 2023

    Ilan Price, Alvaro Sanchez-Gonzalez, Ferran Alet, Tom R Andersson, Andrew El-Kadi, Dominic Masters, Timo Ewalds, Jacklynn Stott, Shakir Mohamed, Peter Battaglia, et al. Gencast: Diffusion-based ensemble forecasting for medium-range weather.arXiv preprint arXiv:2312.15796, 2023

  6. [6]

    Va-moe: Variables-adaptive mixture of experts for incremental weather forecasting

    Hao Chen, Han Tao, Guo Song, Jie Zhang, Yunlong Yu, Yonghan Dong, and Lei Bai. Va-moe: Variables-adaptive mixture of experts for incremental weather forecasting. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

  7. [7]

    Stcast: Adaptive boundary alignment for global and regional weather forecasting

    Hao Chen, Tao Han, Jie Zhang, Song Guo, and Lei Bai. Stcast: Adaptive boundary alignment for global and regional weather forecasting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

  8. [8]

    Accuratemedium-rangeglobalweather forecasting with 3d neural networks.Nature, 619(7970):533–538, 2023

    KaifengBi,LingxiXie,HenghengZhang,XinChen,XiaotaoGu,andQiTian. Accuratemedium-rangeglobalweather forecasting with 3d neural networks.Nature, 619(7970):533–538, 2023

  9. [9]

    Weatherbench: a benchmark data set for data-driven weather forecasting.Journal of Advances in Modeling Earth Systems, 12(11):e2020MS002203, 2020

    Stephan Rasp, Peter D Dueben, Sebastian Scher, Jonathan A Weyn, Soukayna Mouatadid, and Nils Thuerey. Weatherbench: a benchmark data set for data-driven weather forecasting.Journal of Advances in Modeling Earth Systems, 12(11):e2020MS002203, 2020

  10. [10]

    Weatherbench 2: A benchmark for the next generation of data-driven global weather models.Journal of Advances in Modeling Earth Systems, 16(6):e2023MS004019, 2024

    Stephan Rasp, Stephan Hoyer, Alexander Merose, Ian Langmore, Peter Battaglia, Tyler Russell, Alvaro Sanchez- Gonzalez, Vivian Yang, Rob Carver, Shreya Agrawal, et al. Weatherbench 2: A benchmark for the next generation of data-driven global weather models.Journal of Advances in Modeling Earth Systems, 16(6):e2023MS004019, 2024

  11. [11]

    Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, ThorstenKurth,DavidHall,ZongyiLi,KamyarAzizzadenesheli,etal.Fourcastnet: Aglobaldata-drivenhigh-resolution weather model using adaptive fourier neural operators.arXiv preprint arXiv:2202.11214, 2022

  12. [12]

    Scaling transformer neural networks for skillful and reliable medium-range weather forecasting.Advances in Neural Information Processing Systems, 37:68740–68771, 2024

    Tung Nguyen, Rohan Shah, Hritik Bansal, Troy Arcomano, Romit Maulik, Veerabhadra Kotamarthi, Ian Foster, Sandeep Madireddy, and Aditya Grover. Scaling transformer neural networks for skillful and reliable medium-range weather forecasting.Advances in Neural Information Processing Systems, 37:68740–68771, 2024

  13. [13]

    Fuxi: A cascade machine learning forecasting system for 15-day global weather forecast.npj climate and atmospheric science, 6(1):190, 2023

    Lei Chen, Xiaohui Zhong, Feng Zhang, Yuan Cheng, Yinghui Xu, Yuan Qi, and Hao Li. Fuxi: A cascade machine learning forecasting system for 15-day global weather forecast.npj climate and atmospheric science, 6(1):190, 2023

  14. [14]

    The era5 global reanalysis.Quarterly journal of the royal meteorological society, 146(730):1999–2049, 2020

    Hans Hersbach, Bill Bell, Paul Berrisford, Shoji Hirahara, András Horányi, Joaquín Muñoz-Sabater, Julien Nicolas, Carole Peubey, Raluca Radu, Dinand Schepers, et al. The era5 global reanalysis.Quarterly journal of the royal meteorological society, 146(730):1999–2049, 2020. 11 RealBench arXiv Preprint

  15. [15]

    Weatherreal: a benchmark based on in-situ observations for evaluating weather models

    Weixin Jin, Jonathan Weyn, Pengcheng Zhao, Siqi Xiang, Jiang Bian, Zuliang Fang, Haiyu Dong, Hongyu Sun, Kit Thambiratnam, and Qi Zhang. Weatherreal: a benchmark based on in-situ observations for evaluating weather models. arXiv preprint arXiv:2409.09371, 2024

  16. [16]

    Benchmarking Physics-Informed Time-Series Models for Operational Global Station Weather Forecasting

    Tao Han, Song Guo, Zhenghao Chen, Wanghan Xu, and Lei Bai. Weather-5k: A large-scale global station weather dataset towards comprehensive time-series forecasting benchmark.arXiv preprint arXiv:2406.14399, 6(2), 2024

  17. [17]

    Weatherbench probability: A benchmark dataset for probabilistic medium-range weather forecasting along with deep learning baseline models.arXiv preprint arXiv:2205.00865, 2022

    Sagar Garg, Stephan Rasp, and Nils Thuerey. Weatherbench probability: A benchmark dataset for probabilistic medium-range weather forecasting along with deep learning baseline models.arXiv preprint arXiv:2205.00865, 2022

  18. [18]

    Learned benchmarks for subseasonal forecasting

    Soukayna Mouatadid, Paulo Orenstein, Genevieve Flaspohler, Miruna Oprescu, Judah Cohen, Franklyn Wang, Sean Knight, Maria Geogdzhayeva, Sam Levang, Ernest Fraenkel, et al. Learned benchmarks for subseasonal forecasting. arXiv preprint arXiv:2109.10399, 2021

  19. [19]

    Rainbench: Towards data-driven global precipitation forecasting fromsatelliteimagery

    Christian Schroeder de Witt, Catherine Tong, Valentina Zantedeschi, Daniele De Martini, Alfredo Kalaitzis, Matthew Chantry, Duncan Watson-Parris, and Piotr Bilinski. Rainbench: Towards data-driven global precipitation forecasting fromsatelliteimagery. InProceedingsoftheAAAIconferenceonartificialintelligence,volume35,pages14902–14910, 2021

  20. [20]

    Iowarain: A statewide rain event dataset based on weather radars and quantitative precipitation estimation,

    Muhammed Sit, Bong-Chul Seo, and Ibrahim Demir. Iowarain: A statewide rain event dataset based on weather radars and quantitative precipitation estimation.arXiv preprint arXiv:2107.03432, 2021

  21. [21]

    Evan Racah, Christopher Beckham, Tegan Maharaj, Samira Ebrahimi Kahou, Mr Prabhat, and Chris Pal. Ex- tremeweather: A large-scale climate dataset for semi-supervised detection, localization, and understanding of extreme weather events.Advances in neural information processing systems, 30, 2017

  22. [22]

    Floodnet: A high resolution aerial imagery dataset for post flood scene understanding.IEEE Access, 9:89644–89654, 2021

    Maryam Rahnemoonfar, Tashnim Chowdhury, Argho Sarkar, Debvrat Varshney, Masoud Yari, and Robin Roberson Murphy. Floodnet: A high resolution aerial imagery dataset for post flood scene understanding.IEEE Access, 9:89644–89654, 2021

  23. [23]

    Earthnet2021: A large-scale dataset and challenge for earth surface forecasting as a guided video prediction task

    Christian Requena-Mesa, Vitus Benson, Markus Reichstein, Jakob Runge, and Joachim Denzler. Earthnet2021: A large-scale dataset and challenge for earth surface forecasting as a guided video prediction task. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1132–1142, 2021

  24. [24]

    Droughted: A dataset and methodology for drought forecasting spanning multiple climate zones

    Christoph Minixhofer, Mark Swan, Calum McMeekin, and Pavlos Andreadis. Droughted: A dataset and methodology for drought forecasting spanning multiple climate zones. InTackling Climate Change with Machine Learning: Workshop at ICML 2021, 2021

  25. [25]

    Prabhat,KarthikKashinath,MayurMudigonda,SolKim,LukasKapp-Schwoerer,AndreGraubner,EgeKaraismailoglu, Leo von Kleist, Thorsten Kurth, Annette Greiner, et al. Climatenet: an expert-labelled open dataset and deep learning architecture for enabling high-precision analyses of extreme weather.Geoscientific Model Development Discussions, 2020:1–28, 2020

  26. [26]

    Climatelearn: Benchmarking machine learning for weather and climate modeling.Advances in Neural Information Processing Systems, 36:75009–75025, 2023

    Tung Nguyen, Jason Jewik, Hritik Bansal, Prakhar Sharma, and Aditya Grover. Climatelearn: Benchmarking machine learning for weather and climate modeling.Advances in Neural Information Processing Systems, 36:75009–75025, 2023

  27. [27]

    Climart: A benchmark dataset for emulating atmospheric ra- diative transfer in weather and climate models,

    Salva Rühling Cachay, Venkatesh Ramesh, Jason NS Cole, Howard Barker, and David Rolnick. Climart: A benchmark dataset for emulating atmospheric radiative transfer in weather and climate models.arXiv preprint arXiv:2111.14671, 2021

  28. [28]

    Climatebench v1

    Duncan Watson-Parris, Yuhan Rao, Dirk Olivié, Øyvind Seland, Peer Nowack, Gustau Camps-Valls, Philip Stier, Shahine Bouabid, Maura Dewey, Emilie Fons, et al. Climatebench v1. 0: A benchmark for data-driven climate projections.Journal of Advances in Modeling Earth Systems, 14(10):e2021MS002954, 2022

  29. [29]

    Overview of the coupled model intercomparison project phase 6 (cmip6) experimental design and organization

    Veronika Eyring, Sandrine Bony, Gerald A Meehl, Catherine A Senior, Bjorn Stevens, Ronald J Stouffer, and Karl E Taylor. Overview of the coupled model intercomparison project phase 6 (cmip6) experimental design and organization. Geoscientific Model Development, 9(5):1937–1958, 2016

  30. [30]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  31. [31]

    Self-prompting perceptual edge learning for dense prediction.IEEE Transactions on Circuits and Systems for Video Technology, 34(6):4528–4541, 2023

    Hao Chen, Yonghan Dong, Zhe-Ming Lu, Yunlong Yu, and Jungong Han. Self-prompting perceptual edge learning for dense prediction.IEEE Transactions on Circuits and Systems for Video Technology, 34(6):4528–4541, 2023

  32. [32]

    Ewmoe: An effective model for global weather forecasting with mixture-of-experts

    Lihao Gan, Xin Man, Chenghong Zhang, and Jie Shao. Ewmoe: An effective model for global weather forecasting with mixture-of-experts. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 210–218, 2025

  33. [33]

    Oneforecast: A universal framework for global and regional weather forecasting

    Yuan Gao, Hao Wu, Ruiqi Shu, Huanshuo Dong, Fan Xu, Rui Chen, Yibo Yan, Qingsong Wen, Xuming Hu, Kun Wang, et al. Oneforecast: A universal framework for global and regional weather forecasting. InProceedings of the 42th International Conference on Machine Learning, 2025

  34. [34]

    Pixel matching network for cross-domain few-shot segmentation

    Hao Chen, Yonghan Dong, Zheming Lu, Yunlong Yu, and Jungong Han. Pixel matching network for cross-domain few-shot segmentation. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 978–987, 2024. 12 RealBench arXiv Preprint

  35. [35]

    Diffusion-based decoupled deterministic and uncertain framework for probabilistic multivariate time series forecasting

    Qi Li, Zhenyu Zhang, Lei Yao, Zhaoxia Li, Tianyi Zhong, and Yong Zhang. Diffusion-based decoupled deterministic and uncertain framework for probabilistic multivariate time series forecasting. InThe Thirteenth International Conference on Learning Representations, 2025

  36. [36]

    Continuous ensemble weather forecasting with diffusion models

    Martin Andrae, Tomas Landelius, Joel Oskarsson, and Fredrik Lindsten. Continuous ensemble weather forecasting with diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025

  37. [37]

    Multi-content interaction network for few-shot segmentation.ACM Transactions on Multimedia Computing, Communications and Applications, 20(6):1–20, 2024

    Hao Chen, Yunlong Yu, Yonghan Dong, Zheming Lu, Yingming Li, and Zhongfei Zhang. Multi-content interaction network for few-shot segmentation.ACM Transactions on Multimedia Computing, Communications and Applications, 20(6):1–20, 2024

  38. [38]

    Spherical fourier neural operators: learning stable dynamics on the sphere

    Boris Bonev, Thorsten Kurth, Christian Hundt, Jaideep Pathak, Maximilian Baust, Karthik Kashinath, and Anima Anandkumar. Spherical fourier neural operators: learning stable dynamics on the sphere. InProceedings of the 40th International Conference on Machine Learning, 2023

  39. [39]

    Koopmanlab: machine learning for solving complex physics equations.APL Machine Learning, 1(3), 2023

    Wei Xiong, Muyuan Ma, Xiaomeng Huang, Ziyang Zhang, Pei Sun, and Yang Tian. Koopmanlab: machine learning for solving complex physics equations.APL Machine Learning, 1(3), 2023

  40. [40]

    Fourier neural operator for parametric partial differential equations

    Zongyi Li, Nikola Borislavov Kovachki, Kamyar Azizzadenesheli, Burigede liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. InInternational Conference on Learning Representations, 2021

  41. [41]

    Emformer: Efficient multi-scale transformer for accumulative context weather forecasting

    Hao Chen, Tao Han, Jie Zhang, Song Guo, Fenghua Ling, and Lei Bai. Emformer: Efficient multi-scale transformer for accumulative context weather forecasting. InInternational Conference on Machine Learning, 2026

  42. [42]

    Dyffusion: A dynamics-informed diffusion model for spatiotemporal forecasting.Advances in neural information processing systems, 36:45259–45287, 2023

    Salva Rühling Cachay, Bo Zhao, Hailey Joren, and Rose Yu. Dyffusion: A dynamics-informed diffusion model for spatiotemporal forecasting.Advances in neural information processing systems, 36:45259–45287, 2023

  43. [43]

    Seeds: Emulation of weather forecast ensembles with diffusion models.Science Advances, 10:eadk4489, 2024

    Lizao Li, Robert Carver, Ignacio Lopez-Gomez, Fei Sha, and John Anderson. Seeds: Emulation of weather forecast ensembles with diffusion models.Science Advances, 10:eadk4489, 2024

  44. [44]

    Aifs-crps: ensemble forecasting using a model trained with a loss function based on the continuous ranked probability score.npj Artificial Intelligence, 2(1):18, 2026

    Simon Lang, Mihai Alexe, Mariana CA Clare, Christopher Roberts, Rilwan Adewoyin, Zied Ben Bouallègue, Matthew Chantry, Jesper Dramsch, Peter D Dueben, Sara Hahner, et al. Aifs-crps: ensemble forecasting using a model trained with a loss function based on the continuous ranked probability score.npj Artificial Intelligence, 2(1):18, 2026

  45. [45]

    Kilometer-scaleconvection-allowingmodelemulationusinggenerative diffusion modeling.Science Advances, 12(5):eadv0423, 2026

    Jaideep Pathak, Yair Cohen, Piyush Garg, Peter Harrington, Noah Brenowitz, Dale Durran, Morteza Mardani, Arash Vahdat,ShaomingXu,KarthikKashinath,etal. Kilometer-scaleconvection-allowingmodelemulationusinggenerative diffusion modeling.Science Advances, 12(5):eadv0423, 2026

  46. [46]

    Diffusion-lam: probabilistic limited area weather forecasting with diffusion.arXiv preprint arXiv:2502.07532, 2025

    Erik Larsson, Joel Oskarsson, Tomas Landelius, and Fredrik Lindsten. Diffusion-lam: probabilistic limited area weather forecasting with diffusion.arXiv preprint arXiv:2502.07532, 2025

  47. [47]

    Omnicast: A masked latent diffusion model for weather forecasting across time scales

    Tung Nguyen, Tuan Pham, Troy Arcomano, Rao Kotamarthi, Ian Foster, Sandeep Madireddy, and Aditya Grover. Omnicast: A masked latent diffusion model for weather forecasting across time scales. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  48. [48]

    Probabilistic weather forecasting with deterministic guidance-based diffusion model

    Donggeun Yoon, Minseok Seo, Doyi Kim, Yeji Choi, and Donghyeon Cho. Probabilistic weather forecasting with deterministic guidance-based diffusion model. InComputer Vision – ECCV 2024, pages 108–124, 2025

  49. [49]

    Weather prediction with diffusion guided by realistic forecast processes.arXiv preprint arXiv:2402.06666, 2024

    Zhanxiang Hua, Yutong He, Chengqian Ma, and Alexandra Anderson-Frey. Weather prediction with diffusion guided by realistic forecast processes.arXiv preprint arXiv:2402.06666, 2024

  50. [50]

    Probablisticemulation of a global climate model with spherical DYffusion

    SalvaRühlingCachay,BrianHenn,OliverWatt-Meyer,ChristopherS.Bretherton,andRoseYu. Probablisticemulation of a global climate model with spherical DYffusion. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  51. [51]

    Codicast: Conditional diffusion model for global weather forecasting with uncertainty quantification

    Jimeng Shi, Bowen Jin, Jiawei Han, Sundararaman Gopalakrishnan, and Giri Narasimhan. Codicast: Conditional diffusion model for global weather forecasting with uncertainty quantification. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25, pages 9853–9861, 8 2025

  52. [52]

    Fengwu-4dvar: Couplingthedata-drivenweather forecasting model with 4d variational assimilation.arXiv preprint arXiv:2312.12455, 2023

    YiXiao,LeiBai,WeiXue,KangChen,TaoHan,andWanliOuyang. Fengwu-4dvar: Couplingthedata-drivenweather forecasting model with 4d variational assimilation.arXiv preprint arXiv:2312.12455, 2023

  53. [53]

    Towards an end-to-end artificial intelligence driven global weather forecasting system.arXiv preprint arXiv:2312.12462, 2023

    Kun Chen, Lei Bai, Fenghua Ling, Peng Ye, Tao Chen, Kang Chen, Tao Han, and Wanli Ouyang. Towards an end-to-end artificial intelligence driven global weather forecasting system.arXiv preprint arXiv:2312.12462, 2023

  54. [54]

    Fengwu-ghr: Learning the kilometer-scale medium-range global weather forecasting.arXiv preprint arXiv:2402.00059, 2024

    Tao Han, Song Guo, Fenghua Ling, Kang Chen, Junchao Gong, Jingjia Luo, Junxia Gu, Kan Dai, Wanli Ouyang, and Lei Bai. Fengwu-ghr: Learning the kilometer-scale medium-range global weather forecasting.arXiv preprint arXiv:2402.00059, 2024

  55. [55]

    Extremecast: Boosting extreme value prediction for global weather forecast.arXiv preprint arXiv:2402.01295, 2024

    Wanghan Xu, Kang Chen, Tao Han, Hao Chen, Wanli Ouyang, and Lei Bai. Extremecast: Boosting extreme value prediction for global weather forecast.arXiv preprint arXiv:2402.01295, 2024

  56. [56]

    Neural general circulation models for weather and climate.Nature, 632(8027):1060–1066, 2024

    Dmitrii Kochkov, Janni Yuval, Ian Langmore, Peter Norgaard, Jamie Smith, Griffin Mooers, Milan Klöwer, James Lottes, Stephan Rasp, Peter Düben, et al. Neural general circulation models for weather and climate.Nature, 632(8027):1060–1066, 2024

  57. [57]

    A foundation model for the earth system.Nature, 641(8065):1180–1187, 2025

    Cristian Bodnar, Wessel P Bruinsma, Ana Lucic, Megan Stanley, Anna Allen, Johannes Brandstetter, Patrick Garvan, Maik Riechert, Jonathan A Weyn, Haiyu Dong, et al. A foundation model for the earth system.Nature, 641(8065):1180–1187, 2025. 13 RealBench arXiv Preprint

  58. [58]

    Prithvi wxc: Foundation model for weather and climate,

    Johannes Schmude, Sujit Roy, Will Trojak, Johannes Jakubik, Daniel Salles Civitarese, Shraddha Singh, Julian Kuehnert, Kumar Ankur, Aman Gupta, Christopher E Phillips, et al. Prithvi wxc: Foundation model for weather and climate.arXiv preprint arXiv:2409.13598, 2024

  59. [59]

    AIFS – ECMWF’s data-driven forecast- ing system,

    Simon Lang, Mihai Alexe, Matthew Chantry, Jesper Dramsch, Florian Pinault, Baudouin Raoult, Mariana CA Clare, Christian Lessig, Michael Maier-Gerber, Linus Magnusson, et al. Aifs–ecmwf’s data-driven forecasting system.arXiv preprint arXiv:2406.01465, 2024

  60. [60]

    Aifs-crps: ensemble forecasting using a model trained with a loss function based on the continuous ranked probability score.arXiv preprint arXiv:2412.15832, 2024

    Simon Lang, Mihai Alexe, Mariana CA Clare, Christopher Roberts, Rilwan Adewoyin, Zied Ben Bouallègue, Matthew Chantry, Jesper Dramsch, Peter D Dueben, Sara Hahner, et al. Aifs-crps: Ensemble forecasting using a model trained with a loss function based on the continuous ranked probability score.arXiv preprint arXiv:2412.15832, 2024

  61. [61]

    The operational medium-range deterministic weather forecasting can be extended beyond a 10-day lead time.Communications Earth & Environment, 6(1):518, 2025

    Kang Chen, Tao Han, Fenghua Ling, Junchao Gong, Lei Bai, Xinyu Wang, Jing-Jia Luo, Ben Fei, Wenlong Zhang, Xi Chen, et al. The operational medium-range deterministic weather forecasting can be extended beyond a 10-day lead time.Communications Earth & Environment, 6(1):518, 2025

  62. [62]

    On the measurement of heat waves.Journal of climate, 26(13):4500–4517, 2013

    Sarah E Perkins and Lisa V Alexander. On the measurement of heat waves.Journal of climate, 26(13):4500–4517, 2013

  63. [63]

    Academic press, 2011

    Daniel S Wilks.Statistical methods in the atmospheric sciences, volume 100. Academic press, 2011. 14 RealBench arXiv Preprint A Limitations Although RealBench provides a more realistic benchmark for AI weather forecasting, several limitations remain. First, WEATHER-10K is based on globally distributed in-situ stations, but the station network is geographi...

  64. [64]

    We denote the processed station observation by𝑦𝑣 𝑖,𝑡 and the corresponding station-interpolated ERA5 reference byˆ𝑦𝑣 𝑖,𝑡. For temperature, dew-point temperature, station-level pressure, sea-level pressure, and wind speed, we flag an observation as anomalous if 𝑦𝑣 𝑖,𝑡 ˆ𝑦𝑣 𝑖,𝑡 > 𝑟 𝑣 ,(5) where 𝑟𝑣 is a variable-specific threshold. We use𝑟𝑣 =6 for temperature...