RealBench: Benchmarking Data-Driven Numerical Weather Forecasting Under Operational Conditions and Extreme Event Challenges

Fenghua Ling; Hao Chen; Lei Bai; Ruize Li; Song Guo; Tao Han; Wei Zhang; Zhibin Wen

arxiv: 2605.24945 · v1 · pith:ZT2E3EQDnew · submitted 2026-05-24 · 💻 cs.LG · cs.AI· physics.ao-ph

RealBench: Benchmarking Data-Driven Numerical Weather Forecasting Under Operational Conditions and Extreme Event Challenges

Ruize Li , Zhibin Wen , Tao Han , Hao Chen , Fenghua Ling , Wei Zhang , Song Guo , Lei Bai This is my paper

Pith reviewed 2026-06-30 11:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AIphysics.ao-ph

keywords AI weather forecastingbenchmark evaluationoperational conditionsextreme eventsreanalysis mismatchin-situ observationsdata-driven modelstropical cyclones

0 comments

The pith

RealBench shows that reanalysis-based benchmarks for AI weather models produce systematically different results from real operational data, especially on extremes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RealBench as a new evaluation framework for data-driven weather forecasting models. It argues that prior benchmarks rely on reanalysis products like ERA5, which use delayed data assimilation and therefore fail to match the constraints of actual real-time forecasting. RealBench instead uses a 2025 test period designed to be strictly out-of-distribution, combines low-latency operational analysis with direct in-situ measurements from more than 10,000 stations, and adds event-specific metrics for heatwaves, cold surges, and tropical cyclones. Evaluation with this setup produces substantially different performance numbers than reanalysis-based scoring, with the largest differences appearing during extreme events. A sympathetic reader would care because deployment decisions for AI forecasting systems depend on benchmarks that actually reflect the conditions under which those systems will be used.

Core claim

RealBench is a benchmark that evaluates AI weather models against low-latency operational analysis and a global network of in-situ station observations on a 2025 test set chosen to avoid data leakage. It demonstrates that metrics computed from reanalysis products diverge from metrics computed from these real measurements, with the divergence especially pronounced when models are scored on extreme events using event-specific criteria rather than standard global averages.

What carries the argument

RealBench benchmark, which supplies an out-of-distribution 2025 test set, multi-source operational ground truth, and extreme-event metrics to measure model performance directly against observations instead of reanalysis.

If this is right

Models that rank highly under reanalysis evaluation can rank differently when scored against operational data and in-situ observations.
Standard global metrics alone are insufficient for assessing value on high-impact extremes; event-specific scores are required.
Training and selection of data-driven forecasting systems should incorporate operational constraints to reduce the observed performance gap.
A benchmark built on real-time data sources provides a more relevant testbed for next-generation model development.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adoption of this style of benchmark could shift model development toward objectives that explicitly account for latency and observational sparsity.
The framework might be extended to include direct head-to-head comparisons between AI models and traditional physics-based numerical weather prediction systems under identical operational conditions.
If the discrepancies remain large, training pipelines may need to incorporate more recent or lower-latency data sources to close the gap.

Load-bearing premise

The 2025 test period contains no data leakage from training sets and the combination of low-latency analysis plus station observations forms a sufficiently complete and unbiased ground truth.

What would settle it

If model rankings and absolute scores computed on reanalysis data for the same 2025 period turn out to be nearly identical to those computed on the RealBench operational and station data, the claimed systematic mismatch would be falsified.

read the original abstract

Accurate evaluation of weather forecasting models is critical for their reliable deployment in real-world applications. However, existing benchmarks predominantly rely on reanalysis products such as ERA5, which are generated through delayed data assimilation and do not reflect the constraints of real-time operational forecasting, thereby resulting in a systematic mismatch between benchmark performance and real-world forecasting. In this work, we introduce RealBench, a next-generation benchmark for AI weather forecasting that emphasizes realistic evaluation under operational conditions. RealBench features a strictly out-of-distribution test set spanning 2025 to eliminate data leakage and capture recent atmospheric regimes. It integrates multiple data sources, including low-latency operational analysis and a large-scale global in-situ observation dataset comprising over 10,000 stations, enabling direct evaluation against real atmospheric measurements. Beyond standard global metrics, RealBench provides a comprehensive evaluation framework for high-impact extreme events, including heatwaves, cold surges, and tropical cyclones, using event-specific metrics that better reflect real-world forecasting priorities. The evaluation results reveal substantial discrepancies between reanalysis-based metrics and real-world performance, particularly concerning extreme events. By highlighting the limitations of existing benchmarks, this work establishes a more faithful and operationally relevant evaluation paradigm, providing a rigorous foundation for advancing next-generation AI weather forecasting systems. The benchmark implementation is available at: https://github.com/lixruize-del/NWP-Benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RealBench pushes evaluation toward operational analysis and station data on a 2025 OOD window, but the abstract leaves the size of the claimed discrepancies and the alignment protocols unclear.

read the letter

RealBench is the main point: it builds a benchmark that swaps reanalysis for low-latency operational analysis plus over 10,000 in-situ stations and uses a 2025 test window to stay out of distribution. The idea is to expose how standard benchmarks can overstate performance, especially on extremes.

What is actually new is the specific package—2025 split, operational ground truth, large station set, and event-specific metrics for heatwaves, cold surges, and cyclones. The paper does a clean job spelling out why reanalysis creates a systematic mismatch with real forecasting constraints and why that matters for operational use. Code release on github helps with checking the implementation.

The soft spots sit in the results and methods. The abstract states substantial discrepancies without numbers, error analysis, or details on grid alignment and leakage controls, so the practical size of the effect is hard to gauge. The central assumption that the chosen ground truth is complete and unbiased needs explicit checks in the full text; if those are missing or weak, the discrepancies become harder to trust. Minor issues like station representativeness or latency effects could also matter but are secondary.

This is for people who select or develop AI weather models for real deployment and care about extreme-event skill. A reader working on operational systems or extremes would get the most from it.

Send it for peer review. The motivation is practical and the benchmark design is concrete enough to be worth referee time, even if the quantitative claims will need close scrutiny.

Referee Report

3 major / 1 minor

Summary. The paper introduces RealBench, a benchmark for data-driven numerical weather forecasting that prioritizes operational realism over reanalysis-based evaluation. It defines a strictly out-of-distribution 2025 test window, combines low-latency operational analysis with >10k in-situ station observations as ground truth, and supplies event-specific metrics for extremes (heatwaves, cold surges, tropical cyclones). The central claim is that reanalysis-based benchmarks produce systematic mismatches with real-world performance, with RealBench revealing substantial discrepancies especially on extremes; the implementation is released on GitHub.

Significance. If the reported discrepancies are reproducible and the ground-truth protocol is shown to be unbiased, the work would meaningfully shift evaluation standards in AI weather forecasting away from ERA5-style reanalysis toward operationally relevant targets. This could reduce over-optimism in published scores and encourage models that generalize under real-time data constraints. The public benchmark release is a concrete strength that enables community follow-up.

major comments (3)

[Abstract / Evaluation section] The abstract states that evaluation results reveal substantial discrepancies, yet the manuscript provides no quantitative tables, error bars, or statistical tests comparing reanalysis vs. RealBench metrics (e.g., no RMSE or event-specific scores for the 2025 period). This absence prevents verification of the central claim.
[Data sources and evaluation framework] No protocol is described for aligning >10k in-situ station observations with model grid points or for characterizing station measurement error and representativeness (e.g., § on data sources). Without these details the assumption that the combined ground truth is complete and unbiased cannot be assessed.
[Test-set construction] The claim that the 2025 test set is strictly OOD requires explicit documentation of all model training data cutoffs and any reanalysis overlap; this information is not supplied, leaving the no-leakage guarantee unverified.

minor comments (1)

[Availability statement] The GitHub link is provided but the repository contents (data loaders, exact metric implementations) are not referenced in the text; adding a pointer to specific scripts would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight areas where additional detail will strengthen the manuscript's verifiability. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Abstract / Evaluation section] The abstract states that evaluation results reveal substantial discrepancies, yet the manuscript provides no quantitative tables, error bars, or statistical tests comparing reanalysis vs. RealBench metrics (e.g., no RMSE or event-specific scores for the 2025 period). This absence prevents verification of the central claim.

Authors: We agree that the absence of explicit quantitative comparisons limits immediate verification of the central claim. In the revised manuscript we will expand the evaluation section with tables that directly compare reanalysis-based and RealBench metrics on the 2025 period, including RMSE, event-specific scores, error bars, and statistical significance tests. These additions will be cross-referenced from the abstract. revision: yes
Referee: [Data sources and evaluation framework] No protocol is described for aligning >10k in-situ station observations with model grid points or for characterizing station measurement error and representativeness (e.g., § on data sources). Without these details the assumption that the combined ground truth is complete and unbiased cannot be assessed.

Authors: We acknowledge that the current description of the ground-truth protocol is insufficient. The revised data-sources section will specify the alignment procedure (including interpolation method and grid-matching criteria) and will characterize station representativeness and measurement uncertainty by citing established protocols and, where feasible, providing quantitative bounds on error sources. revision: yes
Referee: [Test-set construction] The claim that the 2025 test set is strictly OOD requires explicit documentation of all model training data cutoffs and any reanalysis overlap; this information is not supplied, leaving the no-leakage guarantee unverified.

Authors: We will add a dedicated subsection under test-set construction that tabulates the training-data cutoffs for every model included in the benchmark and explicitly states the absence of overlap with the 2025 window or with any reanalysis products used during training. This documentation will be supported by references to the original model papers and the released benchmark code. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces RealBench as a benchmark definition and data protocol for evaluating weather forecasting models. It contains no derivations, equations, fitted parameters, predictions, or self-referential claims that reduce to inputs by construction. The central contribution is the specification of an out-of-distribution test set, data sources, and evaluation metrics, all presented as design choices rather than derived results. No self-citation chains or ansatzes are invoked to support any load-bearing step.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The benchmark rests on domain assumptions about data sources and test-set independence rather than new mathematical entities or fitted parameters.

axioms (2)

domain assumption The 2025 period is strictly out-of-distribution and captures recent atmospheric regimes without leakage from any training data.
Explicitly stated in the abstract as the basis for the test set.
domain assumption Low-latency operational analysis and in-situ station observations provide a more faithful ground truth than reanalysis for model evaluation.
Central premise motivating the benchmark design.

pith-pipeline@v0.9.1-grok · 5797 in / 1354 out tokens · 35839 ms · 2026-06-30T11:51:19.815884+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 17 canonical work pages · 2 internal anchors

[1]

Electricity load forecasting for urban area using weather forecast information

Vasudev Dehalwar, Akhtar Kalam, Mohan Lal Kolhe, and Aladin Zayegh. Electricity load forecasting for urban area using weather forecast information. In2016 IEEE International Conference on Power and Renewable Energy (ICPRE), pages 355–359. IEEE, 2016

2016
[2]

A review of high impact weather for aviation meteorology.Pure and applied geophysics, 176(5):1869–1921, 2019

Ismail Gultepe, R Sharman, Paul D Williams, Binbin Zhou, G Ellrod, P Minnis, S Trier, S Griffin, Seong S Yum, B Gharabaghi, et al. A review of high impact weather for aviation meteorology.Pure and applied geophysics, 176(5):1869–1921, 2019

1921
[3]

Precision agriculture: Weather forecasting for future farming

Kingsley Eghonghon Ukhurebor, Charles Oluwaseun Adetunji, Olaniyan T Olugbemi, W Nwankwo, Akinola Samson Olayinka, C Umezuruike, and Daniel Ingo Hefft. Precision agriculture: Weather forecasting for future farming. InAi, edge and iot-based smart agriculture, pages 101–121. Elsevier, 2022

2022
[4]

Learningskillfulmedium-rangeglobalweatherforecasting

Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri,TimoEwalds,ZachEaton-Rosen,WeihuaHu,etal. Learningskillfulmedium-rangeglobalweatherforecasting. Science, 382(6677):1416–1421, 2023

2023
[5]

Gencast: Diffusion-based ensemble forecasting for medium-range weather.arXiv preprint arXiv:2312.15796, 2023

Ilan Price, Alvaro Sanchez-Gonzalez, Ferran Alet, Tom R Andersson, Andrew El-Kadi, Dominic Masters, Timo Ewalds, Jacklynn Stott, Shakir Mohamed, Peter Battaglia, et al. Gencast: Diffusion-based ensemble forecasting for medium-range weather.arXiv preprint arXiv:2312.15796, 2023

work page arXiv 2023
[6]

Va-moe: Variables-adaptive mixture of experts for incremental weather forecasting

Hao Chen, Han Tao, Guo Song, Jie Zhang, Yunlong Yu, Yonghan Dong, and Lei Bai. Va-moe: Variables-adaptive mixture of experts for incremental weather forecasting. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025
[7]

Stcast: Adaptive boundary alignment for global and regional weather forecasting

Hao Chen, Tao Han, Jie Zhang, Song Guo, and Lei Bai. Stcast: Adaptive boundary alignment for global and regional weather forecasting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

2026
[8]

Accuratemedium-rangeglobalweather forecasting with 3d neural networks.Nature, 619(7970):533–538, 2023

KaifengBi,LingxiXie,HenghengZhang,XinChen,XiaotaoGu,andQiTian. Accuratemedium-rangeglobalweather forecasting with 3d neural networks.Nature, 619(7970):533–538, 2023

2023
[9]

Weatherbench: a benchmark data set for data-driven weather forecasting.Journal of Advances in Modeling Earth Systems, 12(11):e2020MS002203, 2020

Stephan Rasp, Peter D Dueben, Sebastian Scher, Jonathan A Weyn, Soukayna Mouatadid, and Nils Thuerey. Weatherbench: a benchmark data set for data-driven weather forecasting.Journal of Advances in Modeling Earth Systems, 12(11):e2020MS002203, 2020

2020
[10]

Weatherbench 2: A benchmark for the next generation of data-driven global weather models.Journal of Advances in Modeling Earth Systems, 16(6):e2023MS004019, 2024

Stephan Rasp, Stephan Hoyer, Alexander Merose, Ian Langmore, Peter Battaglia, Tyler Russell, Alvaro Sanchez- Gonzalez, Vivian Yang, Rob Carver, Shreya Agrawal, et al. Weatherbench 2: A benchmark for the next generation of data-driven global weather models.Journal of Advances in Modeling Earth Systems, 16(6):e2023MS004019, 2024

2024
[11]

Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, ThorstenKurth,DavidHall,ZongyiLi,KamyarAzizzadenesheli,etal.Fourcastnet: Aglobaldata-drivenhigh-resolution weather model using adaptive fourier neural operators.arXiv preprint arXiv:2202.11214, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Scaling transformer neural networks for skillful and reliable medium-range weather forecasting.Advances in Neural Information Processing Systems, 37:68740–68771, 2024

Tung Nguyen, Rohan Shah, Hritik Bansal, Troy Arcomano, Romit Maulik, Veerabhadra Kotamarthi, Ian Foster, Sandeep Madireddy, and Aditya Grover. Scaling transformer neural networks for skillful and reliable medium-range weather forecasting.Advances in Neural Information Processing Systems, 37:68740–68771, 2024

2024
[13]

Fuxi: A cascade machine learning forecasting system for 15-day global weather forecast.npj climate and atmospheric science, 6(1):190, 2023

Lei Chen, Xiaohui Zhong, Feng Zhang, Yuan Cheng, Yinghui Xu, Yuan Qi, and Hao Li. Fuxi: A cascade machine learning forecasting system for 15-day global weather forecast.npj climate and atmospheric science, 6(1):190, 2023

2023
[14]

The era5 global reanalysis.Quarterly journal of the royal meteorological society, 146(730):1999–2049, 2020

Hans Hersbach, Bill Bell, Paul Berrisford, Shoji Hirahara, András Horányi, Joaquín Muñoz-Sabater, Julien Nicolas, Carole Peubey, Raluca Radu, Dinand Schepers, et al. The era5 global reanalysis.Quarterly journal of the royal meteorological society, 146(730):1999–2049, 2020. 11 RealBench arXiv Preprint

1999
[15]

Weatherreal: a benchmark based on in-situ observations for evaluating weather models

Weixin Jin, Jonathan Weyn, Pengcheng Zhao, Siqi Xiang, Jiang Bian, Zuliang Fang, Haiyu Dong, Hongyu Sun, Kit Thambiratnam, and Qi Zhang. Weatherreal: a benchmark based on in-situ observations for evaluating weather models. arXiv preprint arXiv:2409.09371, 2024

work page arXiv 2024
[16]

Benchmarking Physics-Informed Time-Series Models for Operational Global Station Weather Forecasting

Tao Han, Song Guo, Zhenghao Chen, Wanghan Xu, and Lei Bai. Weather-5k: A large-scale global station weather dataset towards comprehensive time-series forecasting benchmark.arXiv preprint arXiv:2406.14399, 6(2), 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Weatherbench probability: A benchmark dataset for probabilistic medium-range weather forecasting along with deep learning baseline models.arXiv preprint arXiv:2205.00865, 2022

Sagar Garg, Stephan Rasp, and Nils Thuerey. Weatherbench probability: A benchmark dataset for probabilistic medium-range weather forecasting along with deep learning baseline models.arXiv preprint arXiv:2205.00865, 2022

work page arXiv 2022
[18]

Learned benchmarks for subseasonal forecasting

Soukayna Mouatadid, Paulo Orenstein, Genevieve Flaspohler, Miruna Oprescu, Judah Cohen, Franklyn Wang, Sean Knight, Maria Geogdzhayeva, Sam Levang, Ernest Fraenkel, et al. Learned benchmarks for subseasonal forecasting. arXiv preprint arXiv:2109.10399, 2021

work page arXiv 2021
[19]

Rainbench: Towards data-driven global precipitation forecasting fromsatelliteimagery

Christian Schroeder de Witt, Catherine Tong, Valentina Zantedeschi, Daniele De Martini, Alfredo Kalaitzis, Matthew Chantry, Duncan Watson-Parris, and Piotr Bilinski. Rainbench: Towards data-driven global precipitation forecasting fromsatelliteimagery. InProceedingsoftheAAAIconferenceonartificialintelligence,volume35,pages14902–14910, 2021

2021
[20]

Iowarain: A statewide rain event dataset based on weather radars and quantitative precipitation estimation,

Muhammed Sit, Bong-Chul Seo, and Ibrahim Demir. Iowarain: A statewide rain event dataset based on weather radars and quantitative precipitation estimation.arXiv preprint arXiv:2107.03432, 2021

work page arXiv 2021
[21]

Evan Racah, Christopher Beckham, Tegan Maharaj, Samira Ebrahimi Kahou, Mr Prabhat, and Chris Pal. Ex- tremeweather: A large-scale climate dataset for semi-supervised detection, localization, and understanding of extreme weather events.Advances in neural information processing systems, 30, 2017

2017
[22]

Floodnet: A high resolution aerial imagery dataset for post flood scene understanding.IEEE Access, 9:89644–89654, 2021

Maryam Rahnemoonfar, Tashnim Chowdhury, Argho Sarkar, Debvrat Varshney, Masoud Yari, and Robin Roberson Murphy. Floodnet: A high resolution aerial imagery dataset for post flood scene understanding.IEEE Access, 9:89644–89654, 2021

2021
[23]

Earthnet2021: A large-scale dataset and challenge for earth surface forecasting as a guided video prediction task

Christian Requena-Mesa, Vitus Benson, Markus Reichstein, Jakob Runge, and Joachim Denzler. Earthnet2021: A large-scale dataset and challenge for earth surface forecasting as a guided video prediction task. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1132–1142, 2021

2021
[24]

Droughted: A dataset and methodology for drought forecasting spanning multiple climate zones

Christoph Minixhofer, Mark Swan, Calum McMeekin, and Pavlos Andreadis. Droughted: A dataset and methodology for drought forecasting spanning multiple climate zones. InTackling Climate Change with Machine Learning: Workshop at ICML 2021, 2021

2021
[25]

Prabhat,KarthikKashinath,MayurMudigonda,SolKim,LukasKapp-Schwoerer,AndreGraubner,EgeKaraismailoglu, Leo von Kleist, Thorsten Kurth, Annette Greiner, et al. Climatenet: an expert-labelled open dataset and deep learning architecture for enabling high-precision analyses of extreme weather.Geoscientific Model Development Discussions, 2020:1–28, 2020

2020
[26]

Climatelearn: Benchmarking machine learning for weather and climate modeling.Advances in Neural Information Processing Systems, 36:75009–75025, 2023

Tung Nguyen, Jason Jewik, Hritik Bansal, Prakhar Sharma, and Aditya Grover. Climatelearn: Benchmarking machine learning for weather and climate modeling.Advances in Neural Information Processing Systems, 36:75009–75025, 2023

2023
[27]

Climart: A benchmark dataset for emulating atmospheric ra- diative transfer in weather and climate models,

Salva Rühling Cachay, Venkatesh Ramesh, Jason NS Cole, Howard Barker, and David Rolnick. Climart: A benchmark dataset for emulating atmospheric radiative transfer in weather and climate models.arXiv preprint arXiv:2111.14671, 2021

work page arXiv 2021
[28]

Climatebench v1

Duncan Watson-Parris, Yuhan Rao, Dirk Olivié, Øyvind Seland, Peer Nowack, Gustau Camps-Valls, Philip Stier, Shahine Bouabid, Maura Dewey, Emilie Fons, et al. Climatebench v1. 0: A benchmark for data-driven climate projections.Journal of Advances in Modeling Earth Systems, 14(10):e2021MS002954, 2022

2022
[29]

Overview of the coupled model intercomparison project phase 6 (cmip6) experimental design and organization

Veronika Eyring, Sandrine Bony, Gerald A Meehl, Catherine A Senior, Bjorn Stevens, Ronald J Stouffer, and Karl E Taylor. Overview of the coupled model intercomparison project phase 6 (cmip6) experimental design and organization. Geoscientific Model Development, 9(5):1937–1958, 2016

1937
[30]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016
[31]

Self-prompting perceptual edge learning for dense prediction.IEEE Transactions on Circuits and Systems for Video Technology, 34(6):4528–4541, 2023

Hao Chen, Yonghan Dong, Zhe-Ming Lu, Yunlong Yu, and Jungong Han. Self-prompting perceptual edge learning for dense prediction.IEEE Transactions on Circuits and Systems for Video Technology, 34(6):4528–4541, 2023

2023
[32]

Ewmoe: An effective model for global weather forecasting with mixture-of-experts

Lihao Gan, Xin Man, Chenghong Zhang, and Jie Shao. Ewmoe: An effective model for global weather forecasting with mixture-of-experts. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 210–218, 2025

2025
[33]

Oneforecast: A universal framework for global and regional weather forecasting

Yuan Gao, Hao Wu, Ruiqi Shu, Huanshuo Dong, Fan Xu, Rui Chen, Yibo Yan, Qingsong Wen, Xuming Hu, Kun Wang, et al. Oneforecast: A universal framework for global and regional weather forecasting. InProceedings of the 42th International Conference on Machine Learning, 2025

2025
[34]

Pixel matching network for cross-domain few-shot segmentation

Hao Chen, Yonghan Dong, Zheming Lu, Yunlong Yu, and Jungong Han. Pixel matching network for cross-domain few-shot segmentation. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 978–987, 2024. 12 RealBench arXiv Preprint

2024
[35]

Diffusion-based decoupled deterministic and uncertain framework for probabilistic multivariate time series forecasting

Qi Li, Zhenyu Zhang, Lei Yao, Zhaoxia Li, Tianyi Zhong, and Yong Zhang. Diffusion-based decoupled deterministic and uncertain framework for probabilistic multivariate time series forecasting. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[36]

Continuous ensemble weather forecasting with diffusion models

Martin Andrae, Tomas Landelius, Joel Oskarsson, and Fredrik Lindsten. Continuous ensemble weather forecasting with diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[37]

Multi-content interaction network for few-shot segmentation.ACM Transactions on Multimedia Computing, Communications and Applications, 20(6):1–20, 2024

Hao Chen, Yunlong Yu, Yonghan Dong, Zheming Lu, Yingming Li, and Zhongfei Zhang. Multi-content interaction network for few-shot segmentation.ACM Transactions on Multimedia Computing, Communications and Applications, 20(6):1–20, 2024

2024
[38]

Spherical fourier neural operators: learning stable dynamics on the sphere

Boris Bonev, Thorsten Kurth, Christian Hundt, Jaideep Pathak, Maximilian Baust, Karthik Kashinath, and Anima Anandkumar. Spherical fourier neural operators: learning stable dynamics on the sphere. InProceedings of the 40th International Conference on Machine Learning, 2023

2023
[39]

Koopmanlab: machine learning for solving complex physics equations.APL Machine Learning, 1(3), 2023

Wei Xiong, Muyuan Ma, Xiaomeng Huang, Ziyang Zhang, Pei Sun, and Yang Tian. Koopmanlab: machine learning for solving complex physics equations.APL Machine Learning, 1(3), 2023

2023
[40]

Fourier neural operator for parametric partial differential equations

Zongyi Li, Nikola Borislavov Kovachki, Kamyar Azizzadenesheli, Burigede liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. InInternational Conference on Learning Representations, 2021

2021
[41]

Emformer: Efficient multi-scale transformer for accumulative context weather forecasting

Hao Chen, Tao Han, Jie Zhang, Song Guo, Fenghua Ling, and Lei Bai. Emformer: Efficient multi-scale transformer for accumulative context weather forecasting. InInternational Conference on Machine Learning, 2026

2026
[42]

Dyffusion: A dynamics-informed diffusion model for spatiotemporal forecasting.Advances in neural information processing systems, 36:45259–45287, 2023

Salva Rühling Cachay, Bo Zhao, Hailey Joren, and Rose Yu. Dyffusion: A dynamics-informed diffusion model for spatiotemporal forecasting.Advances in neural information processing systems, 36:45259–45287, 2023

2023
[43]

Seeds: Emulation of weather forecast ensembles with diffusion models.Science Advances, 10:eadk4489, 2024

Lizao Li, Robert Carver, Ignacio Lopez-Gomez, Fei Sha, and John Anderson. Seeds: Emulation of weather forecast ensembles with diffusion models.Science Advances, 10:eadk4489, 2024

2024
[44]

Aifs-crps: ensemble forecasting using a model trained with a loss function based on the continuous ranked probability score.npj Artificial Intelligence, 2(1):18, 2026

Simon Lang, Mihai Alexe, Mariana CA Clare, Christopher Roberts, Rilwan Adewoyin, Zied Ben Bouallègue, Matthew Chantry, Jesper Dramsch, Peter D Dueben, Sara Hahner, et al. Aifs-crps: ensemble forecasting using a model trained with a loss function based on the continuous ranked probability score.npj Artificial Intelligence, 2(1):18, 2026

2026
[45]

Kilometer-scaleconvection-allowingmodelemulationusinggenerative diffusion modeling.Science Advances, 12(5):eadv0423, 2026

Jaideep Pathak, Yair Cohen, Piyush Garg, Peter Harrington, Noah Brenowitz, Dale Durran, Morteza Mardani, Arash Vahdat,ShaomingXu,KarthikKashinath,etal. Kilometer-scaleconvection-allowingmodelemulationusinggenerative diffusion modeling.Science Advances, 12(5):eadv0423, 2026

2026
[46]

Diffusion-lam: probabilistic limited area weather forecasting with diffusion.arXiv preprint arXiv:2502.07532, 2025

Erik Larsson, Joel Oskarsson, Tomas Landelius, and Fredrik Lindsten. Diffusion-lam: probabilistic limited area weather forecasting with diffusion.arXiv preprint arXiv:2502.07532, 2025

work page arXiv 2025
[47]

Omnicast: A masked latent diffusion model for weather forecasting across time scales

Tung Nguyen, Tuan Pham, Troy Arcomano, Rao Kotamarthi, Ian Foster, Sandeep Madireddy, and Aditya Grover. Omnicast: A masked latent diffusion model for weather forecasting across time scales. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[48]

Probabilistic weather forecasting with deterministic guidance-based diffusion model

Donggeun Yoon, Minseok Seo, Doyi Kim, Yeji Choi, and Donghyeon Cho. Probabilistic weather forecasting with deterministic guidance-based diffusion model. InComputer Vision – ECCV 2024, pages 108–124, 2025

2024
[49]

Weather prediction with diffusion guided by realistic forecast processes.arXiv preprint arXiv:2402.06666, 2024

Zhanxiang Hua, Yutong He, Chengqian Ma, and Alexandra Anderson-Frey. Weather prediction with diffusion guided by realistic forecast processes.arXiv preprint arXiv:2402.06666, 2024

work page arXiv 2024
[50]

Probablisticemulation of a global climate model with spherical DYffusion

SalvaRühlingCachay,BrianHenn,OliverWatt-Meyer,ChristopherS.Bretherton,andRoseYu. Probablisticemulation of a global climate model with spherical DYffusion. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024
[51]

Codicast: Conditional diffusion model for global weather forecasting with uncertainty quantification

Jimeng Shi, Bowen Jin, Jiawei Han, Sundararaman Gopalakrishnan, and Giri Narasimhan. Codicast: Conditional diffusion model for global weather forecasting with uncertainty quantification. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25, pages 9853–9861, 8 2025

2025
[52]

Fengwu-4dvar: Couplingthedata-drivenweather forecasting model with 4d variational assimilation.arXiv preprint arXiv:2312.12455, 2023

YiXiao,LeiBai,WeiXue,KangChen,TaoHan,andWanliOuyang. Fengwu-4dvar: Couplingthedata-drivenweather forecasting model with 4d variational assimilation.arXiv preprint arXiv:2312.12455, 2023

work page arXiv 2023
[53]

Towards an end-to-end artificial intelligence driven global weather forecasting system.arXiv preprint arXiv:2312.12462, 2023

Kun Chen, Lei Bai, Fenghua Ling, Peng Ye, Tao Chen, Kang Chen, Tao Han, and Wanli Ouyang. Towards an end-to-end artificial intelligence driven global weather forecasting system.arXiv preprint arXiv:2312.12462, 2023

work page arXiv 2023
[54]

Fengwu-ghr: Learning the kilometer-scale medium-range global weather forecasting.arXiv preprint arXiv:2402.00059, 2024

Tao Han, Song Guo, Fenghua Ling, Kang Chen, Junchao Gong, Jingjia Luo, Junxia Gu, Kan Dai, Wanli Ouyang, and Lei Bai. Fengwu-ghr: Learning the kilometer-scale medium-range global weather forecasting.arXiv preprint arXiv:2402.00059, 2024

work page arXiv 2024
[55]

Extremecast: Boosting extreme value prediction for global weather forecast.arXiv preprint arXiv:2402.01295, 2024

Wanghan Xu, Kang Chen, Tao Han, Hao Chen, Wanli Ouyang, and Lei Bai. Extremecast: Boosting extreme value prediction for global weather forecast.arXiv preprint arXiv:2402.01295, 2024

work page arXiv 2024
[56]

Neural general circulation models for weather and climate.Nature, 632(8027):1060–1066, 2024

Dmitrii Kochkov, Janni Yuval, Ian Langmore, Peter Norgaard, Jamie Smith, Griffin Mooers, Milan Klöwer, James Lottes, Stephan Rasp, Peter Düben, et al. Neural general circulation models for weather and climate.Nature, 632(8027):1060–1066, 2024

2024
[57]

A foundation model for the earth system.Nature, 641(8065):1180–1187, 2025

Cristian Bodnar, Wessel P Bruinsma, Ana Lucic, Megan Stanley, Anna Allen, Johannes Brandstetter, Patrick Garvan, Maik Riechert, Jonathan A Weyn, Haiyu Dong, et al. A foundation model for the earth system.Nature, 641(8065):1180–1187, 2025. 13 RealBench arXiv Preprint

2025
[58]

Prithvi wxc: Foundation model for weather and climate,

Johannes Schmude, Sujit Roy, Will Trojak, Johannes Jakubik, Daniel Salles Civitarese, Shraddha Singh, Julian Kuehnert, Kumar Ankur, Aman Gupta, Christopher E Phillips, et al. Prithvi wxc: Foundation model for weather and climate.arXiv preprint arXiv:2409.13598, 2024

work page arXiv 2024
[59]

AIFS – ECMWF’s data-driven forecast- ing system,

Simon Lang, Mihai Alexe, Matthew Chantry, Jesper Dramsch, Florian Pinault, Baudouin Raoult, Mariana CA Clare, Christian Lessig, Michael Maier-Gerber, Linus Magnusson, et al. Aifs–ecmwf’s data-driven forecasting system.arXiv preprint arXiv:2406.01465, 2024

work page arXiv 2024
[60]

Aifs-crps: ensemble forecasting using a model trained with a loss function based on the continuous ranked probability score.arXiv preprint arXiv:2412.15832, 2024

Simon Lang, Mihai Alexe, Mariana CA Clare, Christopher Roberts, Rilwan Adewoyin, Zied Ben Bouallègue, Matthew Chantry, Jesper Dramsch, Peter D Dueben, Sara Hahner, et al. Aifs-crps: Ensemble forecasting using a model trained with a loss function based on the continuous ranked probability score.arXiv preprint arXiv:2412.15832, 2024

work page arXiv 2024
[61]

The operational medium-range deterministic weather forecasting can be extended beyond a 10-day lead time.Communications Earth & Environment, 6(1):518, 2025

Kang Chen, Tao Han, Fenghua Ling, Junchao Gong, Lei Bai, Xinyu Wang, Jing-Jia Luo, Ben Fei, Wenlong Zhang, Xi Chen, et al. The operational medium-range deterministic weather forecasting can be extended beyond a 10-day lead time.Communications Earth & Environment, 6(1):518, 2025

2025
[62]

On the measurement of heat waves.Journal of climate, 26(13):4500–4517, 2013

Sarah E Perkins and Lisa V Alexander. On the measurement of heat waves.Journal of climate, 26(13):4500–4517, 2013

2013
[63]

Academic press, 2011

Daniel S Wilks.Statistical methods in the atmospheric sciences, volume 100. Academic press, 2011. 14 RealBench arXiv Preprint A Limitations Although RealBench provides a more realistic benchmark for AI weather forecasting, several limitations remain. First, WEATHER-10K is based on globally distributed in-situ stations, but the station network is geographi...

2011
[64]

We denote the processed station observation by𝑦𝑣 𝑖,𝑡 and the corresponding station-interpolated ERA5 reference byˆ𝑦𝑣 𝑖,𝑡. For temperature, dew-point temperature, station-level pressure, sea-level pressure, and wind speed, we flag an observation as anomalous if 𝑦𝑣 𝑖,𝑡 ˆ𝑦𝑣 𝑖,𝑡 > 𝑟 𝑣 ,(5) where 𝑟𝑣 is a variable-specific threshold. We use𝑟𝑣 =6 for temperature...

2025

[1] [1]

Electricity load forecasting for urban area using weather forecast information

Vasudev Dehalwar, Akhtar Kalam, Mohan Lal Kolhe, and Aladin Zayegh. Electricity load forecasting for urban area using weather forecast information. In2016 IEEE International Conference on Power and Renewable Energy (ICPRE), pages 355–359. IEEE, 2016

2016

[2] [2]

A review of high impact weather for aviation meteorology.Pure and applied geophysics, 176(5):1869–1921, 2019

Ismail Gultepe, R Sharman, Paul D Williams, Binbin Zhou, G Ellrod, P Minnis, S Trier, S Griffin, Seong S Yum, B Gharabaghi, et al. A review of high impact weather for aviation meteorology.Pure and applied geophysics, 176(5):1869–1921, 2019

1921

[3] [3]

Precision agriculture: Weather forecasting for future farming

Kingsley Eghonghon Ukhurebor, Charles Oluwaseun Adetunji, Olaniyan T Olugbemi, W Nwankwo, Akinola Samson Olayinka, C Umezuruike, and Daniel Ingo Hefft. Precision agriculture: Weather forecasting for future farming. InAi, edge and iot-based smart agriculture, pages 101–121. Elsevier, 2022

2022

[4] [4]

Learningskillfulmedium-rangeglobalweatherforecasting

Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri,TimoEwalds,ZachEaton-Rosen,WeihuaHu,etal. Learningskillfulmedium-rangeglobalweatherforecasting. Science, 382(6677):1416–1421, 2023

2023

[5] [5]

Gencast: Diffusion-based ensemble forecasting for medium-range weather.arXiv preprint arXiv:2312.15796, 2023

Ilan Price, Alvaro Sanchez-Gonzalez, Ferran Alet, Tom R Andersson, Andrew El-Kadi, Dominic Masters, Timo Ewalds, Jacklynn Stott, Shakir Mohamed, Peter Battaglia, et al. Gencast: Diffusion-based ensemble forecasting for medium-range weather.arXiv preprint arXiv:2312.15796, 2023

work page arXiv 2023

[6] [6]

Va-moe: Variables-adaptive mixture of experts for incremental weather forecasting

Hao Chen, Han Tao, Guo Song, Jie Zhang, Yunlong Yu, Yonghan Dong, and Lei Bai. Va-moe: Variables-adaptive mixture of experts for incremental weather forecasting. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025

[7] [7]

Stcast: Adaptive boundary alignment for global and regional weather forecasting

Hao Chen, Tao Han, Jie Zhang, Song Guo, and Lei Bai. Stcast: Adaptive boundary alignment for global and regional weather forecasting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

2026

[8] [8]

Accuratemedium-rangeglobalweather forecasting with 3d neural networks.Nature, 619(7970):533–538, 2023

KaifengBi,LingxiXie,HenghengZhang,XinChen,XiaotaoGu,andQiTian. Accuratemedium-rangeglobalweather forecasting with 3d neural networks.Nature, 619(7970):533–538, 2023

2023

[9] [9]

Weatherbench: a benchmark data set for data-driven weather forecasting.Journal of Advances in Modeling Earth Systems, 12(11):e2020MS002203, 2020

Stephan Rasp, Peter D Dueben, Sebastian Scher, Jonathan A Weyn, Soukayna Mouatadid, and Nils Thuerey. Weatherbench: a benchmark data set for data-driven weather forecasting.Journal of Advances in Modeling Earth Systems, 12(11):e2020MS002203, 2020

2020

[10] [10]

Weatherbench 2: A benchmark for the next generation of data-driven global weather models.Journal of Advances in Modeling Earth Systems, 16(6):e2023MS004019, 2024

Stephan Rasp, Stephan Hoyer, Alexander Merose, Ian Langmore, Peter Battaglia, Tyler Russell, Alvaro Sanchez- Gonzalez, Vivian Yang, Rob Carver, Shreya Agrawal, et al. Weatherbench 2: A benchmark for the next generation of data-driven global weather models.Journal of Advances in Modeling Earth Systems, 16(6):e2023MS004019, 2024

2024

[11] [11]

Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, ThorstenKurth,DavidHall,ZongyiLi,KamyarAzizzadenesheli,etal.Fourcastnet: Aglobaldata-drivenhigh-resolution weather model using adaptive fourier neural operators.arXiv preprint arXiv:2202.11214, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

Scaling transformer neural networks for skillful and reliable medium-range weather forecasting.Advances in Neural Information Processing Systems, 37:68740–68771, 2024

Tung Nguyen, Rohan Shah, Hritik Bansal, Troy Arcomano, Romit Maulik, Veerabhadra Kotamarthi, Ian Foster, Sandeep Madireddy, and Aditya Grover. Scaling transformer neural networks for skillful and reliable medium-range weather forecasting.Advances in Neural Information Processing Systems, 37:68740–68771, 2024

2024

[13] [13]

Fuxi: A cascade machine learning forecasting system for 15-day global weather forecast.npj climate and atmospheric science, 6(1):190, 2023

Lei Chen, Xiaohui Zhong, Feng Zhang, Yuan Cheng, Yinghui Xu, Yuan Qi, and Hao Li. Fuxi: A cascade machine learning forecasting system for 15-day global weather forecast.npj climate and atmospheric science, 6(1):190, 2023

2023

[14] [14]

The era5 global reanalysis.Quarterly journal of the royal meteorological society, 146(730):1999–2049, 2020

Hans Hersbach, Bill Bell, Paul Berrisford, Shoji Hirahara, András Horányi, Joaquín Muñoz-Sabater, Julien Nicolas, Carole Peubey, Raluca Radu, Dinand Schepers, et al. The era5 global reanalysis.Quarterly journal of the royal meteorological society, 146(730):1999–2049, 2020. 11 RealBench arXiv Preprint

1999

[15] [15]

Weatherreal: a benchmark based on in-situ observations for evaluating weather models

Weixin Jin, Jonathan Weyn, Pengcheng Zhao, Siqi Xiang, Jiang Bian, Zuliang Fang, Haiyu Dong, Hongyu Sun, Kit Thambiratnam, and Qi Zhang. Weatherreal: a benchmark based on in-situ observations for evaluating weather models. arXiv preprint arXiv:2409.09371, 2024

work page arXiv 2024

[16] [16]

Benchmarking Physics-Informed Time-Series Models for Operational Global Station Weather Forecasting

Tao Han, Song Guo, Zhenghao Chen, Wanghan Xu, and Lei Bai. Weather-5k: A large-scale global station weather dataset towards comprehensive time-series forecasting benchmark.arXiv preprint arXiv:2406.14399, 6(2), 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Weatherbench probability: A benchmark dataset for probabilistic medium-range weather forecasting along with deep learning baseline models.arXiv preprint arXiv:2205.00865, 2022

Sagar Garg, Stephan Rasp, and Nils Thuerey. Weatherbench probability: A benchmark dataset for probabilistic medium-range weather forecasting along with deep learning baseline models.arXiv preprint arXiv:2205.00865, 2022

work page arXiv 2022

[18] [18]

Learned benchmarks for subseasonal forecasting

Soukayna Mouatadid, Paulo Orenstein, Genevieve Flaspohler, Miruna Oprescu, Judah Cohen, Franklyn Wang, Sean Knight, Maria Geogdzhayeva, Sam Levang, Ernest Fraenkel, et al. Learned benchmarks for subseasonal forecasting. arXiv preprint arXiv:2109.10399, 2021

work page arXiv 2021

[19] [19]

Rainbench: Towards data-driven global precipitation forecasting fromsatelliteimagery

Christian Schroeder de Witt, Catherine Tong, Valentina Zantedeschi, Daniele De Martini, Alfredo Kalaitzis, Matthew Chantry, Duncan Watson-Parris, and Piotr Bilinski. Rainbench: Towards data-driven global precipitation forecasting fromsatelliteimagery. InProceedingsoftheAAAIconferenceonartificialintelligence,volume35,pages14902–14910, 2021

2021

[20] [20]

Iowarain: A statewide rain event dataset based on weather radars and quantitative precipitation estimation,

Muhammed Sit, Bong-Chul Seo, and Ibrahim Demir. Iowarain: A statewide rain event dataset based on weather radars and quantitative precipitation estimation.arXiv preprint arXiv:2107.03432, 2021

work page arXiv 2021

[21] [21]

Evan Racah, Christopher Beckham, Tegan Maharaj, Samira Ebrahimi Kahou, Mr Prabhat, and Chris Pal. Ex- tremeweather: A large-scale climate dataset for semi-supervised detection, localization, and understanding of extreme weather events.Advances in neural information processing systems, 30, 2017

2017

[22] [22]

Floodnet: A high resolution aerial imagery dataset for post flood scene understanding.IEEE Access, 9:89644–89654, 2021

Maryam Rahnemoonfar, Tashnim Chowdhury, Argho Sarkar, Debvrat Varshney, Masoud Yari, and Robin Roberson Murphy. Floodnet: A high resolution aerial imagery dataset for post flood scene understanding.IEEE Access, 9:89644–89654, 2021

2021

[23] [23]

Earthnet2021: A large-scale dataset and challenge for earth surface forecasting as a guided video prediction task

Christian Requena-Mesa, Vitus Benson, Markus Reichstein, Jakob Runge, and Joachim Denzler. Earthnet2021: A large-scale dataset and challenge for earth surface forecasting as a guided video prediction task. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1132–1142, 2021

2021

[24] [24]

Droughted: A dataset and methodology for drought forecasting spanning multiple climate zones

Christoph Minixhofer, Mark Swan, Calum McMeekin, and Pavlos Andreadis. Droughted: A dataset and methodology for drought forecasting spanning multiple climate zones. InTackling Climate Change with Machine Learning: Workshop at ICML 2021, 2021

2021

[25] [25]

Prabhat,KarthikKashinath,MayurMudigonda,SolKim,LukasKapp-Schwoerer,AndreGraubner,EgeKaraismailoglu, Leo von Kleist, Thorsten Kurth, Annette Greiner, et al. Climatenet: an expert-labelled open dataset and deep learning architecture for enabling high-precision analyses of extreme weather.Geoscientific Model Development Discussions, 2020:1–28, 2020

2020

[26] [26]

Climatelearn: Benchmarking machine learning for weather and climate modeling.Advances in Neural Information Processing Systems, 36:75009–75025, 2023

Tung Nguyen, Jason Jewik, Hritik Bansal, Prakhar Sharma, and Aditya Grover. Climatelearn: Benchmarking machine learning for weather and climate modeling.Advances in Neural Information Processing Systems, 36:75009–75025, 2023

2023

[27] [27]

Climart: A benchmark dataset for emulating atmospheric ra- diative transfer in weather and climate models,

Salva Rühling Cachay, Venkatesh Ramesh, Jason NS Cole, Howard Barker, and David Rolnick. Climart: A benchmark dataset for emulating atmospheric radiative transfer in weather and climate models.arXiv preprint arXiv:2111.14671, 2021

work page arXiv 2021

[28] [28]

Climatebench v1

Duncan Watson-Parris, Yuhan Rao, Dirk Olivié, Øyvind Seland, Peer Nowack, Gustau Camps-Valls, Philip Stier, Shahine Bouabid, Maura Dewey, Emilie Fons, et al. Climatebench v1. 0: A benchmark for data-driven climate projections.Journal of Advances in Modeling Earth Systems, 14(10):e2021MS002954, 2022

2022

[29] [29]

Overview of the coupled model intercomparison project phase 6 (cmip6) experimental design and organization

Veronika Eyring, Sandrine Bony, Gerald A Meehl, Catherine A Senior, Bjorn Stevens, Ronald J Stouffer, and Karl E Taylor. Overview of the coupled model intercomparison project phase 6 (cmip6) experimental design and organization. Geoscientific Model Development, 9(5):1937–1958, 2016

1937

[30] [30]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016

[31] [31]

Self-prompting perceptual edge learning for dense prediction.IEEE Transactions on Circuits and Systems for Video Technology, 34(6):4528–4541, 2023

Hao Chen, Yonghan Dong, Zhe-Ming Lu, Yunlong Yu, and Jungong Han. Self-prompting perceptual edge learning for dense prediction.IEEE Transactions on Circuits and Systems for Video Technology, 34(6):4528–4541, 2023

2023

[32] [32]

Ewmoe: An effective model for global weather forecasting with mixture-of-experts

Lihao Gan, Xin Man, Chenghong Zhang, and Jie Shao. Ewmoe: An effective model for global weather forecasting with mixture-of-experts. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 210–218, 2025

2025

[33] [33]

Oneforecast: A universal framework for global and regional weather forecasting

Yuan Gao, Hao Wu, Ruiqi Shu, Huanshuo Dong, Fan Xu, Rui Chen, Yibo Yan, Qingsong Wen, Xuming Hu, Kun Wang, et al. Oneforecast: A universal framework for global and regional weather forecasting. InProceedings of the 42th International Conference on Machine Learning, 2025

2025

[34] [34]

Pixel matching network for cross-domain few-shot segmentation

Hao Chen, Yonghan Dong, Zheming Lu, Yunlong Yu, and Jungong Han. Pixel matching network for cross-domain few-shot segmentation. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 978–987, 2024. 12 RealBench arXiv Preprint

2024

[35] [35]

Diffusion-based decoupled deterministic and uncertain framework for probabilistic multivariate time series forecasting

Qi Li, Zhenyu Zhang, Lei Yao, Zhaoxia Li, Tianyi Zhong, and Yong Zhang. Diffusion-based decoupled deterministic and uncertain framework for probabilistic multivariate time series forecasting. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[36] [36]

Continuous ensemble weather forecasting with diffusion models

Martin Andrae, Tomas Landelius, Joel Oskarsson, and Fredrik Lindsten. Continuous ensemble weather forecasting with diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[37] [37]

Multi-content interaction network for few-shot segmentation.ACM Transactions on Multimedia Computing, Communications and Applications, 20(6):1–20, 2024

Hao Chen, Yunlong Yu, Yonghan Dong, Zheming Lu, Yingming Li, and Zhongfei Zhang. Multi-content interaction network for few-shot segmentation.ACM Transactions on Multimedia Computing, Communications and Applications, 20(6):1–20, 2024

2024

[38] [38]

Spherical fourier neural operators: learning stable dynamics on the sphere

Boris Bonev, Thorsten Kurth, Christian Hundt, Jaideep Pathak, Maximilian Baust, Karthik Kashinath, and Anima Anandkumar. Spherical fourier neural operators: learning stable dynamics on the sphere. InProceedings of the 40th International Conference on Machine Learning, 2023

2023

[39] [39]

Koopmanlab: machine learning for solving complex physics equations.APL Machine Learning, 1(3), 2023

Wei Xiong, Muyuan Ma, Xiaomeng Huang, Ziyang Zhang, Pei Sun, and Yang Tian. Koopmanlab: machine learning for solving complex physics equations.APL Machine Learning, 1(3), 2023

2023

[40] [40]

Fourier neural operator for parametric partial differential equations

Zongyi Li, Nikola Borislavov Kovachki, Kamyar Azizzadenesheli, Burigede liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. InInternational Conference on Learning Representations, 2021

2021

[41] [41]

Emformer: Efficient multi-scale transformer for accumulative context weather forecasting

Hao Chen, Tao Han, Jie Zhang, Song Guo, Fenghua Ling, and Lei Bai. Emformer: Efficient multi-scale transformer for accumulative context weather forecasting. InInternational Conference on Machine Learning, 2026

2026

[42] [42]

Dyffusion: A dynamics-informed diffusion model for spatiotemporal forecasting.Advances in neural information processing systems, 36:45259–45287, 2023

Salva Rühling Cachay, Bo Zhao, Hailey Joren, and Rose Yu. Dyffusion: A dynamics-informed diffusion model for spatiotemporal forecasting.Advances in neural information processing systems, 36:45259–45287, 2023

2023

[43] [43]

Seeds: Emulation of weather forecast ensembles with diffusion models.Science Advances, 10:eadk4489, 2024

Lizao Li, Robert Carver, Ignacio Lopez-Gomez, Fei Sha, and John Anderson. Seeds: Emulation of weather forecast ensembles with diffusion models.Science Advances, 10:eadk4489, 2024

2024

[44] [44]

Aifs-crps: ensemble forecasting using a model trained with a loss function based on the continuous ranked probability score.npj Artificial Intelligence, 2(1):18, 2026

Simon Lang, Mihai Alexe, Mariana CA Clare, Christopher Roberts, Rilwan Adewoyin, Zied Ben Bouallègue, Matthew Chantry, Jesper Dramsch, Peter D Dueben, Sara Hahner, et al. Aifs-crps: ensemble forecasting using a model trained with a loss function based on the continuous ranked probability score.npj Artificial Intelligence, 2(1):18, 2026

2026

[45] [45]

Kilometer-scaleconvection-allowingmodelemulationusinggenerative diffusion modeling.Science Advances, 12(5):eadv0423, 2026

Jaideep Pathak, Yair Cohen, Piyush Garg, Peter Harrington, Noah Brenowitz, Dale Durran, Morteza Mardani, Arash Vahdat,ShaomingXu,KarthikKashinath,etal. Kilometer-scaleconvection-allowingmodelemulationusinggenerative diffusion modeling.Science Advances, 12(5):eadv0423, 2026

2026

[46] [46]

Diffusion-lam: probabilistic limited area weather forecasting with diffusion.arXiv preprint arXiv:2502.07532, 2025

Erik Larsson, Joel Oskarsson, Tomas Landelius, and Fredrik Lindsten. Diffusion-lam: probabilistic limited area weather forecasting with diffusion.arXiv preprint arXiv:2502.07532, 2025

work page arXiv 2025

[47] [47]

Omnicast: A masked latent diffusion model for weather forecasting across time scales

Tung Nguyen, Tuan Pham, Troy Arcomano, Rao Kotamarthi, Ian Foster, Sandeep Madireddy, and Aditya Grover. Omnicast: A masked latent diffusion model for weather forecasting across time scales. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[48] [48]

Probabilistic weather forecasting with deterministic guidance-based diffusion model

Donggeun Yoon, Minseok Seo, Doyi Kim, Yeji Choi, and Donghyeon Cho. Probabilistic weather forecasting with deterministic guidance-based diffusion model. InComputer Vision – ECCV 2024, pages 108–124, 2025

2024

[49] [49]

Weather prediction with diffusion guided by realistic forecast processes.arXiv preprint arXiv:2402.06666, 2024

Zhanxiang Hua, Yutong He, Chengqian Ma, and Alexandra Anderson-Frey. Weather prediction with diffusion guided by realistic forecast processes.arXiv preprint arXiv:2402.06666, 2024

work page arXiv 2024

[50] [50]

Probablisticemulation of a global climate model with spherical DYffusion

SalvaRühlingCachay,BrianHenn,OliverWatt-Meyer,ChristopherS.Bretherton,andRoseYu. Probablisticemulation of a global climate model with spherical DYffusion. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024

[51] [51]

Codicast: Conditional diffusion model for global weather forecasting with uncertainty quantification

Jimeng Shi, Bowen Jin, Jiawei Han, Sundararaman Gopalakrishnan, and Giri Narasimhan. Codicast: Conditional diffusion model for global weather forecasting with uncertainty quantification. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25, pages 9853–9861, 8 2025

2025

[52] [52]

Fengwu-4dvar: Couplingthedata-drivenweather forecasting model with 4d variational assimilation.arXiv preprint arXiv:2312.12455, 2023

YiXiao,LeiBai,WeiXue,KangChen,TaoHan,andWanliOuyang. Fengwu-4dvar: Couplingthedata-drivenweather forecasting model with 4d variational assimilation.arXiv preprint arXiv:2312.12455, 2023

work page arXiv 2023

[53] [53]

Towards an end-to-end artificial intelligence driven global weather forecasting system.arXiv preprint arXiv:2312.12462, 2023

Kun Chen, Lei Bai, Fenghua Ling, Peng Ye, Tao Chen, Kang Chen, Tao Han, and Wanli Ouyang. Towards an end-to-end artificial intelligence driven global weather forecasting system.arXiv preprint arXiv:2312.12462, 2023

work page arXiv 2023

[54] [54]

Fengwu-ghr: Learning the kilometer-scale medium-range global weather forecasting.arXiv preprint arXiv:2402.00059, 2024

Tao Han, Song Guo, Fenghua Ling, Kang Chen, Junchao Gong, Jingjia Luo, Junxia Gu, Kan Dai, Wanli Ouyang, and Lei Bai. Fengwu-ghr: Learning the kilometer-scale medium-range global weather forecasting.arXiv preprint arXiv:2402.00059, 2024

work page arXiv 2024

[55] [55]

Extremecast: Boosting extreme value prediction for global weather forecast.arXiv preprint arXiv:2402.01295, 2024

Wanghan Xu, Kang Chen, Tao Han, Hao Chen, Wanli Ouyang, and Lei Bai. Extremecast: Boosting extreme value prediction for global weather forecast.arXiv preprint arXiv:2402.01295, 2024

work page arXiv 2024

[56] [56]

Neural general circulation models for weather and climate.Nature, 632(8027):1060–1066, 2024

Dmitrii Kochkov, Janni Yuval, Ian Langmore, Peter Norgaard, Jamie Smith, Griffin Mooers, Milan Klöwer, James Lottes, Stephan Rasp, Peter Düben, et al. Neural general circulation models for weather and climate.Nature, 632(8027):1060–1066, 2024

2024

[57] [57]

A foundation model for the earth system.Nature, 641(8065):1180–1187, 2025

Cristian Bodnar, Wessel P Bruinsma, Ana Lucic, Megan Stanley, Anna Allen, Johannes Brandstetter, Patrick Garvan, Maik Riechert, Jonathan A Weyn, Haiyu Dong, et al. A foundation model for the earth system.Nature, 641(8065):1180–1187, 2025. 13 RealBench arXiv Preprint

2025

[58] [58]

Prithvi wxc: Foundation model for weather and climate,

Johannes Schmude, Sujit Roy, Will Trojak, Johannes Jakubik, Daniel Salles Civitarese, Shraddha Singh, Julian Kuehnert, Kumar Ankur, Aman Gupta, Christopher E Phillips, et al. Prithvi wxc: Foundation model for weather and climate.arXiv preprint arXiv:2409.13598, 2024

work page arXiv 2024

[59] [59]

AIFS – ECMWF’s data-driven forecast- ing system,

Simon Lang, Mihai Alexe, Matthew Chantry, Jesper Dramsch, Florian Pinault, Baudouin Raoult, Mariana CA Clare, Christian Lessig, Michael Maier-Gerber, Linus Magnusson, et al. Aifs–ecmwf’s data-driven forecasting system.arXiv preprint arXiv:2406.01465, 2024

work page arXiv 2024

[60] [60]

Aifs-crps: ensemble forecasting using a model trained with a loss function based on the continuous ranked probability score.arXiv preprint arXiv:2412.15832, 2024

Simon Lang, Mihai Alexe, Mariana CA Clare, Christopher Roberts, Rilwan Adewoyin, Zied Ben Bouallègue, Matthew Chantry, Jesper Dramsch, Peter D Dueben, Sara Hahner, et al. Aifs-crps: Ensemble forecasting using a model trained with a loss function based on the continuous ranked probability score.arXiv preprint arXiv:2412.15832, 2024

work page arXiv 2024

[61] [61]

The operational medium-range deterministic weather forecasting can be extended beyond a 10-day lead time.Communications Earth & Environment, 6(1):518, 2025

Kang Chen, Tao Han, Fenghua Ling, Junchao Gong, Lei Bai, Xinyu Wang, Jing-Jia Luo, Ben Fei, Wenlong Zhang, Xi Chen, et al. The operational medium-range deterministic weather forecasting can be extended beyond a 10-day lead time.Communications Earth & Environment, 6(1):518, 2025

2025

[62] [62]

On the measurement of heat waves.Journal of climate, 26(13):4500–4517, 2013

Sarah E Perkins and Lisa V Alexander. On the measurement of heat waves.Journal of climate, 26(13):4500–4517, 2013

2013

[63] [63]

Academic press, 2011

Daniel S Wilks.Statistical methods in the atmospheric sciences, volume 100. Academic press, 2011. 14 RealBench arXiv Preprint A Limitations Although RealBench provides a more realistic benchmark for AI weather forecasting, several limitations remain. First, WEATHER-10K is based on globally distributed in-situ stations, but the station network is geographi...

2011

[64] [64]

We denote the processed station observation by𝑦𝑣 𝑖,𝑡 and the corresponding station-interpolated ERA5 reference byˆ𝑦𝑣 𝑖,𝑡. For temperature, dew-point temperature, station-level pressure, sea-level pressure, and wind speed, we flag an observation as anomalous if 𝑦𝑣 𝑖,𝑡 ˆ𝑦𝑣 𝑖,𝑡 > 𝑟 𝑣 ,(5) where 𝑟𝑣 is a variable-specific threshold. We use𝑟𝑣 =6 for temperature...

2025