arxiv: 2605.01126 · v1 · submitted 2026-05-01 · 💻 cs.LG

Recognition: unknown

Extreme Weather Bench: A framework and benchmark for evaluation of high-impact weather

Amy McGovern , Taylor Mandelbaum , Daniel Rothenberg , Nicholas Loveday , Corey Potvin , Montgomery Flora , Linus Magnusson , Eric Gilleland

show 1 more author

John Allen

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:17 UTC · model grok-4.3

classification 💻 cs.LG

keywords extreme weatherbenchmark suiteAI weather modelsmodel verificationhigh-impact weathernumerical weather predictioncase studiesimpact metrics

0 comments

The pith

Extreme Weather Bench supplies standardized case studies, data, and impact metrics to evaluate AI and NWP models on high-impact weather.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Extreme Weather Bench as an open-source benchmark suite designed to standardize evaluation of weather models on events that matter to the public. It supplies a fixed collection of case studies spanning different scales and weather types, along with observational data, impact-focused metrics, and code for running the tests. Current AI weather model assessments often rely on global averages or researcher-chosen examples, which makes fair comparisons difficult. A shared benchmark lets developers verify models against the same high-impact phenomena and track progress toward better real-world performance. The suite is intended to grow through community input and apply to both AI and traditional numerical weather prediction systems.

Core claim

Extreme Weather Bench is a community-driven framework that supplies a standard set of high-impact weather case studies across multiple spatial and temporal scales, paired with observational data, impact-based metrics, and open-source evaluation code. This setup allows direct verification of AI and NWP models on phenomena that affect people, enabling consistent comparisons across models and focusing development on events that carry real consequences rather than on global-scale statistics alone.

What carries the argument

The Extreme Weather Bench suite itself, which organizes case studies, data, metrics, and code into a reusable evaluation system.

If this is right

Models can be tested and ranked on identical high-impact events instead of researcher-selected examples.
Direct head-to-head comparisons become possible between AI weather models and traditional NWP systems.
Evaluation shifts from global error statistics toward metrics that reflect public impacts such as damage or disruption.
Open code and data lower the barrier for new groups to participate in model verification.
Ongoing community additions will expand coverage to more phenomena and regions over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adoption of EWB could create a shared reference point similar to standard datasets in other AI domains, making it easier to measure genuine progress in weather prediction.
The benchmark may surface systematic weaknesses in current models for specific hazard types that global metrics currently obscure.
Integration with real-time forecast systems could turn the benchmark into a continuous monitoring tool rather than a one-time test set.
If the metrics prove predictive of operational value, they could influence how funding agencies and operational centers prioritize model development.

Load-bearing premise

The selected case studies and impact-based metrics represent the full range of high-impact weather events around the world and will produce model improvements that transfer to operational forecasting.

What would settle it

A set of models that perform well on the EWB cases but show no corresponding gains in real-world forecasts of high-impact events outside those cases, or a finding that the chosen cases miss major types of hazards experienced globally.

read the original abstract

Forecasting the wide variety of high-impact weather events experienced globally is a challenge for both Artificial Intelligence (AI) and Numerical Weather Prediction (NWP) models and it is critical that such models be properly verified before deployment. Although AI weather models are rapidly evolving, much of their evaluation is currently done either with a global-scale evaluation or by hand-picking a small number of case studies or a region. A widely-used open-source benchmark suite focusing on high-impact weather will help to drive the science forward for all scales of weather models, as it has for other AI fields. Here we introduce Extreme Weather Bench (EWB), a new community-driven benchmark suite that facilitates model validation and verification on a variety of high-impact hazards that matter to people around the globe. EWB provides a standard set of case studies (spanning across multiple spatial and temporal scales and different parts of the weather spectrum), observational data, impact-based metrics, and open-source code for users to evaluate their models. Verifying that a model works against a standard set of case studies, especially events that are high-impact for the general public, is a key piece of improving the trustworthiness of AI models. EWB will help to drive the science forward for all weather models, enabling true comparisons across models and evaluating models on specific high-impact phenomena through the use of case studies. EWB is a free open-source community-driven system and will continue to evolve to include additional phenomena, test cases and metrics in collaboration with the worldwide weather and forecast verification community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Extreme Weather Bench (EWB), a community-driven open-source benchmark suite for evaluating AI and NWP models on high-impact weather. It supplies a standardized collection of case studies spanning multiple spatial and temporal scales, associated observational data, impact-based metrics, and code to support verification and enable comparisons across models.

Significance. If widely adopted, EWB could establish a common reference for assessing model performance on societally relevant hazards, addressing the current reliance on global aggregates or ad-hoc case selection. The extensible, community-oriented design and emphasis on impact-based evaluation are strengths that align with successful benchmark efforts in other AI domains.

major comments (1)

Abstract: the positioning of EWB as critical for 'improving the trustworthiness of AI models' and 'driving the science forward' rests on the untested assumption that the chosen cases and impact-based metrics are representative and will produce meaningful improvements; the manuscript provides no quantitative validation results, example model evaluations, or comparisons to demonstrate this utility.

minor comments (2)

Abstract: the term 'different parts of the weather spectrum' is introduced without definition or elaboration, which reduces clarity for readers outside the immediate domain.
The manuscript should include explicit links or DOIs to the open-source code repository, data sources, and case-study files in the main text (rather than only in supplementary material) to support immediate reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and recommendation of minor revision. We address the single major comment below.

read point-by-point responses

Referee: Abstract: the positioning of EWB as critical for 'improving the trustworthiness of AI models' and 'driving the science forward' rests on the untested assumption that the chosen cases and impact-based metrics are representative and will produce meaningful improvements; the manuscript provides no quantitative validation results, example model evaluations, or comparisons to demonstrate this utility.

Authors: We agree that the abstract makes forward-looking claims without accompanying quantitative demonstrations or model comparisons in the current manuscript. The paper's core contribution is the introduction of the standardized benchmark (case studies, data, metrics, and code) rather than its application to specific models. To address this directly, we will revise the abstract to moderate the language and add a short section with example evaluations of baseline AI and NWP models on a subset of cases, thereby illustrating the benchmark's intended use. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces Extreme Weather Bench (EWB) as a community-driven benchmark suite supplying case studies, observational data, impact-based metrics, and open-source code for model evaluation on high-impact weather. No derivation chain, equations, fitted parameters, or predictive claims exist that could reduce to inputs by construction. The contribution is the factual assembly and release of these artifacts, with representativeness framed as an assumption about downstream utility rather than a precondition or self-referential step. No self-citations are load-bearing for any central claim, and the work is self-contained as a framework release without internal reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on domain assumptions about what constitutes high-impact weather and how to measure model skill on it; no free parameters or invented physical entities are introduced.

axioms (2)

domain assumption A fixed set of case studies can serve as a representative and stable benchmark for global high-impact weather evaluation.
Invoked when stating that the provided cases span scales and phenomena sufficiently for community use.
domain assumption Impact-based metrics provide a more relevant assessment of model performance than traditional verification scores alone.
Stated as a key feature of the benchmark design.

pith-pipeline@v0.9.0 · 5598 in / 1255 out tokens · 27757 ms · 2026-05-09T19:17:29.013797+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 41 canonical work pages

[1]

Scienc e 382(6669), 1416–1421 (2023) https://doi.org/10.1126/science.adi2336

Lam, R., Sanchez-Gonzalez, A., Willson, M., Wirnsberger, P., Fortunato, M., Alet, F., Ravuri, S., et al.: Learning skillful medium-range global weather forecasting. Science, 2336 (2023) https://doi.org/10.1126/science.adi2336

work page doi:10.1126/science.adi2336 2023
[2]

Nature637, 84–90 (2025) https://doi.org/10.1038/s41586-024-08252-9

Price, I., Sanchez-Gonzalez, A., Alet, F., Andersson, T.R., El-Kadi, A., Masters, D., Ewalds, T.,et al.: Probabilistic weather forecasting with machine learning. Nature637(8044), 84–90 (2025) https://doi.org/10.1038/s41586-024-08252-9

work page doi:10.1038/s41586-024-08252-9 2025
[3]

Skillful joint probabilistic weather forecasting from marginals.arXiv preprint arXiv:2506.10772, 2025

Alet, F., Price, I., El-Kadi, A., Masters, D., Markou, S., Andersson, T.R., Stott, J.,et al.: Skillful joint probabilistic weather forecasting from marginals. arXiv [Cs.LG] (2025) https://doi.org/10.48550/arXiv.2506.10772

work page doi:10.48550/arxiv.2506.10772 2025
[4]

Accurate medium-range global weather forecasting with 3d neural networks,

Bi, K., Xie, L., Zhang, H., Chen, X., Gu, X., Tian, Q.: Accurate medium-range global weather forecasting with 3D neural networks. Nature (2023) https://doi. org/10.1038/s41586-023-06185-3

work page doi:10.1038/s41586-023-06185-3 2023
[5]

Journal of Advances in Modeling Earth Systems12(11) (2020) https://doi.org/10.1029/ 2020ms002203

Rasp, S., Dueben, P.D., Scher, S., Weyn, J.A., Mouatadid, S., Thuerey, N.: Weath- erBench: A benchmark data set for data-driven weather forecasting. Journal of Advances in Modeling Earth Systems12(11) (2020) https://doi.org/10.1029/ 2020ms002203

2020
[6]

WeatherBench 2: A benchmark for the next generation of data‐driven global weather models

Rasp, S., Hoyer, S., Merose, A., Langmore, I., Battaglia, P., Russell, T., Sanchez- Gonzalez, A., et al.: WeatherBench 2: A benchmark for the next generation of data-driven global weather models. Journal of Advances in Modeling Earth Systems16(6) (2024) https://doi.org/10.1029/2023ms004019

work page doi:10.1029/2023ms004019 2024
[7]

Geophysical Research Letters51(22) (2024) https://doi.org/10.1029/ 2024gl110960

Feldmann, M., Beucler, T., Gomez, M., Martius, O.: Lightning-fast convective outlooks: Predicting severe convective environments with global AI-based weather models. Geophysical Research Letters51(22) (2024) https://doi.org/10.1029/ 2024gl110960

2024
[8]

npj Climate and Atmospheric Science7(1), 1–12 (2024) https://doi.org/10.1038/s41612-024-00769-0

Liu, C.-C., Hsu, K., Peng, M.S., Chen, D.-S., Chang, P.-L., Hsiao, L.-F., Fong, C.-T.,et al.: Evaluation of five global AI models for predicting weather in Eastern Asia and Western Pacific. npj Climate and Atmospheric Science7(1), 1–12 (2024) https://doi.org/10.1038/s41612-024-00769-0

work page doi:10.1038/s41612-024-00769-0 2024
[9]

In: IEEE Computer Vision and Pattern Recognition (CVPR) (2009)

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A 31 large-scale hierarchical image database. In: IEEE Computer Vision and Pattern Recognition (CVPR) (2009)

2009
[10]

npj Natural Hazards1(1), 1–6 (2024) https://doi.org/10.1038/s44304-024-00014-x

McGovern, A., Demuth, J., Bostrom, A., Wirz, C.D., Tissot, P.E., Cains, M.G., Musgrave, K.D.: The value of convergence research for developing trustworthy AI for weather, climate, and ocean hazards. npj Natural Hazards1(1), 1–6 (2024) https://doi.org/10.1038/s44304-024-00014-x

work page doi:10.1038/s44304-024-00014-x 2024
[11]

Weather and Forecasting39, 1219–1241 (2024) https: //doi.org/10.1175/WAF-D-23-0180.1

Cains, M.G., Wirz, C.D., Demuth, J.L., Bostrom, A., Gagne, D.J., McGovern, A., Sobash, R.A., Madlambayan, D.: Exploring NWS forecasters’ assessment of AI guidance trustworthiness. Weather and Forecasting39, 1219–1241 (2024) https: //doi.org/10.1175/WAF-D-23-0180.1

work page doi:10.1175/waf-d-23-0180.1 2024
[12]

Bulletin of the American Meteorological Society105, 2194–2215 (2024) https://doi.org/10.1175/BAMS-D-24-0044.1

Wirz, C.D., Demuth, J.L., Cains, M.G., White, M., Radford, J., Bostrom, A.: National weather service (NWS) forecasters’ perceptions of AI/ML and its use in operational forecasting. Bulletin of the American Meteorological Society105, 2194–2215 (2024) https://doi.org/10.1175/BAMS-D-24-0044.1

work page doi:10.1175/bams-d-24-0044.1 2024
[13]

Risk Analysis44, 1498–1513 (2024) https://doi.org/10.1111/risa.14245

Bostrom, A., Demuth, J.L., Wirz, C.D., Cains, M.G., Schumacher, A., Madlam- bayan, D., Bansal, A.S., Bearth, A., Chase, R., Crosman, K.M., Ebert-Uphoff, I., Gagne, D.J., Guikema, S., Hoffman, R., Johnson, B.B., Kumler-Bonfanti, C., Lee, J.D., Lowe, A., McGovern, A., Williams, J.K.: Trust and trustworthy arti- ficial intelligence: A research agenda for AI ...

work page doi:10.1111/risa.14245 2024
[14]

Weather and Forecasting8(2), 281–293 (1993) https://doi

Murphy, A.H.: What is a good forecast? An essay on the nature of goodness in weather forecasting. Weather and Forecasting8(2), 281–293 (1993) https://doi. org/10.1175/1520-0434(1993)008⟨0281:WIAGFA⟩2.0.CO;2

work page doi:10.1175/1520-0434(1993)008 1993
[15]

Journal of the American Statistical Association102(477), 359–378 (2007)

Gneiting, T., Raftery, A.E.: Strictly proper scoring rules, prediction, and esti- mation. Journal of the American Statistical Association102(477), 359–378 (2007)

2007
[16]

Journal of the American Statistical Association106(494), 746–762 (2011)

Gneiting, T.: Making and evaluating point forecasts. Journal of the American Statistical Association106(494), 746–762 (2011)

2011
[17]

Weather and Forecasting11(1), 3–20 (1996) https://doi.org/10.1175/ 1520-0434(1996)011⟨0003:TFAASE⟩2.0.CO;2

Murphy, A.H.: The Finley affair: A signal event in the history of forecast ver- ification. Weather and Forecasting11(1), 3–20 (1996) https://doi.org/10.1175/ 1520-0434(1996)011⟨0003:TFAASE⟩2.0.CO;2

1996
[18]

Artifi- cial Intelligence for the Earth Systems3, 230104 (2024) https://doi.org/10.1175/ AIES-D-23-0104.1

Brooks, H.E., Flora, M.L., Baldwin, M.E.: A rose by any other name: On basic scores from the 2×2 table and the plethora of names attached to them. Artifi- cial Intelligence for the Earth Systems3, 230104 (2024) https://doi.org/10.1175/ AIES-D-23-0104.1

2024
[19]

Artificial Intelligence for the Earth Systems (2024) https://doi.org/10

Hakim, G.J., Masanam, S.: Dynamical tests of a deep-learning weather prediction model. Artificial Intelligence for the Earth Systems (2024) https://doi.org/10. 32 1175/AIES-D-23-0090.1

2024
[20]

Weather and Forecasting35(4), 1381–1406 (2020) https://doi.org/ 10.1175/WAF-D-19-0108.1

Demuth, J.L., Morss, R.E., Jankov, I., Alcott, T.I., Alexander, C.R., Nietfeld, D., Jensen, T.L., Novak, D.R., Benjamin, S.G.: Recommendations for develop- ing useful and usable convection-allowing model ensemble information for NWS forecasters. Weather and Forecasting35(4), 1381–1406 (2020) https://doi.org/ 10.1175/WAF-D-19-0108.1

work page doi:10.1175/waf-d-19-0108.1 2020
[21]

Weather, Climate, and Society14(2), 481–498 (2022) https://doi.org/ 10.1175/WCAS-D-21-0034.1

Ripberger, J., Bell, A., Fox, A., Forney, A., Livingston, W., Gaddie, C., Silva, C., Jenkins-Smith, H.: Communicating probability information in weather forecasts: Findings and recommendations from a living systematic review of the research literature. Weather, Climate, and Society14(2), 481–498 (2022) https://doi.org/ 10.1175/WCAS-D-21-0034.1

work page doi:10.1175/wcas-d-21-0034.1 2022
[22]

Quarterly Journal of the Royal Meteorological Society126(563), 649–667 (2000) https://doi.org/10.1002/qj.49712656313

Richardson, D.S.: Skill and relative economic value of the ECMWF ensemble pre- diction system. Quarterly Journal of the Royal Meteorological Society126(563), 649–667 (2000) https://doi.org/10.1002/qj.49712656313

work page doi:10.1002/qj.49712656313 2000
[23]

American Meteorological Journal1, 19–24 (1884)

Finley, J.P.: Tornado predictions. American Meteorological Journal1, 19–24 (1884)
[24]

Jolliffe, I.T., Stephenson, D.B.: Forecast Verification: A Practitioner’s Guide in Atmospheric Science, p. 274. Wiley-Blackwell, Oxford, U.K. (2012)

2012
[25]

Geoscientific Model Development17(21), 7915–7962 (2024) https: //doi.org/10.5194/gmd-17-7915-2024

Olivetti, L., Messori, G.: Do data-driven models beat numerical models in fore- casting weather extremes? A comparison of IFS HRES, Pangu-Weather, and GraphCast. Geoscientific Model Development17(21), 7915–7962 (2024) https: //doi.org/10.5194/gmd-17-7915-2024

work page doi:10.5194/gmd-17-7915-2024 2024
[26]

Journal of Open Source Software9(99), 6889 (2024) https://doi.org/10.21105/joss.06889

Leeuwenburg, T., Loveday, N., Ebert, E.E., Cook, H., Khanarmuei, M., Taggart, R.J., Ramanathan, N.,et al.: Scores: A Python package for verifying and evaluat- ing models and predictions with xarray. Journal of Open Source Software9(99), 6889 (2024) https://doi.org/10.21105/joss.06889

work page doi:10.21105/joss.06889 2024
[27]

Nature648(8092), 97–108 (2025) https://doi.org/10.1038/ s41586-025-09716-2

Xiang, A., Andrews, J.T.A., Bourke, R.L., Thong, W., LaChance, J.M., Georgievski, T., Modas, A.,et al.: Fair human-centric image dataset for ethi- cal AI benchmarking. Nature648(8092), 97–108 (2025) https://doi.org/10.1038/ s41586-025-09716-2

2025
[28]

Proteins89(12), 1633–1646 (2021) https://doi.org/10.1002/prot.26223

Kryshtafovych, A., Moult, J., Albrecht, R., Chang, G.A., Chao, K., Fraser, A., Greenfield, J., Hartmann, M.D., Herzberg, O., Josts, I., Leiman, P.G., Linden, S.B., Lupas, A.N., Nelson, D.C., Rees, S.D., Shang, X., Sokolova, M.L., Tidow, H., AlphaFold2 team: Computational models in the service of X-ray and cryo- electron microscopy structure determination....

work page doi:10.1002/prot.26223 2021
[29]

doi: 10.18653/v1/W18-5446

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: A 33 multi-task benchmark and analysis platform for natural language understand- ing. In: Proceedings of the International Conference on Learning Representations (ICLR), pp. 353–355 (2019). https://doi.org/10.18653/v1/W18-5446

work page doi:10.18653/v1/w18-5446 2019
[30]

Nature Climate Change6(1), 106–111 (2016) https://doi.org/10.1038/nclimate2777

Lin, N., Emanuel, K.: Grey swan tropical cyclones. Nature Climate Change6(1), 106–111 (2016) https://doi.org/10.1038/nclimate2777

work page doi:10.1038/nclimate2777 2016
[31]

Nath, S., Palmer, T.: Can AI models reliably forecast extreme weather events? Nature651(8106), 583–584 (2026) https://doi.org/10.1038/d41586-026-00842-z

work page doi:10.1038/d41586-026-00842-z 2026
[32]

Statistical Science32(1), 106–127 (2017) https://doi.org/10.1214/16-sts588

Lerch, S., Thorarinsdottir, T.L., Ravazzolo, F., Gneiting, T.: Forecaster’s dilemma: Extreme events and forecast evaluation. Statistical Science32(1), 106–127 (2017) https://doi.org/10.1214/16-sts588

work page doi:10.1214/16-sts588 2017
[33]

Bulletin of the American Meteorological Society103(9), 2050–2068 (2022) https://doi.org/10.1175/bams-d-21-0234.1

Magnusson, L., Ackerley, D., Bouteloup, Y., Chen, J.-H., Doyle, J., Earnshaw, P., Kwon, Y.C.,et al.: Skill of medium-range forecast models using the same initial conditions. Bulletin of the American Meteorological Society103(9), 2050–2068 (2022) https://doi.org/10.1175/bams-d-21-0234.1

work page doi:10.1175/bams-d-21-0234.1 2050
[34]

Data Project Report 1, Integrated Research on Disas- ter Risk (2014)

Integrated Research on Disaster Risk: Peril classification and hazard glossary. Data Project Report 1, Integrated Research on Disas- ter Risk (2014). https://council.science/wp-content/uploads/2019/12/ Peril-Classification-and-Hazard-Glossary-1.pdf

2014
[36]

arXiv [Physics.Ao-Ph] (2024)

Jin, W., Weyn, J., Zhao, P., Xiang, S., Bian, J., Fang, Z., Dong, H., Sun, H., Thambiratnam, K., Zhang, Q.: WeatherReal: A benchmark based on in-situ observations for evaluating weather models. arXiv [Physics.Ao-Ph] (2024)

2024
[37]

Quarterly Journal of the Royal Meteorological Society148(748), 3124–3137 (2022) https://doi.org/10.1002/qj

Lavers, D.A., Simmons, A., Vamborg, F., Rodwell, M.J.: An evaluation of ERA5 precipitation for climate monitoring. Quarterly Journal of the Royal Meteorological Society148(748), 3124–3137 (2022) https://doi.org/10.1002/qj. 4351

work page doi:10.1002/qj 2022
[38]

Quarterly Journal of the Royal Meteorological Society150(758), 436–446 (2024) https://doi.org/10.1002/qj.4604

Buschow, S.: Tropical convection in ERA5 has partly shifted from parameterized to resolved. Quarterly Journal of the Royal Meteorological Society150(758), 436–446 (2024) https://doi.org/10.1002/qj.4604

work page doi:10.1002/qj.4604 2024
[39]

WMO, vol

World Meteorological Organization: Guide to the Global Observing System. WMO, vol. 488. WMO, Geneva (2010)

2010
[40]

Geoscientific Instrumentation, Methods and Data Systems Discussions (2016) https://doi.org/ 34 10.5194/gi-2016-9

Dunn, R.J.H., Willett, K.M., Parker, D.E., Mitchell, L.: GHCNh. Geoscientific Instrumentation, Methods and Data Systems Discussions (2016) https://doi.org/ 34 10.5194/gi-2016-9

work page doi:10.5194/gi-2016-9 2016
[41]

In: ECSS2023 (2023)

Brimelow, J.C., Kopp, G.A., Sills, D.M.: The Northern Hail Project: A renais- sance in hail research in Canada. In: ECSS2023 (2023). Copernicus Meetings

2023
[42]

Bulletin of the Ameri- can Meteorological Society101(12), 2113–2132 (2020) https://doi.org/10.1175/ BAMS-D-20-0012.1

Sills, D.M., Kopp, G.A., Elliott, L., Jaffe, A.L., Sutherland, L., Miller, C.S., Kunkel, J.M., Hong, E., Stevenson, S., Wang, W.: The Northern Tornadoes Project: Uncovering Canada’s true tornado climatology. Bulletin of the Ameri- can Meteorological Society101(12), 2113–2132 (2020) https://doi.org/10.1175/ BAMS-D-20-0012.1

2020
[43]

Eos, Transactions, AGU90(46) (2009) https://doi.org/10.1029/2009EO060002

Knapp, K.R., Kruk, M.C., Levinson, D.H., Gibney, E.J.: Archive compiles new resource for global tropical cyclone research. Eos, Transactions, AGU90(46) (2009) https://doi.org/10.1029/2009EO060002

work page doi:10.1029/2009eo060002 2009
[44]

Scientific Data11, 900 (2024) https://doi.org/10.1038/s41597-024-03679-1

Mo, R.: EDARA: An ERA5-based dataset for atmospheric river analysis. Scientific Data11, 900 (2024) https://doi.org/10.1038/s41597-024-03679-1

work page doi:10.1038/s41597-024-03679-1 2024
[45]

Meteorological Applications22(3), 534–543 (2015) https://doi.org/10.1002/met

Gilleland, E., Roux, G.: A new approach to testing forecast predictive accuracy. Meteorological Applications22(3), 534–543 (2015) https://doi.org/10.1002/met. 1485

work page doi:10.1002/met 2015
[46]

Epidemiology20, 205–213 (2009) https: //doi.org/10.1097/EDE.0b013e318190ee08

Anderson, B., Bell, M.: Weather-related mortality: how heat, cold, and heat waves affect mortality in the United States. Epidemiology20, 205–213 (2009) https: //doi.org/10.1097/EDE.0b013e318190ee08

work page doi:10.1097/ede.0b013e318190ee08 2009
[47]

arXiv [Physics.Ao-Ph] (2024)

Lang, S., Alexe, M., Chantry, M., Dramsch, J., Pinault, F., Raoult, B., Clare, M.C.A., et al.: AIFS - ECMWF’s data-driven forecasting system. arXiv [Physics.Ao-Ph] (2024)

2024
[48]

Bulletin of the American Meteorological Society106(1), 68–76 (2025) https://doi.org/10

Radford, J.T., Ebert-Uphoff, I., Stewart, J.Q., Musgrave, K.D., DeMaria, R., Tourville, N., Hilburn, K.: Accelerating community-wide evaluation of AI models for global weather prediction by facilitating access to model output. Bulletin of the American Meteorological Society106(1), 68–76 (2025) https://doi.org/10. 1175/bams-d-24-0057.1

2025
[49]

In: Proceedings of the Platform for Advanced Scientific Computing Conference, pp

Kurth, T., Subramanian, S., Harrington, P., Pathak, J., Mardani, M., Hall, D., Miele, A., Kashinath, K., Anandkumar, A.: FourCastNet: Accelerating global high-resolution weather forecasting using adaptive Fourier neural operators. In: Proceedings of the Platform for Advanced Scientific Computing Conference, pp. 1–11. ACM, New York, NY, USA (2023). https:/...

work page doi:10.1145/3592979 2023
[50]

arXiv [Cs.LG] (2023) 35

Bonev, B., Kurth, T., Hundt, C., Pathak, J., Baust, M., Kashinath, K., Anand- kumar, A.: Spherical Fourier neural operators: Learning stable dynamics on the sphere. arXiv [Cs.LG] (2023) 35

2023
[51]

https://www.ecmwf.int/en/publications/ ifs-documentation

Blanchonnet, H.: IFS Documentation. https://www.ecmwf.int/en/publications/ ifs-documentation. Last accessed: 22 April 2026 (2022)

2026
[52]

Nature Climate Change5, 46–50 (2015) https://doi.org/10.1038/nclimate2468

Christidis, N., Jones, G., Stott, P.: Dramatically increasing chance of extremely hot summers since the 2003 European heatwave. Nature Climate Change5, 46–50 (2015) https://doi.org/10.1038/nclimate2468

work page doi:10.1038/nclimate2468 2003
[53]

Science305, 994–997 (2004) https://doi.org/10.1126/ science.1098704

Meehl, G.A., Tebaldi, C.: More intense, more frequent, and longer lasting heat waves in the 21st century. Science305, 994–997 (2004) https://doi.org/10.1126/ science.1098704

2004
[54]

Nature Climate Change16(1), 26–32 (2026) https://doi.org/10

Callahan, C.W., Trok, J., Wilson, A.J., Gould, C.F., Heft-Neal, S., Diffenbaugh, N.S., Burke, M.: Increasing risk of mass human heat mortality if historical weather patterns recur. Nature Climate Change16(1), 26–32 (2026) https://doi.org/10. 1038/s41558-025-02480-1

2026
[55]

Environmental Epidemiology6(1), 189 (2022) https://doi.org/10.1097/ EE9.0000000000000189

Henderson, S.B., McLean, K.E., Lee, M.J., Kosatsky, T.: Analysis of commu- nity deaths during the catastrophic 2021 heat dome: Early evidence to inform the public health response during subsequent events in greater Vancouver, Canada. Environmental Epidemiology6(1), 189 (2022) https://doi.org/10.1097/ EE9.0000000000000189

2021
[56]

Artificial Intelligence for the Earth Systems5(1) (2026) https: //doi.org/10.1175/aies-d-25-0065.1

White, E., Hill, A.J.: Severe weather forecasts from artificial intelligence weather prediction models. Artificial Intelligence for the Earth Systems5(1) (2026) https: //doi.org/10.1175/aies-d-25-0065.1

work page doi:10.1175/aies-d-25-0065.1 2026
[57]

Artificial Intelligence for the Earth Systems5(1) (2026) https://doi.org/10.1175/aies-d-25-0045.1

Hua, Z., Sobash, R.A., Gagne, D.J., Sha, Y., Anderson-Frey, A.: Improving medium-range severe weather prediction through transformer postprocessing of AI weather forecasts. Artificial Intelligence for the Earth Systems5(1) (2026) https://doi.org/10.1175/aies-d-25-0045.1

work page doi:10.1175/aies-d-25-0045.1 2026
[58]

Bulletin of the American Meteorological Society (2020) https: //doi.org/10.1175/BAMS-D-19-0321.1

Gensini, V.A., Haberlie, A.M., Marsh, P.T.: Practically perfect hindcasts of severe convective storms. Bulletin of the American Meteorological Society (2020) https: //doi.org/10.1175/BAMS-D-19-0321.1

work page doi:10.1175/bams-d-19-0321.1 2020
[59]

National Weather Digest28, 13–24 (2004)

Craven, J.P., Brooks, H.E.: Baseline climatology of sounding derived parameters associated with deep moist convection. National Weather Digest28, 13–24 (2004)

2004
[60]

Water3, 455–478 (2011) https://doi.org/10.3390/w3020445

Dettinger, M.D., Ralph, F.M., Das, T., Neiman, P.J., Cayan, D.: Atmospheric rivers, floods, and the water resources of California. Water3, 455–478 (2011) https://doi.org/10.3390/w3020445

work page doi:10.3390/w3020445 2011
[61]

Scientific Data11(1), 440 (2024) https://doi

Guan, B., Waliser, D.E.: A regionally refined quarter-degree global atmospheric rivers database based on ERA5. Scientific Data11(1), 440 (2024) https://doi. org/10.1038/s41597-024-03258-4

work page doi:10.1038/s41597-024-03258-4 2024
[62]

Geophysical Research Letters19, 2401–2404 (1992) https://doi.org/10

Newell, R.E., Newell, N.E., Zhu, Y., Scott, C.: Tropospheric rivers?—A pilot 36 study. Geophysical Research Letters19, 2401–2404 (1992) https://doi.org/10. 1029/92GL02916

1992
[63]

Natural Hazards76(3), 1651–1665 (2015)

Habeeb, D., Vargo, J., Stone, B.: Rising heat wave trends in large US cities. Natural Hazards76(3), 1651–1665 (2015)

2015
[64]

Technical report (2021)

United States Environmental Protection Agency: Technical documentation: Heat waves. Technical report (2021). https://www.epa.gov/sites/default/files/ 2021-04/documents/heat-waves td.pdf

2021
[65]

Monthly Weather Review149(5), 1211–1232 (2021) https://doi.org/10.1175/mwr-d-20-0248.1

Allen, J.T., Allen, E.R., Richter, H., Lepore, C.: Australian tornadoes in 2013: Implications for climatology and forecasting. Monthly Weather Review149(5), 1211–1232 (2021) https://doi.org/10.1175/mwr-d-20-0248.1

work page doi:10.1175/mwr-d-20-0248.1 2013
[66]

Bulletin of the American Meteorological Society103(10), 2273–2284 (2022) https://doi.org/10

May, R.M., Goebbert, K.H., Thielen, J.E., Leeman, J.R., Camron, M.D., Bruick, Z., Bruning, E.C., Manser, R.P., Arms, S.C., Marsh, P.T.: Metpy: A mete- orological python library for data analysis and visualization. Bulletin of the American Meteorological Society103(10), 2273–2284 (2022) https://doi.org/10. 1175/BAMS-D-21-0125.1

2022
[67]

Geoscientific Model Development 14, 5023–5048 (2021) https://doi.org/10.5194/gmd-14-5023-2021

Ullrich, P.A., Zarzycki, C.M., McClenny, E.E., Pinheiro, M.C., Stansfield, A.M., Reed, K.A.: TempestExtremes v2.1: a community framework for feature detec- tion, tracking, and analysis in large datasets. Geoscientific Model Development 14, 5023–5048 (2021) https://doi.org/10.5194/gmd-14-5023-2021

work page doi:10.5194/gmd-14-5023-2021 2021
[68]

CLIVAR working group on hurricanes

Walsh, K.J.E.,et al.: Hurricanes and climate: The U.S. CLIVAR working group on hurricanes. Bulletin of the American Meteorological Society96, 997–1017 (2015) https://doi.org/10.1175/BAMS-D-13-00242.1

work page doi:10.1175/bams-d-13-00242.1 2015
[69]

Journal of Climate20, 2307–2314 (2007) https://doi.org/10.1175/JCLI4074.1

Walsh, K.J.E., Fiorino, M., Landsea, C.W., McInnes, K.L.: Objectively deter- mined resolution-dependent threshold criteria for the detection of tropical cyclones in climate models and reanalyses. Journal of Climate20, 2307–2314 (2007) https://doi.org/10.1175/JCLI4074.1

work page doi:10.1175/jcli4074.1 2007
[70]

Weather and Forecasting38, 1363–1374 (2023) https://doi.org/10.1175/WAF-D-22-0199.1 37

Trabing, B.C., Musgrave, K.D., DeMaria, M., Zachry, B.C., Brennan, M.J., Rappaport, E.N.: The development and evaluation of a tropical cyclone proba- bilistic landfall forecast product. Weather and Forecasting38, 1363–1374 (2023) https://doi.org/10.1175/WAF-D-22-0199.1 37

work page doi:10.1175/waf-d-22-0199.1 2023