pith. machine review for the scientific record. sign in

arxiv: 2604.25559 · v1 · submitted 2026-04-28 · ⚛️ physics.ao-ph

Recognition: unknown

Representing the Surface Ocean in ECMWF's data-driven forecasting system AIFS

Authors on Pith no claims yet

Pith reviewed 2026-05-07 13:57 UTC · model grok-4.3

classification ⚛️ physics.ao-ph
keywords machine learning forecastingcoupled atmosphere oceanmedium range forecastsocean wavessea icedata driven modelingsurface ocean
0
0 comments X

The pith

A single machine-learning model that combines atmosphere and surface ocean predictions gains about one day of skill for marine forecasts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a machine-learning approach that models the atmosphere and the surface ocean, including waves and sea ice, together in one system. Traditional numerical models use separate components for these parts, but here the model learns their interactions directly from data. The goal is to improve medium-range forecasts and enable better representation of coupled processes like how the ocean affects weather. Evaluation shows gains in skill for ocean variables, and the model behaves consistently even with unusual starting points. This suggests data-driven methods can handle complex Earth system interactions effectively.

Core claim

The paper establishes that incorporating the surface ocean into a data-driven weather model produces an improvement of approximately one day in forecast skill for nearly all marine variables at medium-range lead times relative to physics-based models. The model learns correlations across the atmosphere-ocean interface in a unified way and preserves physical realism in its predictions.

What carries the argument

The unified neural network that jointly represents atmosphere and ocean variables, trained with tailored datasets and loss scaling to address multi-scale dynamics and missing data over land.

If this is right

  • Improved predictions of wave swell and sea surface temperature changes from storms.
  • More accurate medium-range forecasts for marine applications without needing multiple models.
  • Potential to expand to full Earth system modeling with added components.
  • Robust performance on initial conditions not seen during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This unified approach could simplify the development of forecasting systems by eliminating the need for interface codes between models.
  • Extending the model to include deeper ocean layers might enhance seasonal predictions.
  • The method could be tested on other coupled systems such as atmosphere-land interactions.

Load-bearing premise

The patterns in historical data are sufficient for the model to accurately predict future states and maintain physical consistency even without explicit physical laws built into the system.

What would settle it

Comparing the model's sea surface temperature forecasts to observations during an extreme weather event occurring after the training data period to see if the skill improvement holds.

Figures

Figures reproduced from arXiv: 2604.25559 by Ana Prieto Nemesio, Baudouin Raoult, Charles Pelletier, Christian Lessig, Florian Pinault, Gabriel Moldovan, Gert Mertes, Hao Zuo, Harrison Cook, Irina Sandu, Jakob Schloer, Jean-Raymond Bidlot, Josh Kousal, Kristian Mogensen, Lorenzo Zampieri, Mariana C. A. Clare, Mario Santa Cruz, Matthew Chantry, Peter Dueben, Philip Browne, Rachel Furner, Sara Hahner, Sarah Keeley, Simon Lang, Steffen Tietsche.

Figure 1
Figure 1. Figure 1: Comparison of AIFS Marine and IFS forecasts of Hurricanes Idalia and Franklin over the Gulf of Mexico view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the handling of missing values for prognostic sea surface temperature (SST). The input field view at source ↗
Figure 3
Figure 3. Figure 3: Histogram of 6h sea ice concentration predictions by the model AIFS Ocean before applying the bounding view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of AIFS Waves and the physics-based wave model in root mean square error (RMSE) of view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of two variations of a joint model to AIFS Waves in RMSE of significant wave height (SWH) view at source ↗
Figure 6
Figure 6. Figure 6: (a,b) Integrated Ice Edge Error (IIEE) for the Arctic and Antarctic as a function of lead time, verified against ORAS6 for 15 June–15 December 2023. (c–f) Spatial maps of the Mean Absolute Error difference (∆MAE) in sea ice concentration between AIFS Ocean and IFS, averaged over forecast days 8–10, for two sub-periods: 15 June–15 September 2023 and 15 September–15 December 2023. 10 view at source ↗
Figure 7
Figure 7. Figure 7: Forecast verification of sea surface temperature (top two rows) and sea surface height (bottom two rows) view at source ↗
Figure 8
Figure 8. Figure 8: Anomaly correlation skill scores for geopotential at 500hPa (left) and temperature at 850hpa (right) in the view at source ↗
Figure 9
Figure 9. Figure 9: Root mean squared error (RMSE) scores for 2-metre temperature (left) and 10-metre wind speed (right) view at source ↗
Figure 10
Figure 10. Figure 10: Normalised change in RMSE for 2 m temperature relative to the AIFS Atmosphere for the period 15 June– view at source ↗
Figure 11
Figure 11. Figure 11: AIFS Marine forecast fields for the Bellingshausen and Weddell Seas, initialised on 26 August 2023 at 00 view at source ↗
Figure 12
Figure 12. Figure 12: Sea surface temperature (SST) anomaly over the Gulf of Mexico and the western North Atlantic for view at source ↗
Figure 13
Figure 13. Figure 13: Initialisation of the AIFS Waves model with synthetic large-period waves at isolated locations, while the view at source ↗
Figure 14
Figure 14. Figure 14: Sensitivity experiment in which sea ice is removed from the initial conditions. Time series of (top) view at source ↗
read the original abstract

Machine-learning (ML) models, such as the AIFS at the ECMWF, have revolutionised weather forecasting in recent years. We present an extension of the AIFS that jointly models the atmosphere and surface ocean, including ocean waves and sea ice. The primary objective of this extension is to enhance machine-learning medium-range forecasting and enable new use cases by expanding the weather state to better capture coupled surface processes. Our approach departs from traditional numerical models by not having two separate models for the atmosphere and marine components. The joint model instead learns correlations across the entire atmosphere-ocean interface in a component-agnostic way, and can exploit the expressive capacity of ML architectures to learn cross-component relationships directly from the data. We leverage tailored and targeted datasets and solve model design challenges such as missing values over land, multi-scale temporal dynamics, and physical realism of forecast fields and demonstrate the utility of loss scaling in guiding the learning process. We evaluate how representing the surface ocean affects medium-range weather forecasts. We also assess the model's ability to predict surface-ocean fields, including wave swell and tropical-cyclone cold wakes. For nearly all evaluated marine variables, we observe an improvement of approximately one day in forecast skill at medium-range lead times compared to physics-based models. Furthermore, we demonstrate that the model is robust to idealised initial conditions outside the training distribution and responds to them in a physically consistent way. Overall, our findings suggest that the joint AIFS modelling approach offers significant potential for combined atmosphere-ocean forecasting. Our work provides a solid foundation for future development of data-driven coupled Earth system models with greater flexibility and physical fidelity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript presents an extension of ECMWF's AIFS machine-learning weather forecasting model to jointly represent the atmosphere and surface ocean (including waves and sea ice) within a single component-agnostic architecture. It claims that this joint approach yields an approximately one-day improvement in forecast skill for nearly all evaluated marine variables at medium-range lead times relative to physics-based models, while also demonstrating physical consistency in responses to idealized initial conditions outside the training distribution. The work addresses practical challenges such as missing land values, multi-scale temporal dynamics, and physical realism via tailored datasets and loss scaling.

Significance. If the reported skill gains prove robust, this work would be significant as one of the first demonstrations of a unified data-driven model handling coupled atmosphere-ocean processes without separate numerical components. It highlights the potential for ML architectures to learn cross-interface correlations directly from reanalysis data and provides a foundation for more flexible Earth-system forecasting systems. The emphasis on loss scaling and idealized robustness tests adds value for guiding future coupled ML model development.

major comments (3)
  1. [Abstract and Results] Abstract and Results section: The central claim of an approximately one-day skill improvement for marine variables lacks error bars, statistical significance tests, or explicit details on the physics-based baseline models and verification protocols (including any rules for excluding training-distribution overlap). Without these, it is difficult to determine whether the reported gain is load-bearing or could be explained by differences in training data or evaluation setup.
  2. [Robustness tests] Robustness tests subsection: The idealized initial-condition probes, while useful, are narrow and do not address secular trends (e.g., warming SST baselines), rare compound events, or error accumulation over 5–10 day leads. These omissions directly affect the generalizability assumption underlying the skill claims versus coupled physics models.
  3. [Model design] Model design and loss scaling description: The paper states that loss scaling guides physical realism, but provides insufficient quantitative detail on the scaling factors, their derivation, or ablation results showing their necessity for cross-component consistency. This weakens the ability to assess how the joint model avoids unphysical drift.
minor comments (3)
  1. [Abstract] The abstract refers to 'tailored and targeted datasets' without specifying their construction or differences from standard reanalysis products used in prior AIFS work.
  2. [Figures] Figure captions and axis labels for skill-score plots should explicitly state the exact lead times, variables, and baseline models to improve clarity for readers comparing to physics-based systems.
  3. [Discussion] A brief discussion of how the single-model architecture scales computationally relative to traditional coupled atmosphere-ocean models would help contextualize the practical advantages.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and positive assessment of our manuscript. We address each major comment point by point below and have revised the manuscript accordingly where the suggestions strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract and Results] Abstract and Results section: The central claim of an approximately one-day skill improvement for marine variables lacks error bars, statistical significance tests, or explicit details on the physics-based baseline models and verification protocols (including any rules for excluding training-distribution overlap). Without these, it is difficult to determine whether the reported gain is load-bearing or could be explained by differences in training data or evaluation setup.

    Authors: We agree that additional statistical detail will strengthen the central claim. In the revised manuscript we will add error bars (derived from ensemble spread and temporal bootstrapping) to all skill-score curves, include formal significance tests (paired t-tests and block bootstrapping over independent forecast cases), and expand the Methods section with explicit descriptions of the physics-based baselines (IFS atmospheric forecasts coupled to the operational ocean-wave and sea-ice components), the verification protocol, and the temporal separation rules used to avoid training-distribution overlap. These additions will allow readers to assess the robustness of the reported one-day gain. revision: yes

  2. Referee: [Robustness tests] Robustness tests subsection: The idealized initial-condition probes, while useful, are narrow and do not address secular trends (e.g., warming SST baselines), rare compound events, or error accumulation over 5–10 day leads. These omissions directly affect the generalizability assumption underlying the skill claims versus coupled physics models.

    Authors: We acknowledge the limited scope of the current idealized probes. The revised manuscript will expand the discussion to explicitly note these limitations and will add (i) a short sensitivity test using perturbed SST baselines consistent with observed warming trends and (ii) quantitative assessment of error growth out to 10-day leads for the marine variables. Comprehensive evaluation of rare compound events remains outside the present scope and will be flagged as future work; however, the existing physical-consistency tests still provide useful evidence that the model does not produce obviously unphysical responses outside the training distribution. revision: partial

  3. Referee: [Model design] Model design and loss scaling description: The paper states that loss scaling guides physical realism, but provides insufficient quantitative detail on the scaling factors, their derivation, or ablation results showing their necessity for cross-component consistency. This weakens the ability to assess how the joint model avoids unphysical drift.

    Authors: We agree that more quantitative information is needed. The revised manuscript will include a dedicated subsection (or appendix) that reports the exact scaling factors applied to each variable group, describes their derivation from climatological standard deviations and physical units, and presents ablation experiments comparing the full loss-scaled model against versions without scaling. These results will quantify the reduction in cross-component drift and unphysical artifacts, directly addressing the concern about physical realism. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical skill gains from trained ML model vs physics baselines

full rationale

The paper trains a neural network on historical reanalysis data to jointly forecast atmosphere and surface ocean variables, then reports forecast skill improvements (approximately one day at medium range) via direct comparison of model outputs against physics-based models. No derivation chain, equations, or uniqueness theorems are presented that reduce by construction to fitted inputs or self-citations; the central results are empirical evaluations on held-out or out-of-distribution cases, with robustness checks described as separate tests rather than tautological fits. Evaluation shares data sources with training in the usual ML sense but does not force the reported skill deltas.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim rests on a trained neural network whose weights are fitted to reanalysis data, plus assumptions that loss scaling and data handling produce physically consistent outputs without explicit conservation laws.

free parameters (1)
  • loss scaling factors
    Used to balance learning across atmosphere, ocean, waves, and ice components during training.
axioms (1)
  • domain assumption Historical reanalysis data sufficiently samples the coupled atmosphere-ocean state space for generalization.
    Invoked when claiming robustness to idealised initial conditions outside training distribution.

pith-pipeline@v0.9.0 · 5696 in / 1056 out tokens · 47319 ms · 2026-05-07T13:57:19.850707+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

7 extracted references · 5 canonical work pages

  1. [1]

    URLhttps://arxiv.org/abs/2510.20416. G.B. Brassington, M.J. Martin, H.L. Tolman, S. Akella, M. Balmeseda, C.R.S. Chambers, E. Chassignet, J.A. Cum- mings, Y . Drillet, P.A.E.M. Jansen, P. Laloyaux, D. Lea, A. Mehra, I. Mirouze, H. Ritchie, G. Samson, P.A. Sandery, G.C. Smith, M. Suarez, and R. Todling. Progress and challenges in short- to medium-range cou...

  2. [2]

    doi: 10.1029/2024gl114318

    ISSN 1944-8007. doi: 10.1029/2024gl114318. URLhttp://dx.doi.org/10.1029/2024GL114318. James P. C. Duncan, Elynn Wu, Surya Dheeshjith, Adam Subel, Troy Arcomano, Spencer K. Clark, Brian Henn, Anna Kwa, Jeremy McGibbon, W. Andre Perkins, William Gregory, Carlos Fernandez-Granda, Julius Busecke, Oliver Watt-Meyer, William J. Hurlin, Alistair Adcroft, Laure Z...

  3. [3]

    doi: 10.1029/2025ms005221

    ISSN 1942-2466. doi: 10.1029/2025ms005221. URLhttp://dx.doi.org/10.1029/2025MS005221. Hans Hersbach, Bill Bell, Paul Berrisford, Shoji Hirahara, Andr ´as Hor ´anyi, Joaqu´ın Mu˜noz-Sabater, Julien Nico- las, Carole Peubey, Raluca Radu, Dinand Schepers, Adrian Simmons, Cornel Soci, Saleh Abdalla, Xavier Abel- lan, Gianpaolo Balsamo, Peter Bechtold, Gionata...

  4. [4]

    doi: 10.1175/waf-d-20-0035.1

    ISSN 1520-0434. doi: 10.1175/waf-d-20-0035.1. URLhttp://dx.doi.org/10.1175/WAF-D-20-0035. 1. Xiang Wang, Renzhi Wang, Ningzi Hu, Pinqiang Wang, Peng Huo, Guihua Wang, Huizan Wang, Senzhang Wang, Junxing Zhu, Jianbo Xu, Jun Yin, Senliang Bao, Ciqiang Luo, Ziqing Zu, Yi Han, Weimin Zhang, Kaijun Ren, Kefeng Deng, and Junqiang Song. XiHe: A Data-Driven Model...

  5. [5]

    URLhttps://arxiv.org/abs/2402.02995. Nils P. Wedi. Increasing horizontal resolution in numerical weather prediction and climate simulations: illusion or panacea?Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 372(2018):20130289, June 2014. ISSN 1471-2962. doi: 10.1098/rsta.2013.0289. URLhttp://dx.doi.org...

  6. [6]

    Relative score changes are shown as function of lead time (day 1 to 10) for northern extra-tropics (n.hem), southern extra-tropics (s.hem) and tropics

    Forecasts are initialised at 00 and 12 UTC. Relative score changes are shown as function of lead time (day 1 to 10) for northern extra-tropics (n.hem), southern extra-tropics (s.hem) and tropics. Blue colours mark score im- provements and red colours score degradations. Purple colours indicate an increase in standard deviation of forecast anomaly, while g...

  7. [7]

    S.6 Figure S.8: Spectra of 10-day forecasts of temperature at 850 hPa relative to spectra of IFS initial condition

    For a description of metrics see Fig. S.6 Figure S.8: Spectra of 10-day forecasts of temperature at 850 hPa relative to spectra of IFS initial condition. 15th June 2023 until 15th December 2023. 29 6.4 Additional Evaluation on Removing Sea Ice from Initial Conditions Figure S.9: Arctic sea ice response in the perturbed forecast initialised on 1 February 2...