Recognition: unknown
Representing the Surface Ocean in ECMWF's data-driven forecasting system AIFS
Pith reviewed 2026-05-07 13:57 UTC · model grok-4.3
The pith
A single machine-learning model that combines atmosphere and surface ocean predictions gains about one day of skill for marine forecasts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that incorporating the surface ocean into a data-driven weather model produces an improvement of approximately one day in forecast skill for nearly all marine variables at medium-range lead times relative to physics-based models. The model learns correlations across the atmosphere-ocean interface in a unified way and preserves physical realism in its predictions.
What carries the argument
The unified neural network that jointly represents atmosphere and ocean variables, trained with tailored datasets and loss scaling to address multi-scale dynamics and missing data over land.
If this is right
- Improved predictions of wave swell and sea surface temperature changes from storms.
- More accurate medium-range forecasts for marine applications without needing multiple models.
- Potential to expand to full Earth system modeling with added components.
- Robust performance on initial conditions not seen during training.
Where Pith is reading between the lines
- This unified approach could simplify the development of forecasting systems by eliminating the need for interface codes between models.
- Extending the model to include deeper ocean layers might enhance seasonal predictions.
- The method could be tested on other coupled systems such as atmosphere-land interactions.
Load-bearing premise
The patterns in historical data are sufficient for the model to accurately predict future states and maintain physical consistency even without explicit physical laws built into the system.
What would settle it
Comparing the model's sea surface temperature forecasts to observations during an extreme weather event occurring after the training data period to see if the skill improvement holds.
Figures
read the original abstract
Machine-learning (ML) models, such as the AIFS at the ECMWF, have revolutionised weather forecasting in recent years. We present an extension of the AIFS that jointly models the atmosphere and surface ocean, including ocean waves and sea ice. The primary objective of this extension is to enhance machine-learning medium-range forecasting and enable new use cases by expanding the weather state to better capture coupled surface processes. Our approach departs from traditional numerical models by not having two separate models for the atmosphere and marine components. The joint model instead learns correlations across the entire atmosphere-ocean interface in a component-agnostic way, and can exploit the expressive capacity of ML architectures to learn cross-component relationships directly from the data. We leverage tailored and targeted datasets and solve model design challenges such as missing values over land, multi-scale temporal dynamics, and physical realism of forecast fields and demonstrate the utility of loss scaling in guiding the learning process. We evaluate how representing the surface ocean affects medium-range weather forecasts. We also assess the model's ability to predict surface-ocean fields, including wave swell and tropical-cyclone cold wakes. For nearly all evaluated marine variables, we observe an improvement of approximately one day in forecast skill at medium-range lead times compared to physics-based models. Furthermore, we demonstrate that the model is robust to idealised initial conditions outside the training distribution and responds to them in a physically consistent way. Overall, our findings suggest that the joint AIFS modelling approach offers significant potential for combined atmosphere-ocean forecasting. Our work provides a solid foundation for future development of data-driven coupled Earth system models with greater flexibility and physical fidelity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an extension of ECMWF's AIFS machine-learning weather forecasting model to jointly represent the atmosphere and surface ocean (including waves and sea ice) within a single component-agnostic architecture. It claims that this joint approach yields an approximately one-day improvement in forecast skill for nearly all evaluated marine variables at medium-range lead times relative to physics-based models, while also demonstrating physical consistency in responses to idealized initial conditions outside the training distribution. The work addresses practical challenges such as missing land values, multi-scale temporal dynamics, and physical realism via tailored datasets and loss scaling.
Significance. If the reported skill gains prove robust, this work would be significant as one of the first demonstrations of a unified data-driven model handling coupled atmosphere-ocean processes without separate numerical components. It highlights the potential for ML architectures to learn cross-interface correlations directly from reanalysis data and provides a foundation for more flexible Earth-system forecasting systems. The emphasis on loss scaling and idealized robustness tests adds value for guiding future coupled ML model development.
major comments (3)
- [Abstract and Results] Abstract and Results section: The central claim of an approximately one-day skill improvement for marine variables lacks error bars, statistical significance tests, or explicit details on the physics-based baseline models and verification protocols (including any rules for excluding training-distribution overlap). Without these, it is difficult to determine whether the reported gain is load-bearing or could be explained by differences in training data or evaluation setup.
- [Robustness tests] Robustness tests subsection: The idealized initial-condition probes, while useful, are narrow and do not address secular trends (e.g., warming SST baselines), rare compound events, or error accumulation over 5–10 day leads. These omissions directly affect the generalizability assumption underlying the skill claims versus coupled physics models.
- [Model design] Model design and loss scaling description: The paper states that loss scaling guides physical realism, but provides insufficient quantitative detail on the scaling factors, their derivation, or ablation results showing their necessity for cross-component consistency. This weakens the ability to assess how the joint model avoids unphysical drift.
minor comments (3)
- [Abstract] The abstract refers to 'tailored and targeted datasets' without specifying their construction or differences from standard reanalysis products used in prior AIFS work.
- [Figures] Figure captions and axis labels for skill-score plots should explicitly state the exact lead times, variables, and baseline models to improve clarity for readers comparing to physics-based systems.
- [Discussion] A brief discussion of how the single-model architecture scales computationally relative to traditional coupled atmosphere-ocean models would help contextualize the practical advantages.
Simulated Author's Rebuttal
We thank the referee for their constructive and positive assessment of our manuscript. We address each major comment point by point below and have revised the manuscript accordingly where the suggestions strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Abstract and Results] Abstract and Results section: The central claim of an approximately one-day skill improvement for marine variables lacks error bars, statistical significance tests, or explicit details on the physics-based baseline models and verification protocols (including any rules for excluding training-distribution overlap). Without these, it is difficult to determine whether the reported gain is load-bearing or could be explained by differences in training data or evaluation setup.
Authors: We agree that additional statistical detail will strengthen the central claim. In the revised manuscript we will add error bars (derived from ensemble spread and temporal bootstrapping) to all skill-score curves, include formal significance tests (paired t-tests and block bootstrapping over independent forecast cases), and expand the Methods section with explicit descriptions of the physics-based baselines (IFS atmospheric forecasts coupled to the operational ocean-wave and sea-ice components), the verification protocol, and the temporal separation rules used to avoid training-distribution overlap. These additions will allow readers to assess the robustness of the reported one-day gain. revision: yes
-
Referee: [Robustness tests] Robustness tests subsection: The idealized initial-condition probes, while useful, are narrow and do not address secular trends (e.g., warming SST baselines), rare compound events, or error accumulation over 5–10 day leads. These omissions directly affect the generalizability assumption underlying the skill claims versus coupled physics models.
Authors: We acknowledge the limited scope of the current idealized probes. The revised manuscript will expand the discussion to explicitly note these limitations and will add (i) a short sensitivity test using perturbed SST baselines consistent with observed warming trends and (ii) quantitative assessment of error growth out to 10-day leads for the marine variables. Comprehensive evaluation of rare compound events remains outside the present scope and will be flagged as future work; however, the existing physical-consistency tests still provide useful evidence that the model does not produce obviously unphysical responses outside the training distribution. revision: partial
-
Referee: [Model design] Model design and loss scaling description: The paper states that loss scaling guides physical realism, but provides insufficient quantitative detail on the scaling factors, their derivation, or ablation results showing their necessity for cross-component consistency. This weakens the ability to assess how the joint model avoids unphysical drift.
Authors: We agree that more quantitative information is needed. The revised manuscript will include a dedicated subsection (or appendix) that reports the exact scaling factors applied to each variable group, describes their derivation from climatological standard deviations and physical units, and presents ablation experiments comparing the full loss-scaled model against versions without scaling. These results will quantify the reduction in cross-component drift and unphysical artifacts, directly addressing the concern about physical realism. revision: yes
Circularity Check
No circularity: empirical skill gains from trained ML model vs physics baselines
full rationale
The paper trains a neural network on historical reanalysis data to jointly forecast atmosphere and surface ocean variables, then reports forecast skill improvements (approximately one day at medium range) via direct comparison of model outputs against physics-based models. No derivation chain, equations, or uniqueness theorems are presented that reduce by construction to fitted inputs or self-citations; the central results are empirical evaluations on held-out or out-of-distribution cases, with robustness checks described as separate tests rather than tautological fits. Evaluation shares data sources with training in the usual ML sense but does not force the reported skill deltas.
Axiom & Free-Parameter Ledger
free parameters (1)
- loss scaling factors
axioms (1)
- domain assumption Historical reanalysis data sufficiently samples the coupled atmosphere-ocean state space for generalization.
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2510.20416. G.B. Brassington, M.J. Martin, H.L. Tolman, S. Akella, M. Balmeseda, C.R.S. Chambers, E. Chassignet, J.A. Cum- mings, Y . Drillet, P.A.E.M. Jansen, P. Laloyaux, D. Lea, A. Mehra, I. Mirouze, H. Ritchie, G. Samson, P.A. Sandery, G.C. Smith, M. Suarez, and R. Todling. Progress and challenges in short- to medium-range cou...
-
[2]
ISSN 1944-8007. doi: 10.1029/2024gl114318. URLhttp://dx.doi.org/10.1029/2024GL114318. James P. C. Duncan, Elynn Wu, Surya Dheeshjith, Adam Subel, Troy Arcomano, Spencer K. Clark, Brian Henn, Anna Kwa, Jeremy McGibbon, W. Andre Perkins, William Gregory, Carlos Fernandez-Granda, Julius Busecke, Oliver Watt-Meyer, William J. Hurlin, Alistair Adcroft, Laure Z...
-
[3]
ISSN 1942-2466. doi: 10.1029/2025ms005221. URLhttp://dx.doi.org/10.1029/2025MS005221. Hans Hersbach, Bill Bell, Paul Berrisford, Shoji Hirahara, Andr ´as Hor ´anyi, Joaqu´ın Mu˜noz-Sabater, Julien Nico- las, Carole Peubey, Raluca Radu, Dinand Schepers, Adrian Simmons, Cornel Soci, Saleh Abdalla, Xavier Abel- lan, Gianpaolo Balsamo, Peter Bechtold, Gionata...
-
[4]
ISSN 1520-0434. doi: 10.1175/waf-d-20-0035.1. URLhttp://dx.doi.org/10.1175/WAF-D-20-0035. 1. Xiang Wang, Renzhi Wang, Ningzi Hu, Pinqiang Wang, Peng Huo, Guihua Wang, Huizan Wang, Senzhang Wang, Junxing Zhu, Jianbo Xu, Jun Yin, Senliang Bao, Ciqiang Luo, Ziqing Zu, Yi Han, Weimin Zhang, Kaijun Ren, Kefeng Deng, and Junqiang Song. XiHe: A Data-Driven Model...
-
[5]
URLhttps://arxiv.org/abs/2402.02995. Nils P. Wedi. Increasing horizontal resolution in numerical weather prediction and climate simulations: illusion or panacea?Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 372(2018):20130289, June 2014. ISSN 1471-2962. doi: 10.1098/rsta.2013.0289. URLhttp://dx.doi.org...
-
[6]
Relative score changes are shown as function of lead time (day 1 to 10) for northern extra-tropics (n.hem), southern extra-tropics (s.hem) and tropics
Forecasts are initialised at 00 and 12 UTC. Relative score changes are shown as function of lead time (day 1 to 10) for northern extra-tropics (n.hem), southern extra-tropics (s.hem) and tropics. Blue colours mark score im- provements and red colours score degradations. Purple colours indicate an increase in standard deviation of forecast anomaly, while g...
-
[7]
S.6 Figure S.8: Spectra of 10-day forecasts of temperature at 850 hPa relative to spectra of IFS initial condition
For a description of metrics see Fig. S.6 Figure S.8: Spectra of 10-day forecasts of temperature at 850 hPa relative to spectra of IFS initial condition. 15th June 2023 until 15th December 2023. 29 6.4 Additional Evaluation on Removing Sea Ice from Initial Conditions Figure S.9: Arctic sea ice response in the perturbed forecast initialised on 1 February 2...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.