arxiv: 2604.07861 · v1 · submitted 2026-04-09 · ⚛️ physics.ao-ph

Recognition: 2 theorem links

· Lean Theorem

Comparing Ocean Forecasts Driven with Machine Learning-based and Physics-based Atmospheric Forcings

Xiaobing Zhou , Frank Colberg , Debra Hudson , Yonghong Yin , Griffith Young , Christopher Bladwell , Catherine Deburgh-Day

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:54 UTC · model grok-4.3

classification ⚛️ physics.ao-ph

keywords ocean forecastingmachine learningatmospheric forcingNEMOAIFSforecast verificationnumerical weather prediction

0 comments

The pith

Ocean forecasts forced by machine learning atmospheric data show comparable or enhanced skill versus physics-based forcing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper compares 10-day ocean forecasts produced by the NEMO model when driven by two different atmospheric datasets. One set uses ECMWF's machine learning AIFS model for winds, temperature and other fields; the other uses the Australian Bureau of Meteorology's traditional ACCESS-G3 physics-based forecasts. The same ocean initial conditions are used for both, and forecasts are started on the first day of each month from 2023 to 2024. Atmospheric forcing quality is first checked against ERA5 and ACCESS analyses, where AIFS performs at least as well. Ocean results are then scored against reanalysis and observations for sea surface temperature, salinity, sea level and currents. The machine learning forcing yields ocean predictions that match or exceed the skill of the physics-based runs.

Core claim

The ocean forecasts forced with AIFS atmospheric data exhibit comparable or enhanced predictive skill compared to those forced with ACCESS-G3 data.

What carries the argument

Side-by-side evaluation of NEMO ocean model skill under AIFS versus ACCESS-G3 atmospheric forcing, using identical initial conditions and assessing surface variables against reanalysis and observations.

Load-bearing premise

That differences in ocean forecast skill are caused only by the atmospheric forcing and not by other model biases or the limited two-year initialization window.

What would settle it

Repeating the experiment over a five-year period or with an independent ocean model and finding that AIFS-forced runs show systematically higher errors than ACCESS-G3 runs would falsify the central claim.

read the original abstract

Operational ocean forecasting systems conventionally employ dynamical ocean models driven by atmospheric forcing derived from numerical weather prediction (NWP) models. Recent advancements in artificial intelligence and machine learning (ML) have led to the development of ML-based atmospheric weather models, which have competitive, if not better, medium range forecast accuracy compared to traditional NWP systems. This study evaluates the impact of ML-based atmospheric forcing on ocean forecast skill through two sets of 10-day forecasts using the UK Met Office GOSI9 configuration of the NEMO dynamical ocean model. Both experiments share identical ocean initial conditions; but differ in atmospheric forcing: one uses ECMWF's ML-based AIFS model, while the other uses the Australian Bureau of Meteorology's physics-based NWP model, ACCESS-G3. Forecasts were initialized on the first day of each month over the period 2023-2024. The quality of the atmospheric forcing was assessed by comparing AIFS and ACCESS-G3 forecast skill against both ECMWF reanalysis v5 (ERA5) and ACCESS-G3 analyses. Results indicate that AIFS consistently outperforms ACCESS-G3, either from the initial forecast time or after the first few days. Oceanic forecast skill was evaluated against both the GOSI9 reanalysis and observations, focusing on key surface variables including sea surface temperature, salinity, sea level, and ocean currents. The ocean forecasts forced with AIFS atmospheric data exhibit comparable or enhanced predictive skill compared to those forced with ACCESS-G3 data. These findings underscore the potential of ML-based atmospheric models to replace traditional NWP forcing in operational ocean forecasting systems, offering improved accuracy and computational efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AIFS atmospheric forcing produces ocean forecasts that match or slightly beat ACCESS-G3 over 2023-2024, but the two-year monthly sample leaves the result too fragile to generalize.

read the letter

The paper runs a straightforward comparison: same NEMO/GOSI9 ocean model and initial conditions, but one run forced by ECMWF's AIFS machine-learning atmosphere and the other by the Bureau's ACCESS-G3 physics model. Over ten-day forecasts started on the first of each month in 2023-2024, the AIFS-driven runs show comparable or better skill on sea surface temperature, salinity, sea level, and currents when checked against both the GOSI9 reanalysis and observations. They also verify that AIFS itself beats ACCESS-G3 against ERA5, which is the cleanest part of the design because it isolates the forcing difference without changing the ocean model.

Referee Report

2 major / 1 minor

Summary. The paper compares 10-day NEMO/GOSI9 ocean forecasts initialized monthly in 2023-2024 with identical ocean initial conditions but differing atmospheric forcings: ECMWF's ML-based AIFS versus the physics-based ACCESS-G3 NWP model. Atmospheric forcing quality is assessed against ERA5 and ACCESS-G3 analyses, while ocean forecast skill for SST, salinity, sea level, and currents is evaluated against GOSI9 reanalysis and observations. The central claim is that AIFS-driven forecasts exhibit comparable or enhanced predictive skill relative to ACCESS-G3-driven forecasts.

Significance. If the attribution holds, the work provides evidence that ML-based atmospheric models can serve as effective replacements for traditional NWP forcings in operational ocean forecasting, with potential gains in accuracy and efficiency. The design using shared ocean initial conditions and multi-variable evaluation against both reanalysis and observations is a strength that isolates the forcing impact at a high level.

major comments (2)

[Abstract and Results (oceanic forecast skill evaluation)] The evaluation relies on at most 24 forecast cases initialized monthly over 2023-2024 only (as stated in the abstract). Ocean variables exhibit strong seasonal and interannual variability, and without multi-year baselines, cross-validation across periods, or statistical significance tests on skill deltas, it is not possible to rule out that any AIFS advantage is an artifact of the sampled period rather than a general property of the ML forcing. This directly affects the central attribution claim.
[Methods and Results sections] The manuscript does not report full details on the error metrics, statistical tests applied to skill differences, or controls for potential model-specific biases and confounding factors in the NEMO/GOSI9 setup. This omission leaves the robustness of the 'comparable or enhanced' skill conclusion under-supported given the small sample.

minor comments (1)

[Abstract] The abstract could explicitly state the exact number of forecasts performed and the precise initialization dates to clarify the sample size.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We have addressed each major comment point by point below, providing clarifications and indicating where revisions will be made to strengthen the paper. Our responses aim to enhance the robustness and transparency of the analysis without overstating the current results.

read point-by-point responses

Referee: [Abstract and Results (oceanic forecast skill evaluation)] The evaluation relies on at most 24 forecast cases initialized monthly over 2023-2024 only (as stated in the abstract). Ocean variables exhibit strong seasonal and interannual variability, and without multi-year baselines, cross-validation across periods, or statistical significance tests on skill deltas, it is not possible to rule out that any AIFS advantage is an artifact of the sampled period rather than a general property of the ML forcing. This directly affects the central attribution claim.

Authors: We agree that the sample of 24 monthly-initialized forecasts over 2023-2024 is limited and does not capture full interannual variability, which is a genuine constraint on generalizability. This is a fair point regarding the central attribution. To mitigate this, we will add statistical significance tests on the skill differences (e.g., paired Wilcoxon signed-rank tests or bootstrap resampling with confidence intervals) in the revised Results section. We will also expand the Discussion to explicitly note this temporal limitation and recommend longer-term evaluations in future work. The experimental design, with identical ocean initial conditions, helps isolate the forcing impact, and the consistency of AIFS advantages across multiple variables and against both reanalysis and independent observations provides supporting evidence within the sampled period. We do not claim universality but demonstrate potential applicability. revision: partial
Referee: [Methods and Results sections] The manuscript does not report full details on the error metrics, statistical tests applied to skill differences, or controls for potential model-specific biases and confounding factors in the NEMO/GOSI9 setup. This omission leaves the robustness of the 'comparable or enhanced' skill conclusion under-supported given the small sample.

Authors: We thank the referee for highlighting this omission, which we agree weakens the support for our conclusions. In the revised manuscript, we will expand the Methods section to provide: complete definitions and mathematical formulations of all error metrics (RMSE, bias, anomaly correlation coefficient, etc.); descriptions of statistical tests for skill differences (including those we will newly apply); and explicit details on experimental controls, such as the use of identical ocean initial conditions to isolate atmospheric forcing effects, along with any bias-handling procedures or configuration choices in the NEMO/GOSI9 model that address potential confounding factors. These additions will be cross-referenced in the Results to better substantiate the 'comparable or enhanced' skill findings. revision: yes

standing simulated objections not resolved

The two-year period (2023-2024) inherently limits our ability to perform multi-year baselines or cross-validation across independent periods without substantial additional data and computational resources.

Circularity Check

0 steps flagged

No significant circularity; independent empirical comparison

full rationale

The paper performs a straightforward side-by-side evaluation of ocean forecast skill under two external atmospheric forcing datasets (ECMWF AIFS and BoM ACCESS-G3) using identical NEMO/GOSI9 initial conditions and the same ocean model configuration. Skill metrics are computed against independent references (GOSI9 reanalysis, in-situ observations, ERA5). No derivation chain, fitted parameters, self-citations, or ansatzes are invoked to support the central claim; the result is a direct data-driven comparison rather than a reduction of outputs to inputs by construction. The short 2023-2024 sample is a methodological limitation but does not constitute circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No new free parameters or invented entities are introduced; the work relies on standard assumptions in numerical ocean modeling and forecast verification.

axioms (1)

domain assumption Dynamical ocean models like NEMO produce reliable forecasts when provided with accurate atmospheric forcing.
The entire comparison rests on the assumption that the GOSI9 configuration is a valid testbed for evaluating forcing impacts.

pith-pipeline@v0.9.0 · 5611 in / 1092 out tokens · 39207 ms · 2026-05-10T17:54:38.977498+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The ocean forecasts forced with AIFS atmospheric data exhibit comparable or enhanced predictive skill compared to those forced with ACCESS-G3 data.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

E., Gaudel, Q., Regnier, C., Van Gennip, S., Drevillon, M., Drillet, Y., and Lellouche, J.-M

Aouni, A. E., Gaudel, Q., Regnier, C., Van Gennip, S., Drevillon, M., Drillet, Y., and Lellouche, J.-M. (2025). Glonet: Mercator’s end-to-end neural forecasting system. arXiv preprint. doi:10.48550/arXiv.2412.05454 Behrens, E., and Bostock, H. (2023). The response of the subtropical front to changes in the southern hemisphere westerly winds—evidence from ...

work page doi:10.48550/arxiv.2412.05454 2025
[2]

A., and Smedstad, O

doi:10.1038/s41612- 023-00512-1 Cummings, J. A., and Smedstad, O. M. (2013). Variational data assimilation for the global ocean. In Data Assimilation for Atmospheric, Oceanic and Hydrologic Applications (Vol. II). (Eds S. Park and L. Xu.) pp. 303–343. (Springer: Berlin.) doi:10.1007/978-3-642-35088- 7_13 Droghei, R., Buongiorno Nardelli, B., and Santoleri...

work page doi:10.1038/s41612- 2013
[3]

T., Blockley, E., Megann, A., and Hewitt, H

doi:10.3390/rs12040720 Guiavarc’h, C., Storkey, D., Blaker, A. T., Blockley, E., Megann, A., and Hewitt, H. (2025). GOSI9: UK global ocean and sea ice configurations. Geoscientific Model Development 18, 377–403. doi:10.5194/gmd-18-377-2025 Halpern, B. S., Frazier, M., Potapenko, J., Casey, K. S., Koenig, K., Longo, C., Lowndes, J. S., Rockwood, R. C., Sel...

work page doi:10.3390/rs12040720 2025
[4]

A., and Luther, D

doi:10.1038/ncomms8615 Halpern, D., Knox, R. A., and Luther, D. S. (1988). Observations of 20-day meridional current oscillations in the upper ocean along the Pacific equator. Journal of Physical Oceanography 18, 1514–1534. He, Q., Zhan, W., Cai, S., Du, Y., Chen, Z., Tang, S., and Zhan, H. (2023). Enhancing impacts of mesoscale eddies on Southern Ocean t...

work page doi:10.1038/ncomms8615 1988