Njord: A Probabilistic Graph Neural Network for Ensemble Ocean Forecasting

Daniel Holmberg; Erik Wikingsson; Fredrik Lindsten; Joel Oskarsson; Teemu Roos

arxiv: 2605.15470 · v2 · pith:JCS2PZMKnew · submitted 2026-05-14 · 💻 cs.LG · physics.ao-ph

Njord: A Probabilistic Graph Neural Network for Ensemble Ocean Forecasting

Daniel Holmberg , Joel Oskarsson , Erik Wikingsson , Fredrik Lindsten , Teemu Roos This is my paper

Pith reviewed 2026-05-19 14:39 UTC · model grok-4.3

classification 💻 cs.LG physics.ao-ph

keywords ocean forecastinggraph neural networksprobabilistic modelsensemble predictionmachine learninguncertainty estimationocean dynamics

0 comments

The pith

A probabilistic graph neural network for ocean forecasting achieves the lowest errors on a global benchmark while providing uncertainty estimates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Njord, a model that combines deep latent variables with graph neural networks to generate probabilistic ensemble forecasts for ocean dynamics in both global and regional settings. This approach allows sampling multiple forecasts in a single forward pass, unlike deterministic machine learning models that ignore the chaotic nature of ocean systems. To handle large irregular grids, the model uses K-means cluster meshes that adapt to sea surface geometry at 0.25 degree global and 2 km regional resolutions. On the OceanBench benchmark against real observations, Njord records the lowest average errors across upper-ocean variables, with the biggest gains in surface temperature prediction.

Core claim

Njord integrates a deep latent variable framework with a graph neural network architecture on K-means cluster meshes, enabling single-pass sampling of ensemble forecasts that outperform deterministic baselines on upper-ocean variables while supplying uncertainty estimates from the ensembles.

What carries the argument

K-means cluster meshes adapted to irregular sea surface geometry, combined with a deep latent variable model that supports efficient probabilistic sampling within the graph neural network.

Load-bearing premise

K-means cluster meshes adapt sufficiently well to irregular sea-surface geometry to allow accurate and efficient scaling of the graph neural network to global 0.25-degree and regional 2 km grids.

What would settle it

Demonstrating that a competing model produces lower average errors than Njord across upper-ocean variables on the OceanBench benchmark when validated against real-world observations would undermine the performance advantage.

Figures

Figures reproduced from arXiv: 2605.15470 by Daniel Holmberg, Erik Wikingsson, Fredrik Lindsten, Joel Oskarsson, Teemu Roos.

**Figure 1.** Figure 1: Njord. at global short-range (1–10 days) timescales. These models are however, deterministic: they produce a single trajectory and are typically trained with mean squared error, which encourages predictions toward the conditional mean of the future state rather than capturing the full predictive distribution. Consequently, they tend to smooth over fine-scale variance and offer limited insight into the pro… view at source ↗

**Figure 2.** Figure 2: One-step prediction in the Njord model. Residuals are predicted at time [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Example of graph node placement in the Red Sea. 4.1 A graph adapted to ocean geometry Graph-based global weather forecasting models use icosahedral meshes [30, 9, 31] for constructing the spatial graph that the model operates over. These meshes are constructed by iteratively subdividing an icosahedron, with each subdivision quadrupling the number of nodes and edges [30]. As the size of the graph heavily … view at source ↗

**Figure 4.** Figure 4: RMSE for Sea Surface Temperature (SST), Sea Surface Height (SSH), Sea Surface Salin [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: SSR averaged over all global ocean variables. The Spread-Skill Ratio (SSR) in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Global SST at a 10 d lead, initialized on 2024-01-30. Ground truth is GLO12 analysis. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Arctic SIT at 10 d lead time, initialized 2024-01-30. Ground truth is GLO12 analysis. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Global SST predictions evaluated on satellite measurements. To further evaluate SST forecasts outside of OceanBench, we compare the predicted potential temperature of the uppermost ocean layer against a global ocean bias-adjusted SST product [42], based on multi-sensor satellite observations [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 10.** Figure 10: RMSE for Temperature (T), Salinity (S), Zonal Current (U) at 47 m depth, as well as Sea [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: Baltic Sea SST at 10 d lead time, initialized 2024-03-05. Ground truth is NEMO analysis. [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 9.** Figure 9: SSR averaged over Baltic Sea variables. Across variables, Njord-Baltic achieves RMSE values comparable to SeaCast while providing probabilistic forecasts. In this regional setting, GLO12 exhibits a relatively flat error curve, similar to a climatological baseline. Both Njord-Baltic and SeaCast clearly outperform persistence. Njord-Baltic matches SeaCast in deterministic accuracy while additionally provi… view at source ↗

**Figure 12.** Figure 12: One-step prediction in the Njord-Baltic model. Residuals are predicted at time [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Global graphs used by Njord, with grid nodes in blue, encoding/decoding edges in black, [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: Regional graphs used by Njord, with grid nodes in blue, M2G and G2M edges in black, [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Example of mesh node placement in the Gulf of California (latitude [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 16.** Figure 16: Example of mesh node placement in the northern Red Sea and Suez Canal (latitude [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

**Figure 17.** Figure 17: Example of mesh node placement in the Bråviken bay and Östergötland Archipelago, on [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

**Figure 18.** Figure 18: Example of mesh node placement in the Turku Archipelago in south-western Finland. [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗

**Figure 19.** Figure 19: Ensemble mean CRPS scorecards. The heatmaps display the relative difference between [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗

**Figure 20.** Figure 20: The heatmaps display the relative difference in RMSE and CRPS between Njord trained [PITH_FULL_IMAGE:figures/full_fig_p027_20.png] view at source ↗

**Figure 21.** Figure 21: Spatial evaluation of SIC at a 30-day lead time. The panels compare the ground truth [PITH_FULL_IMAGE:figures/full_fig_p029_21.png] view at source ↗

**Figure 22.** Figure 22: Log-scaled scatter density heatmaps evaluating predicted versus observed SIC and SIT at [PITH_FULL_IMAGE:figures/full_fig_p030_22.png] view at source ↗

**Figure 23.** Figure 23: Ensemble mean CRPS scorecards. The heatmaps display the relative difference between [PITH_FULL_IMAGE:figures/full_fig_p030_23.png] view at source ↗

**Figure 24.** Figure 24: Global RMSE of SST by forecast lead time, where Njord has the lowest error compared to satellite measurements. The dataset merges multi-sensor satellite observations into a Level-3 global grid [PITH_FULL_IMAGE:figures/full_fig_p035_24.png] view at source ↗

**Figure 25.** Figure 25: Spatial distribution of normalized RMSE difference for SST between Njord ensemble [PITH_FULL_IMAGE:figures/full_fig_p035_25.png] view at source ↗

**Figure 26.** Figure 26: Surface variables: SSH, SIC, and SIT. Columns from left to right show RMSE, CRPS, [PITH_FULL_IMAGE:figures/full_fig_p036_26.png] view at source ↗

**Figure 27.** Figure 27: Temperature at six different depths. Columns from left to right show RMSE, CRPS, and [PITH_FULL_IMAGE:figures/full_fig_p037_27.png] view at source ↗

**Figure 28.** Figure 28: Salinity at six different depths. Columns from left to right show RMSE, CRPS, and SSR. [PITH_FULL_IMAGE:figures/full_fig_p038_28.png] view at source ↗

**Figure 29.** Figure 29: Zonal current at six different depths. Columns from left to right show RMSE, CRPS, and [PITH_FULL_IMAGE:figures/full_fig_p039_29.png] view at source ↗

**Figure 30.** Figure 30: Normalized RMSE difference for various variables and depth levels, comparing ensemble [PITH_FULL_IMAGE:figures/full_fig_p040_30.png] view at source ↗

**Figure 31.** Figure 31: Sea ice concentration at lead time 10 d, init 2024-12-24. [PITH_FULL_IMAGE:figures/full_fig_p041_31.png] view at source ↗

**Figure 32.** Figure 32: Sea ice thickness at lead time 10 d, init 2024-12-24. [PITH_FULL_IMAGE:figures/full_fig_p042_32.png] view at source ↗

**Figure 33.** Figure 33: Temperature at the surface, lead time 10 d, init 2024-12-24. [PITH_FULL_IMAGE:figures/full_fig_p042_33.png] view at source ↗

**Figure 34.** Figure 34: Salinity at the surface, lead time 10 d, init 2024-12-24. [PITH_FULL_IMAGE:figures/full_fig_p043_34.png] view at source ↗

**Figure 35.** Figure 35: Zonal current at the surface, lead time 10 d, init 2024-12-24. [PITH_FULL_IMAGE:figures/full_fig_p043_35.png] view at source ↗

**Figure 36.** Figure 36: Meridional current at the surface, lead time 10 d, init 2024-12-24. [PITH_FULL_IMAGE:figures/full_fig_p044_36.png] view at source ↗

**Figure 37.** Figure 37: Sea surface height at lead time 10 d, init 2024-12-24. [PITH_FULL_IMAGE:figures/full_fig_p044_37.png] view at source ↗

**Figure 38.** Figure 38: Surface variables: SLA, SIC and SIT. Reanalysis variants are shown dashed and analysis [PITH_FULL_IMAGE:figures/full_fig_p045_38.png] view at source ↗

**Figure 39.** Figure 39: Temperature at 1, 9, 28, 47 and 91 m depth. [PITH_FULL_IMAGE:figures/full_fig_p046_39.png] view at source ↗

**Figure 40.** Figure 40: Salinity at 1, 9, 28, 47 and 91 m depth. [PITH_FULL_IMAGE:figures/full_fig_p047_40.png] view at source ↗

**Figure 41.** Figure 41: Meridional current at 1, 9, 28, 47 and 91 m depth. [PITH_FULL_IMAGE:figures/full_fig_p048_41.png] view at source ↗

**Figure 42.** Figure 42: Sea ice concentration at lead time 10 d, init 2024-02-20. [PITH_FULL_IMAGE:figures/full_fig_p049_42.png] view at source ↗

**Figure 43.** Figure 43: Sea ice thickness at lead time 10 d, init 2024-02-20. [PITH_FULL_IMAGE:figures/full_fig_p050_43.png] view at source ↗

**Figure 44.** Figure 44: Temperature at the surface, lead time 10 d, init 2024-02-20. [PITH_FULL_IMAGE:figures/full_fig_p050_44.png] view at source ↗

**Figure 45.** Figure 45: Salinity at the surface, lead time 10 d, init 2024-02-20. [PITH_FULL_IMAGE:figures/full_fig_p051_45.png] view at source ↗

**Figure 46.** Figure 46: Zonal current at the surface, lead time 10 d, init 2024-02-20. [PITH_FULL_IMAGE:figures/full_fig_p051_46.png] view at source ↗

**Figure 47.** Figure 47: Meridional current at the surface, lead time 10 d, init 2024-02-20. [PITH_FULL_IMAGE:figures/full_fig_p052_47.png] view at source ↗

**Figure 48.** Figure 48: Sea level anomaly at lead time 10 d, init 2024-02-20. [PITH_FULL_IMAGE:figures/full_fig_p052_48.png] view at source ↗

read the original abstract

Ocean dynamics are inherently chaotic, yet existing machine learning ocean models produce only deterministic forecasts. We introduce Njord, a probabilistic data-driven model for ocean forecasting, applicable to both global and regional domains. Njord combines a deep latent variable framework with a graph neural network architecture, enabling sampling each forecast step in a single forward pass. We apply Njord globally at 0.25{\deg} resolution and regionally to the Baltic Sea at 2 km resolution. To scale to these large ocean grids we introduce K-means cluster meshes that adapt to irregular sea surface geometry. Experiments demonstrate strong performance on both domains compared to deterministic machine learning baselines, while also providing uncertainty estimates from the sampled ensemble forecasts. On the global OceanBench benchmark, Njord achieves the lowest errors on average across upper-ocean variables when evaluated against real-world observations, with the largest improvements in surface temperature prediction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces Njord, a probabilistic graph neural network for ensemble ocean forecasting that combines a deep latent variable model with GNN message passing to generate sampled forecasts in a single forward pass. It scales the approach to a global 0.25° grid and a regional 2 km Baltic Sea grid by introducing K-means cluster meshes that adapt to irregular sea-surface geometry. The central empirical claim is that Njord attains the lowest average errors across upper-ocean variables on the OceanBench benchmark when evaluated against real-world observations, with the largest gains in surface temperature, while also supplying uncertainty estimates from the ensemble.

Significance. If the performance and scaling claims are substantiated, the work would be significant for demonstrating that probabilistic GNNs can deliver calibrated ensemble forecasts for chaotic ocean dynamics at both global and high-resolution regional scales. The provision of uncertainty estimates alongside competitive point forecasts against real observations addresses a practical gap in existing deterministic ML ocean models. The adaptive mesh construction, if shown to respect physical boundaries, could serve as a reusable technique for applying graph-based methods to masked geophysical domains.

major comments (1)

[Abstract] Abstract and mesh-construction section: the claim that K-means cluster meshes 'adapt to irregular sea surface geometry' is load-bearing for the scaling argument to 0.25° global and 2 km regional grids, yet no description is given of how land-sea masks are enforced, whether invalid cross-land edges are removed, or what mesh-quality metrics (e.g., connectivity, boundary fidelity) are satisfied. Standard K-means on latitude-longitude coordinates does not inherently respect masks; without explicit post-processing or boundary-aware clustering, message passing can produce unphysical connections, undermining the applicability claim.

minor comments (1)

[Abstract] Abstract: quantitative error values, baseline definitions, and training details are omitted even though the headline performance claim is stated; adding at least the key RMSE or MAE numbers and the names of the deterministic ML baselines would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The concern about insufficient description of the K-means mesh construction and mask handling is well-taken. We address this point below and will revise the manuscript to provide the requested technical details.

read point-by-point responses

Referee: [Abstract] Abstract and mesh-construction section: the claim that K-means cluster meshes 'adapt to irregular sea surface geometry' is load-bearing for the scaling argument to 0.25° global and 2 km regional grids, yet no description is given of how land-sea masks are enforced, whether invalid cross-land edges are removed, or what mesh-quality metrics (e.g., connectivity, boundary fidelity) are satisfied. Standard K-means on latitude-longitude coordinates does not inherently respect masks; without explicit post-processing or boundary-aware clustering, message passing can produce unphysical connections, undermining the applicability claim.

Authors: We agree that the manuscript currently provides insufficient detail on how the K-means meshes enforce land-sea boundaries. In the revised version we will expand the mesh-construction section with the following additions: (i) clustering is performed exclusively on sea-grid points identified by the land-sea mask; (ii) after clustering, any graph edges connecting nodes separated by land are explicitly removed by a post-processing step that checks line-of-sight connectivity within the masked domain; (iii) we will report quantitative mesh-quality metrics including average node degree, fraction of boundary nodes, and verification that no cross-land edges remain. These clarifications will substantiate the adaptation claim and rule out unphysical message passing. We believe the revised description will fully address the referee’s concern. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation and claims are self-contained with external validation

full rationale

The paper presents Njord as a novel probabilistic latent-variable GNN for ensemble ocean forecasting, with K-means cluster meshes introduced to handle irregular sea-surface geometry at global 0.25° and regional 2 km scales. The central performance claim rests on evaluation against real-world observations on the public OceanBench benchmark, which is independent of the model's fitted parameters or internal definitions. No equations, predictions, or uniqueness arguments in the abstract or described content reduce by construction to inputs, self-citations, or ansatzes; the architecture and mesh adaptation are positioned as original contributions whose validity is tested externally rather than assumed via prior self-referential results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard neural-network training assumptions and the domain premise that graph representations plus clustering suffice for ocean geometry; no new physical axioms or invented entities are introduced.

free parameters (1)

Neural network hyperparameters (depth, width, learning rate, latent dimension)
Chosen or tuned during training; typical for any deep learning model and not derived from first principles.

axioms (1)

domain assumption Ocean dynamics on irregular domains can be faithfully represented by graph neural networks on K-means-derived meshes
Invoked to justify scaling to global and regional grids; stated in the abstract description of the architecture.

pith-pipeline@v0.9.0 · 5684 in / 1251 out tokens · 73474 ms · 2026-05-19T14:39:20.562050+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

To construct a graph better adapted to the geometry of the global ocean we instead place the graph nodes based on the density of ocean grid points. We apply spherical K-means clustering of the ocean grid point 3D Cartesian coordinates...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Njord combines a deep latent variable framework with a graph neural network architecture, enabling sampling each forecast step in a single forward pass.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.