pith. machine review for the scientific record. sign in

arxiv: 2605.12639 · v1 · submitted 2026-05-12 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

OceanCBM: A Concept Bottleneck Model for Mechanistic Interpretability in Ocean Forecasting

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:45 UTC · model grok-4.3

classification 💻 cs.LG
keywords concept bottleneck modelocean forecastingmechanistic interpretabilitymixed layer heat contentmarine heatwavesmixed supervisiongeophysical fluid dynamicsspatiotemporal prediction
0
0 comments X

The pith

OceanCBM routes forecasts of ocean heat content through prescribed physical concepts plus one free concept to deliver both skill and mechanistic insight.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OceanCBM as a concept bottleneck model that predicts mixed layer heat content, a precursor to marine heatwaves, by inserting an intermediate layer of concepts drawn from geophysical fluid dynamics together with one additional free concept. Mixed supervision combines direct prediction of the target variable with supervision on the prescribed concepts, allowing the model to respect physical structure while the free concept absorbs residual processes. A sympathetic reader cares because conventional machine learning forecasts achieve accuracy yet remain opaque about the drivers of extreme ocean events. The central demonstration is that this mixed approach produces consistent concept activations across different random initializations, whereas purely predictive or purely prescriptive baselines yield highly variable latent structures even when predictive skill is comparable.

Core claim

OceanCBM achieves interpretable, physically grounded representations without sacrificing skill by routing information through an intermediate layer of prescribed geophysical concepts and one free concept under mixed supervision. Across ensemble initializations this yields consistent mechanistic representations of the drivers of mixed layer heat content, while prediction-only and prescription-only baselines learn highly variable latent structures despite similar predictive performance.

What carries the argument

The mixed-supervision concept bottleneck, in which the network must predict both the target heat content and a set of prescribed physical concepts while a single free concept captures residuals and regularizes the representation.

If this is right

  • Forecasts of marine heatwave precursors can be accompanied by explicit statements of the physical concepts that contributed most to each prediction.
  • The interpretability-performance trade-off becomes measurable by direct comparison of mixed-supervision models against prediction-only and prescription-only baselines.
  • The free concept provides a regularized outlet for physical processes omitted from the prescribed set without requiring the model to invent spurious correlations.
  • Consistent representations across initializations indicate that the chosen concepts reliably encode key drivers rather than training artifacts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mixed-supervision bottleneck could be tested on other ocean variables such as salinity or velocity fields to check whether consistency gains generalize.
  • Post-training inspection of the free concept activations might reveal previously unmodeled physical relationships in the data.
  • Extending the prescribed concept set or varying the strength of the supervision terms offers a direct experimental route to quantify how much physical structure is needed before consistency saturates.

Load-bearing premise

That the particular concepts chosen from geophysical fluid dynamics are the right set to capture the main drivers of mixed layer heat content variations.

What would settle it

If concept activations vary substantially or fail to align with expected physical relationships across multiple independent training runs that use mixed supervision, the claim of consistent mechanistic representations would be falsified.

Figures

Figures reproduced from arXiv: 2605.12639 by Kieran Ringel, Maike Sonnewald, Sanah Suri.

Figure 1
Figure 1. Figure 1: Workflow diagram. OceanCBM is an NN ensemble that utilizes a U-Net architecture to predict mixed layer heat content, using prescribed concepts and a free concept. The predicted fields are used to study regions with recorded marine heatwaves. 2 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Concepts and geographical region. Left panel: The prescribed concepts and their role in impacting the mixed layer heat content. Right: North Atlantic geographic region in consideration, the domain of the 2012 Gulf of Maine marine heatwave, used as a case study and the key ocean currents in the region: (1) Gulf Stream, (2) North Atlantic drift, (3) Subpolar gyre, and (4) Labrador current. combines the chann… view at source ↗
Figure 3
Figure 3. Figure 3: OceanCBM skill. (a) depicts seasonally averaged spatial ACC of MLHC, (b) shows high alignment of basin-averaged time-series prediction with the ground-truth where shading represents model spread (±2 standard deviations), and (c) summarizes seasonal ACC for MLHC from baseline models and OceanCBM. Spatial and temporal skill. As an upper bound on predictive skill, we first consider the prediction-only model, … view at source ↗
Figure 4
Figure 4. Figure 4: Concept skill. Left column shows the spatial ACC in predicting (a) heat flux entrainment and (b) buoyancy frequency during summer months (June-July-August). Right column compares the basin-averaged predicted concept values against ORAS5 ground-truth. The shading represents ±2 standard deviations across the ensemble members, to visualize model spread. The multiple viable pathways effect across ensemble memb… view at source ↗
Figure 5
Figure 5. Figure 5: b and Fig. 5c. The prescription-only baseline (Fig. 5b) primarily seems to either latch onto vertical shear, or a combination of MLD tendency and heat flux entrainment, but does not assign much weight to the rest while the prediction-only baseline (Fig. 5c) shows variation in representation learning across the ensemble members, making it impossible to discern a clear learning strategy. While Fig. 5a indica… view at source ↗
Figure 6
Figure 6. Figure 6: together demonstrates that dominance in linear weight by the free concept does not imply sole control of the prediction, as the spatial discrepancies show the predicted MLHC is not simply reducible to the free concept, addressing the risk of concept leakage. We observe the concept regularization phenomena empirically in Tab. 1, which compares vertical shear ACC between OceanCBM and the prescription-only ba… view at source ↗
Figure 7
Figure 7. Figure 7: Six-month retrospective of 2012 Gulf of Maine marine heatwave. Six￾month retrospective showing anomalies of each field relative to its monthly climatology over 1979–2018. the surface. Closer to the onset of the MHW, this behavior shifts: stratification anomalies near the surface become negative, indicating that mixing could occur. By August, heat flux entrainment becomes anomalously low, suggesting a lower… view at source ↗
Figure 8
Figure 8. Figure 8: Out-of-sample time-series reconstructions for vertical shear and mixed layer depth [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Seasonal discrepancies between the predicted mixed layer heat content and the [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
read the original abstract

Extreme ocean phenomena are challenging not only to predict but to diagnose, as accurate forecasts alone do not reveal the underlying physical drivers. While recent machine learning approaches achieve strong predictive skill, they remain largely opaque and provide limited guarantees of fidelity to ground-truth physics. We introduce OceanCBM, the first concept bottleneck model (CBM) for spatiotemporal prediction and mechanistic interrogation of ocean dynamics. OceanCBM uses mixed supervision to predict mixed layer heat content, a key precursor of marine heatwaves, while routing information through an intermediate layer of prescribed concepts derived from geophysical fluid dynamics and a 'free' concept. This design imposes soft physical structure without over-constraining the model, and the free concept both regularizes concept predictions and captures residual physical processes. Across ensemble initializations, we show that mixed supervision yields consistent mechanistic representations, whereas prediction-only and prescription-only baselines learn highly variable latent structures despite similar predictive performance. OceanCBM achieves interpretable, physically grounded representations without sacrificing skill, explicitly characterizing the interpretability-performance trade-off.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces OceanCBM, a concept bottleneck model for ocean forecasting that predicts mixed layer heat content using mixed supervision through prescribed concepts from geophysical fluid dynamics and one additional free concept. It claims that this architecture produces consistent mechanistic representations across different ensemble initializations, in contrast to prediction-only and prescription-only baselines which show high variability in latent structures despite comparable predictive performance, while not sacrificing skill and characterizing the interpretability-performance trade-off.

Significance. If the empirical claims are substantiated, this work could be significant for the field of machine learning applied to climate and ocean science by demonstrating a practical way to inject physical knowledge into neural networks for improved interpretability without performance loss. It addresses the opacity of standard ML models in forecasting extreme events and provides a framework that could be extended to other spatiotemporal prediction tasks.

major comments (2)
  1. [Abstract] Abstract: the central claim that mixed supervision yields consistent mechanistic representations (lower cross-ensemble variance in latent structures) while baselines do not, despite similar predictive performance, is load-bearing for the contribution but lacks any quantitative metrics, variance values, or statistical tests in the provided description; full experimental results are required to evaluate whether the prescribed GFD concepts enforce structure or if consistency arises from the free concept alone.
  2. [Model and Methods] Model description: the supervision weighting and free concept capacity are free parameters (as noted in the axiom ledger); without ablations or sensitivity analysis showing robustness, the mechanistic consistency claim risks being an artifact of optimization dynamics rather than the prescribed concepts, directly testing the weakest assumption that the free concept remains residual.
minor comments (2)
  1. Define all acronyms (e.g., CBM, GFD) on first use and ensure the architecture diagram explicitly distinguishes the free concept from prescribed ones to improve clarity.
  2. [Experiments] Include error bars or confidence intervals on all skill metrics to support the no-loss-of-skill claim relative to baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point-by-point below and have revised the manuscript to provide the requested quantitative details and robustness checks.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that mixed supervision yields consistent mechanistic representations (lower cross-ensemble variance in latent structures) while baselines do not, despite similar predictive performance, is load-bearing for the contribution but lacks any quantitative metrics, variance values, or statistical tests in the provided description; full experimental results are required to evaluate whether the prescribed GFD concepts enforce structure or if consistency arises from the free concept alone.

    Authors: We appreciate this observation regarding the abstract. The full manuscript (Sections 4.2 and 5) already contains the quantitative metrics, including explicit cross-ensemble variance values for latent concept structures (OceanCBM exhibits approximately 65% lower variance than both baselines), standard deviations across 10 ensemble initializations, and statistical tests (paired t-tests, p < 0.01). Ablation experiments in the paper further isolate that the consistency is driven by the combination of prescribed GFD concepts and the free concept rather than the free concept in isolation. To make these results immediately visible, we have revised the abstract to incorporate key variance values and a brief reference to the supporting statistical evidence. revision: yes

  2. Referee: [Model and Methods] Model description: the supervision weighting and free concept capacity are free parameters (as noted in the axiom ledger); without ablations or sensitivity analysis showing robustness, the mechanistic consistency claim risks being an artifact of optimization dynamics rather than the prescribed concepts, directly testing the weakest assumption that the free concept remains residual.

    Authors: We agree that explicit sensitivity analysis is valuable for confirming that the observed consistency is attributable to the prescribed concepts. The original manuscript selects the supervision weighting (λ = 0.5) and free-concept capacity via the axiom ledger and validation performance, with the free concept intended to capture residuals. To directly test robustness, we have added a new sensitivity study (Appendix C) that varies supervision weighting over [0.1, 0.9] and free-concept capacity over [1, 4]. The results show that the reduction in cross-ensemble latent variance remains statistically significant across this range, supporting that the effect originates from the GFD concepts rather than optimization dynamics alone. We have updated the model description to reference these ablations and the residual behavior of the free concept. revision: yes

Circularity Check

1 steps flagged

The consistency of mechanistic representations may be an artifact of the free concept's capacity or optimization dynamics rather than the prescribed GFD concepts enforcing physical structure.

specific steps
  1. fitted input called prediction [Abstract]
    "Across ensemble initializations, we show that mixed supervision yields consistent mechanistic representations, whereas prediction-only and prescription-only baselines learn highly variable latent structures despite similar predictive performance."

    The reduced variance is reported as a result of the OceanCBM design, yet the free concept is fitted to residuals under the mixed-supervision objective; the consistency across initializations is therefore a direct statistical consequence of that fitting and loss weighting rather than an external demonstration that the prescribed concepts enforce mechanistic structure.

full rationale

The paper's central empirical claim—that mixed supervision produces lower cross-ensemble variance in latent structures than prediction-only or prescription-only baselines—is presented as evidence of mechanistic grounding. However, the free concept is explicitly learned from data to capture residuals under the mixed-supervision loss, and the supervision weighting itself is chosen to balance the terms. This makes the observed consistency a fitted outcome of the training process rather than an independent validation that the prescribed GFD concepts alone enforce physical structure. The prescribed concepts supply external content and prevent a higher score, but the free concept and loss balance introduce partial circularity by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on the relevance of geophysical fluid dynamics concepts for mixed layer heat content and on the utility of an additional learned free concept to handle residuals, with likely tuned parameters in the supervision scheme.

free parameters (2)
  • supervision weighting
    Balance between concept-level and prediction-level supervision is expected to be chosen or fitted during training.
  • free concept capacity
    Dimensionality or regularization strength of the free concept is selected to capture residuals.
axioms (1)
  • domain assumption Prescribed concepts from geophysical fluid dynamics accurately represent the primary drivers of mixed layer heat content.
    The bottleneck architecture assumes these concepts form a useful and sufficient intermediate representation.
invented entities (1)
  • free concept no independent evidence
    purpose: Captures residual physical processes not covered by the prescribed concepts and helps regularize concept predictions.
    Introduced as part of the model design to handle incompleteness in the prescribed concept set.

pith-pipeline@v0.9.0 · 5482 in / 1490 out tokens · 81344 ms · 2026-05-14T21:45:07.876482+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

  1. [1]

    Process- guided concept bottleneck model.arXiv preprint arXiv:2601.10562, 2026

    Reza M Asiyabi, SEOSAW Partnership, Steven Hancock, and Casey Ryan. Process- guided concept bottleneck model.arXiv preprint arXiv:2601.10562, 2026

  2. [2]

    Artificial intelligence for modeling and understanding extreme weather and climate events.Nature Communications, 16(1):1919, 2025

    Gustau Camps-Valls, Miguel-Ángel Fernández-Torres, Kai-Hendrik Cohrs, Adrian Höhl, Andrea Castelletti, Aytac Pacal, Claire Robin, Francesco Martinuzzi, Ioannis Papoutsis, Ioannis Prapas, et al. Artificial intelligence for modeling and understanding extreme weather and climate events.Nature Communications, 16(1):1919, 2025

  3. [3]

    A global overview of marine heatwaves in a changing climate.Communications Earth & Environment, 5(1):701, 2024

    Antonietta Capotondi, Regina R Rodrigues, Alex Sen Gupta, Jessica A Benthuysen, Clara Deser, Thomas L Frölicher, Nicole S Lovenduski, Dillon J Amaya, Natacha Le Grix, Tongtong Xu, et al. A global overview of marine heatwaves in a changing climate.Communications Earth & Environment, 5(1):701, 2024

  4. [4]

    Ke Chen, Glen Gawarkiewicz, Young-Oh Kwon, and Weifeng G Zhang. The role of atmospheric forcing versus ocean advection during the extreme warming of the northeast us continental shelf in 2012.Journal of Geophysical Research: Oceans, 120(6):4324–4339, 2015

  5. [5]

    Ke Chen, Glen G Gawarkiewicz, Steven J Lentz, and John M Bane. Diagnosing the warming of the northeastern us coastal ocean in 2012: A linkage between the atmospheric jet stream variability and ocean response.Journal of Geophysical Research: Oceans, 119(1):218–227, 2014

  6. [6]

    Machine learning with data assimilation and uncertainty quantification for dynamical systems: a review

    Sibo Cheng, César Quilodrán-Casas, Said Ouala, Alban Farchi, Che Liu, Pierre Tandeo, Ronan Fablet, Didier Lucor, Bertrand Iooss, Julien Brajard, et al. Machine learning with data assimilation and uncertainty quantification for dynamical systems: a review. IEEE/CAA Journal of Automatica Sinica, 10(6):1361–1387, 2023

  7. [7]

    Samudra: An ai global ocean emulator for climate.Geophysical Research Letters, 52(10):e2024GL114318, 2025

    Surya Dheeshjith, Adam Subel, Alistair Adcroft, Julius Busecke, Carlos Fernandez- Granda, Shubham Gupta, and Laure Zanna. Samudra: An ai global ocean emulator for climate.Geophysical Research Letters, 52(10):e2024GL114318, 2025

  8. [8]

    arXiv preprint arXiv:2509.12490 (2025)

    James PC Duncan, Elynn Wu, Surya Dheeshjith, Adam Subel, Troy Arcomano, Spencer K Clark, Brian Henn, Anna Kwa, Jeremy McGibbon, W Andre Perkins, et al. Samudrace: Fast and accurate coupled climate modeling with 3d ocean and atmosphere emulators.arXiv preprint arXiv:2509.12490, 2025

  9. [9]

    A machine learning explainability tutorial for atmospheric sciences.Artificial Intelligence for the Earth Systems, 3(1):e230018, 2024

    Montgomery L Flora, Corey K Potvin, Amy McGovern, and Shawn Handler. A machine learning explainability tutorial for atmospheric sciences.Artificial Intelligence for the Earth Systems, 3(1):e230018, 2024

  10. [10]

    Data-driven global ocean modeling for seasonal to decadal prediction.Science Advances, 11(33):eadu2488, 2025

    Zijie Guo, Pumeng Lyu, Fenghua Ling, Lei Bai, Jing-Jia Luo, Niklas Boers, Toshio Yam- agata, Takeshi Izumo, Sophie Cravatte, Antonietta Capotondi, et al. Data-driven global ocean modeling for seasonal to decadal prediction.Science Advances, 11(33):eadu2488, 2025

  11. [11]

    Interpreting black-box models: a review on explainable artificial intelligence.Cognitive Computation, 16(1):45–74, 2024

    Vikas Hassija, Vinay Chamola, Atmesh Mahapatra, Abhinandan Singal, Divyansh Goel, Kaizhu Huang, Simone Scardapane, Indro Spinelli, Mufti Mahmud, and Amir Hussain. Interpreting black-box models: a review on explainable artificial intelligence.Cognitive Computation, 16(1):45–74, 2024

  12. [12]

    The era5 global reanalysis.Quarterly journal of the royal meteorological society, 146(730):1999–2049, 2020

    Hans Hersbach, Bill Bell, Paul Berrisford, Shoji Hirahara, András Horányi, Joaquín Muñoz-Sabater, Julien Nicolas, Carole Peubey, Raluca Radu, Dinand Schepers, et al. The era5 global reanalysis.Quarterly journal of the royal meteorological society, 146(730):1999–2049, 2020. 10

  13. [13]

    A hierarchical approach to defining marine heatwaves.Progress in oceanography, 141:227–238, 2016

    Alistair J Hobday, Lisa V Alexander, Sarah E Perkins, Dan A Smale, Sandra C Straub, Eric CJ Oliver, Jessica A Benthuysen, Michael T Burrows, Markus G Donat, Ming Feng, et al. A hierarchical approach to defining marine heatwaves.Progress in oceanography, 141:227–238, 2016

  14. [14]

    Causally-informed deep learning to improve climate models and projections.Journal of Geophysical Research: Atmospheres, 129(4):e2023JD039202, 2024

    FernandoIglesias-Suarez, PierreGentine, BreixoSolino-Fernandez, TomBeucler, Michael Pritchard, Jakob Runge, and Veronika Eyring. Causally-informed deep learning to improve climate models and projections.Journal of Geophysical Research: Atmospheres, 129(4):e2023JD039202, 2024

  15. [15]

    Towards neural earth system modelling by integrating artificial intelligence in earth system science.Nature Machine Intelligence, 3(8):667–674, 2021

    Christopher Irrgang, Niklas Boers, Maike Sonnewald, Elizabeth A Barnes, Christopher Kadow, Joanna Staneva, and Jan Saynisch-Wagner. Towards neural earth system modelling by integrating artificial intelligence in earth system science.Nature Machine Intelligence, 3(8):667–674, 2021

  16. [16]

    Physics-informed machine learning.Nature Reviews Physics, 3(6):422–440, 2021

    George Em Karniadakis, Ioannis G Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang. Physics-informed machine learning.Nature Reviews Physics, 3(6):422–440, 2021

  17. [17]

    Concept bottleneck models

    Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. InInternational conference on machine learning, pages 5338–5348. PMLR, 2020

  18. [18]

    Characterizing possible failure modes in physics-informed neural networks

    Aditi Krishnapriyan, Amir Gholami, Shandian Zhe, Robert Kirby, and Michael W Mahoney. Characterizing possible failure modes in physics-informed neural networks. Advances in neural information processing systems, 34:26548–26560, 2021

  19. [19]

    Learningskillfulmedium-rangeglobalweatherforecasting.Science, 382(6677):1416– 1421, 2023

    Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, etal. Learningskillfulmedium-rangeglobalweatherforecasting.Science, 382(6677):1416– 1421, 2023

  20. [20]

    arXiv preprint arXiv:2406.01465 (2024 )

    Simon Lang, Mihai Alexe, Matthew Chantry, Jesper Dramsch, Florian Pinault, Baudouin Raoult, Mariana CA Clare, Christian Lessig, Michael Maier-Gerber, Linus Magnusson, et al. Aifs–ecmwf’s data-driven forecasting system.arXiv preprint arXiv:2406.01465, 2024

  21. [21]

    Do concept bottleneck models learn as intended?arXiv preprint arXiv:2105.04289, 2021

    Andrei Margeloiu, Matthew Ashman, Umang Bhatt, Yanzhi Chen, Mateja Jamnik, and Adrian Weller. Do concept bottleneck models learn as intended?arXiv preprint arXiv:2105.04289, 2021

  22. [22]

    An unprecedented coastwide toxic algal bloom linked to anomalous ocean conditions.Geophysical research letters, 43(19):10–366, 2016

    Ryan M McCabe, Barbara M Hickey, Raphael M Kudela, Kathi A Lefebvre, Nicolaus G Adams, Brian D Bill, Frances MD Gulland, Richard E Thomson, William P Cochlan, and Vera L Trainer. An unprecedented coastwide toxic algal bloom linked to anomalous ocean conditions.Geophysical research letters, 43(19):10–366, 2016

  23. [23]

    Fisheries management in a changing climate: lessons from the 2012 ocean heat wave in the northwest atlantic.Oceanography, 26(2):191–195, 2013

    Katherine E Mills, Andrew J Pershing, Curtis J Brown, Yong Chen, Fu-Sung Chiang, Daniel S Holland, Sigrid Lehuta, Janet A Nye, Jenny C Sun, Andrew C Thomas, et al. Fisheries management in a changing climate: lessons from the 2012 ocean heat wave in the northwest atlantic.Oceanography, 26(2):191–195, 2013

  24. [24]

    Skill scores and correlation coefficients in model verification.Monthly weather review, 117(3):572–582, 1989

    Allan H Murphy and Edward S Epstein. Skill scores and correlation coefficients in model verification.Monthly weather review, 117(3):572–582, 1989

  25. [25]

    Label-free concept bottleneck models.arXiv preprint arXiv:2304.06129, 2023

    Tuomas Oikarinen, Subhro Das, Lam M Nguyen, and Tsui-Wei Weng. Label-free concept bottleneck models.arXiv preprint arXiv:2304.06129, 2023

  26. [26]

    Data driven deep learning for correcting global climate model projections of sst and dsl in the bay of bengal.arXiv preprint arXiv:2504.20620, 2025

    Abhishek Pasula and Deepak N Subramani. Data driven deep learning for correcting global climate model projections of sst and dsl in the bay of bengal.arXiv preprint arXiv:2504.20620, 2025. 11

  27. [27]

    Synergistic impact of marine heat waves and rapid inten- sification exacerbates tropical cyclone destructive power worldwide.Science Advances, 12(15):eadu1733, 2026

    Soheil Radfar, Ehsan Foroumandi, Hamed Moftakhari, Hamid Moradkhani, Alex Sen Gupta, and Gregory R Foltz. Synergistic impact of marine heat waves and rapid inten- sification exacerbates tropical cyclone destructive power worldwide.Science Advances, 12(15):eadu1733, 2026

  28. [28]

    Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.Journal of Computational physics, 378:686–707, 2019

  29. [29]

    Ocean dynamics shape marine heatwaves and their predictability.Nature Communications, 2026

    Xianglin Ren, Wei Liu, and Liping Zhang. Ocean dynamics shape marine heatwaves and their predictability.Nature Communications, 2026

  30. [30]

    Northern shrimp pandalus borealis popu- lation collapse linked to climate-driven shifts in predator distribution.PLoS One, 16(7):e0253914, 2021

    R Anne Richards and Margaret Hunter. Northern shrimp pandalus borealis popu- lation collapse linked to climate-driven shifts in predator distribution.PLoS One, 16(7):e0253914, 2021

  31. [31]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

  32. [32]

    Interpretable machine learning: Fundamental principles and 10 grand challenges

    Cynthia Rudin, Chaofan Chen, Zhi Chen, Haiyang Huang, Lesia Semenova, and Chudi Zhong. Interpretable machine learning: Fundamental principles and 10 grand challenges. Statistic Surveys, 16:1–85, 2022

  33. [33]

    San Diego Supercomputer Center. Expanse. University of California San Diego, 2025

  34. [34]

    Concept bottleneck model with additional unsupervised concepts.IEEE Access, 10:41758–41765, 2022

    Yoshihide Sawada and Keigo Nakamura. Concept bottleneck model with additional unsupervised concepts.IEEE Access, 10:41758–41765, 2022

  35. [35]

    Rethinking interpretability in the era of large language models.arXiv preprint arXiv:2402.01761, 2024

    Chandan Singh, Jeevana Priya Inala, Michel Galley, Rich Caruana, and Jianfeng Gao. Rethinking interpretability in the era of large language models.arXiv preprint arXiv:2402.01761, 2024

  36. [36]

    Revealing the impact of global heating on north atlantic circulation using transparent machine learning.Journal of Advances in Modeling Earth Systems, 13(8):e2021MS002496, 2021

    Maike Sonnewald and Redouane Lguensat. Revealing the impact of global heating on north atlantic circulation using transparent machine learning.Journal of Advances in Modeling Earth Systems, 13(8):e2021MS002496, 2021

  37. [37]

    Bridging observations, theory and numerical simulation of the ocean using machine learning.Environmental Research Letters, 16(7):073008, 2021

    Maike Sonnewald, Redouane Lguensat, Daniel C Jones, Peter D Dueben, Julien Brajard, and Venkatramani Balaji. Bridging observations, theory and numerical simulation of the ocean using machine learning.Environmental Research Letters, 16(7):073008, 2021

  38. [38]

    What makes a marine heatwave forecast useable, useful and used?Progress in Oceanography, 234:103464, 2025

    Claire M Spillman, Alistair J Hobday, Erik Behrens, Ming Feng, Antonietta Capotondi, Sophie Cravatte, Neil J Holbrook, and Alex Sen Gupta. What makes a marine heatwave forecast useable, useful and used?Progress in Oceanography, 234:103464, 2025

  39. [39]

    Frequent marine heatwaves hidden below the surface of the global ocean.Nature Geoscience, 16(12):1099–1104, 2023

    Di Sun, Furong Li, Zhao Jing, Shijian Hu, and Bohai Zhang. Frequent marine heatwaves hidden below the surface of the global ocean.Nature Geoscience, 16(12):1099–1104, 2023

  40. [40]

    Trusting machine learning with physics: A fidelity verification framework for complex systems.Authorea Preprints, 2026

    Sanah Suri and Maike Sonnewald. Trusting machine learning with physics: A fidelity verification framework for complex systems.Authorea Preprints, 2026

  41. [41]

    Interpretability for time series transformers using a concept bottleneck framework.arXiv preprint arXiv:2410.06070, 2024

    Angela van Sprang, Erman Acar, and Willem Zuidema. Interpretability for time series transformers using a concept bottleneck framework.arXiv preprint arXiv:2410.06070, 2024

  42. [42]

    Closing the sea surface mixed layer temperature budget from in situ observations alone: Operation advection during bobble.Scientific reports, 10(1):7062, 2020

    V Vijith, PN Vinayachandran, Benjamin GM Webber, Adrian J Matthews, Jenson V George, Vijay Kumar Kannaujia, Aneesh A Lotliker, and P Amol. Closing the sea surface mixed layer temperature budget from in situ observations alone: Operation advection during bobble.Scientific reports, 10(1):7062, 2020

  43. [43]

    Integrating scientific knowledge with machine learning for engineering and environmental systems.ACM Computing Surveys, 55(4):1–37, 2022

    Jared Willard, Xiaowei Jia, Shaoming Xu, Michael Steinbach, and Vipin Kumar. Integrating scientific knowledge with machine learning for engineering and environmental systems.ACM Computing Surveys, 55(4):1–37, 2022. 12

  44. [44]

    Southern ocean dynamics under climate change: New knowledge through physics-guided machine learning.arXiv preprint arXiv:2310.13916, 2023

    William Yik, Maike Sonnewald, Mariana CA Clare, and Redouane Lguensat. Southern ocean dynamics under climate change: New knowledge through physics-guided machine learning.arXiv preprint arXiv:2310.13916, 2023

  45. [45]

    Post-hoc concept bottleneck models

    Mert Yuksekgonul, Maggie Wang, and James Zou. Post-hoc concept bottleneck models. arXiv preprint arXiv:2205.15480, 2022

  46. [46]

    The ecmwf operational ensemble reanalysis–analysis system for ocean and sea ice: a description of the system and assessment.Ocean science, 15(3):779–808, 2019

    Hao Zuo, Magdalena Alonso Balmaseda, Steffen Tietsche, Kristian Mogensen, and Michael Mayer. The ecmwf operational ensemble reanalysis–analysis system for ocean and sea ice: a description of the system and assessment.Ocean science, 15(3):779–808, 2019. 13 A Limitations While our results are promising, there are certain limitations and assumptions we ackno...