pith. machine review for the scientific record. sign in

arxiv: 2605.03802 · v1 · submitted 2026-05-05 · ⚛️ physics.ao-ph · cs.LG

Recognition: unknown

Towards accurate extreme event likelihoods from diffusion model climate emulators

Authors on Pith no claims yet

Pith reviewed 2026-05-07 12:35 UTC · model grok-4.3

classification ⚛️ physics.ao-ph cs.LG
keywords diffusion modelsclimate emulatorstropical cyclonesextreme eventsprobability densityimportance samplingatmospheric modelingguidance
0
0 comments X

The pith

Diffusion model climate emulators quantify extreme event likelihoods by comparing guided and unguided probability densities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that diffusion-based climate emulators approximate the probability density of atmospheric states and can therefore support calculation of likelihoods for rare events. Guiding the model toward tropical cyclones at chosen locations produces density values that, when compared to the unguided case, yield odds ratios measuring the boost from guidance. These ratios then drive importance sampling, which estimates the probability of such events with lower statistical error than simple Monte Carlo sampling from the emulator. The approach matters for cost-efficient scenario planning because it lets users probe the tails of the distribution without generating vast numbers of random samples. The authors also examine how the same densities might be applied in extreme-event attribution experiments.

Core claim

Diffusion models such as cBottle approximate the probability density of training data. When the model is guided toward states that include tropical cyclones, the ratio of probability densities between the guided and unguided versions directly quantifies how much more likely the guidance has rendered the cyclone. These odds ratios then enable importance sampling from the TC distribution, which reduces the standard error of the probability estimate relative to ordinary Monte Carlo sampling.

What carries the argument

The odds ratio between guided and unguided model probability densities, which reweights samples to importance-sample the distribution of extreme events.

If this is right

  • Fewer emulator runs are needed to obtain reliable probability estimates for rare events.
  • Guidance can target specific locations or boundary conditions while still delivering corrected likelihoods.
  • Model densities open a route to attribution-style calculations that compare event likelihoods under different forcings.
  • Emulators shift from pure generation tools to sources of quantitative probabilistic information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same density-ratio technique could be tested on other extremes such as heat waves or extreme precipitation.
  • Combining the approach with observational constraints might improve calibration of the underlying density estimate.
  • If the approximation remains reliable, the method could lower the cost of tail-risk assessment in long climate projections.

Load-bearing premise

The diffusion model accurately approximates the true probability density of atmospheric states and the guidance mechanism does not distort that density in ways that invalidate likelihood comparisons for extremes.

What would settle it

Importance-sampled estimates of tropical cyclone occurrence rates that systematically diverge from rates obtained by a much larger set of unguided Monte Carlo runs or from direct observational records would falsify the accuracy claim.

Figures

Figures reproduced from arXiv: 2605.03802 by Julius Berner, Karthik Kashinath, Mike Pritchard, Noah Brenowitz, Peter Manshausen.

Figure 1
Figure 1. Figure 1: Visualization of the PF ODE. To sample from the data distribution, we would draw a latent from the prior on the right and flow along the ODE velocity to the left. For calculating the probability of a sample, we do the opposite: flowing from the sample back to noise, and integrating the divergence of the velocity along the path. When doing this under two models’ (guided and unguided) velocity fields, the di… view at source ↗
Figure 2
Figure 2. Figure 2: Impact of guidance interval on guidance divergence and samples. Shown is a guided sample using fixed SST and starting noise, denoised with guidance that is switched on in the interval given in each column’s title. The top row plots the instantaneous divergence of the guidance term at each noise level, which integrated over the denoising time gives −∆guidance. The bottom row shows the resulting zonal wind i… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of TC probabilities and log ORs from different guidance strengths, motivating the use of small guidance. Shown are distributions of a) the classifier evaluated on the final sample, i.e. the detection probability of a TC at the location where guidance was applied, b) the log odds ratio logo(x) of the final sample, which is the sum of c) the difference of final log probabilities ∆latent d) the (… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of ERA5 and cBottle TC climatology and importance-sampled TC likelihoods. Panel a) shows the estimates of the exceedance frequency of different detection thresholds. We are counting a detection when the TC classifier returns a probability above the threshold in the location of interest. The frequency of detection is computed from different data sets: The guided cBottle model (oversampled), the i… view at source ↗
Figure 5
Figure 5. Figure 5: Calculating likelihoods for ERA5 reanalysis samples, here for heat events in the Antarctic. Panel a) shows a visualization of the heatwave surface temperature anomaly, on 17 March 2022 at 00:00 UTC from ERA5. The red circle shows the unmasked region of interest. Panel b) shows the seasonal cycle of sample probabilities over the entire train set, with the shaded area showing the standard deviation. Panel c)… view at source ↗
read the original abstract

ML climate model emulators are useful for scenario planning and adaptation, allowing for cost-efficient experimentation. Recently, the diffusion model Climate in a Bottle (cBottle) has been proposed for generation of atmospheric states compatible with boundary conditions of solar position and sea surface temperatures. Crucially, cBottle can be guided to generate extreme events such as Tropical Cyclones (TCs) over locations of interest. Diffusion models such as cBottle work by approximating the probability density of the training data. Here, we show use cases of the probability density estimates of atmospheric states obtained from this climate emulator. Most importantly, these estimates allow us to calculate likelihoods of extreme events under guidance. When guiding the model towards states including TCs, comparing the probability density under the guided and unguided model enables us to quantify how much more likely the guidance has made the TC. We show how these odds ratios allow us to importance-sample from the TC distribution, reducing the standard error of the probability estimate compared to simple Monte Carlo sampling. Furthermore, we discuss results and limitations of the application of model probability densities to extreme event attribution-like experiments. We present these early but encouraging results hoping they will spur more research into probabilistic information that can be gained from diffusion models of the atmosphere.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes using probability density estimates from the diffusion model climate emulator cBottle to compute likelihoods of extreme events such as tropical cyclones (TCs) under guidance. By comparing the density under guided versus unguided conditions, odds ratios are derived to enable importance sampling from the TC distribution, which is claimed to reduce the standard error of probability estimates relative to simple Monte Carlo sampling. The work further discusses applications and limitations of these densities for extreme event attribution-like experiments.

Significance. If the density approximations hold for rare events, the approach could provide an efficient means to extract probabilistic information on extremes directly from climate emulators, reducing the need for large ensembles in scenario planning and attribution. The method capitalizes on the generative and density-estimating properties of diffusion models in a novel way for atmospheric science, though its practical impact depends on validation of the tail estimates.

major comments (3)
  1. [Abstract] Abstract: The central claim that odds ratios from guided/unguided densities enable importance sampling with reduced standard error is presented without any quantitative demonstration (e.g., reported variance reduction factors, effective sample sizes, or direct Monte Carlo comparisons), leaving the practical benefit of the method unverified.
  2. [Methods (density estimation and guidance)] The approach requires evaluating p_unguided(x) at rare TC states x drawn from the guided distribution. No details are given on the likelihood computation method (e.g., probability-flow ODE integration or ELBO) nor any calibration of absolute or relative density values against empirical frequencies, which is load-bearing because diffusion likelihood approximations are known to have larger relative errors in low-density regions.
  3. [Results and Discussion] The discussion of results and limitations for extreme event attribution-like experiments does not include any benchmark or sensitivity test confirming that guidance does not distort the density ratios in ways that bias the importance weights, undermining the reliability of the derived likelihoods for rare events.
minor comments (1)
  1. [Title] The title uses 'accurate' while the abstract describes 'early but encouraging results'; aligning the title with the preliminary nature of the validation would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which highlight important areas for improving the clarity and rigor of our work. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our methods and results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that odds ratios from guided/unguided densities enable importance sampling with reduced standard error is presented without any quantitative demonstration (e.g., reported variance reduction factors, effective sample sizes, or direct Monte Carlo comparisons), leaving the practical benefit of the method unverified.

    Authors: We agree that the abstract would benefit from explicit quantitative support for the claimed reduction in standard error. The manuscript illustrates the importance sampling approach through examples of TC generation, but does not report specific metrics such as variance reduction factors or effective sample sizes. In the revised version, we will update the abstract to include these quantitative results and expand the results section with direct Monte Carlo comparisons. revision: yes

  2. Referee: [Methods (density estimation and guidance)] The approach requires evaluating p_unguided(x) at rare TC states x drawn from the guided distribution. No details are given on the likelihood computation method (e.g., probability-flow ODE integration or ELBO) nor any calibration of absolute or relative density values against empirical frequencies, which is load-bearing because diffusion likelihood approximations are known to have larger relative errors in low-density regions.

    Authors: We will revise the Methods section to provide explicit details on the likelihood computation, specifying the probability-flow ODE integration approach used to evaluate the densities. We will also add a calibration subsection comparing the estimated densities (both absolute and relative) to empirical frequencies derived from the training and validation data, with particular attention to behavior in low-density regions relevant to rare events. revision: yes

  3. Referee: [Results and Discussion] The discussion of results and limitations for extreme event attribution-like experiments does not include any benchmark or sensitivity test confirming that guidance does not distort the density ratios in ways that bias the importance weights, undermining the reliability of the derived likelihoods for rare events.

    Authors: We acknowledge that the current discussion of limitations is primarily qualitative. To address this, we will add benchmark and sensitivity tests in the revised Results and Discussion sections. These will include comparisons of density ratios with and without guidance for both common and rare events, along with checks on the stability of the resulting importance weights. revision: yes

Circularity Check

0 steps flagged

No circularity: likelihood ratios derived from model's own density estimates without reduction to inputs

full rationale

The paper proposes applying the diffusion model's learned probability density p(x) (approximated via the score function) to compute odds ratios p_guided(x)/p_unguided(x) for importance sampling of TCs. This is a direct use of the model's internal density estimator under different conditioning, not a self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. The derivation chain starts from the standard diffusion model training objective and guidance mechanism (already established in prior literature) and applies it to extreme-event likelihoods without equations that equate the output back to the fitted inputs by construction. No uniqueness theorems, ansatzes smuggled via citation, or renaming of known empirical patterns are invoked as load-bearing steps. The method is self-contained against the model's own density approximation, with acknowledged limitations on tail accuracy.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on the abstract, the central claim rests on the assumption that the diffusion model provides reliable density estimates for both guided and unguided cases.

axioms (1)
  • domain assumption Diffusion models can accurately approximate the probability density function of the training data distribution.
    Stated in the abstract as the basis for using the estimates.

pith-pipeline@v0.9.0 · 5533 in / 1246 out tokens · 46430 ms · 2026-05-07T12:35:06.973130+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 21 canonical work pages · 6 internal anchors

  1. [1]

    Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models

    Ai, X., He, Y ., Gu, A., Salakhutdinov, R., Kolter, J. Z., Boffi, N. M., and Simchowitz, M.: Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models, https://arxiv.org/abs/2512.02636,

  2. [2]

    Alexe, M., Boucher, E., Lean, P., Pinnington, E., Laloyaux, P., McNally, A., Lang, S., Chantry, M., Burrows, C., Chrust, M., Pinault, F., Villeneuve, E., Bormann, N., and Healy, S.: GraphDOP: Towards skilful data-driven medium-range weather forecasts learnt and initialised directly from observations, arXiv e-prints, arXiv:2412.15687, https://doi.org/10.48...

  3. [3]

    I., and Donohoe, A.: The largest ever recorded heatwave—Characteristics and attribution of the Antarctic heatwave of March 2022, Geophysical Research Letters, 50, e2023GL104 910,

    Blanchard-Wrigglesworth, E., Cox, T., Espinosa, Z. I., and Donohoe, A.: The largest ever recorded heatwave—Characteristics and attribution of the Antarctic heatwave of March 2022, Geophysical Research Letters, 50, e2023GL104 910,

  4. [4]

    Score-based generative emulation of impact-relevant Earth system model outputs

    Bouabid, S., Souza, A. N., and Ferrari, R.: Score-based generative emulation of impact-relevant Earth system model outputs, arXiv preprint arXiv:2510.04358,

  5. [5]

    arXiv:2505.06474 [physics]

    Brenowitz, N. D., Ge, T., Subramaniam, A., Manshausen, P., Gupta, A., Hall, D. M., Mardani, M., Vahdat, A., Kashinath, K., and Pritchard, M. S.: Climate in a bottle: Towards a generative foundation model for the kilometer-scale global atmosphere, arXiv preprint arXiv:2505.06474,

  6. [6]

    E., Schreck, J

    Chapman, W. E., Schreck, J. S., Sha, Y ., Gagne II, D. J., Kimpara, D., Zanna, L., Mayer, K. J., and Berner, J.: CAMulator: Fast emulation of the community atmosphere model, arXiv preprint arXiv:2504.06007,

  7. [7]

    AGU Advances 6(4), 2025–001706 (2025) https://doi.org/10.1029/2025A V001706

    Cresswell-Clay, N., Liu, B., Durran, D. R., Liu, Z., Espinosa, Z. I., Moreno, R. A., and Karlbauer, M.: A Deep Learning Earth System Model for Efficient Simulation of the Observed Climate, AGU Advances, 6, e2025A V001706, https://doi.org/10.1029/2025A V001706,

  8. [8]

    Galashov, A., Pokle, A., Doucet, A., Gretton, A., Delbracio, M., and De Bortoli, V .: Learn to guide your diffusion model, arXiv preprint arXiv:2510.00815,

  9. [9]

    HealDA: Highlighting the importance of initial errors in end-to-end AI weather forecasts

    Gupta, A., Subramaniam, A., Pritchard, M. S., Kashinath, K., Frolov, S., Lieberman, K., Miller, C., Silverman, N., and Brenowitz, N. D.: HealDA: Highlighting the importance of initial errors in end-to-end AI weather forecasts, arXiv preprint arXiv:2601.17636,

  10. [10]

    RNE: plug-and-play diffusion inference-time control and energy-based training

    He, J., Hernández-Lobato, J. M., Du, Y ., and Vargas, F.: RNE: plug-and-play diffusion inference-time control and energy-based training, https://arxiv.org/abs/2506.05668,

  11. [11]

    Keisler, R.: Forecasting Global Weather with Graph Neural Networks, https://arxiv.org/abs/2202.07575,

  12. [12]

    S., Harrington, P., Subramaniam, A., et al.: Demystifying Data-Driven Probabilistic Medium-Range Weather Forecasting, arXiv preprint arXiv:2601.18111,

    Kossaifi, J., Kovachki, N., Mardani, M., Leibovici, D., Ravuri, S., Shokar, I., Calvello, E., Abbas, M. S., Harrington, P., Subramaniam, A., et al.: Demystifying Data-Driven Probabilistic Medium-Range Weather Forecasting, arXiv preprint arXiv:2601.18111,

  13. [13]

    arXiv preprint arXiv:2406.01465 (2024 )

    Lang, S., Alexe, M., Chantry, M., Dramsch, J., Pinault, F., Raoult, B., Clare, M. C., Lessig, C., Maier-Gerber, M., Magnusson, L., et al.: AIFS–ECMWF’s data-driven forecasting system, arXiv preprint arXiv:2406.01465,

  14. [14]

    Martin, S. A., Brenowitz, N., Durran, D., and Pritchard, M.: Long-Range Distillation: Distilling 10,000 Years of Simulated Climate into Long Timestep AI Weather Models, https://arxiv.org/abs/2512.22814,

  15. [15]

    W., and Lakshminarayanan, B.: Detecting out-of-distribution inputs to deep generative models using typicality, arXiv preprint arXiv:1906.02994,

    Nalisnick, E., Matsukawa, A., Teh, Y . W., and Lakshminarayanan, B.: Detecting out-of-distribution inputs to deep generative models using typicality, arXiv preprint arXiv:1906.02994,

  16. [16]

    Pathak, J., Subramanian, S., Harrington, P., Raja, S., Chattopadhyay, A., Mardani, M., Kurth, T., Hall, D., Li, Z., Azizzadenesheli, K., et al.: Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators, arXiv preprint arXiv:2202.11214,

  17. [17]

    Pathak, J., Cohen, Y ., Garg, P., Harrington, P., Brenowitz, N., Durran, D., Mardani, M., Vahdat, A., Xu, S., Kashinath, K., and Pritchard, M.: Kilometer-scale convection-allowing model emulation using generative diffusion modeling, Science Advances, 12, eadv0423, https://doi.org/10.1126/sciadv.adv0423,

  18. [18]

    Pathak, J., Shoaib Abbas, M., Harrington, P., Hu, Z., Brenowitz, N., Ravuri, S., Carpentieri, A., Leinonen, J., Adams, C., Hennigh, O., Geneva, N., Durran, D., and Pritchard, M.: Learning Accurate Storm-Scale Evolution from Observations, arXiv e-prints, arXiv:2601.17268, https://doi.org/10.48550/arXiv.2601.17268,

  19. [19]

    A., Kwa, A., McGibbon, J., Arcomano, T., Clark, S

    Perkins, W. A., Kwa, A., McGibbon, J., Arcomano, T., Clark, S. K., Watt-Meyer, O., Bretherton, C. S., and Harris, L. M.: HiRO-ACE: Fast and skillful AI emulation and downscaling trained on a 3 km global storm-resolving model, https://arxiv.org/abs/2512.18224,

  20. [20]

    Price, I., Sanchez-Gonzalez, A., Alet, F., Andersson, T. R., El-Kadi, A., Masters, D., Ewalds, T., Stott, J., Mohamed, S., Battaglia, P., et al.: Gencast: Diffusion-based ensemble forecasting for medium-range weather, arXiv preprint arXiv:2312.15796,

  21. [21]

    Rehman, D., Akhound-Sadegh, T., Gazizov, A., Bengio, Y ., and Tong, A.: FALCON: Few-step Accurate Likelihoods for Continuous Flows, https://arxiv.org/abs/2512.09914,

  22. [22]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Song, Y ., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B.: Score-based generative modeling through stochastic differential equations, arXiv preprint arXiv:2011.13456,