pith. machine review for the scientific record. sign in

arxiv: 2604.20052 · v2 · submitted 2026-04-21 · 📊 stat.CO

Recognition: unknown

Annealed Langevin Monte Carlo for Flow ODE Sampling

Hanwen Huang

Pith reviewed 2026-05-10 00:15 UTC · model grok-4.3

classification 📊 stat.CO
keywords Annealed Langevin Monte CarloFlow ODE samplingJarzynski identityMultimodal distributionsImportance samplingVelocity field estimationStochastic interpolantsMarkov chain Monte Carlo
0
0 comments X

The pith

Annealed Langevin Monte Carlo with Jarzynski reweighting achieves O(1/n) error in flow ODE velocity estimation for multimodal targets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces ALMC-ODE, a method that runs annealed Langevin Markov chains across intermediate distributions to generate particles for estimating the velocity field of a probability-flow ODE that maps a Gaussian to the target density. The core contribution is a general Jarzynski-type reweighting identity for time-inhomogeneous kernels, along with the optimal backward kernel and a proof that the velocity estimator has mean squared error of order 1/n. A reader would care if this holds because it offers a principled way to handle the variance issues in importance sampling for multimodal distributions where standard MCMC fails. The method is shown to outperform direct Monte Carlo and Hamiltonian Monte Carlo on benchmarks like Gaussian mixtures and high-dimensional field systems.

Core claim

The paper proposes using an annealed Langevin Markov chain to evolve particles through intermediate distributions bridging a Gaussian reference to the target, then applies Jarzynski reweighting to obtain a low-variance estimator of the ODE velocity field. It establishes a Jarzynski-type identity for general time-inhomogeneous transition kernels, characterizes the optimal backward kernel minimizing weight variance, and proves an O(1/n) MSE bound for the estimator. Numerical experiments confirm significant outperformance on multimodal targets.

What carries the argument

Annealed Langevin Markov chain with Jarzynski-based importance reweighting to estimate the velocity field of the stochastic-interpolant probability-flow ODE.

If this is right

  • The resulting velocity-field estimator has mean squared error that scales as O(1/n) with the number of particles.
  • The optimal backward kernel minimizes the variance of the importance weights.
  • ALMC-ODE outperforms direct Monte Carlo ODE sampling and Hamiltonian Monte Carlo on highly multimodal distributions such as Gaussian mixtures and Allen-Cahn systems.
  • Importance-weighted particles from the annealed chains provide reliable estimates for the continuous transport in the ODE.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reweighting identity could be applied to other sampling algorithms involving time-dependent transitions beyond ODE flows.
  • Combining this with existing flow-matching techniques might further reduce computational cost in high dimensions.
  • Testing the variance bound on targets with known but extreme multimodality, such as mixtures with many separated modes, would provide direct validation.

Load-bearing premise

The annealed Langevin chains can be run with intermediate distributions that keep importance-weight variance manageable even when the target is highly multimodal.

What would settle it

A numerical experiment where the observed mean squared error of the velocity estimator fails to decrease proportionally to 1/n, or where importance weight variance explodes for a multimodal target, would falsify the theoretical guarantees.

Figures

Figures reproduced from arXiv: 2604.20052 by Hanwen Huang.

Figure 1
Figure 1. Figure 1: Scatter plots of 10,000 samples from the 2-dimensional 20-component Gaussian mixture. [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Samples projected onto the first two dimensions for seed 81. Left: ground truth (all 5 [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: shows representative field configurations sampled by each method. Both ϕ + and ϕ − modes (configurations close to ϕ ≈ +1 and ϕ ≈ −1, respectively) are visually identifiable in the ALMC-ODE output; HMC samples collapse to a single polarity, confirming mode trapping. The MC-ODE estimator produces numerically degenerate samples in this d = 64 setting (its KSD is undefined) [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗
read the original abstract

We propose Annealed Langevin Monte Carlo for Flow ODE Sampling (ALMC-ODE), a method for generating samples from unnormalized target distributions, with a particular emphasis on multimodal densities that are challenging for standard Markov chain Monte Carlo methods. ALMC-ODE is based on a probability-flow ordinary differential equation (ODE) derived from stochastic interpolants, which continuously transports a standard Gaussian reference distribution at $t = 0$ to the target distribution $\rho$ at $t = 1$. The key innovation lies in an annealed Langevin Markov chain that evolves through a sequence of intermediate distributions bridging the reference and the target. The resulting importance-weighted particles, reweighted via a Jarzynski-based scheme, yield a low-variance estimator of the velocity field governing the ODE. On the theoretical side, we establish a Jarzynski-type reweighting identity for general time-inhomogeneous transition kernels, characterize the optimal backward kernel that minimizes the variance of the importance weights, and prove an $\mathcal{O}(1/n)$ mean squared error bound for the resulting velocity-field estimator. Numerical experiments on challenging benchmarks, including Gaussian mixture models and a 64-dimensional Allen--Cahn field system, demonstrate that ALMC-ODE significantly outperforms both direct Monte Carlo ODE approaches and Hamiltonian Monte Carlo when applied to highly multimodal target distributions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Annealed Langevin Monte Carlo for Flow ODE Sampling (ALMC-ODE), which evolves annealed Langevin chains through intermediate distributions to produce importance-weighted particles. These particles are reweighted via a Jarzynski-type identity to yield a low-variance estimator of the velocity field for a probability-flow ODE that transports a Gaussian reference to an unnormalized multimodal target. The paper establishes a general reweighting identity for time-inhomogeneous kernels, characterizes the optimal backward kernel minimizing weight variance, proves an O(1/n) MSE bound for the velocity estimator, and reports superior empirical performance versus direct Monte Carlo ODE sampling and HMC on Gaussian mixture models and a 64-dimensional Allen-Cahn system.

Significance. If the importance-weight second moments remain controlled, the work offers a principled bridge between annealed MCMC and continuous normalizing flows for multimodal sampling. The first-principles derivation of the Jarzynski-type identity for general kernels and the explicit characterization of the optimal backward kernel are notable strengths, as is the O(1/n) rate under square-integrable weights. The empirical results on high-dimensional multimodal benchmarks indicate practical promise, though the overall significance depends on whether the theoretical rate remains informative rather than vacuous in the multimodal regime.

major comments (2)
  1. [theoretical analysis section] The O(1/n) MSE bound for the velocity-field estimator (abstract and theoretical analysis section) follows from standard importance-sampling variance arguments once the weights are square-integrable. However, the implicit constant depends on the second moment of the importance weights, and the manuscript supplies no explicit bounds, sufficient conditions on the annealing schedule, or step-size restrictions that guarantee this moment stays controlled for highly multimodal targets; without such control the stated rate is formally correct but may be vacuous precisely where the method is claimed to be useful.
  2. [theoretical results] The characterization of the optimal backward kernel that minimizes importance-weight variance (theoretical results) is derived from first principles, yet the manuscript does not clarify whether this kernel is tractably computable or approximable within the annealed Langevin implementation; if the optimal kernel cannot be realized without additional approximation error, the low-variance claim and the O(1/n) bound both require re-examination.
minor comments (2)
  1. [Numerical Experiments] The experimental section describes outperformance on Gaussian mixtures and the Allen-Cahn system at a high level but omits error bars, the number of independent runs, and any ablation on annealing parameters or step sizes; these details are needed to assess robustness of the low-variance estimator.
  2. Notation for the forward and backward transition kernels in the reweighting identity could be made more explicit (e.g., consistent use of subscripts or superscripts) to improve readability of the general time-inhomogeneous case.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [theoretical analysis section] The O(1/n) MSE bound for the velocity-field estimator (abstract and theoretical analysis section) follows from standard importance-sampling variance arguments once the weights are square-integrable. However, the implicit constant depends on the second moment of the importance weights, and the manuscript supplies no explicit bounds, sufficient conditions on the annealing schedule, or step-size restrictions that guarantee this moment stays controlled for highly multimodal targets; without such control the stated rate is formally correct but may be vacuous precisely where the method is claimed to be useful.

    Authors: We agree that the O(1/n) MSE bound is conditional on the second moment of the importance weights remaining finite, as it follows from standard importance-sampling analysis applied to the Jarzynski reweighting identity. The manuscript states the rate under this assumption and supports its practical relevance through experiments on multimodal targets where the estimator exhibits low variance. However, we acknowledge that the current version does not supply explicit sufficient conditions on the annealing schedule or step sizes to guarantee bounded moments for arbitrary highly multimodal densities. In the revised manuscript we will add a dedicated paragraph in the theoretical analysis section discussing this limitation, referencing related results on weight degeneracy in annealed importance sampling, and outlining heuristic guidelines for schedule design that appear effective in our benchmarks. revision: partial

  2. Referee: [theoretical results] The characterization of the optimal backward kernel that minimizes importance-weight variance (theoretical results) is derived from first principles, yet the manuscript does not clarify whether this kernel is tractably computable or approximable within the annealed Langevin implementation; if the optimal kernel cannot be realized without additional approximation error, the low-variance claim and the O(1/n) bound both require re-examination.

    Authors: The characterization identifies the backward kernel that achieves the minimal possible weight variance for any given forward transition, derived directly from the general reweighting identity. The ALMC-ODE implementation employs annealed Langevin dynamics as a computationally tractable surrogate for this optimal kernel, selected because it efficiently samples the sequence of intermediate distributions. The low-variance property and O(1/n) bound are claimed for the implemented estimator whenever the resulting weights are square-integrable, which is verified empirically. We will revise the theoretical results section to explicitly state that the annealed Langevin kernel approximates the optimal one, to discuss the approximation error, and to note that the variance reduction and rate still hold for the practical kernel under the finite-second-moment condition. revision: yes

Circularity Check

0 steps flagged

No circularity: reweighting identity and MSE bound derived from first principles

full rationale

The paper derives a Jarzynski-type reweighting identity for general time-inhomogeneous transition kernels, characterizes the optimal backward kernel, and proves the O(1/n) MSE bound using standard importance-sampling variance arguments once weights are square-integrable. These steps are presented as direct mathematical derivations rather than reductions to fitted parameters, prior self-citations, or ansatzes. No load-bearing step reduces by construction to the paper's own inputs or empirical fits; the theoretical claims remain independent of the specific multimodal targets or numerical experiments. The skeptic concern about weight control for multimodal cases affects practical utility of the bound but does not indicate circularity in the derivation itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method assumes standard existence and uniqueness results for ODEs and SDEs, plus the validity of the Jarzynski equality for the chosen kernels; no new entities or fitted parameters are introduced in the abstract.

axioms (2)
  • standard math Existence and uniqueness of solutions to the probability-flow ODE derived from stochastic interpolants
    Invoked to guarantee a well-defined continuous transport from Gaussian reference to target distribution.
  • domain assumption The Jarzynski equality holds for the time-inhomogeneous transition kernels used in the annealed chains
    Central to the reweighting identity and variance analysis.

pith-pipeline@v0.9.0 · 5523 in / 1412 out tokens · 41045 ms · 2026-05-10T00:15:02.772025+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 8 canonical work pages · 2 internal anchors

  1. [1]

    Akyildiz, O. D., F. R. Crucinio, M. Girolami, T. Johnston, and S. Sabanis (2025). Interacting particle langevin algorithm for maximum marginal likelihood estimation

  2. [2]

    Albergo, M. S., N. M. Boffi, and E. Vanden-Eijnden (2023). Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797\/

  3. [3]

    Albergo, M. S. and E. Vanden-Eijnden (2023). Building normalizing flows with stochastic interpolants. In The Eleventh International Conference on Learning Representations

  4. [4]

    Di Gesù, and H

    Berglund, N., G. Di Gesù, and H. Weber (2017, January). An eyring–kramers law for the stochastic allen–cahn equation in dimension two. Electronic Journal of Probability\/ 22

  5. [5]

    Gelman, G

    Brooks, S., A. Gelman, G. Jones, and X.-L. Meng (2011). Handbook of Markov chain Monte Carlo . CRC press

  6. [6]

    Carbone, D. (2025). Jarzynski reweighting and sampling dynamics for training energy-based models: theoretical analysis of different transition kernels. Bollettino dell\'Unione Matematica Italiana\/ , https://doi.org/10.1007/s40574--025--00512--4

  7. [7]

    Carbone, D., M. Hua, S. Coste, and E. Vanden-Eijnden (2023). Efficient training of energy-based models using jarzynski equality. In Proceedings of the 37th International Conference on Neural Information Processing Systems , NIPS '23, Red Hook, NY, USA. Curran Associates Inc

  8. [8]

    Carbone, D., M. Hua, S. Coste, and E. Vanden-Eijnden (2024). Generative models as out-of-equilibrium particle systems: Training of energy-based models using non-equilibrium thermodynamics. In A. Saha and S. Banerjee (Eds.), Proceedings of the 2nd International Conference on Nonlinear Dynamics and Applications (ICNDA 2024), Volume 3 , Cham, pp.\ 287--311. ...

  9. [9]

    Chewi, S. (2024). Log-Concave Sampling . Book draft, in preparation

  10. [10]

    Carbone, and \"O

    Cuin, J., D. Carbone, and \"O . D. Akyildiz (2025). Learning latent variable models via jarzynski-adjusted langevin algorithm. In Advances in Neural Information Processing Systems

  11. [11]

    Dagpunar, J. S. (2007). Simulation and Monte Carlo : With Applications in Finance and MCMC . John Wiley & Sons, Ltd

  12. [12]

    Dai, Y., Y. Gao, J. Huang, Y. Jiao, L. Kang, and J. Liu (2023). Lipschitz transport maps via the F \"o llmer flow. arXiv preprint arXiv:2309.03490\/

  13. [13]

    Dalalyan, A. S. (2017). Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology)\/ 79\/ (3), 651--676

  14. [14]

    Ding, Z., Y. Jiao, X. Lu, Z. Yang, and C. Yuan (2023). Sampling via f\"ollmer flow

  15. [15]

    Dong, J. and X. T. Tong (2022). Spectral gap of replica exchange langevin diffusion on mixture distributions. Stochastic Processes and their Applications\/ 151 , 451--489

  16. [16]

    Grathwohl, A

    Doucet, A., W. Grathwohl, A. G. D. G. Matthews, and H. Strathmann (2022). Score-based diffusion meets annealed importance sampling

  17. [17]

    Duan, C., Y. Jiao, G. Steidl, C. Wald, J. Z. Yang, and R. Zhang (2026). Sampling via stochastic interpolants by langevin-based velocity and initialization estimation in flow odes

  18. [18]

    Duan, C., Y. Jiao, G. Steidl, C. J. Walder, J. Z. Yang, and R. Zhang (2026). Sampling via stochastic interpolants by langevin-based velocity and initialization estimation in flow odes. ArXiv\/ abs/2601.08527

  19. [19]

    Duane, S., A. D. Kennedy, B. J. Pendleton, and D. Roweth (1987). Hybrid Monte Carlo . Physics Letters B\/ 195\/ (2), 216--222

  20. [20]

    Dunson, D. B. and J. E. Johndrow (2020). The hastings algorithm at fifty. Biometrika\/ 107\/ (1), 1--23

  21. [21]

    Huang, , and Y

    Gao, Y., J. Huang, , and Y. Jiao (2024). Gaussian interpolation flows. Journal of Machine Learning Research\/ 25\/ (253), 1--52

  22. [22]

    Gelfand, A. E. and A. F. M. Smith (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association\/ 85\/ (410), 398--409

  23. [23]

    Gelfand, S. B., S. K. Mitter, et al. (1990). On sampling methods and annealing algorithms. technical report\/

  24. [24]

    Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin (2013). Bayesian data analysis\/ (3 ed.). Chapman and Hall/CRC

  25. [25]

    Noble, M

    Grenioux, L., M. Noble, M. Gabri'e, and A. O. Durmus (2024). Stochastic localization via iterative posterior sampling. ArXiv\/ abs/2402.10758

  26. [26]

    C., C.-C

    Hackett, D. C., C.-C. Hsieh, S. Pontula, M. S. Albergo, D. Boyda, J.-W. Chen, K.-F. Chen, K. Cranmer, G. Kanwar, and P. E. Shanahan (2021). Flow-based sampling for multimodal and extended-mode distributions in lattice field theory

  27. [27]

    HASTINGS, W. K. (1970, 04). Monte carlo sampling methods using markov chains and their applications. Biometrika\/ 57\/ (1), 97--109

  28. [28]

    Rojas, and M

    He, Y., K. Rojas, and M. Tao (2024). Zeroth-order sampling methods for non-log-concave distributions: Alleviating metastability by denoising diffusion. In The Thirty-eighth Annual Conference on Neural Information Processing Systems

  29. [29]

    Huang, J., Y. Jiao, L. Kang, X. Liao, J. Liu, and Y. Liu (2025). Schrödinger-föllmer sampler. IEEE Transactions on Information Theory\/ 71\/ (2), 1283--1299

  30. [30]

    Huang, X., H. Dong, Y. Hao, Y. Ma, and T. Zhang (2024). Reverse diffusion Monte Carlo . In The Twelfth International Conference on Learning Representations

  31. [31]

    Huang, X., D. Zou, H. Dong, Y.-A. Ma, and T. Zhang (2024, 30 Jun--03 Jul). Faster sampling without isoperimetry via diffusion-based Monte Carlo . In S. Agrawal and A. Roth (Eds.), Proceedings of Thirty Seventh Conference on Learning Theory , Volume 247 of Proceedings of Machine Learning Research , pp.\ 2438--2493. PMLR

  32. [32]

    (1997, Apr)

    Jarzynski, C. (1997, Apr). Nonequilibrium equality for free energy differences. Phys. Rev. Lett.\/ 78 , 2690--2693

  33. [33]

    Kou, S. C., Q. Zhou, and W. H. Wong (2006). Equi-energy sampler with applications in statistical inference and statistical mechanics . The Annals of Statistics\/ 34\/ (4), 1581 -- 1619

  34. [34]

    Laio, A. and M. Parrinello (2002). Escaping free-energy minima. Proceedings of the National Academy of Sciences\/ 99\/ (20), 12562--12566

  35. [35]

    Liang, F. and W. H. Wong (2001). Real-parameter evolutionary monte carlo with applications to bayesian mixture models. Journal of the American Statistical Association\/ 96\/ (454), 653--666

  36. [36]

    Lipman, Y., R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le (2022). Flow matching for generative modeling. arXiv:2210.02747\/

  37. [37]

    Liu, J. S. (2004). Monte Carlo strategies in scientific computing . Springer Series in Statistics. Springer New York, NY

  38. [38]

    Liu, Q., J. D. Lee, and M. I. Jordan (2016). A kernelized stein discrepancy for goodness-of-fit tests. In M. F. Balcan and K. Q. Weinberger (Eds.), Proceedings of the 33rd International Conference on Machine Learning , Volume 48 of Proceedings of Machine Learning Research , pp.\ 276--284. PMLR

  39. [39]

    Gong, and Q

    Liu, X., C. Gong, and Q. Liu (2023). Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations

  40. [40]

    Ma, Y.-A., Y. Chen, C. Jin, N. Flammarion, and M. I. Jordan (2019). Sampling can be faster than optimization. Proceedings of the National Academy of Sciences\/ 116\/ (42), 20881--20885

  41. [41]

    Mangoubi, O., N. S. Pillai, and A. Smith (2018). Does hamiltonian monte carlo mix faster than a random walk on multimodal densities? arXiv preprint arXiv:1808.03230\/

  42. [42]

    Marinari, E. and G. Parisi (1992, jul). Simulated tempering: A new monte carlo scheme. Europhysics Letters\/ 19\/ (6), 451

  43. [43]

    Marks, J., T. Y. J. Wang, and O. D. Akyildiz (2025). Learning latent energy-based models via interacting particle langevin dynamics

  44. [44]

    Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller (1953). Equation of state calculations by fast computing machines. The Journal of Chemical Physics\/ 21\/ (6), 1087--1092

  45. [45]

    Neal, R. M. (2001, April). Annealed importance sampling. Statistics and Computing\/ 11\/ (2), 125--139

  46. [46]

    Neal, R. M. (2011). MCMC using Hamiltonian dynamics. In Handbook of Markov Chain Monte Carlo , Chapter 5, pp.\ 113--162. CRC Press

  47. [47]

    Newman, M. E. J. and G. T. Barkema (1999, 02). Monte Carlo Methods in Statistical Physics . Oxford University Press

  48. [48]

    Roberts, G. O. and R. L. Tweedie (1996). Exponential convergence of Langevin distributions and their discrete approximations . Bernoulli\/ 2\/ (4), 341 -- 363

  49. [49]

    Song, Y. and S. Ermon (2019). Generative modeling by estimating gradients of the data distribution. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch\'e Buc, E. Fox, and R. Garnett (Eds.), Advances in Neural Information Processing Systems , Volume 32. Curran Associates, Inc

  50. [50]

    Song, Y. and S. Ermon (2020). Improved techniques for training score-based generative models. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Advances in Neural Information Processing Systems , Volume 33, pp.\ 12438--12448. Curran Associates, Inc

  51. [51]

    Sz \'e kely, G. J. and M. L. Rizzo (2013). Energy statistics: A class of statistics based on distances. Journal of Statistical Planning and Inference\/ 143\/ (8), 1249--1272

  52. [52]

    Tan, L. and J. Lu (2025). Accelerate langevin sampling with birth-death process and exploration component. SIAM/ASA Journal on Uncertainty Quantification\/ 13\/ (3), 1265--1293

  53. [53]

    odinger-f\

    Vargas, F., A. Ovsianas, D. Fernandes, M. Girolami, N. D. Lawrence, and N. Nüsken (2022). Bayesian learning via neural schr\"odinger-f\"ollmer flows. arXiv:2111.10510\/

  54. [54]

    Wang, F. and D. P. Landau (2001, Mar). Efficient, multiple-range random walk algorithm to calculate the density of states. Physical Review Letters\/ 86 , 2050--2053

  55. [55]

    Zilberstein, N., C. Dick, R. Doost-Mohammady, A. Sabharwal, and S. Segarra (2023). Annealed Langevin dynamics for massive MIMO detection. IEEE Transactions on Wireless Communications\/ 22\/ (6), 3762--3776

  56. [56]

    Sabharwal, and S

    Zilberstein, N., A. Sabharwal, and S. Segarra (2024). Solving linear inverse problems using higher-order Annealed langevin diffusion. IEEE Transactions on Signal Processing\/ 72 , 492--505