pith. sign in

arxiv: 2409.02399 · v2 · submitted 2024-09-04 · 📊 stat.CO · math.OC

Guidance for twisted particle filter: a continuous-time perspective

Pith reviewed 2026-05-23 21:04 UTC · model grok-4.3

classification 📊 stat.CO math.OC
keywords twisted particle filtercontinuous timeneural networkKL divergencepath measuresimportance samplingMonte Carlosequential Monte Carlo
0
0 comments X

The pith

A neural network trained to minimize KL divergence between path measures guides the Twisted-Path Particle Filter in continuous time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Twisted-Path Particle Filter, which parameterizes a twisting function with a neural network and trains it by minimizing a KL-divergence between path measures. This construction draws from control-based importance sampling methods that operate directly in continuous time. The goal is to lower the variance of Monte Carlo estimates for high-dimensional distributions and their normalizing constants. Numerical experiments are presented to show that the resulting algorithm improves approximation quality over standard particle filters.

Core claim

The Twisted-Path Particle Filter parameterizes its twisting function by a neural network and trains the network parameters to minimize a specific KL-divergence between path measures; the design is guided by existing control-based importance sampling algorithms in the continuous-time setting, and experiments indicate that the trained filter produces lower-variance Monte Carlo approximations than the untwisted particle filter.

What carries the argument

The neural-network-parameterized twisting function trained by minimizing KL divergence between path measures.

If this is right

  • Lower variance Monte Carlo estimates of normalizing constants become available for continuous-time models.
  • The same training procedure can be applied to other path-space importance samplers that admit a twisting function.
  • The continuous-time perspective supplies a principled objective for choosing the twisting function in discrete-time twisted particle filters.
  • High-dimensional filtering problems can be addressed without hand-crafting the twisting function.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may extend to settings where the underlying process is only partially observed, provided the path-measure KL objective can still be estimated.
  • Because the training objective is defined on entire paths, the approach could be combined with existing continuous-time control methods to produce hybrid samplers.
  • If the KL minimization succeeds, the resulting filter may serve as a building block for more accurate sequential Monte Carlo algorithms in non-Markovian or infinite-dimensional state spaces.

Load-bearing premise

Training the neural network to minimize the chosen KL divergence between path measures produces a net reduction in the variance of the particle filter estimator.

What would settle it

A side-by-side run on the same continuous-time model in which the empirical variance of the Twisted-Path Particle Filter estimator, after training, is no smaller than that of the ordinary particle filter.

Figures

Figures reproduced from arXiv: 2409.02399 by Jianfeng Lu, Yuliang Wang.

Figure 1
Figure 1. Figure 1: Linear Gaussian model: compare TPPF (trained with LRE, LCE, or LRECE) and its competitors (BPF, iAPF and FA-APF). Boxplot for log Z using 1000 replicates, with configurations d ∈ {2, 5, 15, 20}. The red cross represents the mean and the red dash line represents the medium. d=2 d=5 d=15 d=20 BPF 0.60 1.14 3.85 5.95 TPPF(RE) 0.27 0.38 0.87 1.23 TPPF(CE) 0.34 0.82 3.51 5.70 TPPF(RECE) 0.31 0.54 0.86 1.21 FA-A… view at source ↗
Figure 2
Figure 2. Figure 2: Lorenz-96 model: compare TPPF (trained with LRE, LCE, or LRECE) and its competitors (BPF, iAPF and FA-APF). (a): Empirical standard deviation for different external force strength α under dimension d = 3, using 20 replicates. (b): Boxplot of log Z using 20 replicates for d = 3, α = 3.0. The red cross represents the mean and the red dash line represents the medium. As we can see from [PITH_FULL_IMAGE:figur… view at source ↗
read the original abstract

The particle filter (PF), also known as sequential Monte Carlo (SMC), approximates high-dimensional probability distributions and their normalizing constants in the discrete-time setting. To reduce the variance of the Monte Carlo approximation, various twisted particle filters (TPFs) have been proposed, in which a twisting function is chosen or learned to modify the Markov transition kernel. Guided by existing control-based importance sampling algorithms in the continuous-time setting, we propose a novel algorithm called the ``Twisted-Path Particle Filter'' (TPPF), in which the twisting function is parameterized by a neural network and trained to minimize a specific KL-divergence between path measures. Numerical experiments illustrate the capability of the proposed algorithm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes the Twisted-Path Particle Filter (TPPF) as an extension of twisted particle filters to the continuous-time setting. A neural network parameterizes the twisting function, which is trained by minimizing a KL divergence between path measures; the design is guided by existing control-based importance sampling methods. Numerical experiments are presented as illustrations of the algorithm's capability for Monte Carlo approximation of distributions and normalizing constants.

Significance. If the KL-trained twisting yields a net reduction in estimator variance after accounting for training cost, the continuous-time control perspective could provide a principled route to improved SMC performance on path-space problems. The explicit link to control-based IS is a constructive contribution that may aid future work on learned proposals.

major comments (2)
  1. [§3] §3 (algorithm derivation): the manuscript states that the chosen KL objective between path measures produces an improved twisting function, but supplies no explicit variance bound or bias-variance decomposition showing that the resulting estimator variance is strictly smaller than the untwisted PF (or existing TPF baselines) for the same number of particles; without this, the central claim that the method 'improves Monte Carlo approximation' rests on the illustrative experiments alone.
  2. [§5] §5 (numerical experiments): the reported runs use small state dimensions and short time horizons; no scaling study or comparison against a non-neural twisted filter (e.g., analytically chosen twisting) is given, so it remains unclear whether the NN parameterization delivers a practical advantage once training overhead is included.
minor comments (3)
  1. [§2] Notation for the continuous-time path measure and the twisting function should be introduced with a single consistent symbol table; currently the same symbol appears to be reused for the discrete-time and continuous-time cases.
  2. [§5] Figure captions should explicitly state the number of particles, the dimension of the state, and the training budget (epochs / samples) so that the plots can be reproduced without consulting the main text.
  3. The reference list omits several standard works on continuous-time SMC and on control-based importance sampling (e.g., the original papers on the continuous-time Feynman-Kac framework); adding them would clarify the precise novelty of the KL choice.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive comments. We address each major comment below, providing clarifications on the theoretical motivation and the scope of the numerical experiments.

read point-by-point responses
  1. Referee: [§3] §3 (algorithm derivation): the manuscript states that the chosen KL objective between path measures produces an improved twisting function, but supplies no explicit variance bound or bias-variance decomposition showing that the resulting estimator variance is strictly smaller than the untwisted PF (or existing TPF baselines) for the same number of particles; without this, the central claim that the method 'improves Monte Carlo approximation' rests on the illustrative experiments alone.

    Authors: We agree that the manuscript does not derive an explicit finite-particle variance bound. The KL objective is selected because it arises directly from the continuous-time control formulation of importance sampling, where the optimal twisting function minimizes a path-space cost that is known to yield the zero-variance estimator in the limit; this connection is the central guidance provided by the continuous-time perspective. A rigorous bias-variance decomposition for the resulting particle estimator is technically involved and lies beyond the scope of the present work, which focuses on algorithm derivation and the control-theoretic link. We will revise §3 to make this motivation and limitation explicit, while retaining the claim of improvement on the basis of the principled objective and supporting experiments. revision: partial

  2. Referee: [§5] §5 (numerical experiments): the reported runs use small state dimensions and short time horizons; no scaling study or comparison against a non-neural twisted filter (e.g., analytically chosen twisting) is given, so it remains unclear whether the NN parameterization delivers a practical advantage once training overhead is included.

    Authors: The experiments are explicitly described in the abstract and §5 as illustrations of the algorithm's capability rather than a comprehensive benchmark. The neural-network parameterization is intended for regimes in which closed-form twisting functions are unavailable; direct comparison to an analytic baseline is therefore not always feasible and would not demonstrate the method's intended use case. Training cost is acknowledged as part of the procedure, but the paper does not assert net computational superiority. We therefore do not plan revisions to §5, as expanding the experiments would shift the manuscript away from its stated focus on the continuous-time derivation. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation introduces a neural-network-parameterized twisting function trained on an external KL-divergence between path measures, guided by prior control-based importance sampling results. This objective is independent of the final particle filter estimator and does not reduce to it by construction; the claimed variance reduction is presented as an empirical consequence rather than a definitional identity. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the abstract or described chain. The numerical experiments are explicitly illustrative, leaving the central proposal self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no explicit free parameters, axioms, or invented entities. The neural-network weights are implicit fitting parameters of the method rather than part of the claim itself. Standard properties of KL divergence and path measures are assumed but not listed as paper-specific axioms.

pith-pipeline@v0.9.0 · 5634 in / 1274 out tokens · 17386 ms · 2026-05-23T21:04:58.620347+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Efficient Inference for Coupled Hidden Markov Models in Continuous Time and Discrete Space

    stat.ML 2025-10 unverdicted novelty 6.0

    Proposes Latent Interacting Particle Systems with an efficient parameterization of twist potentials to enable approximate posterior inference for coupled continuous-time hidden Markov models via twisted sequential Mon...

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    Zero-variance importance sampling estimators for Markov process expectations

    Hernan P Awad, Peter W Glynn, and Reuven Y Rubinstein. Zero-variance importance sampling estimators for Markov process expectations. Mathematics of Operations Re- search, 38(2):358–388, 2013

  2. [2]

    An intuitive proof of the data processing inequality

    Normand J Beaudry and Renato Renner. An intuitive proof of the data processing inequality. arXiv preprint arXiv:1107.0740 , 2011

  3. [3]

    Monte Carlo twisting for particle filters

    Joshua J Bon, Christopher Drovandi, and Anthony Lee. Monte Carlo twisting for particle filters. arXiv preprint arXiv:2208.04288 , 2022

  4. [4]

    A variational representation for certain functionals of Brownian motion

    Michelle Bou´ e and Paul Dupuis. A variational representation for certain functionals of Brownian motion. The Annals of Probability , 26(4):1641–1659, 1998

  5. [5]

    Optimized auxiliary particle filters: adapting mix- ture proposals via convex optimization

    Nicola Branchini and V´ ıctor Elvira. Optimized auxiliary particle filters: adapting mix- ture proposals via convex optimization. In Uncertainty in Artificial Intelligence , pages 1289–1299. PMLR, 2021

  6. [6]

    A sequential particle filter method for static models

    Nicolas Chopin. A sequential particle filter method for static models. Biometrika, 89(3):539–552, 2002. 32

  7. [7]

    Approximation by superpositions of a sigmoidal function

    George Cybenko. Approximation by superpositions of a sigmoidal function. Mathemat- ics of control, signals and systems , 2(4):303–314, 1989

  8. [8]

    Theoretical guarantees for approximate sampling from smooth and log-concave densities

    Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society Series B: Statistical Methodology, 79(3):651–676, 2017

  9. [9]

    Feynman-Kac formulae

    Pierre Del Moral. Feynman-Kac formulae. Springer, 2004

  10. [10]

    Sequential Monte Carlo samplers

    Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(3):411–436, 2006

  11. [11]

    On adaptive resampling strategies for sequential Monte Carlo methods

    Pierrre Del Moral, Arnaud Doucet, and Ajay Jasra. On adaptive resampling strategies for sequential Monte Carlo methods. Bernoulli, 18(1):252–278, 2012

  12. [12]

    Large deviations, volume 342

    Jean-Dominique Deuschel and Daniel W Stroock. Large deviations, volume 342. Amer- ican Mathematical Soc., 2001

  13. [13]

    Particle filtering

    Petar M Djuric, Jayesh H Kotecha, Jianqui Zhang, Yufei Huang, Tadesse Ghirmai, M´ onica F Bugallo, and Joaquin Miguez. Particle filtering. IEEE signal processing magazine, 20(5):19–38, 2003

  14. [14]

    An introduction to sequential Monte Carlo methods

    Arnaud Doucet, Nando De Freitas, and Neil Gordon. An introduction to sequential Monte Carlo methods. Sequential Monte Carlo methods in practice , pages 3–14, 2001

  15. [15]

    On sequential Monte Carlo sampling methods for Bayesian filtering

    Arnaud Doucet, Simon Godsill, and Christophe Andrieu. On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and computing , 10:197–208, 2000

  16. [16]

    A tutorial on particle filtering and smoothing: Fifteen years later

    Arnaud Doucet, Adam M Johansen, et al. A tutorial on particle filtering and smoothing: Fifteen years later. Handbook of nonlinear filtering, 12(656-704):3, 2009

  17. [17]

    Temporal difference learning in continuous time and space

    Kenji Doya. Temporal difference learning in continuous time and space. Advances in neural information processing systems, 8, 1995

  18. [18]

    Time series analysis by state space methods , volume 38

    James Durbin and Siem Jan Koopman. Time series analysis by state space methods , volume 38. OUP Oxford, 2012

  19. [19]

    Stochastic calculus: a practical introduction

    Richard Durrett. Stochastic calculus: a practical introduction . CRC press, 2018

  20. [20]

    Stochastic equations with delay: Optimal control via BSDEs and regular solutions of Hamilton-Jacobi-Bellman equations

    Marco Fuhrman, Federica Masiero, and Gianmario Tessitore. Stochastic equations with delay: Optimal control via BSDEs and regular solutions of Hamilton-Jacobi-Bellman equations. SIAM Journal on Control and Optimization , 48(7):4624–4651, 2010

  21. [21]

    On transforming a certain class of stochastic processes by absolutely continuous substitution of measures

    Igor Vladimirovich Girsanov. On transforming a certain class of stochastic processes by absolutely continuous substitution of measures. Theory of Probability & Its Appli- cations, 5(3):285–301, 1960

  22. [22]

    Monte Carlo methods in financial engineering , volume 53

    Paul Glasserman. Monte Carlo methods in financial engineering , volume 53. Springer, 2004

  23. [23]

    Importance sampling for portfolio credit risk

    Paul Glasserman and Jingyi Li. Importance sampling for portfolio credit risk. Man- agement science, 51(11):1643–1656, 2005

  24. [24]

    Importance sampling for stochastic simulations

    Peter W Glynn and Donald L Iglehart. Importance sampling for stochastic simulations. Management science, 35(11):1367–1392, 1989

  25. [25]

    Novel approach to nonlinear/non-Gaussian Bayesian state estimation

    Neil J Gordon, David J Salmond, and Adrian FM Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. In IEE proceedings F (radar and signal processing), volume 140, pages 107–113. IET, 1993. 33

  26. [26]

    The iterated auxiliary particle filter

    Pieralberto Guarniero, Adam M Johansen, and Anthony Lee. The iterated auxiliary particle filter. Journal of the American Statistical Association , 112(520):1636–1647, 2017

  27. [27]

    Reinforcement learning with deep energy-based policies

    Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In International conference on machine learn- ing, pages 1352–1361. PMLR, 2017

  28. [28]

    Solving high-dimensional partial differen- tial equations using deep learning

    Jiequn Han, Arnulf Jentzen, and Weinan E. Solving high-dimensional partial differen- tial equations using deep learning. Proceedings of the National Academy of Sciences , 115(34):8505–8510, 2018

  29. [29]

    Nonasymptotic bounds for suboptimal im- portance sampling

    Carsten Hartmann and Lorenz Richter. Nonasymptotic bounds for suboptimal im- portance sampling. SIAM/ASA Journal on Uncertainty Quantification , 12(2):309–346, 2024

  30. [30]

    Variational characterization of free energy: Theory and algorithms

    Carsten Hartmann, Lorenz Richter, Christof Sch¨ utte, and Wei Zhang. Variational characterization of free energy: Theory and algorithms. Entropy, 19(11):626, 2017

  31. [31]

    Efficient rare event simulation by optimal nonequilibrium forcing

    Carsten Hartmann and Christof Sch¨ utte. Efficient rare event simulation by optimal nonequilibrium forcing. Journal of Statistical Mechanics: Theory and Experiment , 2012(11):P11004, 2012

  32. [32]

    Model reduction algorithms for optimal control and importance sampling of diffusions

    Carsten Hartmann, Christof Sch¨ utte, and Wei Zhang. Model reduction algorithms for optimal control and importance sampling of diffusions. Nonlinearity, 29(8):2298, 2016

  33. [33]

    Controlled sequential Monte Carlo

    Jeremy Heng, Adrian N Bishop, George Deligiannidis, and Arnaud Doucet. Controlled sequential Monte Carlo. The Annals of Statistics , 48(5):2904–2929, 2020

  34. [34]

    DenseNet: Implementing Efficient ConvNet Descriptor Pyramids

    Forrest Iandola, Matt Moskewicz, Sergey Karayev, Ross Girshick, Trevor Darrell, and Kurt Keutzer. Densenet: Implementing efficient convnet descriptor pyramids. arXiv preprint arXiv:1404.1869, 2014

  35. [35]

    A new approach to linear filtering and prediction problems

    Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. 1960

  36. [36]

    Adaptive importance sampling with forward-backward stochastic differential equations

    Omar Kebiri, Lara Neureither, and Carsten Hartmann. Adaptive importance sampling with forward-backward stochastic differential equations. In Stochastic Dynamics Out of Equilibrium: Institut Henri Poincar´ e, Paris, France, 2017, pages 265–281. Springer, 2019

  37. [37]

    Monte Carlo filter and smoother for non-Gaussian nonlinear state space models

    Genshiro Kitagawa. Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. Journal of computational and graphical statistics , 5(1):1–25, 1996

  38. [38]

    Bayesian estimates of equation system param- eters: an application of integration by Monte Carlo

    Teun Kloek and Herman K Van Dijk. Bayesian estimates of equation system param- eters: an application of integration by Monte Carlo. Econometrica: Journal of the Econometric Society, pages 1–19, 1978

  39. [39]

    Sequential imputations and Bayesian missing data problems

    Augustine Kong, Jun S Liu, and Wing Hung Wong. Sequential imputations and Bayesian missing data problems. Journal of the American statistical association , 89(425):278–288, 1994

  40. [40]

    Sixo: Smoothing inference with twisted objectives

    Dieterich Lawson, Allan Ravent´ os, Andrew Warrington, and Scott Linderman. Sixo: Smoothing inference with twisted objectives. Advances in Neural Information Process- ing Systems, 35:38844–38858, 2022

  41. [41]

    Twisted variational sequential Monte Carlo

    Dieterich Lawson, George Tucker, Christian A Naesseth, Chris Maddison, Ryan P Adams, and Yee Whye Teh. Twisted variational sequential Monte Carlo. In Third workshop on Bayesian Deep Learning (NeurIPS) , 2018

  42. [42]

    Propagation of chaos in path spaces via information theory

    Lei Li, Yuelin Wang, and Yuliang Wang. Propagation of chaos in path spaces via information theory. arXiv preprint arXiv:2312.00339 , 2023. 34

  43. [43]

    and Wang, Y

    Lei Li and Yuliang Wang. A sharp uniform-in-time error estimate for Stochastic Gra- dient Langevin Dynamics. arXiv preprint arXiv:2207.09304 , 2022

  44. [44]

    On a strongly convex approximation of a stochastic optimal control problem for importance sampling of metastable diffusions

    Han Cheng Lie. On a strongly convex approximation of a stochastic optimal control problem for importance sampling of metastable diffusions . PhD thesis, 2016

  45. [45]

    Blind deconvolution via sequential imputations

    Jun S Liu and Rong Chen. Blind deconvolution via sequential imputations. Journal of the american statistical association , 90(430):567–576, 1995

  46. [46]

    Sequential Monte Carlo methods for dynamic systems

    Jun S Liu and Rong Chen. Sequential Monte Carlo methods for dynamic systems. Journal of the American statistical association , 93(443):1032–1044, 1998

  47. [47]

    Predictability: A problem partly solved

    Edward N Lorenz. Predictability: A problem partly solved. In Proc. Seminar on predictability, volume 1. Reading, 1996

  48. [48]

    Im- proved bounds for discretization of Langevin diffusions: Near-optimal rates without convexity

    Wenlong Mou, Nicolas Flammarion, Martin J Wainwright, and Peter L Bartlett. Im- proved bounds for discretization of Langevin diffusions: Near-optimal rates without convexity. Bernoulli, 28(3):1577–1601, 2022

  49. [49]

    On Bellman equations for continuous-time policy eval- uation i: discretization and approximation

    Wenlong Mou and Yuhua Zhu. On Bellman equations for continuous-time policy eval- uation i: discretization and approximation. arXiv preprint arXiv:2407.05966 , 2024

  50. [50]

    Anytime Monte Carlo

    Lawrence M Murray, Sumeetpal S Singh, and Anthony Lee. Anytime Monte Carlo. Data-Centric Engineering, 2:e7, 2021

  51. [51]

    On the optimal and suboptimal nonlinear fil- tering problem for discrete-time systems

    M Netto, L Gimeno, and M Mendes. On the optimal and suboptimal nonlinear fil- tering problem for discrete-time systems. IEEE Transactions on Automatic Control , 23(6):1062–1067, 1978

  52. [52]

    Filtering via simulation: Auxiliary particle filters

    Michael K Pitt and Neil Shephard. Filtering via simulation: Auxiliary particle filters. Journal of the American statistical association , 94(446):590–599, 1999

  53. [53]

    Improv- ing control based importance sampling strategies for metastable diffusions via adapted metadynamics

    Enric Ribera Borrell, Jannes Quer, Lorenz Richter, and Christof Sch¨ utte. Improv- ing control based importance sampling strategies for metastable diffusions via adapted metadynamics. SIAM Journal on Scientific Computing , 46(2):S298–S323, 2024

  54. [54]

    Solving high-dimensional PDEs, approximation of path space measures and importance sampling of diffusions

    Lorenz Richter. Solving high-dimensional PDEs, approximation of path space measures and importance sampling of diffusions . PhD thesis, BTU Cottbus-Senftenberg, 2021

  55. [55]

    Bayesian filtering and smoothing , volume 17

    Simo S¨ arkk¨ a and Lennart Svensson. Bayesian filtering and smoothing , volume 17. Cambridge university press, 2023

  56. [56]

    Equivalence Between Policy Gradients and Soft Q-Learning

    John Schulman, Xi Chen, and Pieter Abbeel. Equivalence between policy gradients and soft q-learning. arXiv preprint arXiv:1704.06440 , 2017

  57. [57]

    Learning to summarize with human feedback

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems , 33:3008–3021, 2020

  58. [58]

    Learning to predict by the methods of temporal differences

    Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3:9–44, 1988

  59. [59]

    Policy gradient methods for reinforcement learning with function approximation

    Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999

  60. [60]

    An introduction to optimal control of FBSDE with incomplete information

    Guangchen Wang, Zhen Wu, Jie Xiong, et al. An introduction to optimal control of FBSDE with incomplete information . Springer, 2018

  61. [61]

    Reinforcement learning in continuous time and space: A stochastic control approach.Journal of Machine Learning Research, 21(198):1–34, 2020

    Haoran Wang, Thaleia Zariphopoulou, and Xun Yu Zhou. Reinforcement learning in continuous time and space: A stochastic control approach.Journal of Machine Learning Research, 21(198):1–34, 2020. 35

  62. [62]

    Mixture models, Monte Carlo, Bayesian updating, and dynamic models

    Mike West. Mixture models, Monte Carlo, Bayesian updating, and dynamic models. Computing Science and Statistics , pages 325–325, 1993

  63. [63]

    Twisted particle filters

    Nick Whiteley and Anthony Lee. Twisted particle filters. The Annals of Statistics , 42(1):115–141, 2014

  64. [64]

    Simple statistical gradient-following algorithms for connectionist reinforcement learning

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992

  65. [65]

    FUDGE: Controlled text generation with future discrim- inators

    Kevin Yang and Dan Klein. FUDGE: Controlled text generation with future discrim- inators. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 3511–3535, 2021

  66. [66]

    Ap- plications of the cross-entropy method to importance sampling and optimal control of diffusions

    Wei Zhang, Han Wang, Carsten Hartmann, Marcus Weber, and Christof Sch¨ utte. Ap- plications of the cross-entropy method to importance sampling and optimal control of diffusions. SIAM Journal on Scientific Computing , 36(6):A2654–A2672, 2014

  67. [67]

    Probabilistic inference in language models via twisted sequential monte carlo.arXiv preprint arXiv:2404.17546, 2024

    Stephen Zhao, Rob Brekelmans, Alireza Makhzani, and Roger Grosse. Probabilis- tic inference in language models via twisted sequential Monte Carlo. arXiv preprint arXiv:2404.17546, 2024

  68. [68]

    Solving time-continuous stochastic optimal control prob- lems: Algorithm design and convergence analysis of actor-critic flow

    Mo Zhou and Jianfeng Lu. Solving time-continuous stochastic optimal control prob- lems: Algorithm design and convergence analysis of actor-critic flow. arXiv preprint arXiv:2402.17208, 2024

  69. [69]

    Fine-Tuning Language Models from Human Preferences

    Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from hu- man preferences. arXiv preprint arXiv:1909.08593 , 2019

  70. [70]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023. 36