pith. machine review for the scientific record. sign in

arxiv: 2604.21456 · v2 · submitted 2026-04-23 · 💻 cs.LG · cs.RO

Recognition: 2 theorem links

· Lean Theorem

Tempered Sequential Monte Carlo for Trajectory and Policy Optimization with Differentiable Dynamics

Heng Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:56 UTC · model grok-4.3

classification 💻 cs.LG cs.RO
keywords tempered sequential monte carlotrajectory optimizationpolicy optimizationdifferentiable dynamicshamiltonian monte carlocontrol as inferenceannealed samplingparticle methods
0
0 comments X

The pith

Tempered sequential Monte Carlo samples optimal controllers by annealing particles from prior to low-cost target distributions using exact gradients from differentiable rollouts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper casts finite-horizon controller design as an inference problem that minimizes a KL-regularized expected trajectory cost, producing an optimal Boltzmann-tilted distribution over parameters that concentrates on low-cost solutions as temperature drops. To draw samples from this potentially sharp and multimodal target, it introduces tempered sequential Monte Carlo, an annealing procedure that progressively reweights and resamples particles while inserting Hamiltonian Monte Carlo rejuvenation steps. These steps exploit exact gradients obtained by differentiating through full trajectory rollouts, preserving particle diversity throughout the tempering path. The same machinery extends to policy optimization through a deterministic approximation of the initial-state distribution and an auxiliary-variable construction for rollout stochasticity. If the method works as described, it supplies a broadly applicable sampling alternative to direct gradient or variational approaches for trajectory and policy problems where dynamics are differentiable.

Core claim

The central claim is that tempered sequential Monte Carlo provides an effective sampling scheme for the Boltzmann-tilted distribution arising from KL-regularized trajectory-cost minimization. Particles begin from a prior and are adaptively reweighted and resampled along a tempering schedule toward the target; Hamiltonian Monte Carlo rejuvenation steps, which use gradients obtained by automatic differentiation through the dynamics, maintain diversity even as the distribution sharpens. For policy optimization the construction is enlarged by treating initial states deterministically and rollout noise as auxiliary variables, allowing the same annealed particle filter to optimize stochastic or un

What carries the argument

Tempered sequential Monte Carlo (TSMC) annealing with Hamiltonian Monte Carlo rejuvenation that exploits exact gradients from differentiating through trajectory rollouts.

If this is right

  • The same TSMC procedure applies without change to both open-loop trajectory optimization and closed-loop policy optimization.
  • Policy optimization receives a deterministic empirical approximation of the initial-state distribution and an extended-space treatment of rollout randomness as auxiliary variables.
  • Experiments across standard trajectory- and policy-optimization benchmarks indicate that TSMC performs at least as well as current state-of-the-art baselines.
  • The combination of adaptive tempering and gradient-informed rejuvenation keeps particle diversity high even for sharp multimodal targets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be combined with existing automatic-differentiation simulators to reduce the need for hand-crafted policy gradients in control tasks.
  • If the tempering schedule can be learned rather than hand-tuned, sample efficiency might improve further on problems with varying degrees of multimodality.
  • The method's reliance on differentiability through rollouts points to natural integration with learned dynamics models that themselves admit gradients.

Load-bearing premise

Hamiltonian Monte Carlo rejuvenation continues to mix effectively inside the high-dimensional, potentially stiff distributions produced by long-horizon differentiable rollouts.

What would settle it

On a long-horizon benchmark with stiff dynamics, the particle set collapses to a single mode or the sampled controllers fail to reach lower costs than standard gradient-based methods.

Figures

Figures reproduced from arXiv: 2604.21456 by Heng Yang.

Figure 1
Figure 1. Figure 1: Tempered Sequential Monte Carlo samples from the Shekel-tilted distribution in Example [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Trajectory optimization results. (a)–(c): boxplots of final-particle trajectory costs for [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Policy optimization results. Left to right: episode returns of [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Running PPO and SAC on Acrobot for 109 environment steps. are 1713.00 for PPO and 3121.77 for SAC [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

We propose a sampling-based framework for finite-horizon trajectory and policy optimization under differentiable dynamics by casting controller design as inference. Specifically, we minimize a KL-regularized expected trajectory cost, which yields an optimal "Boltzmann-tilted" distribution over controller parameters that concentrates on low-cost solutions as temperature decreases. To sample efficiently from this sharp, potentially multimodal target, we introduce tempered sequential Monte Carlo (TSMC): an annealing scheme that adaptively reweights and resamples particles along a tempering path from a prior to the target distribution, while using Hamiltonian Monte Carlo rejuvenation to maintain diversity and exploit exact gradients obtained by differentiating through trajectory rollouts. For policy optimization, we extend TSMC via (i) a deterministic empirical approximation of the initial-state distribution and (ii) an extended-space construction that treats rollout randomness as auxiliary variables. Experiments across trajectory- and policy-optimization benchmarks show that TSMC is broadly applicable and compares favorably to state-of-the-art baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes Tempered Sequential Monte Carlo (TSMC) for finite-horizon trajectory and policy optimization under differentiable dynamics by casting controller design as inference. It minimizes a KL-regularized expected trajectory cost to obtain a Boltzmann-tilted target distribution over controller parameters that concentrates on low-cost solutions at low temperature. TSMC performs adaptive annealing from a prior via reweighting and resampling, with HMC rejuvenation steps that exploit exact gradients obtained by differentiating through trajectory rollouts. Extensions handle policy optimization via empirical initial-state approximation and auxiliary variables for rollout randomness. Benchmark experiments indicate that TSMC is broadly applicable and compares favorably to state-of-the-art baselines.

Significance. If the empirical results hold and HMC mixing is reliable, the work provides a principled sampling framework that bridges SMC tempering with gradient-based rejuvenation for control problems exhibiting multimodality. A clear strength is the use of exact autodiff gradients inside HMC and the adaptive tempering construction built from standard, independently justified components.

major comments (1)
  1. [§3] §3 (TSMC construction) and the HMC rejuvenation description: the central claim that HMC rejuvenation maintains particle diversity along the tempering path is load-bearing, yet the manuscript supplies no analysis of leapfrog stability, mass-matrix choice, acceptance rates, or autocorrelation times when applied to the high-dimensional, potentially stiff distributions induced by long-horizon differentiable rollouts. If acceptance rates collapse, the adaptive reweighting/resampling step degenerates and the method cannot reliably reach the claimed low-temperature modes.
minor comments (2)
  1. [Abstract] The abstract states that experiments show favorable performance but supplies no quantitative metrics, error bars, or ablation details; including at least one representative table or figure reference in the abstract would strengthen the summary.
  2. [§2] The Boltzmann-tilted target is referenced as arising from the KL-regularized cost, but an explicit one-paragraph derivation (including the role of temperature) would improve accessibility for readers unfamiliar with the inference framing.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on the manuscript. We address the major comment on the HMC rejuvenation analysis point by point below.

read point-by-point responses
  1. Referee: [§3] §3 (TSMC construction) and the HMC rejuvenation description: the central claim that HMC rejuvenation maintains particle diversity along the tempering path is load-bearing, yet the manuscript supplies no analysis of leapfrog stability, mass-matrix choice, acceptance rates, or autocorrelation times when applied to the high-dimensional, potentially stiff distributions induced by long-horizon differentiable rollouts. If acceptance rates collapse, the adaptive reweighting/resampling step degenerates and the method cannot reliably reach the claimed low-temperature modes.

    Authors: We agree that the manuscript would benefit from additional empirical diagnostics on the HMC rejuvenation steps, as these are central to maintaining particle diversity during tempering. The current version emphasizes the overall TSMC construction, adaptive annealing, and benchmark performance rather than a dedicated mixing study. In practice, the gradual tempering schedule combined with resampling helps prevent degeneracy even when individual HMC steps are imperfect. For the revised manuscript, we will add an appendix that reports observed acceptance rates (typically tuned to 0.6-0.8), autocorrelation times, and effective sample sizes from the trajectory- and policy-optimization experiments. We will also document the leapfrog integrator settings and mass-matrix choice (a diagonal matrix scaled by per-parameter gradient variances). These additions will allow readers to evaluate mixing behavior on the high-dimensional, rollout-induced distributions without altering the core claims. revision: yes

Circularity Check

0 steps flagged

No circularity: target distribution and TSMC sampler built from external cost and standard components

full rationale

The paper defines the target as the Boltzmann-tilted distribution induced by a KL-regularized expected trajectory cost (external to the sampler). TSMC is assembled by combining tempered SMC reweighting/resampling, HMC rejuvenation steps, and automatic differentiation through rollouts; none of these steps is shown to reduce to a self-definition, a fitted parameter renamed as prediction, or a load-bearing self-citation. The derivation chain therefore remains self-contained against external benchmarks and does not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the differentiability of the dynamics (to obtain exact gradients) and on the standard convergence properties of tempered SMC and HMC; no new entities are postulated and no parameters are fitted inside the abstract description.

axioms (2)
  • domain assumption The system dynamics are differentiable with respect to controller parameters.
    Required to compute exact gradients for the HMC rejuvenation steps.
  • domain assumption Hamiltonian Monte Carlo mixes sufficiently well on the tempered distributions encountered in trajectory rollouts.
    Implicit in the claim that HMC rejuvenation maintains diversity.

pith-pipeline@v0.9.0 · 5458 in / 1387 out tokens · 35526 ms · 2026-05-12T00:56:05.180159+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

113 extracted references · 113 canonical work pages · 3 internal anchors

  1. [1]

    https://github

    [MJX] jax.lax.while_loop in solver.py prevents computation of backward gradients (#2259) . https://github. com/google-deepmind/mujoco/issues/2259. 2024

  2. [2]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. “Cosmos world foundation model platform for physical ai”. In: arXiv preprint arXiv:2501.03575 (2025)

  3. [3]

    A review of learning-based dynamics models for robotic manipulation

    Bo Ai, Stephen Tian, Haochen Shi, Yixuan Wang, Tobias Pfaff, Cheston Tan, Henrik I Christensen, Hao Su, Jiajun Wu, and Yunzhu Li. “A review of learning-based dynamics models for robotic manipulation”. In: Science Robotics 10.106 (2025), eadt1497

  4. [4]

    Iterated denoising energy matching for sampling from boltzmann densities

    Tara Akhound-Sadegh, Jarrid Rector-Brooks, Avishek Joey Bose, Sarthak Mittal, Pablo Lemos, Cheng-Hao Liu, Marcin Sendera, Siamak Ravanbakhsh, Gauthier Gidel, Yoshua Bengio, et al. “Iterated denoising energy matching for sampling from boltzmann densities”. In: arXiv preprint arXiv:2402.06121 (2024)

  5. [5]

    Flow-based generative models for Markov chain Monte Carlo in lattice field theory

    Michael S Albergo, Gurtej Kanwar, and Phiala E Shanahan. “Flow-based generative models for Markov chain Monte Carlo in lattice field theory”. In: Physical Review D 100.3 (2019), p. 034515

  6. [6]

    Pseudo-Marginal Hamiltonian Monte Carlo

    Johan Alenlöv, Arnaud Doucet, and Fredrik Lindsten. “Pseudo-Marginal Hamiltonian Monte Carlo”. In: Journal of Machine Learning Research 22.141 (2021), pp. 1–45

  7. [7]

    First order model- based rl through decoupled backpropagation

    Joseph Amigo, Rooholla Khorrambakht, Elliot Chane-Sane, Nicolas Mansard, and Ludovic Righetti. “First order model- based rl through decoupled backpropagation”. In: Conference on Robot Learning (CoRL) . 2025

  8. [8]

    On the model-based stochastic value gradient for continuous reinforcement learning

    Brandon Amos, Samuel Stanton, Denis Yarats, and Andrew Gordon Wilson. “On the model-based stochastic value gradient for continuous reinforcement learning”. In: Annual Learning for Dynamics and Control Conference (L4DC) . PMLR. 2021, pp. 6–20

  9. [9]

    CasADi: a software framework for nonlinear optimization and optimal control

    Joel AE Andersson, Joris Gillis, Greg Horn, James B Rawlings, and Moritz Diehl. “CasADi: a software framework for nonlinear optimization and optimal control”. In: Mathematical Programming Computation 11.1 (2019), pp. 1–36

  10. [10]

    Particle Markov Chain Monte Carlo Methods

    Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. “Particle Markov Chain Monte Carlo Methods”. In: Journal of the American Statistical Association 105.491 (2010), pp. 1541–1554

  11. [11]

    The Pseudo-Marginal Approach for Efficient Monte Carlo Computations

    Christophe Andrieu and Gareth O. Roberts. “The Pseudo-Marginal Approach for Efficient Monte Carlo Computations”. In: The Annals of Statistics 37.2 (2009), pp. 697–725

  12. [12]

    Comments on

    Julian Besag. “Comments on "Representations of knowledge in complex systems" by U. Grenander and M. I. Miller”. In: Journal of the Royal Statistical Society Series B: Statistical Methodology (1994)

  13. [13]

    On the convergence of adaptive sequential Monte Carlo methods

    Alexandros Beskos, Ajay Jasra, Nikolas Kantas, and Alexandre Thiery. “On the convergence of adaptive sequential Monte Carlo methods”. In: The Annals of Applied Probability 26.2 (2016), pp. 1111–1146

  14. [14]

    A Conceptual Introduction to Hamiltonian Monte Carlo

    Michael Betancourt. “A conceptual introduction to Hamiltonian Monte Carlo”. In: arXiv preprint arXiv:1701.02434 (2017)

  15. [15]

    JAX: composable transformations of Python+NumPy programs

    James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs. Version 0.3.13. 2018

  16. [16]

    Bryson and Yu-Chi Ho

    Arthur E. Bryson and Yu-Chi Ho. Applied Optimal Control: Optimization, Estimation, and Control . Waltham, MA: Blaisdell Pub. Co., 1969

  17. [17]

    BlackJAX: composable Bayesian inference in JAX

    Alberto Cabezas, Adrien Corenflos, Junpeng Lao, Rémi Louf, Antoine Carnec, Kaustubh Chaudhari, Reuben Cohn- Gordon, Jeremie Coullon, Wei Deng, Sam Duffield, et al. “BlackJAX: composable Bayesian inference in JAX”. In: arXiv preprint arXiv:2402.10797 (2024)

  18. [18]

    Stochastic gradient hamiltonian monte carlo

    Tianqi Chen, Emily Fox, and Carlos Guestrin. “Stochastic gradient hamiltonian monte carlo”. In: International Conference on Machine Learning (ICML) . PMLR. 2014, pp. 1683–1691

  19. [19]

    SMC 2: An Efficient Algorithm for Sequential Analysis of State Space Models

    Nicolas Chopin, Pierre E. Jacob, and Omiros Papaspiliopoulos. “SMC 2: An Efficient Algorithm for Sequential Analysis of State Space Models”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75.3 (2013), pp. 397–426

  20. [20]

    Model-augmented actor-critic: Backpropagating through paths

    Ignasi Clavera, Violet Fu, and Pieter Abbeel. “Model-augmented actor-critic: Backpropagating through paths”. In: International Conference on Learning Representations (ICLR) . 2020

  21. [21]

    Spacecraft trajectory optimization

    Bruce A Conway. Spacecraft trajectory optimization . V ol. 29. Cambridge University Press, 2010

  22. [22]

    An invitation to sequential Monte Carlo samplers

    Chenguang Dai, Jeremy Heng, Pierre E Jacob, and Nick Whiteley. “An invitation to sequential Monte Carlo samplers”. In: Journal of the American Statistical Association 117.539 (2022), pp. 1587–1600

  23. [23]

    A tutorial on the cross-entropy method

    Pieter-Tjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein. “A tutorial on the cross-entropy method”. In: Annals of operations research 134.1 (2005), pp. 19–67

  24. [24]

    Efficient mixed-integer planning for UA Vs in cluttered environments

    Robin Deits and Russ Tedrake. “Efficient mixed-integer planning for UA Vs in cluttered environments”. In: IEEE International Conference on Robotics and Automation (ICRA) . IEEE. 2015, pp. 42–49

  25. [25]

    Feynman-Kac formulae: genealogical and interacting particle systems with applications

    Pierre Del Moral. Feynman-Kac formulae: genealogical and interacting particle systems with applications . Springer, 2004

  26. [26]

    Sequential Monte Carlo samplers

    Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. “Sequential Monte Carlo samplers”. In: Journal of the Royal Statistical Society: Series B 68.3 (2006), pp. 411–436

  27. [27]

    Sequential Monte Carlo methods in practice

    Arnaud Doucet, Nando De Freitas, Neil James Gordon, et al. Sequential Monte Carlo methods in practice . V ol. 1. 2. Springer, 2001

  28. [28]

    Hybrid monte carlo

    Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. “Hybrid monte carlo”. In: Physics letters B 195.2 (1987), pp. 216–222

  29. [29]

    Brax–a differentiable physics engine for large scale rigid body simulation

    C Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem. “Brax–a differentiable physics engine for large scale rigid body simulation”. In: arXiv preprint arXiv:2106.13281 (2021)

  30. [30]

    SNOPT: An SQP algorithm for large-scale constrained optimization

    Philip E Gill, Walter Murray, and Michael A Saunders. “SNOPT: An SQP algorithm for large-scale constrained optimization”. In: SIAM review 47.1 (2005), pp. 99–131

  31. [31]

    Riemann manifold Langevin and Hamiltonian Monte Carlo methods

    Mark Girolami and Ben Calderhead. “Riemann manifold Langevin and Hamiltonian Monte Carlo methods”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73.2 (2011), pp. 123–214

  32. [32]

    Swing-up of the double pendulum on a cart by feedforward and feedback control with experimental validation

    Knut Graichen, Michael Treuer, and Michael Zeitz. “Swing-up of the double pendulum on a cart by feedforward and feedback control with experimental validation”. In: Automatica 43.1 (2007), pp. 63–71

  33. [33]

    Variance reduction techniques for gradient estimates in reinforcement learning

    Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. “Variance reduction techniques for gradient estimates in reinforcement learning”. In: Journal of Machine Learning Research (JMLR) 5 (2004), pp. 1471–1530

  34. [34]

    Reinforcement learning with deep energy-based policies

    Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. “Reinforcement learning with deep energy-based policies”. In: International Conference on Machine Learning (ICML) . PMLR. 2017, pp. 1352–1361

  35. [35]

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor”. In: International Conference on Machine Learning (ICML) . 2018, pp. 1861–1870

  36. [36]

    Geometric Numerical Integration: Structure-Preserving Algorithms for Ordinary Differential Equations

    Ernst Hairer, Christian Lubich, and Gerhard Wanner. Geometric Numerical Integration: Structure-Preserving Algorithms for Ordinary Differential Equations . 2nd ed. V ol. 31. Springer Series in Computational Mathematics. Springer, 2006

  37. [37]

    Time series analysis

    James D Hamilton. Time series analysis . Princeton university press, 2020

  38. [38]

    Building Rome with Convex Optimization

    Haoyu Han and Heng Yang. “Building Rome with Convex Optimization”. In: Robotics: Science and Systems (RSS) . 2025

  39. [39]

    Non-Uniform Noise-to-Signal Ratio in the REINFORCE Policy-Gradient Estimator

    Haoyu Han and Heng Yang. “Non-Uniform Noise-to-Signal Ratio in the REINFORCE Policy-Gradient Estimator”. In: arXiv preprint arXiv:2602.01460 (2026)

  40. [40]

    On the nonsmooth geometry and neural approximation of the optimal value function of infinite-horizon pendulum swing-up

    Haoyu Han and Heng Yang. “On the nonsmooth geometry and neural approximation of the optimal value function of infinite-horizon pendulum swing-up”. In: Annual Learning for Dynamics and Control Conference (L4DC) . PMLR. 2024, pp. 654–666

  41. [41]

    Adjoint sampling: Highly scalable diffusion samplers via adjoint matching

    Aaron Havens, Benjamin Kurt Miller, Bing Yan, Carles Domingo-Enrich, Anuroop Sriram, Brandon Wood, Daniel Levine, Bin Hu, Brandon Amos, Brian Karrer, et al. “Adjoint sampling: Highly scalable diffusion samplers via adjoint matching”. In: arXiv preprint arXiv:2504.11713 (2025)

  42. [42]

    The No-U-Turn Sampler: adaptively setting path lengths in Hamiltonian Monte Carlo

    Matthew D. Hoffman and Andrew Gelman. “The No-U-Turn Sampler: adaptively setting path lengths in Hamiltonian Monte Carlo”. In: Journal of Machine Learning Research 15.1 (2014), pp. 1593–1623

  43. [43]

    Feedback control of the pusher-slider system: A story of hybrid and underactuated contact dynamics

    François Robert Hogan and Alberto Rodriguez. “Feedback control of the pusher-slider system: A story of hybrid and underactuated contact dynamics”. In: International Workshop on the Algorithmic Foundations of Robotics (WAFR) . Springer. 2020, pp. 800–815

  44. [44]

    Predictive sampling: Real-time behaviour synthesis with mujoco,

    Taylor Howell, Nimrod Gileadi, Saran Tunyasuvunakool, Kevin Zakka, Tom Erez, and Yuval Tassa. “Predictive sampling: Real-time behaviour synthesis with mujoco”. In: arXiv preprint arXiv:2212.00541 (2022)

  45. [45]

    ALTRO: A fast solver for constrained trajectory optimization

    Taylor A Howell, Brian E Jackson, and Zachary Manchester. “ALTRO: A fast solver for constrained trajectory optimization”. In: IEEE. 2019, pp. 7674–7679

  46. [46]

    Dojo: A differentiable simulator for robotics

    Taylor A Howell, Simon Le Cleach, J Zico Kolter, Mac Schwager, and Zachary Manchester. “Dojo: A differentiable simulator for robotics”. In: arXiv preprint arXiv:2203.00806 9.2 (2022), p. 4

  47. [47]

    Diff- taichi: Differentiable programming for physical simulation

    Yuanming Hu, Luke Anderson, Tzu-Mao Li, Qi Sun, Nathan Carr, Jonathan Ragan-Kelley, and Frédo Durand. “Diff- taichi: Differentiable programming for physical simulation”. In: International Conference on Learning Representations (ICLR). 2019

  48. [48]

    On Nonnegative Unbiased Estimators

    Pierre E. Jacob and Alexandre H. Thiéry. “On Nonnegative Unbiased Estimators”. In: The Annals of Statistics 43.2 (2015), pp. 769–784

  49. [49]

    Optimal Stochastic Linear Systems with Exponential Performance Criteria and Their Relation to Deterministic Differential Games

    David H. Jacobson. “Optimal Stochastic Linear Systems with Exponential Performance Criteria and Their Relation to Deterministic Differential Games”. In: IEEE Transactions on Automatic Control 18.2 (1973), pp. 124–131

  50. [50]

    Proxnlp: a primal-dual augmented lagrangian solver for nonlinear programming in robotics and beyond

    Wilson Jallet, Antoine Bambade, Nicolas Mansard, and Justin Carpentier. “Proxnlp: a primal-dual augmented lagrangian solver for nonlinear programming in robotics and beyond”. In: arXiv preprint arXiv:2210.02109 (2022)

  51. [51]

    Categorical reparameterization with gumbel-softmax

    Eric Jang, Shixiang Gu, and Ben Poole. “Categorical reparameterization with gumbel-softmax”. In: International Conference on Learning Representations (ICLR) . 2017

  52. [52]

    Information Theory and Statistical Mechanics

    E. T. Jaynes. “Information Theory and Statistical Mechanics”. In: Phys. Rev. (1957)

  53. [53]

    A Natural Policy Gradient

    Sham M. Kakade. “A Natural Policy Gradient”. In: Conference on Neural Information Processing Systems (NeurIPS) . 2001

  54. [54]

    Global Contact-Rich Planning with Sparsity-Rich Semidefinite Relaxations

    Shucheng Kang, Guorui Liu, and Heng Yang. “Global Contact-Rich Planning with Sparsity-Rich Semidefinite Relaxations”. In: Robotics: Science and Systems (RSS) . 2025

  55. [55]

    Fast and certifiable trajectory optimization

    Shucheng Kang, Xiaoyang Xu, Jay Sarva, Ling Liang, and Heng Yang. “Fast and certifiable trajectory optimization”. In: International Workshop on the Algorithmic Foundations of Robotics (WAFR) . 2024

  56. [56]

    Path integrals and symmetry breaking for optimal control theory

    Hilbert J Kappen. “Path integrals and symmetry breaking for optimal control theory”. In: Journal of statistical mechanics: theory and experiment 2005.11 (2005), P11011

  57. [57]

    Optimization by Simulated Annealing

    S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. “Optimization by Simulated Annealing”. In: Science 220.4598 (1983), pp. 671–680

  58. [58]

    Monte Carlo Methods for Motion Planning and Goal Inference

    Jovana Kondic. “Monte Carlo Methods for Motion Planning and Goal Inference”. PhD thesis. Massachusetts Institute of Technology, 2024

  59. [59]

    A debiased Bernoulli factory and unbiased estimation of a probability

    Jere Koskela, Toni Karvonen, et al. “A debiased Bernoulli factory and unbiased estimation of a probability”. In: arXiv preprint arXiv:2510.01941 (2025)

  60. [60]

    End- to-end and highly-efficient differentiable simulation for robotics,

    Quentin Le Lidec, Louis Montaut, Yann de Mont-Marin, Fabian Schramm, and Justin Carpentier. “Highly-Efficient Differentiable Simulation for Robotics”. In: arXiv preprint arXiv:2409.07107 (2025)

  61. [61]

    Computational geometric mechanics and control of rigid bodies

    Taeyoung Lee. “Computational geometric mechanics and control of rigid bodies”. PhD thesis. University of Michigan, 2008

  62. [62]

    Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

    Sergey Levine. “Reinforcement learning and control as probabilistic inference: Tutorial and review”. In: arXiv preprint arXiv:1805.00909 (2018)

  63. [63]

    Robotic world model: A neural network simulator for robust policy optimization in robotics.arXiv preprint arXiv:2501.10100, 2025

    Chenhao Li, Andreas Krause, and Marco Hutter. “Robotic world model: A neural network simulator for robust policy optimization in robotics”. In: arXiv preprint arXiv:2501.10100 (2025)

  64. [64]

    On the Surprising Robustness of Sequential Convex Optimization for Contact-Implicit Motion Planning

    Yulin Li, Haoyu Han, Shucheng Kang, Jun Ma, and Heng Yang. “On the Surprising Robustness of Sequential Convex Optimization for Contact-Implicit Motion Planning”. In: Robotics: Science and Systems (RSS) . 2025

  65. [65]

    Stable pushing: Mechanics, controllability, and planning

    Kevin M Lynch and Matthew T Mason. “Stable pushing: Mechanics, controllability, and planning”. In: International Journal of Robotics Research (IJRR) 15.6 (1996), pp. 533–556

  66. [66]

    Efficient Online Reinforcement Learning for Diffusion Policy

    Haitong Ma, Tianyi Chen, Kai Wang, Na Li, and Bo Dai. “Efficient Online Reinforcement Learning for Diffusion Policy”. In: International Conference on Machine Learning (ICML) . 2025

  67. [67]

    Information theory, inference and learning algorithms

    David JC MacKay. Information theory, inference and learning algorithms . Cambridge university press, 2003

  68. [68]

    Successive convexification of non-convex optimal control problems and its convergence properties

    Yuanqi Mao, Michael Szmuk, and Behçet Açıkme¸ se. “Successive convexification of non-convex optimal control problems and its convergence properties”. In: IEEE Conference on Decision and Control (CDC) . 2016, pp. 3636–3641

  69. [69]

    Shortest paths in graphs of convex sets

    Tobia Marcucci, Jack Umenberger, Pablo Parrilo, and Russ Tedrake. “Shortest paths in graphs of convex sets”. In: SIAM Journal on Optimization 34.1 (2024), pp. 507–532

  70. [70]

    Mechanics and planning of manipulator pushing operations

    Matthew T Mason. “Mechanics and planning of manipulator pushing operations”. In: International Journal of Robotics Research (IJRR) 5.3 (1986), pp. 53–71

  71. [71]

    Annealed importance sampling

    Radford M Neal. “Annealed importance sampling”. In: Statistics and computing 11.2 (2001), pp. 125–139

  72. [72]

    Sampling from multimodal distributions using tempered transitions

    Radford M Neal. “Sampling from multimodal distributions using tempered transitions”. In: Statistics and computing 6.4 (1996), pp. 353–366

  73. [73]

    MCMC Using Hamiltonian Dynamics

    Radford M. Neal. “MCMC Using Hamiltonian Dynamics”. In: Handbook of Markov Chain Monte Carlo . Ed. by Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Chapman and Hall/CRC, 2011. Chap. 5

  74. [74]

    A review of differentiable simulators

    Rhys Newbury, Jack Collins, Kerry He, Jiahe Pan, Ingmar Posner, David Howard, and Akansel Cosgun. “A review of differentiable simulators”. In: IEEE Access (2024)

  75. [75]

    Jorge Nocedal and Stephen J. Wright. Numerical Optimization. 2nd ed. New York, NY: Springer, 2006

  76. [76]

    Model-based diffusion for trajectory optimization

    Chaoyi Pan, Zeji Yi, Guanya Shi, and Guannan Qu. “Model-based diffusion for trajectory optimization”. In: Conference on Neural Information Processing Systems (NeurIPS) . V ol. 37. 2024, pp. 57914–57943

  77. [77]

    Hard Contacts with Soft Gradients: Refining Differentiable Simulators for Learning and Control

    Anselm Paulus, A René Geist, Pierre Schumacher, Vít Musil, and Georg Martius. “Hard Contacts with Soft Gradients: Refining Differentiable Simulators for Learning and Control”. In: arXiv preprint arXiv:2506.14186 (2025)

  78. [78]

    A direct method for trajectory optimization of rigid bodies through contact

    Michael Posa, Cecilia Cantu, and Russ Tedrake. “A direct method for trajectory optimization of rigid bodies through contact”. In: The International Journal of Robotics Research 33.1 (2014), pp. 69–81

  79. [79]

    Inference-time enhancement of generative robot policies via predictive world modeling, 2026

    Han Qi, Haocheng Yin, Aris Zhu, Yilun Du, and Heng Yang. “Strengthening generative robot policies through predictive world modeling”. In: arXiv preprint arXiv:2502.00622 (2025)

  80. [80]

    Stable- baselines3: Reliable reinforcement learning implementations

    Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. “Stable- baselines3: Reliable reinforcement learning implementations”. In: Journal of machine learning research 22.268 (2021), pp. 1–8

Showing first 80 references.