pith. machine review for the scientific record. sign in

arxiv: 1805.00909 · v3 · submitted 2018-05-02 · 💻 cs.LG · cs.AI· cs.RO· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ROstat.ML
keywords reinforcement learningmaximum entropyprobabilistic inferencevariational inferenceoptimal controlpolicy optimization
0
0 comments X

The pith

Maximum entropy reinforcement learning is equivalent to exact probabilistic inference for deterministic dynamics and variational inference for stochastic dynamics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a maximum-entropy version of the reinforcement learning or optimal control objective can be rewritten as a probabilistic inference problem. When the system dynamics are deterministic, the optimal policy corresponds exactly to the posterior distribution over actions; when dynamics are stochastic, the same objective yields a variational inference problem. This rewriting matters because it lets researchers import tools from approximate inference, such as variational methods and message passing, directly into policy optimization. The resulting perspective also clarifies how to incorporate uncertainty, partial observability, and compositional structure into control problems without changing the underlying decision-making formalism.

Core claim

A generalization of the reinforcement learning or optimal control problem, which is sometimes termed maximum entropy reinforcement learning, is equivalent to exact probabilistic inference in the case of deterministic dynamics, and variational inference in the case of stochastic dynamics.

What carries the argument

The maximum-entropy reinforcement learning objective, which augments the usual expected reward with an entropy term over the policy and thereby converts the control problem into one of inferring a distribution over trajectories.

If this is right

  • Any approximate inference algorithm can be repurposed as a reinforcement learning algorithm by substituting the appropriate energy function.
  • Problems with partial observability become standard filtering or smoothing tasks once cast as inference.
  • Compositionality in tasks can be handled by composing the underlying probabilistic models rather than hand-designing reward functions.
  • Uncertainty over dynamics or goals is represented directly as uncertainty in the inferred posterior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same equivalence may allow transfer of inference scaling techniques, such as amortized variational inference, to high-dimensional continuous control.
  • It suggests that multi-task reinforcement learning can be viewed as joint inference over a shared prior and task-specific posteriors.
  • Future work could test whether inference-based regularization improves sample efficiency in model-based control compared with standard entropy bonuses.

Load-bearing premise

The claimed equivalence requires that the reinforcement learning objective is written in maximum-entropy form and that the dynamics are modeled strictly as either deterministic or stochastic.

What would settle it

An explicit counter-example in which the policy that maximizes the maximum-entropy objective differs from the posterior obtained by exact or variational inference on the same trajectory distribution.

read the original abstract

The framework of reinforcement learning or optimal control provides a mathematical formalization of intelligent decision making that is powerful and broadly applicable. While the general form of the reinforcement learning problem enables effective reasoning about uncertainty, the connection between reinforcement learning and inference in probabilistic models is not immediately obvious. However, such a connection has considerable value when it comes to algorithm design: formalizing a problem as probabilistic inference in principle allows us to bring to bear a wide array of approximate inference tools, extend the model in flexible and powerful ways, and reason about compositionality and partial observability. In this article, we will discuss how a generalization of the reinforcement learning or optimal control problem, which is sometimes termed maximum entropy reinforcement learning, is equivalent to exact probabilistic inference in the case of deterministic dynamics, and variational inference in the case of stochastic dynamics. We will present a detailed derivation of this framework, overview prior work that has drawn on this and related ideas to propose new reinforcement learning and control algorithms, and describe perspectives on future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript claims that maximum-entropy reinforcement learning is equivalent to exact probabilistic inference under deterministic dynamics and to variational inference under stochastic dynamics. It supplies a detailed derivation of the equivalence, surveys prior algorithms that exploit the connection, and outlines perspectives for future research.

Significance. The equivalence supplies a principled route for importing approximate-inference machinery into reinforcement learning and control, thereby supporting more flexible handling of uncertainty, compositionality, and partial observability. Because the derivation is standard and the review synthesizes an already influential line of work, the tutorial consolidates a useful conceptual bridge that has demonstrably aided algorithm design.

minor comments (2)
  1. [Abstract] The abstract states the central equivalence but does not explicitly label the manuscript as a tutorial and review; adding this phrase would help readers set expectations for the scope and depth of the material.
  2. [Section 3] Notation for the trajectory distribution p(τ) and the reward-augmented potential is introduced early but is not cross-referenced in the later algorithmic survey; a brief reminder table or consistent equation numbering would improve readability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and for recommending acceptance. The referee's summary correctly identifies the core contribution: a detailed derivation showing that maximum-entropy reinforcement learning corresponds to exact probabilistic inference for deterministic dynamics and to variational inference for stochastic dynamics, together with a survey of prior algorithms and future perspectives.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper derives the equivalence between maximum-entropy RL and exact/variational inference by defining the trajectory distribution p(τ) ∝ exp(∑ r_t) directly from the reward function and showing that the RL objective is the log-partition function of this distribution. This construction follows immediately from the given probabilistic model and the max-ent objective without any fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations that themselves require the target result. Prior literature is reviewed for context, but the central derivation chain remains independent and self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As a review paper, the work relies on standard axioms from probabilistic inference and reinforcement learning without introducing new free parameters or invented entities.

axioms (1)
  • standard math Standard axioms of probabilistic inference and variational methods
    The equivalence derivations invoke core principles of exact and variational inference as background.

pith-pipeline@v0.9.0 · 5469 in / 1029 out tokens · 38326 ms · 2026-05-13T18:23:50.151277+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation

    cs.LG 2026-05 conditional novelty 7.0

    Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.

  2. Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability

    cs.LG 2026-05 unverdicted novelty 7.0

    The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general functi...

  3. Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

    cs.AI 2026-05 unverdicted novelty 7.0

    An exploration-aware RL framework lets LLM agents adaptively explore only under high uncertainty via variational rewards and action grouping, yielding consistent gains on text and GUI agent benchmarks.

  4. Generative Actor-Critic with Soft Bridge Policies

    cs.LG 2026-05 unverdicted novelty 7.0

    SoftGAC defines a stochastic bridge from base to action latent that converts the MaxEnt objective into a tractable relative-entropy term reducible to control energy, achieving competitive returns with one-pass sampling.

  5. Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

    cs.LG 2026-05 unverdicted novelty 7.0

    Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating c...

  6. Advantage-Guided Diffusion for Model-Based Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 7.0

    Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.

  7. Receding-Horizon Control via Drifting Models

    cs.AI 2026-04 unverdicted novelty 7.0

    Drifting MPC produces a unique distribution over trajectories that trades off data support against optimality and enables efficient receding-horizon planning under unknown dynamics.

  8. DiffusionNFT: Online Diffusion Reinforcement with Forward Process

    cs.LG 2025-09 unverdicted novelty 7.0

    DiffusionNFT performs online RL for diffusion models on the forward process via flow matching and positive-negative contrasts, delivering up to 25x efficiency gains and rapid benchmark improvements over prior reverse-...

  9. Mutual Information Optimal Density Control of Linear Systems and Generalized Schr\"{o}dinger Bridges with Reference Refinement

    math.OC 2026-05 unverdicted novelty 6.0

    Alternating optimization for MI-optimal density control of linear systems coincides with that for generalized Schrödinger bridges.

  10. Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow

    cs.LG 2026-05 unverdicted novelty 6.0

    DFP is a one-step generative policy using Wasserstein gradient flow on a drifting model backbone, with a top-K behavior cloning surrogate, that reaches SOTA on Robomimic and OGBench manipulation tasks.

  11. Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...

  12. PISTO: Proximal Inference for Stochastic Trajectory Optimization

    cs.RO 2026-05 unverdicted novelty 6.0

    PISTO augments stochastic trajectory optimization with proximal KL regularization, yielding closed-form mean updates via importance sampling that outperform STOMP, CHOMP, CEM, and MPPI on robot arm and MuJoCo benchmarks.

  13. Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

    cs.LG 2026-05 unverdicted novelty 6.0

    LPO reframes group-based RLVR as explicit target-projection on the LLM response simplex and performs exact divergence minimization to achieve monotonic listwise improvement with bounded gradients.

  14. Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.

  15. RL Token: Bootstrapping Online RL with Vision-Language-Action Models

    cs.LG 2026-04 unverdicted novelty 6.0

    RL Token enables sample-efficient online RL fine-tuning of large VLAs, delivering up to 3x speed gains and higher success rates on real-robot manipulation tasks within minutes to hours.

  16. Tempered Sequential Monte Carlo for Trajectory and Policy Optimization with Differentiable Dynamics

    cs.LG 2026-04 unverdicted novelty 6.0

    Tempered sequential Monte Carlo samples efficiently from a temperature-annealed distribution over controller parameters to solve trajectory and policy optimization under differentiable dynamics.

  17. Tempered Sequential Monte Carlo for Trajectory and Policy Optimization with Differentiable Dynamics

    cs.LG 2026-04 unverdicted novelty 6.0

    Tempered sequential Monte Carlo samples from a Boltzmann-tilted distribution over controllers to optimize trajectories and policies under differentiable dynamics.

  18. DAG-STL: A Hierarchical Framework for Zero-Shot Trajectory Planning under Signal Temporal Logic Specifications

    cs.RO 2026-04 unverdicted novelty 6.0

    DAG-STL decomposes long-horizon STL planning into decomposition, timed waypoint allocation, and diffusion-based trajectory generation to enable zero-shot planning under unknown dynamics.

  19. Reinforcement Learning, Optimal Control, and Bayesian Filtering in Data Assimilation

    math.DS 2026-04 unverdicted novelty 6.0

    A variational hierarchy unifies Bayesian filtering, variational data assimilation, KL-regularized control, and Kalman methods by proving that posteriors minimize a likelihood-plus-KL objective with evidence as the glo...

  20. Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

    cs.AI 2026-05 unverdicted novelty 5.0

    An exploration-aware policy optimization method lets LLM agents explore selectively via a variational-inference reward and action grouping, yielding consistent gains on text and GUI agent benchmarks.

  21. On Distinguishing Capability Elicitation from Capability Creation in Post-Training: A Free-Energy Perspective

    cs.AI 2026-05 unverdicted novelty 5.0

    Post-training reweights a pretrained model's behavior distribution either within its existing accessible support (elicitation) or by expanding that support (creation), with both SFT and RL acting as free-energy minimi...

  22. RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

    cs.CV 2026-04 unverdicted novelty 5.0

    RAD-2 uses a diffusion generator and RL discriminator to cut collision rates by 56% in closed-loop autonomous driving planning.

  23. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    cs.LG 2020-05 unverdicted novelty 2.0

    Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 21 Pith papers

  1. [1]

    T., Tassa, Y., Munos, R., Heess, N., and Riedmiller, M

    Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., and Riedmiller, M. (2018). Maximum a posteriori policy optimisation. In International Conference on Learning Representations (ICLR)

  2. [2]

    Attias, H. (2003). Planning by probabilistic inference. In Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics

  3. [3]

    Bagnell, J. A. and Schneider, J. (2003). Covariant policy search. In International Joint Conference on Artifical Intelligence (IJCAI)

  4. [4]

    and An, J

    Botvinick, M. and An, J. (2009). Goal-directed decision making in prefrontal cortex: a computational framework. In Advances in Neural Information Processing Systems (NIPS)

  5. [5]

    and Toussaint, M

    Botvinick, M. and Toussaint, M. (2012). Planning as inference. Trends in Cognitive Sciences , 16(10):485--488

  6. [6]

    D., Lee, K

    Dragan, A. D., Lee, K. C. T., and Srinivasa, S. S. (2013). Legibility and predictability of robot motion. In International Conference on Human-Robot Interaction (HRI)

  7. [7]

    and Todorov, E

    Dvijotham, K. and Todorov, E. (2010). Inverse optimal control with linearly-solvable mdps. In International Conference on International Conference on Machine Learning (ICML)

  8. [8]

    Finn, C., Christiano, P., Abbeel, P., and Levine, S. (2016a). A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. CoRR , abs/1611.03852

  9. [9]

    Finn, C., Levine, S., and Abbeel, P. (2016b). Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning (ICML)

  10. [10]

    Friston, K. (2009). The free-energy principle: A rough guide to the brain? Trends in Cognitive Sciences , 13(7):293--301

  11. [11]

    Fu, J., Luo, K., and Levine, S. (2018). Learning robust rewards with adversarial inverse reinforcement learning. In International Conference on Learning Representations (ICLR)

  12. [12]

    Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Neural Information Processing Systems (NIPS)

  13. [13]

    Gupta, A., Mendonca, R., Liu, Y., Abbeel, P., and Levine, S. (2018). Meta-reinforcement learning of structured exploration strategies. CoRR , abs/1802.07245

  14. [14]

    Haarnjoa, T., Pong, V., Zhou, A., Dalal, M., Abbeel, P., and Levine, S. (2018). Composable deep reinforcement learning for robotic manipulation. In International Conference on Robotics and Automation (ICRA)

  15. [15]

    Haarnoja, T., Hartikainen, K., Abbeel, P., and Levine, S. (2018a). Latent space policies for hierarchical reinforcement learning. CoRR , abs/1804.02808

  16. [16]

    Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning (ICML)

  17. [17]

    Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018b). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In arXiv

  18. [18]

    Hachiya, H., Peters, J., and Sugiyama, M. (2009). Efficient sample reuse in em-based policy search. In European Conference on Machine Learning (ECML)

  19. [19]

    T., Wang, Z., Heess, N., and Riedmiller, M

    Hausman, K., Springenberg, J. T., Wang, Z., Heess, N., and Riedmiller, M. (2018). Learning an embedding space for transferable robot skills. In International Conference on Learning Representations (ICLR)

  20. [20]

    Heess, N., Silver, D., and Teh, Y. W. (2013). Actor-critic reinforcement learning with energy-based policies. In European Workshop on Reinforcement Learning (EWRL)

  21. [21]

    and Ermon, S

    Ho, J. and Ermon, S. (2016). Generative adversarial imitation learning. In Neural Information Processing Systems (NIPS)

  22. [22]

    M., and Bagnell, J

    Huang, D., Farahmand, A., Kitani, K. M., and Bagnell, J. A. (2015). Approximate MaxEnt inverse optimal control and its application for mental simulation of human interactions. In AAAI Conference on Artificial Intelligence (AAAI)

  23. [23]

    and Kitani, K

    Huang, D. and Kitani, K. M. (2014). Action-reaction: Forecasting the dynamics of human interaction. In European Conference on Computer Vision (ECCV)

  24. [24]

    Javdani, S., Srinivasa, S., and Bagnell, J. A. (2015). Shared autonomy via hindsight optimization. In Robotics: Science and Systems (RSS)

  25. [25]

    P., Littman, M

    Kaelbling, L. P., Littman, M. L., and Moore, A. P. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research , 4:237--285

  26. [26]

    Kalman, R. (1960). A new approach to linear filtering and prediction problems. ASME Transactions journal of basic engineering , 82(1):35--45

  27. [27]

    Kappen, H. J. (2011). Optimal control theory and the linear bellman equation. Inference and Learning in Dynamic Models , pages 363--387

  28. [28]

    J., G \'o mez, V., and Opper, M

    Kappen, H. J., G \'o mez, V., and Opper, M. (2012). Optimal control as a graphical model inference problem. Machine Learning , 87(2):159--182

  29. [29]

    and Friedman, N

    Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques . The MIT Press

  30. [30]

    Levine, S. (2014). Motor skill learning with local trajectory methods . PhD thesis, Stanford University

  31. [31]

    and Abbeel, P

    Levine, S. and Abbeel, P. (2014). Learning neural network policies with guided policy search under unknown dynamics. In Neural Information Processing Systems (NIPS)

  32. [32]

    Levine, S., Finn, C., Darrell, T., and Abbeel, P. (2016). End-to-end training of deep visuomotor policies. Journal of Machine Learning Research , 17(1)

  33. [33]

    and Koltun, V

    Levine, S. and Koltun, V. (2012). Continuous inverse optimal control with locally optimal examples. In International Conference on Machine Learning (ICML)

  34. [34]

    and Koltun, V

    Levine, S. and Koltun, V. (2013a). Guided policy search. In International Conference on International Conference on Machine Learning (ICML)

  35. [35]

    and Koltun, V

    Levine, S. and Koltun, V. (2013b). Variational policy search via trajectory optimization. In Advances in Neural Information Processing Systems (NIPS)

  36. [36]

    and Koltun, V

    Levine, S. and Koltun, V. (2014). Learning complex neural network policies with trajectory optimization. In International Conference on Machine Learning (ICML)

  37. [37]

    Levine, S., Popovi\' c , Z., and Koltun, V. (2011). Nonlinear inverse reinforcement learning with gaussian processes. In Neural Information Processing Systems (NIPS)

  38. [38]

    Minka, T. P. (2001). Expectation propagation for approximate bayesian inference. In Uncertainty in Artificial Intelligence (UAI)

  39. [39]

    Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017a). Bridging the gap between value and policy based reinforcement learning. In arXiv

  40. [40]

    Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017b). Trust-pcl: An off-policy trust region method for continuous control. CoRR , abs/1707.01891

  41. [41]

    Neumann, G. (2011). Variational inference for policy search in changing situations. In International Conference on Machine Learning (ICML)

  42. [42]

    O'Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. (2017). Pgq: Combining policy gradient and q-learning. In International Conference on Learning Representations (ICLR)

  43. [43]

    u lling, K., and Alt \

    Peters, J., M \"u lling, K., and Alt \"u n, Y. (2010). Relative entropy policy search. In AAAI Conference on Artificial Intelligence (AAAI)

  44. [44]

    and Schaal, S

    Peters, J. and Schaal, S. (2007). Reinforcement learning by reward-weighted regression for operational space control. In International Conference on Machine Learning (ICML)

  45. [45]

    Rawlik, K., Toussaint, M., and Vijayakumar, S. (2013). On stochastic optimal control and reinforcement learning by approximate inference. In Robotics: Science and Systems (RSS)

  46. [46]

    and Hinton, G

    Sallans, B. and Hinton, G. E. (2004). Reinforcement learning with factored states and actions. Journal of Machine Learning Research , 5

  47. [47]

    Schulman, J., Chen, X., and Abbeel, P. (2017). Equivalence between policy gradients and soft q-learning. In arXiv

  48. [48]

    I., and Abbeel, P

    Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and Abbeel, P. (2015). Trust region policy optimization. In International Conference on Machine Learning (ICML)

  49. [49]

    Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations (ICLR)

  50. [50]

    and Botvinick, M

    Solway, A. and Botvinick, M. (2012). Goal-directed decision making as probabilistic inference: a computational framework and potential neural correlates. Psychol Rev. , 119(1):120--154

  51. [51]

    Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In International Conference on Machine Learning (ICML)

  52. [52]

    A., Buchli, J., and Schaal, S

    Theodorou, E. A., Buchli, J., and Schaal, S. (2010). Learning policy improvements with path integrals. In International Conference on Artificial Intelligence and Statistics (AISTATS 2010)

  53. [53]

    Todorov, E. (2006). Linearly-solvable markov decision problems. In Advances in Neural Information Processing Systems (NIPS)

  54. [54]

    Todorov, E. (2008). General duality between optimal control and estimation. In Conference on Decision and Control (CDC)

  55. [55]

    Todorov, E. (2010). Policy gradients in linearly-solvable mdps. In Neural Information Processing Systems (NIPS)

  56. [56]

    Toussaint, M. (2009). Robot trajectory optimization using approximate inference. In International Conference on Machine Learning (ICML)

  57. [57]

    and Storkey, A

    Toussaint, M. and Storkey, A. (2006). Probabilistic inference for solving discrete and continuous state markov decision processes. In International Conference on Machine Learning (ICML)

  58. [58]

    Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning , 8(3-4):229--256

  59. [59]

    Williams, R. J. and Peng, J. (1991). Function optimization using connectionist reinforcement learning algorithms. Connection Science , 3(3):241--268

  60. [60]

    Wulfmeier, M., Ondruska, P., and Posner, I. (2015). Maximum entropy deep inverse reinforcement learning. In Neural Information Processing Systems Conference, Deep Reinforcement Learning Workshop

  61. [61]

    Ziebart, B. (2010). Modeling purposeful adaptive behavior with the principle of maximum causal entropy . PhD thesis, Carnegie Mellon University

  62. [62]

    D., Bagnell, J

    Ziebart, B. D., Bagnell, J. A., and Dey, A. K. (2010). Modeling interaction via the principle of maximum causal entropy. In International Conference on Machine Learning (ICML)

  63. [63]

    D., Maas, A., Bagnell, J

    Ziebart, B. D., Maas, A., Bagnell, J. A., and Dey, A. K. (2008). Maximum entropy inverse reinforcement learning. In International Conference on Artificial Intelligence (AAAI)