pith. machine review for the scientific record. sign in

arxiv: 2605.09157 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Revisiting Mixture Policies in Entropy-Regularized Actor-Critic

Adam White, Jiamin He, Jincheng Mei, Martha White, Samuel Neumann

Pith reviewed 2026-05-12 04:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords mixture policiesactor-criticreparameterization estimatorentropy regularizationcontinuous controlvariance reductionreinforcement learning
0
0 comments X

The pith

A marginalized reparameterization estimator lets mixture policies match or beat Gaussian performance in entropy-regularized actor-critic methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mixture policies combine several action distributions to gain more flexibility and better entropy robustness than single-mode policies, yet they rarely appear in leading continuous-control algorithms. The obstacle has been the absence of a low-variance gradient estimator comparable to the reparameterization trick available for Gaussians. The paper introduces a marginalized reparameterization estimator that marginalizes over mixture components before differentiation, proving it yields strictly lower variance than the conventional likelihood-ratio estimator. Experiments across standard continuous-control benchmarks then show that policies trained with this estimator close the performance gap with Gaussians and sometimes surpass them, turning a theoretical option into a usable alternative.

Core claim

The paper shows that the marginalized reparameterization estimator supplies unbiased gradients of lower variance than likelihood-ratio gradients for mixture policies, allowing entropy-regularized actor-critic agents that use mixtures to reach solution quality and entropy robustness on par with, and in several tasks above, the Gaussian policies that currently dominate practice.

What carries the argument

The marginalized reparameterization (MRP) estimator, which re-expresses the policy gradient by first integrating out the discrete mixture choice and then differentiating the resulting marginal density.

If this is right

  • Mixture policies equipped with the MRP estimator significantly outperform the same policies trained with likelihood-ratio gradients.
  • MRP mixture policies reach parity with, and in some environments exceed, the performance of standard Gaussian policies across Gym MuJoCo, DeepMind Control Suite, and MetaWorld.
  • The added representational capacity of mixtures translates into measurable gains once the gradient-variance barrier is removed.
  • Entropy regularization interacts more favorably with the richer support of mixture policies when low-variance gradients are available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same marginalization idea may reduce variance for other discrete-continuous hybrid policies where direct reparameterization is unavailable.
  • Tasks whose optimal action distributions are naturally multimodal could become reliably solvable without hand-designed policy classes.
  • The variance reduction may compound with other low-variance techniques such as value-function baselines or control variates.

Load-bearing premise

The estimator can be computed exactly or with negligible bias even when mixture components overlap and when the entropy term couples them.

What would settle it

A direct comparison on a simple two-component Gaussian mixture policy showing that the empirical gradient variance of the MRP estimator is not lower than that of the likelihood-ratio estimator would falsify the central variance-reduction claim.

Figures

Figures reproduced from arXiv: 2605.09157 by Adam White, Jiamin He, Jincheng Mei, Martha White, Samuel Neumann.

Figure 1
Figure 1. Figure 1: Stationary points of Gaussian and two-component Gaussian Mixture (GM) policies in a bimodal [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Left: Examples of synthetic multimodal bandits and learning curves on them. Right: Average performance across 100 bandits at different entropy scales. All shaded areas and error bars show 95% bootstrap confidence in￾tervals (CIs). 0 3M Timesteps 0.0 0.25 0.50 0.75 1.00 Average Success Rate basketball-v2 SG-RP SGM-MRP 0 3M Timesteps 0.0 0.50 1.00 lever-pull-v2 0 3M Timesteps 0.0 0.25 0.50 0.75 bin-picking-v… view at source ↗
Figure 3
Figure 3. Figure 3: Learning curves in selected MetaWorld tasks where the performance gap between mixture policies [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance with different target entropy coefficients for mixture policies and base policies in five [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Learning and sensitivity curves for classic control environments with unshaped rewards. Learning [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance comparison of SGM-MRP vs. SG-RP across common benchmarks. Positive values indicates SGM-MRP outperforms SG-RP. Bars repre￾sent mean performance difference across tasks within each benchmark, with error bars denoting 95% boot￾strap CIs. Individual points represent performance delta for each specific task. We next move on to the full RL setting: Do mix￾ture policies improve performance in common … view at source ↗
Figure 7
Figure 7. Figure 7: Learning curves, action-value estimates, and policy densities at a starting state from a sample [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance comparison of different esti [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Expected reward and objective value of the regularized objective’s stationary points found in [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Reward function and the corresponding learning curves for each synthetic bandit. Each bandit [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Average performance across 100 bandits at different entropy scales. The shaded areas and error bars show 95% bootstrap CIs across 100 bandits (10 runs each). Results comparing different estimators for mixture policies. From [PITH_FULL_IMAGE:figures/full_fig_p032_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Learning curves in 27 environments from Gym MuJoCo, DMC, and MetaWorld. sitioning between distributions. For instance, when all modes have collapsed to a single mode, it is more challenging for fixed-weight mixture policies to introduce a new mode far from the current mode, as the component locations are constrained near the existing mode due to the non-zero fixed weights. In contrast, mixture policies wi… view at source ↗
Figure 13
Figure 13. Figure 13: Learning curves in 30 additional environments from MetaWorld and MyoSuite. Visualization of the action-value estimates and policy density for base policies. We presented the visualization of the action-value estimates and policy density at the starting state ( [PITH_FULL_IMAGE:figures/full_fig_p034_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Learning curves in two MetaWorld environments where mixture policies fail to achieve meaningful [PITH_FULL_IMAGE:figures/full_fig_p035_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Performance with different target entropy coefficients for mixture policies and base policies in the [PITH_FULL_IMAGE:figures/full_fig_p035_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Learning and sensitivity curves for classic control environments with shaped rewards. [PITH_FULL_IMAGE:figures/full_fig_p035_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Learning curves of the best hyperparameter setting of different estimators for mixture policies in [PITH_FULL_IMAGE:figures/full_fig_p036_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: State visitation of SG-RP and SGM-MRP during early training in [PITH_FULL_IMAGE:figures/full_fig_p036_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: The starting state at the bottom in MountainCar. [PITH_FULL_IMAGE:figures/full_fig_p036_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Learning curve, action-value estimates, and policy density at a starting state from a sample run [PITH_FULL_IMAGE:figures/full_fig_p037_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Mixture policy statistics during training in classic control environments. Results are averaged [PITH_FULL_IMAGE:figures/full_fig_p038_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Comparative distribution (30 runs) of average return for squashed Gaussian mixture (SGM) policies in Gym MuJoCo environments, comparing the LR and the ERP estimators. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Gradient covariance trace of different estimators (col 1), percentage of seeds in which gradient [PITH_FULL_IMAGE:figures/full_fig_p040_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Learning curves in two robotic Fetch environments from [PITH_FULL_IMAGE:figures/full_fig_p041_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Sensitivity curves for mixture policies with different number of components. The error bars plot [PITH_FULL_IMAGE:figures/full_fig_p041_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Learning curves for mixture policies with different number of components in selected MetaWorld [PITH_FULL_IMAGE:figures/full_fig_p041_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Learning and sensitivity curves for Cauchy policies and Cauchy Mixture (CM) policies in classic [PITH_FULL_IMAGE:figures/full_fig_p042_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Sensitivity analysis of the GumbelRP estimator to the temperature parameter. The shaded areas [PITH_FULL_IMAGE:figures/full_fig_p042_28.png] view at source ↗
read the original abstract

Mixture policies theoretically offer greater flexibility than unimodal policies in continuous action reinforcement learning, but the practical benefits of this complexity remain elusive. Mixture policies are notably absent from most state-of-the-art algorithms, raising a fundamental question: Is the added representational overhead useful? We show that increased flexibility can theoretically enhance solution quality and entropy robustness. Yet standard algorithms like SAC do not leverage these advantages. A core issue is the lack of a low-variance reparameterization trick for mixtures, a luxury Gaussian policies enjoy. We propose a marginalized reparameterization (MRP) estimator to address this, proving it offers lower variance than the standard likelihood-ratio (LR) approach. Our experiments across Gym MuJoCo, DeepMind Control Suite, and MetaWorld show that MRP mixture policies significantly outperform their LR ones, and reach parity (sometimes better) with Gaussian counterparts. In addition, we do find several cases where MRP mixture policies exhibit clear empirical advantages. In this paper, we provide a clearer understanding of the trade-offs involved, elevating MRP mixture policies from theoretical curiosity to a practical tool.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper argues that mixture policies provide theoretical advantages in flexibility and entropy robustness for entropy-regularized actor-critic methods like SAC, but are underused due to the absence of a low-variance reparameterization gradient estimator. The authors introduce a marginalized reparameterization (MRP) estimator, prove that it has strictly lower variance than the standard likelihood-ratio (LR) estimator, and present experiments on Gym MuJoCo, DeepMind Control Suite, and MetaWorld showing that MRP mixture policies outperform LR mixtures and reach parity (sometimes better) with Gaussian policies, with some environments exhibiting clear empirical gains.

Significance. If the MRP estimator delivers unbiased lower-variance gradients for mixture policies in the presence of entropy regularization and overlapping component supports, the work would make mixture policies a practical alternative to unimodal Gaussians, enabling more expressive policies without sacrificing sample efficiency. The multi-benchmark evaluation and explicit comparison to both LR mixtures and Gaussians strengthen the empirical case for revisiting mixture policies.

major comments (2)
  1. [§4] §4 (MRP estimator and variance proof): The claimed proof that MRP has lower variance than LR must explicitly verify that marginalization remains unbiased (or that any bias is negligible) when mixture components have overlapping supports, which is typical for learned policies; the entropy-regularized objective couples the entropy bonus to the full mixture density, so any inner approximation in the marginalization step can interact with the regularization term in ways not covered by a basic variance comparison between estimators.
  2. [§5] §5 (Experiments): The statement that MRP mixtures 'reach parity (sometimes better)' with Gaussian counterparts and exhibit 'clear empirical advantages' in 'several cases' requires quantification of effect sizes, number of environments showing gains, and statistical significance tests across seeds; without these, the practical superiority claim rests on qualitative description rather than load-bearing evidence.
minor comments (2)
  1. Notation for the mixture density and the marginalization operator should be introduced earlier and used consistently to avoid ambiguity when the entropy term is written in terms of the full mixture.
  2. The abstract and introduction would benefit from a short explicit statement of the assumptions under which the variance proof holds (e.g., exact marginalization, non-overlapping supports).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will incorporate to improve the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (MRP estimator and variance proof): The claimed proof that MRP has lower variance than LR must explicitly verify that marginalization remains unbiased (or that any bias is negligible) when mixture components have overlapping supports, which is typical for learned policies; the entropy-regularized objective couples the entropy bonus to the full mixture density, so any inner approximation in the marginalization step can interact with the regularization term in ways not covered by a basic variance comparison between estimators.

    Authors: We appreciate this observation. The MRP estimator marginalizes exactly over the discrete component choice using the law of total expectation: the reparameterized sample is drawn conditionally on the component and then averaged with the mixture weights, yielding an unbiased estimator of the policy gradient for the mixture density irrespective of support overlap. The entropy term is computed directly from the closed-form mixture log-density log(∑_k π_k(a|s)) and its gradient is obtained analytically without approximation; the MRP estimator is applied solely to the expected-return portion of the objective. We will revise §4 to include an explicit unbiasedness lemma under overlapping supports together with a short derivation separating the entropy gradient from the MRP term. revision: yes

  2. Referee: [§5] §5 (Experiments): The statement that MRP mixtures 'reach parity (sometimes better)' with Gaussian counterparts and exhibit 'clear empirical advantages' in 'several cases' requires quantification of effect sizes, number of environments showing gains, and statistical significance tests across seeds; without these, the practical superiority claim rests on qualitative description rather than load-bearing evidence.

    Authors: We agree that quantitative support is necessary. In the revised version we will augment §5 and the appendix with tables that report, for every environment and seed, mean return ± standard deviation, Cohen’s d effect sizes relative to the Gaussian baseline, the exact count of environments in which MRP mixtures outperform Gaussians, and p-values from paired t-tests (or Wilcoxon signed-rank tests when normality assumptions are violated) across the 5–10 random seeds used in each benchmark. These additions will replace the current qualitative phrasing with concrete statistical evidence. revision: yes

Circularity Check

0 steps flagged

No circularity: MRP variance proof and experiments are self-contained

full rationale

The paper introduces the MRP estimator as a new technical contribution and states that it proves lower variance than the LR estimator. This proof is presented as an independent mathematical argument within the manuscript rather than reducing to a fitted parameter, a self-citation chain, or an ansatz imported from prior work by the same authors. The experimental comparisons are downstream validations of the estimator rather than inputs that define the claimed variance reduction. No load-bearing step in the derivation chain collapses to a tautology or to the paper's own fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on standard RL assumptions such as the validity of the policy gradient theorem and entropy regularization framework. The MRP is an invented technique whose independent evidence is the claimed variance proof and experiments.

axioms (1)
  • domain assumption Standard assumptions of the policy gradient theorem and entropy-regularized objective in continuous action spaces hold.
    Invoked implicitly as the foundation for actor-critic methods like SAC.
invented entities (1)
  • Marginalized reparameterization (MRP) estimator no independent evidence
    purpose: To enable low-variance gradient estimation for mixture policies by marginalizing over components.
    New technique introduced to solve the reparameterization issue for mixtures.

pith-pipeline@v0.9.0 · 5490 in / 1283 out tokens · 39390 ms · 2026-05-12T04:27:02.755417+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 8 internal anchors

  1. [1]

    Reinforcement learning: Theory and algorithms

    Alekh Agarwal, Nan Jiang, Sham M Kakade, and Wen Sun. Reinforcement learning: Theory and algorithms. CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep, 32: 0 96, 2019

  2. [2]

    Understanding the impact of entropy on policy optimization

    Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. Understanding the impact of entropy on policy optimization. In International conference on machine learning, pp.\ 151--160. PMLR, 2019

  3. [3]

    Maximum entropy reinforcement learning with mixture policies

    Nir Baram, Guy Tennenholtz, and Shie Mannor. Maximum entropy reinforcement learning with mixture policies. arXiv preprint arXiv:2103.10176, 2021

  4. [4]

    On the sample complexity and metastability of heavy-tailed policy search in continuous control

    Amrit Singh Bedi, Anjaly Parayil, Junyu Zhang, Mengdi Wang, and Alec Koppel. On the sample complexity and metastability of heavy-tailed policy search in continuous control. Journal of Machine Learning Research, 25 0 (39): 0 1--58, 2024

  5. [5]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas L \'e onard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013

  6. [6]

    JAX : composable transformations of P ython+ N um P y programs, 2018

    James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander P las, Skye Wanderman- M ilne, and Qiao Zhang. JAX : composable transformations of P ython+ N um P y programs, 2018. URL http://github.com/jax-ml/jax

  7. [7]

    OpenAI Gym

    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016

  8. [8]

    On upper and lower bounds for the variance of a function of a random variable

    Theophilos Cacoullos. On upper and lower bounds for the variance of a function of a random variable. The Annals of Probability, 10 0 (3): 0 799--809, 1982

  9. [9]

    Myosuite: A contact-rich simulation suite for musculoskeletal motor control

    Vittorio Caggiano, Huawei Wang, Guillaume Durandau, Massimo Sartori, and Vikash Kumar. Myosuite: A contact-rich simulation suite for musculoskeletal motor control. In Learning for Dynamics and Control Conference, pp.\ 492--507. PMLR, 2022

  10. [10]

    Specializing versatile skill libraries using local mixture of experts

    Onur Celik, Dongzhuoran Zhou, Ge Li, Philipp Becker, and Gerhard Neumann. Specializing versatile skill libraries using local mixture of experts. In Conference on Robot Learning, 2022

  11. [11]

    Improving stochastic policy gradients in continuous control with deep reinforcement learning using the beta distribution

    Po-Wei Chou, Daniel Maturana, and Sebastian Scherer. Improving stochastic policy gradients in continuous control with deep reinforcement learning using the beta distribution. In International conference on machine learning, pp.\ 834--843. PMLR, 2017

  12. [12]

    Hierarchical relative entropy policy search

    Christian Daniel, Gerhard Neumann, and Jan Peters. Hierarchical relative entropy policy search. In Artificial Intelligence and Statistics, pp.\ 273--281. PMLR, 2012

  13. [13]

    Model-free reinforcement learning with continuous action in practice

    Thomas Degris, Patrick M Pilarski, and Richard S Sutton. Model-free reinforcement learning with continuous action in practice. In 2012 American control conference (ACC), pp.\ 2177--2182. IEEE, 2012

  14. [14]

    Addressing function approximation error in actor-critic methods

    Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 1587--1596. PMLR, 10--15 Jul 2018. URL https://proceedings.mlr.press/v8...

  15. [15]

    Uncertainty in deep learning

    Yarin Gal. Uncertainty in deep learning. PhD thesis, University of Cambridge, 2016

  16. [16]

    Acquiring diverse robot skills via maximum entropy deep reinforcement learning

    Tuomas Haarnoja. Acquiring diverse robot skills via maximum entropy deep reinforcement learning. PhD thesis, University of California, Berkeley, 2018

  17. [17]

    Reinforcement learning with deep energy-based policies

    Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In International conference on machine learning, pp.\ 1352--1361. PMLR, 2017

  18. [18]

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp.\ 1861--1870. PMLR, 2018 a

  19. [19]

    Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018 b

  20. [20]

    Soft Actor-Critic Algorithms and Applications

    Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018 c

  21. [21]

    Learning latent dynamics for planning from pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In International conference on machine learning, pp.\ 2555--2565. PMLR, 2019

  22. [22]

    Learning continuous control policies by stochastic value gradients

    Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. Advances in neural information processing systems, 28, 2015

  23. [23]

    Off-policy maximum entropy reinforcement learning: Soft actor-critic with advantage weighted mixture policy (sac-awmp)

    Zhimin Hou, Kuangen Zhang, Yi Wan, Dongyu Li, Chenglong Fu, and Haoyong Yu. Off-policy maximum entropy reinforcement learning: Soft actor-critic with advantage weighted mixture policy (sac-awmp). arXiv preprint arXiv:2002.02829, 2020

  24. [24]

    Generalization in dexterous manipulation via geometry-aware multi-task learning

    Wenlong Huang, Igor Mordatch, Pieter Abbeel, and Deepak Pathak. Generalization in dexterous manipulation via geometry-aware multi-task learning. arXiv preprint arXiv:2111.03062, 2021

  25. [25]

    Categorical reparameterization with gumbel-softmax

    Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations, 2016

  26. [26]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  27. [27]

    Student-t policy in reinforcement learning to acquire global optimum of robot control

    Taisuke Kobayashi. Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence, 49 0 (12): 0 4335--4347, 2019

  28. [28]

    Model-free policy learning with reward gradients

    Qingfeng Lan, Samuele Tosatto, Homayoon Farrahi, and Rupam Mahmood. Model-free policy learning with reward gradients. In International Conference on Artificial Intelligence and Statistics, pp.\ 4217--4234. PMLR, 2022

  29. [29]

    Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model

    Alex X Lee, Anusha Nagabandi, Pieter Abbeel, and Sergey Levine. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. Advances in Neural Information Processing Systems, 33: 0 741--752, 2020

  30. [30]

    Continuous control with deep reinforcement learning

    Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015

  31. [31]

    The concrete distribution: A continuous relaxation of discrete random variables

    Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations, 2016

  32. [32]

    Leveraging exploration in off-policy algorithms via normalizing flows

    Bogdan Mazoure, Thang Doan, Audrey Durand, Joelle Pineau, and R Devon Hjelm. Leveraging exploration in off-policy algorithms via normalizing flows. In Conference on Robot Learning, pp.\ 430--444. PMLR, 2020

  33. [33]

    S\ 2\ AC : Energy-based reinforcement learning with stein soft actor critic

    Safa Messaoud, Billel Mokeddem, Zhenghai Xue, Linsey Pang, Bo An, Haipeng Chen, and Sanjay Chawla. S\ 2\ AC : Energy-based reinforcement learning with stein soft actor critic. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=rAHcTCMaLc

  34. [34]

    Reducing reparameterization gradient variance

    Andrew Miller, Nick Foti, Alexander D'Amour, and Ryan P Adams. Reducing reparameterization gradient variance. Advances in Neural Information Processing Systems, 30, 2017

  35. [35]

    Monte carlo gradient estimation in machine learning

    Shakir Mohamed, Mihaela Rosca, Michael Figurnov, and Andriy Mnih. Monte carlo gradient estimation in machine learning. Journal of Machine Learning Research, 21 0 (132): 0 1--62, 2020

  36. [36]

    Robot skill adaptation via soft actor-critic gaussian mixture models

    Iman Nematollahi, Erick Rosete-Beas, Adrian R \"o fer, Tim Welschehold, Abhinav Valada, and Wolfram Burgard. Robot skill adaptation via soft actor-critic gaussian mixture models. In 2022 International Conference on Robotics and Automation (ICRA), pp.\ 8651--8657. IEEE, 2022

  37. [37]

    Greedy actor-critic: A new conditional cross-entropy method for policy improvement

    Samuel Neumann, Sungsu Lim, Ajin George Joseph, Yangchen Pan, Adam White, and Martha White. Greedy actor-critic: A new conditional cross-entropy method for policy improvement. In The Eleventh International Conference on Learning Representations, 2022

  38. [38]

    Pytorch: An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019

  39. [39]

    Multi-goal reinforcement learn- ing: Challenging robotics environments and request for research.arXiv preprint arXiv:1802.09464,

    Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen Baker, Glenn Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, et al. Multi-goal reinforcement learning: Challenging robotics environments and request for research. arXiv preprint arXiv:1802.09464, 2018

  40. [40]

    Probabilistic mixture-of-experts for efficient deep reinforcement learning

    Jie Ren, Yewen Li, Zihan Ding, Wei Pan, and Hao Dong. Probabilistic mixture-of-experts for efficient deep reinforcement learning. arXiv preprint arXiv:2104.09122, 2021

  41. [41]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  42. [42]

    Strength through diversity: Robust behavior learning via mixture policies

    Tim Seyde, Wilko Schwarting, Igor Gilitschenski, Markus Wulfmeier, and Daniela Rus. Strength through diversity: Robust behavior learning via mixture policies. In Conference on Robot Learning, pp.\ 1144--1155. PMLR, 2022

  43. [43]

    Reinforcement learning: An introduction

    Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018

  44. [44]

    Policy gradient methods for reinforcement learning with function approximation

    Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999

  45. [45]

    Implicit policy for reinforcement learning

    Yunhao Tang and Shipra Agrawal. Implicit policy for reinforcement learning. arXiv preprint arXiv:1806.06798, 2018

  46. [46]

    DeepMind Control Suite

    Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018

  47. [47]

    SciPy 1.0: fundamental algorithms for scientific computing in Python

    Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python . Nature methods, 17 0 (3): 0 261--272, 2020

  48. [48]

    Diffusion policies as an expressive policy class for offline reinforcement learning

    Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=AHvFDPi-FA

  49. [49]

    Simple statistical gradient-following algorithms for connectionist reinforcement learning

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8: 0 229--256, 1992

  50. [50]

    Variance reduction properties of the reparameterization trick

    Ming Xu, Matias Quiroz, Robert Kohn, and Scott A Sisson. Variance reduction properties of the reparameterization trick. In The 22nd international conference on artificial intelligence and statistics, pp.\ 2711--2720. PMLR, 2019

  51. [51]

    Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning

    Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pp.\ 1094--1100. PMLR, 2020

  52. [52]

    Latent state marginalization as a low-cost approach for improving exploration

    Dinghuai Zhang, Aaron Courville, Yoshua Bengio, Qinqing Zheng, Amy Zhang, and Ricky TQ Chen. Latent state marginalization as a low-cost approach for improving exploration. In The Eleventh International Conference on Learning Representations, 2023

  53. [53]

    Model-based reparameterization policy gradient methods: Theory and practical algorithms

    Shenao Zhang, Boyi Liu, Zhaoran Wang, and Tuo Zhao. Model-based reparameterization policy gradient methods: Theory and practical algorithms. Advances in Neural Information Processing Systems, 36, 2024