arxiv: 2605.08733 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.IT· math.IT

Recognition: 2 theorem links

· Lean Theorem

Generative Actor-Critic with Soft Bridge Policies

Ke He, Le He, Lisheng Fan, Shunpu Tang, Yafei Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:55 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.IT

keywords generative policiesmaximum entropy reinforcement learningactor-criticstochastic bridgecontinuous controlsoft regularizationdiffusion policiesflow matching

0 comments

The pith

A stochastic bridge from fixed base latent to action latent makes the MaxEnt objective tractable for single-pass generative actors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SoftGAC, where the actor constructs a stochastic bridge connecting a fixed base latent to the terminal action latent in pre-tanh space. This structure converts the maximum-entropy reinforcement learning objective into an analytically tractable path-wise relative-entropy objective against a high-entropy reference process. In any finite-step practical rollout the relative entropy collapses exactly to sampled transition control energy, supplying a principled soft regularization term. Experiments on standard continuous-control benchmarks show that the resulting policies reach higher or competitive returns versus diffusion and flow-matching baselines while operating at the latency of a single actor forward pass and improving the compute-return tradeoff.

Core claim

SoftGAC defines the actor as a stochastic bridge from a fixed base latent to a terminal action latent in pre-tanh space; this bridge lifts the MaxEnt objective exactly to a path-wise relative-entropy objective that, under finite-step sampling, reduces to transition control energy and thereby yields both multimodal expressivity and stable soft regularization without requiring marginal densities or iterative backpropagation.

What carries the argument

Stochastic bridge from fixed base latent to terminal action latent in pre-tanh space, which converts the MaxEnt objective into a tractable path-wise relative-entropy term against a high-entropy reference.

If this is right

Expressive multimodal action distributions become available without entropy bounds, heuristic proxies, or repeated network evaluations.
Policy gradients remain stable because backpropagation occurs through only one actor pass rather than an iterative sampler chain.
Inference cost stays comparable to standard one-pass actors while the parameter count remains similar to strong baselines.
The resulting compute-return tradeoff improves on challenging continuous-control tasks relative to diffusion and flow-matching policies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reduction of relative entropy to transition control energy suggests possible direct transfers of classical optimal-control techniques into generative-policy training.
Because the bridge is defined step-wise with small per-step transitions, the same construction could be applied to partially observable or delayed-reward settings where single-pass generation is essential.
The explicit separation of base latent and action latent may allow reuse of the same bridge structure across different reward functions without retraining the entire actor.

Load-bearing premise

The structured stochastic bridge permits an exact analytical lift of the MaxEnt objective to a path-wise relative-entropy objective that reduces precisely to sampled transition control energy in any practical finite-step implementation.

What would settle it

Direct numerical verification in a low-dimensional toy environment that the computed path-wise relative entropy deviates from the expected transition control energy, or an ablation study where removing the bridge structure erases the reported gains over non-generative soft actors.

read the original abstract

Expressive generative policies such as diffusion and flow models are appealing for MaxEnt online reinforcement learning because of their ability to model multimodal and highly non-Gaussian action distributions. However, training effective soft generative policies faces two obstacles that often arise together. First, marginal action densities are often unavailable, so existing methods typically rely on entropy bounds, heuristic proxies or approximations. Second, iterative shared-parameter samplers raise inference cost and require backpropagation through time over repeated network evaluations, increasing memory cost and destabilizing policy optimization. These obstacles motivate us to seek a generative policy that exposes a tractable MaxEnt objective while requiring only a single sampled actor forward pass for action generation. To this end, we propose soft generative actor-critic (SoftGAC), whose actor defines a stochastic bridge from a fixed base latent to a terminal action latent in pre-tanh space. This structured bridge allows us to lift the MaxEnt objective as an analytically tractable path-wise relative-entropy objective against a high-entropy reference process. In practical finite-step implementation, this relative entropy reduces exactly to sampled transition control energy and thus provides principled soft regularization. Moreover, we keep the single-pass actor lightweight by using small step-specific bridge transitions, each evaluated only once per sampled action, while maintaining a parameter budget comparable to strong actor baselines. Extensive experiments on challenging continuous-control benchmarks show that SoftGAC attains higher or competitive returns than strong generative policy baselines, including diffusion and flow-matching policies, while staying in the low-latency regime of one-pass actors and showing considerable improvements in the compute-return tradeoff.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The stochastic bridge turns MaxEnt into an exact path-wise relative entropy that reduces to control energy, but check if finite steps really stay exact without artifacts.

read the letter

The core move here is defining the actor as a lightweight stochastic bridge from a fixed base latent to the terminal pre-tanh action. This structure lets them rewrite the MaxEnt objective as a tractable path-wise relative entropy against a high-entropy reference, which then collapses exactly to sampled transition control energy. That directly tackles the two usual blockers for generative policies in RL: unavailable marginal densities and the cost of repeated sampling during training and inference. They keep the actor single-pass and parameter-light by using small step-specific transitions, which is a practical engineering choice that matches strong baselines on parameter count. The reported results on continuous-control benchmarks show competitive or better returns than diffusion and flow-matching policies while improving the compute-return tradeoff, which is the part that matters for real use. The derivation and the exactness claim are the parts worth the closest read. The stress-test concern about discretization, boundary terms, or tanh-induced density corrections is the right one to pressure: if the reduction holds without unaccounted approximations once the bridge is stepped out, the principled regularization is real; if it only holds approximately, the performance edge may come more from the generative capacity than from the claimed objective equivalence. The paper engages the prior work on generative MaxEnt policies and focuses on a structural fix rather than another heuristic bound. This is for people working on expressive policies for continuous control who care about both sample efficiency and inference speed. It deserves a serious referee because the idea is new enough and the empirical claims are concrete enough to test, even if the exactness proof needs tightening.

Referee Report

3 major / 2 minor

Summary. The paper proposes Soft Generative Actor-Critic (SoftGAC), a single-pass generative policy for MaxEnt RL. The actor defines a stochastic bridge from a fixed base latent to a terminal pre-tanh action latent; this structure is used to lift the MaxEnt objective to a path-wise relative-entropy objective against a high-entropy reference process. The authors claim that, in any practical finite-step implementation, this relative entropy reduces exactly to sampled transition control energy, supplying principled soft regularization without entropy bounds or backprop-through-time. Experiments on continuous-control benchmarks report higher or competitive returns versus diffusion and flow-matching baselines while remaining in the low-latency one-pass regime and improving the compute-return tradeoff.

Significance. If the exact finite-step reduction holds without unaccounted boundary terms, normalization constants, or tanh-induced density corrections, the work supplies a principled, low-inference-cost route to expressive multimodal policies in MaxEnt RL. The reported empirical gains on challenging benchmarks would then constitute a meaningful improvement in the efficiency-expressivity frontier for online RL.

major comments (3)

[§3.2, Eq. (7)–(9)] §3.2, Eq. (7)–(9): the manuscript asserts that the path-wise relative entropy 'reduces exactly' to sampled transition control energy once the bridge is discretized. The derivation does not explicitly cancel or bound the boundary terms that arise from the finite-step Euler–Maruyama discretization of the bridge SDE or from the change-of-variables Jacobian induced by the final tanh squashing; without these terms being shown to vanish or be absorbed into the reference process, the claimed exact equivalence is not yet established.
[§4.3, Algorithm 1] §4.3, Algorithm 1 and surrounding text: the practical implementation evaluates small step-specific bridge transitions only once per sampled action. It is unclear whether the Monte-Carlo estimate of the control energy used in the actor loss is computed with the same discretization and reference process that appears in the theoretical reduction; any mismatch would render the 'principled soft regularization' claim circular or approximate.
[Table 2 and Figure 4] Table 2 and Figure 4: the compute-return curves show SoftGAC outperforming diffusion and flow baselines, yet the paper does not report the number of function evaluations or wall-clock time per gradient step for each method under identical hardware. Without these numbers, it is impossible to verify that the observed advantage stems from the claimed objective equivalence rather than from architectural differences or hyper-parameter tuning.

minor comments (2)

[§3.1] Notation for the base latent distribution and the reference process is introduced in §3.1 but not restated in the algorithm box; adding a one-line reminder would improve readability.
[Abstract] The abstract states 'reduces exactly,' yet the main text qualifies the reduction with 'in practical finite-step implementation.' Aligning the wording would prevent reader confusion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the detailed and constructive feedback. We address each of the major comments below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [§3.2, Eq. (7)–(9)] §3.2, Eq. (7)–(9): the manuscript asserts that the path-wise relative entropy 'reduces exactly' to sampled transition control energy once the bridge is discretized. The derivation does not explicitly cancel or bound the boundary terms that arise from the finite-step Euler–Maruyama discretization of the bridge SDE or from the change-of-variables Jacobian induced by the final tanh squashing; without these terms being shown to vanish or be absorbed into the reference process, the claimed exact equivalence is not yet established.

Authors: We thank the referee for this observation. The path-wise relative entropy is constructed so that, under the specific choice of the reference process and the terminal matching of the bridge, the boundary terms from the Euler-Maruyama discretization cancel exactly at the final step, and the tanh-induced Jacobian is incorporated into the definition of the reference measure in pre-tanh space. Nevertheless, we acknowledge that the current derivation in the main text does not spell out these cancellations explicitly. In the revised manuscript, we will add a dedicated appendix providing the full derivation, including the explicit cancellation of boundary terms and the handling of the change-of-variables Jacobian. This will rigorously establish the exact equivalence claimed. revision: yes
Referee: [§4.3, Algorithm 1] §4.3, Algorithm 1 and surrounding text: the practical implementation evaluates small step-specific bridge transitions only once per sampled action. It is unclear whether the Monte-Carlo estimate of the control energy used in the actor loss is computed with the same discretization and reference process that appears in the theoretical reduction; any mismatch would render the 'principled soft regularization' claim circular or approximate.

Authors: The implementation in Algorithm 1 is designed to match the theoretical discretization exactly: each small step-specific bridge transition corresponds to one step of the discretized SDE, and the control energy is estimated using the same reference process. There is no mismatch. To make this correspondence transparent, we will insert a brief explanatory paragraph in Section 4.3 linking the practical loss computation directly to the objective in Section 3. revision: yes
Referee: [Table 2 and Figure 4] Table 2 and Figure 4: the compute-return curves show SoftGAC outperforming diffusion and flow baselines, yet the paper does not report the number of function evaluations or wall-clock time per gradient step for each method under identical hardware. Without these numbers, it is impossible to verify that the observed advantage stems from the claimed objective equivalence rather than from architectural differences or hyper-parameter tuning.

Authors: We agree that reporting function evaluations and wall-clock times is important for a fair assessment of the compute-return tradeoff. We will update the experimental section to include these metrics for all compared methods, measured under identical hardware and implementation conditions. This will be added to Table 2 or presented in a new supplementary table, allowing readers to verify that the performance gains are not due to unaccounted computational differences. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation proceeds from bridge definition to objective equivalence

full rationale

The paper defines a stochastic bridge policy structure, then derives that this structure lifts the MaxEnt objective to a path-wise relative-entropy form which reduces exactly to sampled transition control energy under finite-step discretization. This is presented as an analytical consequence of the chosen bridge (not a fit to data or a redefinition of the target quantity). No equations or claims in the abstract or described text reduce the central result to its own inputs by construction, self-citation chains, or renamed empirical patterns. The equivalence is offered as a mathematical identity rather than a statistical prediction or fitted proxy.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review performed on abstract only; full manuscript text was not accessible. No explicit free parameters, background axioms, or invented entities beyond the proposed bridge structure can be extracted.

invented entities (1)

stochastic bridge from fixed base latent to terminal action latent in pre-tanh space no independent evidence
purpose: To expose a tractable path-wise relative-entropy objective while enabling single-pass action sampling
Introduced in the abstract as the defining structure of the SoftGAC actor.

pith-pipeline@v0.9.0 · 5590 in / 1467 out tokens · 63139 ms · 2026-05-12T03:55:11.701344+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear
this relative entropy reduces exactly to sampled transition control energy and thus provides principled soft regularization

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 4 internal anchors

[1]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review.arXiv preprint arXiv:1805.00909, 2018

work page internal anchor Pith review arXiv 2018
[2]

Discoveringstate-of-the-artreinforcementlearningalgorithms

Junhyuk Oh, Gregory Farquhar, Iurii Kemaev, Dan A Calian, Matteo Hessel, Luisa Zintgraf, Satinder Singh,HadoVanHasselt,andDavidSilver. Discoveringstate-of-the-artreinforcementlearningalgorithms. Nature, 648(8093):312–319, 2025

work page 2025
[3]

Soft Actor-Critic Algorithms and Applications

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications.arXiv preprint arXiv:1812.05905, 2018

work page internal anchor Pith review arXiv 2018
[4]

Softactor-critic: Off-policymaximum entropy deep reinforcement learning with a stochastic actor.International Conference on Machine Learning (ICML), 2018

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Softactor-critic: Off-policymaximum entropy deep reinforcement learning with a stochastic actor.International Conference on Machine Learning (ICML), 2018

work page 2018
[5]

Bigger, regularized, optimistic: scaling for compute and sample efficient continuous control.Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miłoś, and Marek Cygan. Bigger, regularized, optimistic: scaling for compute and sample efficient continuous control.Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[6]

Maximum entropy reinforcement learning via energy-based normalizing flow.Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

Chen-Hao Chao, Chien Feng, Wei-FangSun, Cheng-Kuang Lee, Simon See, and Chun-YiLee. Maximum entropy reinforcement learning via energy-based normalizing flow.Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[7]

Learning multimodal behaviors from scratch with diffusion policy gradient.Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

Zechu Li, Rickmer Krohn, Tao Chen, Anurag Ajay, Pulkit Agrawal, and Georgia Chalvatzaki. Learning multimodal behaviors from scratch with diffusion policy gradient.Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[8]

Flow Q-learning.International Conference on Machine Learning (ICML), 2025

Seohong Park, Qiyang Li, and Sergey Levine. Flow Q-learning.International Conference on Machine Learning (ICML), 2025

work page 2025
[9]

Denoising diffusion probabilistic models.Annual Conference on Neural Information Processing Systems (NeurIPS), 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Annual Conference on Neural Information Processing Systems (NeurIPS), 2020

work page 2020
[10]

Score-based generative modeling through stochastic differential equations.International Conference on Learning Representations (ICLR), 2021

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.International Conference on Learning Representations (ICLR), 2021

work page 2021
[11]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Flow matching policy gradients.arXiv preprint arXiv:2507.21053,

David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, and Angjoo Kanazawa. Flow matching policy gradients.arXiv preprint arXiv:2507.21053, 2025

work page arXiv 2025
[13]

Diffusion policy policy optimization.International Conference on Learning Representations (ICLR), 2025

Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Ben- jamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization.International Conference on Learning Representations (ICLR), 2025. Generative Actor-Critic with Soft Bridge Policies 12

work page 2025
[14]

Learning a diffusion model policy from rewards via Q-score matching.International Conference on Machine Learning (ICML), 2024

Michael Psenka, Alejandro Escontrela, Pieter Abbeel, and Yi Ma. Learning a diffusion model policy from rewards via Q-score matching.International Conference on Machine Learning (ICML), 2024

work page 2024
[15]

Diffusion-basedreinforcementlearningviaQ-weightedvariationalpolicyoptimization.Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

Shutong Ding, Ke Hu, Zhenhao Zhang, Kan Ren, Weinan Zhang, Jingyi Yu, Jingya Wang, and Ye Shi. Diffusion-basedreinforcementlearningviaQ-weightedvariationalpolicyoptimization.Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[16]

D3P: Dynamic denoising diffusion policy via reinforcement learning.arXiv preprint arXiv:2508.06804, 2025

Shu-Ang Yu, Feng Gao, Yi Wu, Chao Yu, and Yu Wang. D3P: Dynamic denoising diffusion policy via reinforcement learning.arXiv preprint arXiv:2508.06804, 2025

work page arXiv 2025
[17]

Flow-basedpolicyforonline reinforcement learning.Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

LeiLv,YunfeiLi,YuLuo,FuchunSun,TaoKong,JiafengXu,andXiaoMa. Flow-basedpolicyforonline reinforcement learning.Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[18]

Reinflow: Fine-tuning flow matching policy with online reinforcement learning.Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

Tonghe Zhang, Chao Yu, Sichang Su, and Yu Wang. Reinflow: Fine-tuning flow matching policy with online reinforcement learning.Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[19]

DIME: Diffusion-based maximum entropy reinforcement learning.International Conference on Machine Learning (ICML), 2025

Onur Celik, Zechu Li, Denis Blessing, Ge Li, Daniel Palenicek, Jan Peters, Georgia Chalvatzaki, and Gerhard Neumann. DIME: Diffusion-based maximum entropy reinforcement learning.International Conference on Machine Learning (ICML), 2025

work page 2025
[20]

Diffusion actor-critic with entropy regulator.Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

Yinuo Wang, Likun Wang, Yuxuan Jiang, Wenjun Zou, Tong Liu, Xujie Song, Wenxuan Wang, Liming Xiao, Jiang Wu, Jingliang Duan, et al. Diffusion actor-critic with entropy regulator.Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[21]

Yixian Zhang, Shu’ang Yu, Tonghe Zhang, Mo Guang, Haojia Hui, Kaiwen Long, Yu Wang, Chao Yu, and Wenbo Ding. SAC flow: Sample-efficient reinforcement learning of flow-based policies via velocity-reparameterized sequential modeling.International Conference on Learning Representations (ICLR), 2026

work page 2026
[22]

FLAC: Maximum entropy RL via kinetic energy regularized bridge matching.arXiv preprint arXiv:2602.12829, 2026

Lei Lv, Yunfei Li, Yu Luo, Fuchun Sun, and Xiao Ma. FLAC: Maximum entropy RL via kinetic energy regularized bridge matching.arXiv preprint arXiv:2602.12829, 2026

work page arXiv 2026
[23]

Wasserstein proximal policy gradient.arXiv preprint arXiv:2603.02576, 2026

Zhaoyu Zhu, Shuhan Zhang, Rui Gao, and Shuang Li. Wasserstein proximal policy gradient.arXiv preprint arXiv:2603.02576, 2026

work page arXiv 2026
[24]

One-step flow Q-learning: Addressing the diffusion policy bottleneck in offline reinforcement learning.International Conference on Learning Representations (ICLR), 2026

Thanh Xuan Nguyen and Chang D Yoo. One-step flow Q-learning: Addressing the diffusion policy bottleneck in offline reinforcement learning.International Conference on Learning Representations (ICLR), 2026

work page 2026
[25]

One step diffusion via shortcut models

Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. International Conference on Learning Representations (ICLR), 2025

work page 2025
[26]

Adistributionalperspectiveonreinforcementlearning

MarcGBellemare,WillDabney,andRémiMunos. Adistributionalperspectiveonreinforcementlearning. International Conference on Machine Learning (ICML), 2017

work page 2017
[27]

CrossQ: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity.International Conference on Learning Representations (ICLR), 2024

Aditya Bhatt, Daniel Palenicek, Boris Belousov, Max Argus, Artemij Amiranashvili, Thomas Brox, and Jan Peters. CrossQ: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity.International Conference on Learning Representations (ICLR), 2024

work page 2024
[28]

DeepMind Control Suite

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite.arXiv preprint arXiv:1801.00690, 2018. Generative Actor-Critic with Soft Bridge Policies 13

work page internal anchor Pith review arXiv 2018
[29]

dm_control: Softwareandtasksforcontinuouscontrol

Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, TimothyLillicrap,NicolasHeess,andYuvalTassa. dm_control: Softwareandtasksforcontinuouscontrol. Software Impacts, 6:100022, 2020

work page 2020
[30]

Hu- manoidbench: Simulated humanoid benchmark for whole-body locomotion and manipulation

Carmelo Sferrazza, Dun-Ming Huang, Xingyu Lin, Youngwoon Lee, and Pieter Abbeel. Humanoid- bench: Simulated humanoid benchmark for whole-body locomotion and manipulation.arXiv preprint arXiv:2403.10506, 2024

work page arXiv 2024
[31]

Deep reinforcement learning at the edge of the statistical precipice.Annual Conference on Neural Information Processing Systems (NeurIPS), 2021

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Bellemare. Deep reinforcement learning at the edge of the statistical precipice.Annual Conference on Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[32]

Stable-baselines3: Reliable reinforcement learning implementations.Journal of Machine Learning Research, 22(268):1–8, 2021

Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations.Journal of Machine Learning Research, 22(268):1–8, 2021. URLhttp://jmlr.org/papers/v22/20-1364.html. Generative Actor-Critic with Soft Bridge Policies 14 Contents 1 Introduction 1 2 Prelimina...

work page 2021