Recognition: 2 theorem links
· Lean TheoremIDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies
Pith reviewed 2026-05-13 13:44 UTC · model grok-4.3
The pith
Implicit Q-Learning implicitly defines a behavior-regularized actor that is more accurately extracted using diffusion models and importance sampling than with Gaussian policies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We reinterpret IQL as an actor-critic method by generalizing the critic objective and connecting it to a behavior-regularized implicit actor. This generalization shows how the induced actor balances reward maximization and divergence from the behavior policy, with the specific loss choice determining the nature of this tradeoff. Notably, this actor can exhibit complex and multimodal characteristics, suggesting issues with the conditional Gaussian actor fit with advantage weighted regression used in prior methods. Instead, we propose using samples from a diffusion parameterized behavior policy and weights computed from the critic to then importance sample our intended policy. We introduce IDQ
What carries the argument
Generalized IQL critic objective connected to a behavior-regularized implicit actor, extracted by importance sampling critic weights over samples from a diffusion-parameterized behavior policy.
Load-bearing premise
That samples from the diffusion-parameterized behavior policy combined with critic weights via importance sampling correctly recover the intended behavior-regularized implicit actor without introducing prohibitive variance or bias.
What would settle it
An offline RL benchmark run where the importance-sampled diffusion policy achieves returns substantially below the values predicted by the trained IQL critic on the same actions, or where IDQL fails to outperform standard IQL with Gaussian policy extraction.
read the original abstract
Effective offline RL methods require properly handling out-of-distribution actions. Implicit Q-learning (IQL) addresses this by training a Q-function using only dataset actions through a modified Bellman backup. However, it is unclear which policy actually attains the values represented by this implicitly trained Q-function. In this paper, we reinterpret IQL as an actor-critic method by generalizing the critic objective and connecting it to a behavior-regularized implicit actor. This generalization shows how the induced actor balances reward maximization and divergence from the behavior policy, with the specific loss choice determining the nature of this tradeoff. Notably, this actor can exhibit complex and multimodal characteristics, suggesting issues with the conditional Gaussian actor fit with advantage weighted regression (AWR) used in prior methods. Instead, we propose using samples from a diffusion parameterized behavior policy and weights computed from the critic to then importance sampled our intended policy. We introduce Implicit Diffusion Q-learning (IDQL), combining our general IQL critic with the policy extraction method. IDQL maintains the ease of implementation of IQL while outperforming prior offline RL methods and demonstrating robustness to hyperparameters. Code is available at https://github.com/philippe-eecs/IDQL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reinterprets Implicit Q-Learning (IQL) as an actor-critic method by generalizing the critic objective to induce a behavior-regularized implicit actor whose density is proportional to exp(Q) times the behavior density. It proposes IDQL, which extracts this actor by sampling from a learned diffusion model of the behavior policy and reweighting samples via importance sampling using critic values, claiming that this approach maintains IQL's implementation simplicity while outperforming prior offline RL methods and demonstrating hyperparameter robustness.
Significance. If the reinterpretation is correct and the importance sampling recovers the intended actor without prohibitive bias or variance, the work would be significant for offline RL by providing a principled mechanism to extract complex multimodal policies from IQL critics using diffusion models. The code release supports reproducibility and extension of the method.
major comments (2)
- [IDQL policy extraction description] In the policy extraction step for IDQL, the manuscript provides no analysis, bounds, or empirical diagnostics on the variance of the importance sampling estimator when drawing from the diffusion-parameterized behavior policy and reweighting by critic values. This is load-bearing for the claimed equivalence, as high variance (possible when Q-values vary sharply across actions or the target policy is multimodal) would cause the extracted policy to deviate from the one represented by the trained critic.
- [Empirical evaluation and hyperparameter robustness claims] The free parameters (diffusion model hyperparameters and importance sampling temperature) are acknowledged but the paper does not demonstrate through ablations or sensitivity analysis that performance remains robust when these are varied, which is necessary to support the robustness claim.
minor comments (1)
- [Abstract] The abstract states that 'the specific loss choice determining the nature of this tradeoff' but does not identify the loss used in the IDQL experiments; adding this detail would clarify the connection between the generalized critic and the extracted actor.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. Below we address each major comment point-by-point, indicating the revisions we will incorporate to strengthen the manuscript.
read point-by-point responses
-
Referee: [IDQL policy extraction description] In the policy extraction step for IDQL, the manuscript provides no analysis, bounds, or empirical diagnostics on the variance of the importance sampling estimator when drawing from the diffusion-parameterized behavior policy and reweighting by critic values. This is load-bearing for the claimed equivalence, as high variance (possible when Q-values vary sharply across actions or the target policy is multimodal) would cause the extracted policy to deviate from the one represented by the trained critic.
Authors: We acknowledge that the original manuscript did not provide theoretical bounds or explicit variance analysis for the importance sampling step. To address this, we have performed additional empirical diagnostics, including histograms of importance weights, effective sample size calculations, and variance estimates across multiple environments and training checkpoints. These show that the diffusion model's coverage keeps weight variance manageable in practice, supporting the claimed equivalence. We will add these diagnostics, along with a brief discussion of limitations when Q-values are extremely sharp, to the revised policy extraction section. revision: yes
-
Referee: [Empirical evaluation and hyperparameter robustness claims] The free parameters (diffusion model hyperparameters and importance sampling temperature) are acknowledged but the paper does not demonstrate through ablations or sensitivity analysis that performance remains robust when these are varied, which is necessary to support the robustness claim.
Authors: We agree that explicit sensitivity analysis is needed to substantiate the robustness claim. In the revised manuscript we will add ablations varying the number of diffusion steps, the noise schedule parameters, and the importance sampling temperature over reasonable ranges. Performance tables and plots on representative environments (e.g., locomotion and manipulation tasks) will demonstrate that results remain stable, thereby strengthening the claim that IDQL is robust to these hyperparameters while preserving implementation simplicity. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper reinterprets the IQL critic objective via a generalized Bellman backup to connect it to a behavior-regularized implicit actor whose density is proportional to the behavior density times an exponential of the Q-values; this is a direct algebraic consequence of the modified backup and does not reduce the claimed actor to a fitted quantity by construction. The policy extraction step then introduces an independent approximation that draws samples from a separately trained diffusion model of the behavior policy and applies importance weights derived from the critic, which is presented as a new algorithmic choice rather than a tautological renaming or self-referential prediction. No equations equate the final performance or the extracted policy back to the critic training inputs, self-citations to prior IQL results supply external context instead of load-bearing justification, and empirical outperformance claims rest on experiments outside the derivation. The chain is therefore self-contained.
Axiom & Free-Parameter Ledger
free parameters (1)
- diffusion model hyperparameters and importance sampling temperature
axioms (1)
- domain assumption The generalized critic objective induces a behavior-regularized implicit actor whose tradeoff is controlled by loss choice
Forward citations
Cited by 25 Pith papers
-
Aligning Flow Map Policies with Optimal Q-Guidance
Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.
-
Path-Coupled Bellman Flows for Distributional Reinforcement Learning
Path-Coupled Bellman Flows use source-consistent Bellman-coupled paths and a lambda-parameterized control-variate to learn return distributions via flow matching, improving fidelity and stability over prior DRL approaches.
-
Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning
DOSER detects OOD actions via diffusion-model denoising error and applies selective regularization based on predicted transitions, proving gamma-contraction with performance bounds and outperforming priors on offline ...
-
Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning
FAN achieves state-of-the-art offline RL performance on robotic tasks by anchoring flow policies and using single-sample noise-conditioned Q-learning, with proven convergence and reduced runtimes.
-
CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making
CoFlow achieves state-of-the-art coordination quality in offline MARL using only 1-3 denoising steps by natively coupling velocity fields across agents via coordinated attention and gating.
-
CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making
CoFlow achieves state-of-the-art coordination in offline MARL using single-pass joint velocity fields with Coordinated Velocity Attention and Adaptive Coordination Gating.
-
Preserve Support, Not Correspondence: Dynamic Routing for Offline Reinforcement Learning
DROL trains one-step offline RL actors via top-1 dynamic routing of dataset actions to latent candidates, enabling local improvements while preserving data support and retaining cheap inference.
-
Reinforcement Learning via Value Gradient Flow
VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
-
ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching
ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on lo...
-
Advantage-Guided Diffusion for Model-Based Reinforcement Learning
Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.
-
Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy
Q-Flow enables stable optimization of expressive flow-based policies in RL by propagating terminal values along deterministic flow dynamics to intermediate states for gradient updates without solver unrolling.
-
Discrete Flow Matching for Offline-to-Online Reinforcement Learning
DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.
-
Adaptive Action Chunking via Multi-Chunk Q Value Estimation
ACH lets RL policies dynamically pick action chunk lengths by jointly estimating Q-values for all candidate lengths via a single Transformer pass.
-
ACSAC: Adaptive Chunk Size Actor-Critic with Causal Transformer Q-Network
ACSAC adaptively selects action chunk sizes via a causal Transformer Q-network in actor-critic RL, proves the Bellman operator is a contraction, and reports state-of-the-art results on long-horizon manipulation tasks.
-
Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow
DFP is a one-step generative policy using Wasserstein gradient flow on a drifting model backbone, with a top-K behavior cloning surrogate, that reaches SOTA on Robomimic and OGBench manipulation tasks.
-
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
-
CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making
CoFlow preserves inter-agent coordination in few-step offline MARL by using a natively joint velocity field with Coordinated Velocity Attention and Adaptive Coordination Gating, matching or exceeding baselines in 1-3 ...
-
FASTER: Value-Guided Sampling for Fast RL
FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.
-
Fisher Decorator: Refining Flow Policy via a Local Transport Map
Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.
-
Mean Flow Policy Optimization
Mean Flow Policy Optimization (MFPO) uses few-step flow-based models for RL policies and achieves performance on par with or better than diffusion-based methods while substantially lowering training and inference time...
-
Whole-Body Mobile Manipulation using Offline Reinforcement Learning on Sub-optimal Controllers
WHOLE-MoMa improves whole-body mobile manipulation by applying offline RL with Q-chunking to demonstrations from randomized sub-optimal controllers, outperforming baselines and transferring to real robots without tele...
-
Truncated Rectified Flow Policy for Reinforcement Learning with One-Step Sampling
TRFP combines rectified flow models with truncation to support multimodal policies in MaxEnt RL while allowing fast one-step sampling and stable training.
-
Training Diffusion Models with Reinforcement Learning
DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.
-
Entropy-Regularized Adjoint Matching for Offline Reinforcement Learning
ME-AM adds mirror-descent entropy maximization and a mixture behavior prior to adjoint matching in flow-based policies to mitigate popularity bias and support binding in offline RL.
-
Entropy-Regularized Adjoint Matching for Offline Reinforcement Learning
ME-AM adds entropy regularization and a mixture prior to adjoint matching in flow-based offline RL to extract better multi-modal policies from limited data.
Reference graph
Works this paper leans on
-
[1]
Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022
-
[2]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[3]
Efficient online reinforcement learning with offline data
Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. arXiv preprint arXiv:2302.02948, 2023
-
[4]
JAX: composable transformations of Python+NumPy programs, 2018
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax
work page 2018
-
[5]
Offline rl without off-policy evaluation
David Brandfonbrener, Will Whitney, Rajesh Ranganath, and Joan Bruna. Offline rl without off-policy evaluation. Advances in neural information processing systems, 34:4933–4946, 2021
work page 2021
-
[6]
Jonathon Byrd and Zachary Lipton. What is the effect of importance weighting in deep learning? In International conference on machine learning, pages 872–881. PMLR, 2019
work page 2019
-
[7]
Offline reinforcement learning via high-fidelity generative behavior modeling
Huayu Chen, Cheng Lu, Chengyang Ying, Hang Su, and Jun Zhu. Offline reinforcement learning via high-fidelity generative behavior modeling. arXiv preprint arXiv:2209.14548, 2022
-
[8]
Decision transformer: Reinforcement learning via sequence modeling
Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021
work page 2021
-
[9]
Distributional reinforce- ment learning with quantile regression
Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforce- ment learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018
work page 2018
-
[10]
Pete Florence, Corey Lynch, Andy Zeng, Oscar A Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. Implicit behavioral cloning. In 5th Annual Conference on Robot Learning , 2021. URL https://openreview.net/ forum?id=rif3a5NAxU6
work page 2021
-
[11]
D4RL: Datasets for Deep Data-Driven Reinforcement Learning
Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[12]
A minimalist approach to offline reinforcement learning
Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021
work page 2021
-
[13]
Off-policy deep reinforcement learning without exploration
Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International conference on machine learning , pages 2052–2062. PMLR, 2019. 10
work page 2052
-
[14]
Extreme q-learning: Maxent rl without entropy
Divyansh Garg, Joey Hejna, Matthieu Geist, and Stefano Ermon. Extreme q-learning: Maxent rl without entropy. arXiv preprint arXiv:2301.02328, 2023
-
[15]
Emaq: Expected-max q-learning operator for simple yet effective offline and online rl
Seyed Kamyar Seyed Ghasemipour, Dale Schuurmans, and Shixiang Shane Gu. Emaq: Expected-max q-learning operator for simple yet effective offline and online rl. In International Conference on Machine Learning, pages 3682–3691. PMLR, 2021
work page 2021
-
[16]
Know your boundaries: The necessity of explicit behavioral cloning in offline rl
Wonjoon Goo and Scott Niekum. Know your boundaries: The necessity of explicit behavioral cloning in offline rl. arXiv preprint arXiv:2206.00695, 2022
-
[17]
Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018
work page 2018
-
[18]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
-
[19]
Flax: A neural network library and ecosystem for JAX, 2023
Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas Steiner, and Marc van Zee. Flax: A neural network library and ecosystem for JAX, 2023. URL http://github.com/google/flax
work page 2023
-
[20]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020
work page 2020
-
[21]
Offline reinforcement learning as one big sequence modeling problem
Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273– 1286, 2021
work page 2021
-
[22]
Planning with Diffusion for Flexible Behavior Synthesis
Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022
work page internal anchor Pith review arXiv 2022
-
[23]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[24]
JAXRL: Implementations of Reinforcement Learning algorithms in JAX, 10
Ilya Kostrikov. JAXRL: Implementations of Reinforcement Learning algorithms in JAX, 10
-
[25]
URL https://github.com/ikostrikov/jaxrl
-
[26]
Offline reinforcement learning with fisher divergence critic regularization
Ilya Kostrikov, Rob Fergus, Jonathan Tompson, and Ofir Nachum. Offline reinforcement learning with fisher divergence critic regularization. InInternational Conference on Machine Learning, pages 5774–5783. PMLR, 2021
work page 2021
-
[27]
Offline Reinforcement Learning with Implicit Q-Learning
Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[28]
Stabilizing off- policy q-learning via bootstrapping error reduction
Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off- policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[29]
Conservative q-learning for offline reinforcement learning
Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33: 1179–1191, 2020
work page 2020
-
[30]
Controlling overestimation bias with truncated mixture of continuous distributional quantile critics
Arsenii Kuznetsov, Pavel Shvechikov, Alexander Grishin, and Dmitry Vetrov. Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. In International Conference on Machine Learning, pages 5556–5566. PMLR, 2020
work page 2020
-
[31]
Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch reinforcement learning. Reinforce- ment learning: State-of-the-art, pages 45–73, 2012
work page 2012
-
[32]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[33]
SGDR: Stochastic Gradient Descent with Warm Restarts
Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016. 11
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[34]
AWAC: Accelerating Online Reinforcement Learning with Offline Datasets
Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[35]
Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning
Mitsuhiko Nakamoto, Yuexiang Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning. arXiv preprint arXiv:2303.05479, 2023
-
[36]
Improved denoising diffusion probabilistic models
Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021
work page 2021
-
[37]
Imitating human behaviour with diffusion models
Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Valcarcel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, and Sam Devlin. Imitating human behaviour with diffusion models. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Pv1GPQzRrC8
work page 2023
-
[38]
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[39]
Reinforcement learning by reward-weighted regression for operational space control
Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th International Conference on Machine Learning, volume 227 of ACM International Conference Proceeding Series, pages 745–750. ACM, 2007. ISBN 978-1-59593-793-3. doi: 10.1145/1273496.1273590
-
[40]
Language models are unsupervised multitask learners
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019
work page 2019
-
[41]
Goal-conditioned imitation learning using score-based diffusion policies
Moritz Reuss, Maximilian Li, Xiaogang Jia, and Rudolf Lioutikov. Goal-conditioned imitation learning using score-based diffusion policies. arXiv preprint arXiv:2304.02532, 2023
-
[42]
Deep unsuper- vised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015
work page 2015
-
[43]
Learning structured output representation using deep conditional generative models
Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/ file/8d55a249e6b...
work page 2015
-
[44]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[45]
Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022
-
[46]
Ziyu Wang, Alexander Novikov, Konrad Zolna, Josh S Merel, Jost Tobias Springenberg, Scott E Reed, Bobak Shahriari, Noah Siegel, Caglar Gulcehre, Nicolas Heess, et al. Critic regularized regression. Advances in Neural Information Processing Systems, 33:7768–7778, 2020
work page 2020
-
[47]
Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8:279–292, 1992
work page 1992
-
[48]
Behavior Regularized Offline Reinforcement Learning
Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019
work page internal anchor Pith review arXiv 1911
-
[49]
Understanding the role of importance weighting for deep learning
Da Xu, Yuting Ye, and Chuanwei Ruan. Understanding the role of importance weighting for deep learning. arXiv preprint arXiv:2103.15209, 2021
-
[50]
Haoran Xu, Li Jiang, Jianxiong Li, Zhuoran Yang, Zhaoran Wang, Victor Wai Kin Chan, and Xianyuan Zhan. Offline rl with no ood actions: In-sample learning via implicit value regularization. arXiv preprint arXiv:2303.15810, 2023. 12 A Reinforcement Learning Definitions RL is formulated in the context of a Markov decision process (MDP), which is defined as a tu...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.