arxiv: 2304.10573 · v2 · submitted 2023-04-20 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

Ilya Kostrikov, Jakub Grudzien Kuba, Michael Janner, Philippe Hansen-Estruch, Sergey Levine

Authors on Pith no claims yet

Pith reviewed 2026-05-13 13:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords offline reinforcement learningimplicit Q-learningdiffusion policiesactor-criticbehavior regularizationimportance samplingpolicy extractionmultimodal policies

0 comments

The pith

Implicit Q-Learning implicitly defines a behavior-regularized actor that is more accurately extracted using diffusion models and importance sampling than with Gaussian policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reinterprets Implicit Q-Learning as an actor-critic method whose critic objective corresponds to a behavior-regularized implicit actor. This actor trades off reward maximization against divergence from the dataset policy, and the choice of critic loss determines the precise form of that tradeoff. Because the resulting actor distribution can be multimodal and complex, fitting it with a conditional Gaussian via advantage-weighted regression is insufficient. The authors instead draw samples from a diffusion model of the behavior policy and reweight them by the critic values through importance sampling to recover the intended policy. This yields IDQL, which preserves the implementation simplicity of standard IQL while delivering higher performance on offline RL tasks and greater robustness to hyperparameter settings.

Core claim

We reinterpret IQL as an actor-critic method by generalizing the critic objective and connecting it to a behavior-regularized implicit actor. This generalization shows how the induced actor balances reward maximization and divergence from the behavior policy, with the specific loss choice determining the nature of this tradeoff. Notably, this actor can exhibit complex and multimodal characteristics, suggesting issues with the conditional Gaussian actor fit with advantage weighted regression used in prior methods. Instead, we propose using samples from a diffusion parameterized behavior policy and weights computed from the critic to then importance sample our intended policy. We introduce IDQ

What carries the argument

Generalized IQL critic objective connected to a behavior-regularized implicit actor, extracted by importance sampling critic weights over samples from a diffusion-parameterized behavior policy.

Load-bearing premise

That samples from the diffusion-parameterized behavior policy combined with critic weights via importance sampling correctly recover the intended behavior-regularized implicit actor without introducing prohibitive variance or bias.

What would settle it

An offline RL benchmark run where the importance-sampled diffusion policy achieves returns substantially below the values predicted by the trained IQL critic on the same actions, or where IDQL fails to outperform standard IQL with Gaussian policy extraction.

read the original abstract

Effective offline RL methods require properly handling out-of-distribution actions. Implicit Q-learning (IQL) addresses this by training a Q-function using only dataset actions through a modified Bellman backup. However, it is unclear which policy actually attains the values represented by this implicitly trained Q-function. In this paper, we reinterpret IQL as an actor-critic method by generalizing the critic objective and connecting it to a behavior-regularized implicit actor. This generalization shows how the induced actor balances reward maximization and divergence from the behavior policy, with the specific loss choice determining the nature of this tradeoff. Notably, this actor can exhibit complex and multimodal characteristics, suggesting issues with the conditional Gaussian actor fit with advantage weighted regression (AWR) used in prior methods. Instead, we propose using samples from a diffusion parameterized behavior policy and weights computed from the critic to then importance sampled our intended policy. We introduce Implicit Diffusion Q-learning (IDQL), combining our general IQL critic with the policy extraction method. IDQL maintains the ease of implementation of IQL while outperforming prior offline RL methods and demonstrating robustness to hyperparameters. Code is available at https://github.com/philippe-eecs/IDQL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IDQL recasts IQL as an actor-critic method and swaps in diffusion sampling plus importance weighting for the policy step, which is a clean practical move but leaves the sampling variance unexamined in the abstract.

read the letter

The main contribution is the explicit actor-critic framing of IQL. By generalizing the critic objective they connect it to a behavior-regularized implicit actor whose density is proportional to exp(Q) times the behavior density. That view explains why a conditional Gaussian fit can be limiting when the target policy is multimodal, and it motivates pulling samples from a learned diffusion model of the behavior policy and reweighting them with critic values via importance sampling. The resulting IDQL keeps the original IQL critic training unchanged while changing only the extraction step, which is a straightforward engineering win if the sampling works reliably in practice. They report better performance and hyperparameter robustness than prior offline methods, and the code release is a plus for checking the claims. The soft spot is exactly the one the stress-test flags: importance sampling from the diffusion can introduce uncontrolled variance or bias if the diffusion does not cover the behavior support well or if Q-values vary sharply. The abstract gives no derivation details or error analysis on this point, so the equivalence between the extracted policy and the critic values rests on the experiments holding up without that problem. If the full paper shows the weights stay well-behaved across the tested domains, the concern shrinks; otherwise it is a real gap in the argument. This paper is for offline RL groups already running IQL-style methods who want a better policy head without retraining the critic. It is coherent on its own terms and the idea is testable, so it deserves a serious referee even if the sampling analysis needs tightening.

Referee Report

2 major / 1 minor

Summary. The paper reinterprets Implicit Q-Learning (IQL) as an actor-critic method by generalizing the critic objective to induce a behavior-regularized implicit actor whose density is proportional to exp(Q) times the behavior density. It proposes IDQL, which extracts this actor by sampling from a learned diffusion model of the behavior policy and reweighting samples via importance sampling using critic values, claiming that this approach maintains IQL's implementation simplicity while outperforming prior offline RL methods and demonstrating hyperparameter robustness.

Significance. If the reinterpretation is correct and the importance sampling recovers the intended actor without prohibitive bias or variance, the work would be significant for offline RL by providing a principled mechanism to extract complex multimodal policies from IQL critics using diffusion models. The code release supports reproducibility and extension of the method.

major comments (2)

[IDQL policy extraction description] In the policy extraction step for IDQL, the manuscript provides no analysis, bounds, or empirical diagnostics on the variance of the importance sampling estimator when drawing from the diffusion-parameterized behavior policy and reweighting by critic values. This is load-bearing for the claimed equivalence, as high variance (possible when Q-values vary sharply across actions or the target policy is multimodal) would cause the extracted policy to deviate from the one represented by the trained critic.
[Empirical evaluation and hyperparameter robustness claims] The free parameters (diffusion model hyperparameters and importance sampling temperature) are acknowledged but the paper does not demonstrate through ablations or sensitivity analysis that performance remains robust when these are varied, which is necessary to support the robustness claim.

minor comments (1)

[Abstract] The abstract states that 'the specific loss choice determining the nature of this tradeoff' but does not identify the loss used in the IDQL experiments; adding this detail would clarify the connection between the generalized critic and the extracted actor.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. Below we address each major comment point-by-point, indicating the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [IDQL policy extraction description] In the policy extraction step for IDQL, the manuscript provides no analysis, bounds, or empirical diagnostics on the variance of the importance sampling estimator when drawing from the diffusion-parameterized behavior policy and reweighting by critic values. This is load-bearing for the claimed equivalence, as high variance (possible when Q-values vary sharply across actions or the target policy is multimodal) would cause the extracted policy to deviate from the one represented by the trained critic.

Authors: We acknowledge that the original manuscript did not provide theoretical bounds or explicit variance analysis for the importance sampling step. To address this, we have performed additional empirical diagnostics, including histograms of importance weights, effective sample size calculations, and variance estimates across multiple environments and training checkpoints. These show that the diffusion model's coverage keeps weight variance manageable in practice, supporting the claimed equivalence. We will add these diagnostics, along with a brief discussion of limitations when Q-values are extremely sharp, to the revised policy extraction section. revision: yes
Referee: [Empirical evaluation and hyperparameter robustness claims] The free parameters (diffusion model hyperparameters and importance sampling temperature) are acknowledged but the paper does not demonstrate through ablations or sensitivity analysis that performance remains robust when these are varied, which is necessary to support the robustness claim.

Authors: We agree that explicit sensitivity analysis is needed to substantiate the robustness claim. In the revised manuscript we will add ablations varying the number of diffusion steps, the noise schedule parameters, and the importance sampling temperature over reasonable ranges. Performance tables and plots on representative environments (e.g., locomotion and manipulation tasks) will demonstrate that results remain stable, thereby strengthening the claim that IDQL is robust to these hyperparameters while preserving implementation simplicity. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper reinterprets the IQL critic objective via a generalized Bellman backup to connect it to a behavior-regularized implicit actor whose density is proportional to the behavior density times an exponential of the Q-values; this is a direct algebraic consequence of the modified backup and does not reduce the claimed actor to a fitted quantity by construction. The policy extraction step then introduces an independent approximation that draws samples from a separately trained diffusion model of the behavior policy and applies importance weights derived from the critic, which is presented as a new algorithmic choice rather than a tautological renaming or self-referential prediction. No equations equate the final performance or the extracted policy back to the critic training inputs, self-citations to prior IQL results supply external context instead of load-bearing justification, and empirical outperformance claims rest on experiments outside the derivation. The chain is therefore self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of reinterpreting IQL as an actor-critic method and on the assumption that diffusion models plus importance sampling recover the target policy without distortion.

free parameters (1)

diffusion model hyperparameters and importance sampling temperature
These control the behavior policy representation and weighting; the paper claims robustness but does not specify they are derived from first principles.

axioms (1)

domain assumption The generalized critic objective induces a behavior-regularized implicit actor whose tradeoff is controlled by loss choice
This is the load-bearing reinterpretation connecting the IQL critic to the actor.

pith-pipeline@v0.9.0 · 5524 in / 1198 out tokens · 66232 ms · 2026-05-13T13:44:20.536346+00:00 · methodology

discussion (0)

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Aligning Flow Map Policies with Optimal Q-Guidance
cs.LG 2026-05 unverdicted novelty 7.0

Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.
Path-Coupled Bellman Flows for Distributional Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

Path-Coupled Bellman Flows use source-consistent Bellman-coupled paths and a lambda-parameterized control-variate to learn return distributions via flow matching, improving fidelity and stability over prior DRL approaches.
Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

DOSER detects OOD actions via diffusion-model denoising error and applies selective regularization based on predicted transitions, proving gamma-contraction with performance bounds and outperforming priors on offline ...
Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning
cs.LG 2026-05 unverdicted novelty 7.0

FAN achieves state-of-the-art offline RL performance on robotic tasks by anchoring flow policies and using single-sample noise-conditioned Q-learning, with proven convergence and reduced runtimes.
CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making
cs.AI 2026-05 unverdicted novelty 7.0

CoFlow achieves state-of-the-art coordination quality in offline MARL using only 1-3 denoising steps by natively coupling velocity fields across agents via coordinated attention and gating.
CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making
cs.AI 2026-05 unverdicted novelty 7.0

CoFlow achieves state-of-the-art coordination in offline MARL using single-pass joint velocity fields with Coordinated Velocity Attention and Adaptive Coordination Gating.
Preserve Support, Not Correspondence: Dynamic Routing for Offline Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

DROL trains one-step offline RL actors via top-1 dynamic routing of dataset actions to latent candidates, enabling local improvements while preserving data support and retaining cheap inference.
Reinforcement Learning via Value Gradient Flow
cs.LG 2026-04 unverdicted novelty 7.0

VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching
cs.RO 2026-04 unverdicted novelty 7.0

ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on lo...
Advantage-Guided Diffusion for Model-Based Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 7.0

Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.
Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy
cs.LG 2026-05 unverdicted novelty 6.0

Q-Flow enables stable optimization of expressive flow-based policies in RL by propagating terminal values along deterministic flow dynamics to intermediate states for gradient updates without solver unrolling.
Discrete Flow Matching for Offline-to-Online Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.
Adaptive Action Chunking via Multi-Chunk Q Value Estimation
cs.LG 2026-05 unverdicted novelty 6.0

ACH lets RL policies dynamically pick action chunk lengths by jointly estimating Q-values for all candidate lengths via a single Transformer pass.
ACSAC: Adaptive Chunk Size Actor-Critic with Causal Transformer Q-Network
cs.LG 2026-05 unverdicted novelty 6.0

ACSAC adaptively selects action chunk sizes via a causal Transformer Q-network in actor-critic RL, proves the Bellman operator is a contraction, and reports state-of-the-art results on long-horizon manipulation tasks.
Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow
cs.LG 2026-05 unverdicted novelty 6.0

DFP is a one-step generative policy using Wasserstein gradient flow on a drifting model backbone, with a top-K behavior cloning surrogate, that reaches SOTA on Robomimic and OGBench manipulation tasks.
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
cs.LG 2026-05 unverdicted novelty 6.0

OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making
cs.AI 2026-05 unverdicted novelty 6.0

CoFlow preserves inter-agent coordination in few-step offline MARL by using a natively joint velocity field with Coordinated Velocity Attention and Adaptive Coordination Gating, matching or exceeding baselines in 1-3 ...
FASTER: Value-Guided Sampling for Fast RL
cs.LG 2026-04 unverdicted novelty 6.0

FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.
Fisher Decorator: Refining Flow Policy via a Local Transport Map
cs.LG 2026-04 unverdicted novelty 6.0

Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.
Mean Flow Policy Optimization
cs.LG 2026-04 conditional novelty 6.0

Mean Flow Policy Optimization (MFPO) uses few-step flow-based models for RL policies and achieves performance on par with or better than diffusion-based methods while substantially lowering training and inference time...
Whole-Body Mobile Manipulation using Offline Reinforcement Learning on Sub-optimal Controllers
cs.RO 2026-04 unverdicted novelty 6.0

WHOLE-MoMa improves whole-body mobile manipulation by applying offline RL with Q-chunking to demonstrations from randomized sub-optimal controllers, outperforming baselines and transferring to real robots without tele...
Truncated Rectified Flow Policy for Reinforcement Learning with One-Step Sampling
cs.LG 2026-04 unverdicted novelty 6.0

TRFP combines rectified flow models with truncation to support multimodal policies in MaxEnt RL while allowing fast one-step sampling and stable training.
Training Diffusion Models with Reinforcement Learning
cs.LG 2023-05 unverdicted novelty 6.0

DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.
Entropy-Regularized Adjoint Matching for Offline Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 5.0

ME-AM adds mirror-descent entropy maximization and a mixture behavior prior to adjoint matching in flow-based policies to mitigate popularity bias and support binding in offline RL.
Entropy-Regularized Adjoint Matching for Offline Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 5.0

ME-AM adds entropy regularization and a mixture prior to adjoint matching in flow-based offline RL to extract better multi-modal policies from limited data.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 22 Pith papers · 11 internal anchors

[1]

Tenenbaum, T

Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022

work page arXiv 2022
[2]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

Efﬁcient online reinforcement learning with ofﬂine data

Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efﬁcient online reinforcement learning with ofﬂine data. arXiv preprint arXiv:2302.02948, 2023

work page arXiv 2023
[4]

JAX: composable transformations of Python+NumPy programs, 2018

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax

work page 2018
[5]

Ofﬂine rl without off-policy evaluation

David Brandfonbrener, Will Whitney, Rajesh Ranganath, and Joan Bruna. Ofﬂine rl without off-policy evaluation. Advances in neural information processing systems, 34:4933–4946, 2021

work page 2021
[6]

What is the effect of importance weighting in deep learning? In International conference on machine learning, pages 872–881

Jonathon Byrd and Zachary Lipton. What is the effect of importance weighting in deep learning? In International conference on machine learning, pages 872–881. PMLR, 2019

work page 2019
[7]

Ofﬂine reinforcement learning via high-ﬁdelity generative behavior modeling

Huayu Chen, Cheng Lu, Chengyang Ying, Hang Su, and Jun Zhu. Ofﬂine reinforcement learning via high-ﬁdelity generative behavior modeling. arXiv preprint arXiv:2209.14548, 2022

work page arXiv 2022
[8]

Decision transformer: Reinforcement learning via sequence modeling

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021

work page 2021
[9]

Distributional reinforce- ment learning with quantile regression

Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforce- ment learning with quantile regression. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 32, 2018

work page 2018
[10]

Implicit behavioral cloning

Pete Florence, Corey Lynch, Andy Zeng, Oscar A Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. Implicit behavioral cloning. In 5th Annual Conference on Robot Learning , 2021. URL https://openreview.net/ forum?id=rif3a5NAxU6

work page 2021
[11]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Justin Fu, Aviral Kumar, Oﬁr Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[12]

A minimalist approach to ofﬂine reinforcement learning

Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to ofﬂine reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021

work page 2021
[13]

Off-policy deep reinforcement learning without exploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International conference on machine learning , pages 2052–2062. PMLR, 2019. 10

work page 2052
[14]

Extreme q-learning: Maxent rl without entropy

Divyansh Garg, Joey Hejna, Matthieu Geist, and Stefano Ermon. Extreme q-learning: Maxent rl without entropy. arXiv preprint arXiv:2301.02328, 2023

work page arXiv 2023
[15]

Emaq: Expected-max q-learning operator for simple yet effective ofﬂine and online rl

Seyed Kamyar Seyed Ghasemipour, Dale Schuurmans, and Shixiang Shane Gu. Emaq: Expected-max q-learning operator for simple yet effective ofﬂine and online rl. In International Conference on Machine Learning, pages 3682–3691. PMLR, 2021

work page 2021
[16]

Know your boundaries: The necessity of explicit behavioral cloning in ofﬂine rl

Wonjoon Goo and Scott Niekum. Know your boundaries: The necessity of explicit behavioral cloning in ofﬂine rl. arXiv preprint arXiv:2206.00695, 2022

work page arXiv 2022
[17]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018

work page 2018
[18]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[19]

Flax: A neural network library and ecosystem for JAX, 2023

Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas Steiner, and Marc van Zee. Flax: A neural network library and ecosystem for JAX, 2023. URL http://github.com/google/flax

work page 2023
[20]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020

work page 2020
[21]

Ofﬂine reinforcement learning as one big sequence modeling problem

Michael Janner, Qiyang Li, and Sergey Levine. Ofﬂine reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273– 1286, 2021

work page 2021
[22]

Planning with Diffusion for Flexible Behavior Synthesis

Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for ﬂexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022

work page internal anchor Pith review arXiv 2022
[23]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[24]

JAXRL: Implementations of Reinforcement Learning algorithms in JAX, 10

Ilya Kostrikov. JAXRL: Implementations of Reinforcement Learning algorithms in JAX, 10

work page
[25]

URL https://github.com/ikostrikov/jaxrl

work page
[26]

Ofﬂine reinforcement learning with ﬁsher divergence critic regularization

Ilya Kostrikov, Rob Fergus, Jonathan Tompson, and Oﬁr Nachum. Ofﬂine reinforcement learning with ﬁsher divergence critic regularization. InInternational Conference on Machine Learning, pages 5774–5783. PMLR, 2021

work page 2021
[27]

Offline Reinforcement Learning with Implicit Q-Learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Ofﬂine reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

Stabilizing off- policy q-learning via bootstrapping error reduction

Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off- policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[29]

Conservative q-learning for ofﬂine reinforcement learning

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for ofﬂine reinforcement learning. Advances in Neural Information Processing Systems, 33: 1179–1191, 2020

work page 2020
[30]

Controlling overestimation bias with truncated mixture of continuous distributional quantile critics

Arsenii Kuznetsov, Pavel Shvechikov, Alexander Grishin, and Dmitry Vetrov. Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. In International Conference on Machine Learning, pages 5556–5566. PMLR, 2020

work page 2020
[31]

Batch reinforcement learning

Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch reinforcement learning. Reinforce- ment learning: State-of-the-art, pages 45–73, 2012

work page 2012
[32]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Ofﬂine reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[33]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016. 11

work page internal anchor Pith review Pith/arXiv arXiv 2016
[34]

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. Awac: Accelerating online reinforcement learning with ofﬂine datasets. arXiv preprint arXiv:2006.09359, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[35]

Cal-ql: Calibrated ofﬂine rl pre-training for efﬁcient online ﬁne-tuning

Mitsuhiko Nakamoto, Yuexiang Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-ql: Calibrated ofﬂine rl pre-training for efﬁcient online ﬁne-tuning. arXiv preprint arXiv:2303.05479, 2023

work page arXiv 2023
[36]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021

work page 2021
[37]

Imitating human behaviour with diffusion models

Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Valcarcel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, and Sam Devlin. Imitating human behaviour with diffusion models. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Pv1GPQzRrC8

work page 2023
[38]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[39]

Reinforcement learning by reward-weighted regression for operational space control

Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th International Conference on Machine Learning, volume 227 of ACM International Conference Proceeding Series, pages 745–750. ACM, 2007. ISBN 978-1-59593-793-3. doi: 10.1145/1273496.1273590

work page doi:10.1145/1273496.1273590 2007
[40]

Language models are unsupervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019

work page 2019
[41]

Goal-conditioned imitation learning using score-based diffusion policies

Moritz Reuss, Maximilian Li, Xiaogang Jia, and Rudolf Lioutikov. Goal-conditioned imitation learning using score-based diffusion policies. arXiv preprint arXiv:2304.02532, 2023

work page arXiv 2023
[42]

Deep unsuper- vised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015

work page 2015
[43]

Learning structured output representation using deep conditional generative models

Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/ file/8d55a249e6b...

work page 2015
[44]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011
[45]

Diffusion policies as an expressive policy class for offline reinforcement learning.arXiv preprint arXiv:2208.06193,

Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for ofﬂine reinforcement learning. arXiv preprint arXiv:2208.06193, 2022

work page arXiv 2022
[46]

Critic regularized regression

Ziyu Wang, Alexander Novikov, Konrad Zolna, Josh S Merel, Jost Tobias Springenberg, Scott E Reed, Bobak Shahriari, Noah Siegel, Caglar Gulcehre, Nicolas Heess, et al. Critic regularized regression. Advances in Neural Information Processing Systems, 33:7768–7778, 2020

work page 2020
[47]

Q-learning

Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8:279–292, 1992

work page 1992
[48]

Behavior Regularized Offline Reinforcement Learning

Yifan Wu, George Tucker, and Oﬁr Nachum. Behavior regularized ofﬂine reinforcement learning. arXiv preprint arXiv:1911.11361, 2019

work page internal anchor Pith review arXiv 1911
[49]

Understanding the role of importance weighting for deep learning

Da Xu, Yuting Ye, and Chuanwei Ruan. Understanding the role of importance weighting for deep learning. arXiv preprint arXiv:2103.15209, 2021

work page arXiv 2021
[50]

-A"), 1e6 (

Haoran Xu, Li Jiang, Jianxiong Li, Zhuoran Yang, Zhaoran Wang, Victor Wai Kin Chan, and Xianyuan Zhan. Ofﬂine rl with no ood actions: In-sample learning via implicit value regularization. arXiv preprint arXiv:2303.15810, 2023. 12 A Reinforcement Learning Deﬁnitions RL is formulated in the context of a Markov decision process (MDP), which is deﬁned as a tu...

work page arXiv 2023