arxiv: 2110.06169 · v1 · submitted 2021-10-12 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Offline Reinforcement Learning with Implicit Q-Learning

Ashvin Nair, Ilya Kostrikov, Sergey Levine

Pith reviewed 2026-05-12 08:41 UTC · model grok-4.3

classification 💻 cs.LG

keywords offline reinforcement learningimplicit Q-learningexpectile regressionD4RL benchmarkadvantage-weighted behavioral cloningdistributional shiftpolicy improvement

0 comments

The pith

Offline reinforcement learning can improve policies beyond the collected data without ever evaluating actions outside that data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Offline RL faces a core tension: any policy improvement risks errors from actions the dataset never saw, yet staying too close to the data prevents gains. Existing methods address this by constraining actions or regularizing their values during training. Implicit Q-learning instead approximates the improvement step without querying those actions at all. It treats the state value as a random variable over possible actions, takes an upper expectile of that variable to capture the best actions, and backs the result into a Q-function. The final policy is recovered through advantage-weighted behavioral cloning. The approach reaches state-of-the-art scores on the D4RL benchmark and supports strong further improvement once online interaction is allowed.

Core claim

By modeling the state-value function as a random variable whose randomness is induced by the action and estimating its state-conditional upper expectile, the method performs an implicit policy improvement step. This value is then backed up into a Q-function, after which the policy is extracted by advantage-weighted behavioral cloning. The resulting algorithm improves over the behavior policy while never requiring the Q-function to be evaluated on actions absent from the dataset, thereby sidestepping distributional shift errors that arise from direct queries to unseen actions.

What carries the argument

Upper-expectile regression on the state-value function, which implicitly identifies the best available actions through generalization of the approximator rather than explicit maximization.

If this is right

The method reaches state-of-the-art performance on the D4RL offline RL benchmark.
After offline pre-training, the same initialization yields strong results under subsequent online fine-tuning.
Training never requires evaluating the Q-function on actions outside the dataset, removing one source of overestimation error.
Policy extraction reduces to advantage-weighted behavioral cloning once the implicit value function is learned.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same expectile-based implicit improvement could be tried in other offline sequential decision settings where direct action queries are expensive.
Because no explicit action constraints are needed, the approach may scale more readily to continuous or high-dimensional action spaces than methods that enforce support constraints.
One could examine whether the upper-expectile target remains stable when the dataset size is reduced or when the behavior policy is highly suboptimal.

Load-bearing premise

The function approximator can accurately assign higher values to the best actions at each state even when those actions never appear in the training data.

What would settle it

Test the method on a dataset in which the single best action at each state is deliberately withheld; if the learned policy still exceeds the behavior policy's return, the central claim holds.

read the original abstract

Offline reinforcement learning requires reconciling two conflicting aims: learning a policy that improves over the behavior policy that collected the dataset, while at the same time minimizing the deviation from the behavior policy so as to avoid errors due to distributional shift. This trade-off is critical, because most current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy, and therefore need to either constrain these actions to be in-distribution, or else regularize their values. We propose an offline RL method that never needs to evaluate actions outside of the dataset, but still enables the learned policy to improve substantially over the best behavior in the data through generalization. The main insight in our work is that, instead of evaluating unseen actions from the latest policy, we can approximate the policy improvement step implicitly by treating the state value function as a random variable, with randomness determined by the action (while still integrating over the dynamics to avoid excessive optimism), and then taking a state conditional upper expectile of this random variable to estimate the value of the best actions in that state. This leverages the generalization capacity of the function approximator to estimate the value of the best available action at a given state without ever directly querying a Q-function with this unseen action. Our algorithm alternates between fitting this upper expectile value function and backing it up into a Q-function. Then, we extract the policy via advantage-weighted behavioral cloning. We dub our method implicit Q-learning (IQL). IQL demonstrates the state-of-the-art performance on D4RL, a standard benchmark for offline reinforcement learning. We also demonstrate that IQL achieves strong performance fine-tuning using online interaction after offline initialization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IQL gives a clean practical method for offline RL by using upper expectiles to implicitly improve without OOD queries, with strong D4RL results, but the generalization step lacks supporting analysis.

read the letter

The main thing to know is that this paper introduces IQL, which does policy improvement in offline RL by fitting a state-conditional upper expectile of the value function instead of directly evaluating unseen actions. They model the value as random over actions in the data, take the upper expectile to stand in for the best action at each state, back that up into a Q-function, and pull out the policy with advantage-weighted behavioral cloning. This sidesteps the usual need for constraints or value regularization on out-of-distribution actions. It reports state-of-the-art numbers on the D4RL benchmarks and also shows decent performance when you start online fine-tuning from the offline policy. The method is straightforward to implement and the empirical comparisons look reasonable from the description. The core trick with expectiles is genuinely new in this context and avoids some of the circularity in other offline approaches. The experiments support the claims on standard tasks, and the fine-tuning results add a bit of extra value. The softer part is the justification for the key step. The whole improvement relies on the neural net generalizing so that the upper expectile of in-distribution actions recovers something close to the max action value, but there is no derivation, bound, or analysis showing why this holds under function approximation. If the extrapolation on the upper tail goes wrong, the implicit improvement disappears and you are left with something closer to behavior cloning. The stress-test note flags this correctly, and the paper does not close the gap with theory or targeted ablations on the expectile parameter. This is the kind of work that offline RL people will want to read and try, especially if they are looking for a simple baseline that still beats the data. It deserves a serious referee because the idea is distinct, the results are competitive, and the method is easy enough to reproduce that the community can test the generalization claim themselves.

Referee Report

2 major / 2 minor

Summary. The paper introduces Implicit Q-Learning (IQL) for offline RL. It proposes fitting a state-value function V(s) as the upper expectile (over actions) of the random variable formed by Q-values from the dataset, backing this up into a Q-function, and then extracting an improved policy via advantage-weighted behavioral cloning. The central claim is that this implicitly performs policy improvement without ever querying out-of-distribution actions, relying on function-approximator generalization to recover near-maximal action values; the method is reported to achieve state-of-the-art results on the D4RL benchmark and strong performance when fine-tuned online after offline initialization.

Significance. If the core mechanism is sound, the work is significant because it provides a simple, constraint-free alternative to prior offline RL methods that must explicitly handle or penalize out-of-distribution actions. It builds directly on standard expectile regression and advantage-weighted cloning without introducing new free parameters or complex regularizers, and the reported D4RL results plus fine-tuning experiments indicate practical utility on standard benchmarks. The approach is reproducible in principle via the clear algorithmic outline and could be readily implemented and tested.

major comments (2)

[§3.2] §3.2 (Implicit policy improvement via expectiles): The manuscript states that the upper expectile of the action-conditioned value random variable approximates the value of the best available action through generalization, yet provides no derivation, error bound, or analysis showing that expectile regression on in-distribution actions yields a quantity close to max_a Q(s,a) under neural function approximation. This is load-bearing for the central claim that distributional shift is avoided without explicit constraints.
[§4] §4 (Experiments): The SOTA claim on D4RL is presented without reported error bars across seeds, full ablation tables on the expectile parameter τ, or direct comparisons against the most recent baselines at the time of submission; this weakens the strength of the empirical support for the generalization hypothesis.

minor comments (2)

[§3.1] The notation for the random variable whose upper expectile is taken (action-induced variation while integrating dynamics) is introduced informally; a short clarifying equation or diagram in §3.1 would improve readability.
[Figure 1] Figure 1 (algorithm overview) and the pseudocode could be cross-referenced more explicitly with the text describing the three alternating steps (expectile V, Q backup, policy extraction).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comments point by point below and indicate the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [§3.2] §3.2 (Implicit policy improvement via expectiles): The manuscript states that the upper expectile of the action-conditioned value random variable approximates the value of the best available action through generalization, yet provides no derivation, error bound, or analysis showing that expectile regression on in-distribution actions yields a quantity close to max_a Q(s,a) under neural function approximation. This is load-bearing for the central claim that distributional shift is avoided without explicit constraints.

Authors: We agree that a more rigorous analysis would be beneficial. The current manuscript motivates the use of upper expectiles by noting that they provide a way to estimate the value of the best actions implicitly through generalization of the function approximator, without explicit queries to out-of-distribution actions. While we do not derive formal error bounds (as such bounds are difficult to obtain for neural networks in general), we will revise §3.2 to include a more detailed discussion of the properties of expectile regression and its connection to approximating the max operator. This will include references to related theoretical work on expectiles and additional intuition on why this leads to policy improvement. We believe this addresses the concern while acknowledging the limitations of the current theoretical support. revision: partial
Referee: [§4] §4 (Experiments): The SOTA claim on D4RL is presented without reported error bars across seeds, full ablation tables on the expectile parameter τ, or direct comparisons against the most recent baselines at the time of submission; this weakens the strength of the empirical support for the generalization hypothesis.

Authors: We will add error bars computed over multiple random seeds to all reported results in the revised manuscript. Additionally, we will include a more comprehensive ablation study on the expectile parameter τ, presenting results across a range of values in the main paper or appendix. For comparisons, we will incorporate any additional baselines that have become available since the original submission to provide a more up-to-date evaluation, while noting the timing of the experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity: IQL uses standard RL backup and expectile definition as heuristic

full rationale

The paper presents IQL as alternating between upper-expectile regression for V(s) (treating action as the source of randomness in the state-value random variable) and standard temporal-difference backup into Q, followed by advantage-weighted behavioral cloning. This chain invokes the generalization capacity of the function approximator as an explicit modeling assumption rather than deriving the approximation to max_a Q(s,a) from any fitted quantity or prior result within the paper. No equation reduces a claimed prediction to an input parameter by construction, no self-citation is load-bearing for the core mechanism, and the method is evaluated on external benchmarks (D4RL) without internal self-reference loops. The derivation is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on standard RL assumptions plus the novel modeling choice of treating action-induced variation as the source of randomness for expectile estimation.

axioms (1)

domain assumption The state value function can be treated as a random variable whose randomness is determined by the action while integrating over dynamics
This modeling choice is the central insight that enables implicit policy improvement.

pith-pipeline@v0.9.0 · 5597 in / 1253 out tokens · 51066 ms · 2026-05-12T08:41:35.543368+00:00 · methodology

discussion (0)

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Path-Coupled Bellman Flows for Distributional Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

Path-Coupled Bellman Flows use source-consistent Bellman-coupled paths and a lambda-parameterized control-variate to learn return distributions via flow matching, improving fidelity and stability over prior DRL approaches.
Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

DOSER detects OOD actions via diffusion-model denoising error and applies selective regularization based on predicted transitions, proving gamma-contraction with performance bounds and outperforming priors on offline ...
Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning
cs.LG 2026-05 unverdicted novelty 7.0

FAN achieves state-of-the-art offline RL performance on robotic tasks by anchoring flow policies and using single-sample noise-conditioned Q-learning, with proven convergence and reduced runtimes.
SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation
cs.CV 2026-04 unverdicted novelty 7.0

SceneOrchestra trains an orchestrator to generate full tool-call trajectories for 3D scene synthesis and uses a discriminator during training to select high-quality plans, yielding state-of-the-art results with lower runtime.
Provably Efficient Offline-to-Online Value Adaptation with General Function Approximation
cs.LG 2026-04 unverdicted novelty 7.0

Offline-to-online value adaptation in RL has a minimax lower bound matching pure online learning in hard cases, yet O2O-LSVI improves sample complexity under a novel structural condition on pretrained Q-functions.
WOMBET: World Model-based Experience Transfer for Robust and Sample-efficient Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

WOMBET generates reliable prior data with world-model uncertainty penalization and transfers it to target tasks via adaptive offline-online sampling, yielding better sample efficiency than baselines.
RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking
cs.AI 2026-05 unverdicted novelty 6.0

RankQ adds a self-supervised ranking loss to Q-learning to learn structured action orderings, yielding competitive or better performance than prior methods on D4RL benchmarks and large gains in vision-based robot fine-tuning.
Multi-scale Predictive Representations for Goal-conditioned Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Ms.PR applies multi-scale predictive supervision to enforce goal-directed alignment in latent spaces for offline GCRL, yielding improved representation quality and performance on vision and state-based tasks.
Predictive but Not Plannable: RC-aux for Latent World Models
cs.LG 2026-05 unverdicted novelty 6.0

RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.
Beyond Autoregressive RTG: Conditioning via Injection Outside Sequential Modeling in Decision Transformer
cs.LG 2026-05 unverdicted novelty 6.0

Injecting RTG into states outside the autoregressive sequence yields shorter, more efficient Decision Transformers that outperform the original on offline RL tasks.
Offline Reinforcement Learning for Rotation Profile Control in Tokamaks
cs.LG 2026-05 unverdicted novelty 6.0

Offline RL policies trained solely on DIII-D historical data were deployed on the tokamak and produced promising real-world control of the plasma rotation profile.
On the Role of Language Representations in Auto-Bidding: Findings and Implications
cs.AI 2026-05 unverdicted novelty 6.0

SemBid injects LLM-encoded Task, History, and Strategy semantics as tokens into offline bidding trajectories and uses self-attention to outperform numerical-only baselines in performance, constraint satisfaction, and ...
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
cs.AI 2026-05 unverdicted novelty 6.0

LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
cs.AI 2026-05 unverdicted novelty 6.0

LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
Adaptive Q-Chunking for Offline-to-Online Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Adaptive Q-Chunking selects optimal action chunk sizes at each state via normalized advantage comparisons to outperform fixed chunk sizes in offline-to-online RL on robot benchmarks.
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
cs.LG 2026-05 unverdicted novelty 6.0

OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies
cs.RO 2026-05 unverdicted novelty 6.0

Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.
When Policies Cannot Be Retrained: A Unified Closed-Form View of Post-Training Steering in Offline Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 6.0

For diagonal-Gaussian frozen actors, PoE with alpha equals KL adaptation with beta = alpha/(1-alpha); empirically, composition shows an actor-competence ceiling with 4/5/3 HELP/FROZEN/HURT split on D4RL and zero succe...
Occupancy Reward Shaping: Improving Credit Assignment for Offline Goal-Conditioned Reinforcement Learning
cs.LG 2026-04 conditional novelty 6.0

Occupancy Reward Shaping extracts goal-reaching rewards from world-model occupancy measures using optimal transport, improving offline goal-conditioned RL performance 2.2x on 13 tasks without changing the optimal policy.
Fisher Decorator: Refining Flow Policy via a Local Transport Map
cs.LG 2026-04 unverdicted novelty 6.0

Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.
Abstract Sim2Real through Approximate Information States
cs.RO 2026-04 unverdicted novelty 6.0

Abstract simulators can be grounded to real tasks by making their dynamics history-dependent and correcting them with real data, enabling RL policy transfer.
When Missing Becomes Structure: Intent-Preserving Policy Completion from Financial KOL Discourse
cs.LG 2026-04 unverdicted novelty 6.0

KICL completes execution decisions in KOL financial discourse using offline RL, achieving top returns and Sharpe ratios with no unsupported trades or direction changes on YouTube and X data from 2022-2025.
MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks
cs.RO 2026-04 unverdicted novelty 6.0

MoRI dynamically mixes RL and IL experts with variance-based switching and IL regularization to reach 97.5% success in four real-world robotic tasks while cutting human intervention by 85.8%.
JD-BP: A Joint-Decision Generative Framework for Auto-Bidding and Pricing
cs.GT 2026-04 unverdicted novelty 6.0

JD-BP jointly generates bids and pricing corrections via generative models, memory-less return-to-go, trajectory augmentation, and energy-based DPO to improve auto-bidding performance despite prediction errors and latency.
Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents
cs.AI 2026-04 unverdicted novelty 6.0

STEP-HRL enables step-level learning in LLM agents via hierarchical task structure and local progress modules, outperforming baselines on ScienceWorld and ALFWorld while cutting token usage.
Hierarchical Planning with Latent World Models
cs.LG 2026-04 unverdicted novelty 6.0

Hierarchical planning over multi-scale latent world models enables 70% success on real robotic pick-and-place with goal-only input where flat models achieve 0%, while cutting planning compute up to 4x in simulations.
IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies
cs.LG 2023-04 conditional novelty 6.0

IDQL generalizes IQL into an actor-critic framework and uses diffusion policies for robust policy extraction, outperforming prior offline RL methods.
Safe-Support Q-Learning: Learning without Unsafe Exploration
cs.LG 2026-04 unverdicted novelty 5.0

Safe-Support Q-Learning trains Q-functions and policies in reinforcement learning without ever visiting unsafe states by constraining the behavior policy to a safe set and using KL-regularized Bellman targets in a two...
Efficient Hierarchical Implicit Flow Q-learning for Offline Goal-conditioned Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 5.0

Proposes mean flow policies and LeJEPA loss to overcome Gaussian policy limits and weak subgoal generation in hierarchical offline GCRL, reporting strong results on OGBench state and pixel tasks.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 28 Pith papers · 4 internal anchors

[1]

David Brandfonbrener, William F Whitney, Rajesh Ranganath, and Joan Bruna

URL http: //github.com/google/jax. David Brandfonbrener, William F Whitney, Rajesh Ranganath, and Joan Bruna. Ofﬂine rl without off-policy evaluation. arXiv preprint arXiv:2106.08909,

work page arXiv
[2]

PonderNet: Learning to ponder.arXiv preprint arXiv:2106.01345,

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. arXiv preprint arXiv:2106.01345,

work page arXiv
[3]

Will Dabney, Mark Rowland, Marc G Bellemare, and R ´emi Munos

PMLR, 2018a. Will Dabney, Mark Rowland, Marc G Bellemare, and R ´emi Munos. Distributional reinforcement learning with quantile regression. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence , 2018b. Pete Florence, Corey Lynch, Andy Zeng, Oscar Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. Implic...

work page arXiv
[4]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Justin Fu, Aviral Kumar, Oﬁr Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219,

work page internal anchor Pith review arXiv 2004
[5]

A minimalist approach to ofﬂine reinforcement learning

Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to ofﬂine reinforcement learning. arXiv preprint arXiv:2106.06860,

work page arXiv
[6]

Off-policy deep reinforcement learning without exploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pp. 2052–2062. PMLR,

work page 2052
[7]

Adam: A Method for Stochastic Optimization

URL http://github.com/google/flax. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Stabilizing oﬀ-policy q-learning via bootstrapping error reduction.arXiv preprint arXiv:1906.00949,

Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. arXiv preprint arXiv:1906.00949,

work page arXiv 1906
[9]

Kumar, A

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for ofﬂine reinforcement learning. arXiv preprint arXiv:2006.04779,

work page arXiv 2006
[10]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[11]

Learning complex dexterous manipulation with deep reinforcement learning and demonstrations.arXiv preprint arXiv:1709.10087, 2017

URL https://arxiv. org/pdf/1709.10087.pdf. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting. The journal of machine learning research, 15(1):1929–1958,

work page arXiv 1929
[12]

Critic regularized regression

Ziyu Wang, Alexander Novikov, Konrad Zolna, Jost Tobias Springenberg, Scott Reed, Bobak Shahriari, Noah Siegel, Josh Merel, Caglar Gulcehre, Nicolas Heess, et al. Critic regularized regression. arXiv preprint arXiv:2006.15134,

work page arXiv 2006
[13]

Behavior Regularized Offline Reinforcement Learning

Yifan Wu, George Tucker, and Oﬁr Nachum. Behavior regularized ofﬂine reinforcement learning. arXiv preprint arXiv:1911.11361,

work page internal anchor Pith review arXiv 1911
[14]

We use cosine schedule for the actor learning rate

with a learning rate 3 · 10−4 and 2 layer MLP with ReLU activations and 256 hidden units for all networks. We use cosine schedule for the actor learning rate. We param- eterize the policy as a Gaussian distribution with a state-independent standard deviation. We update the target network with soft updates with parameter α = 0 .005. And following Brandfonb...

work page 2021
[15]

Then we continue training while collecting data actively in the environment and adding that data to the replay buffer, running 1 gradient update / environment step

Table 3: Evaluation on Franca Kitchen and Adroit tasks from D4RL dataset BC BRAC-p BEAR Onestep RL CQL Ours kitchen-complete-v0 65.0 0.0 0.0 - 43.8 62.5 kitchen-partial-v0 38.0 0.0 0.0 - 49.8 46.3 kitchen-mixed-v0 51.5 0.0 0.0 - 51.0 51.0 kitchen-v0 total 154.5 0.0 0.0 - 144.6 159.8 pen-human-v0 63.9 8.1 -1.0 - 37.5 71.5 hammer-human-v0 1.2 0.3 0.3 - 4.4 ...

work page 2018
[16]

For AW AC we used https://github.com/rail-berkeley/ rlkit/tree/master/rlkit

and CQL (Kumar et al., 2020). For AW AC we used https://github.com/rail-berkeley/ rlkit/tree/master/rlkit. We found AW AC to overﬁt heavily with too many ofﬂine gra- dient steps, and instead used 25000 ofﬂine gradient steps as in the original paper. For the dex- trous manipulation results, we report average return normalized from 0 to 100 for consistency,...

work page 2020
[17]

In particular, we discuss connections to BCQ Fujimoto et al

D C ONNECTIONS TO PRIOR WORK In this section, we discuss how our approach is related to prior work on ofﬂine reinforcement learn- ing. In particular, we discuss connections to BCQ Fujimoto et al. (2019). Our batch constrained optimization objective is similar to BCQ (Fujimoto et al., 2019). In particular, the authors of BCQ build on the Q-learning framewo...

work page 2019