arxiv: 2004.07219 · v4 · submitted 2020-04-15 · 💻 cs.LG · stat.ML

Recognition: 1 theorem link

· Lean Theorem

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Justin Fu , Aviral Kumar , Ofir Nachum , George Tucker , Sergey Levine

Authors on Pith no claims yet

Pith reviewed 2026-05-12 23:14 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords offline reinforcement learningbenchmarksdatasetsD4RLdeep RLstatic datasetspolicy evaluation

0 comments

The pith

New benchmark datasets for offline RL, drawn from human demonstrations and mixed policies, expose deficiencies in existing algorithms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces benchmark datasets specifically for offline reinforcement learning that reflect real-world data collection scenarios. These include collections from hand-designed controllers, human demonstrators, multitask environments, and mixtures of policies instead of relying only on data from partially trained agents. Testing current methods on these datasets shows important shortcomings that simpler benchmarks had hidden. This matters because offline RL aims to learn policies from large static datasets the way supervised learning does, but progress requires test conditions that match that setting. The work releases the datasets, an evaluation protocol, and baseline results to give the community a shared foundation for addressing those shortcomings.

Core claim

The authors establish that moving beyond simple benchmark tasks and data collected by partially-trained RL agents to datasets generated via hand-designed controllers, human demonstrators, multitask settings, and mixtures of policies reveals important and unappreciated deficiencies of existing offline RL algorithms, and they provide these benchmarks together with evaluations and open-source examples to serve as a common starting point.

What carries the argument

The D4RL benchmark suite of datasets and evaluation protocol, built around collection from hand-designed controllers, human demonstrators, multitask environments, and policy mixtures.

If this is right

Algorithms previously considered successful must be re-evaluated and often improved when tested on data from human and mixed-policy sources.
The community gains a standardized way to measure progress in offline RL that aligns with large static dataset use.
Development can now target handling of realistic data properties such as distribution shift from policy mixtures.
Large previously collected datasets become more directly usable for training without online interaction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Methods that succeed on these benchmarks may transfer more reliably to domains where online data collection is expensive or unsafe.
The emphasis on dataset diversity points to distribution shift as a central challenge that new offline RL techniques must address.
Similar benchmark design principles could be applied to other sequential decision problems that rely on logged data.

Load-bearing premise

Datasets generated via hand-designed controllers, human demonstrators, multitask settings, and mixtures of policies capture the key properties most relevant to real-world offline RL applications.

What would settle it

Running existing offline RL algorithms on the released D4RL datasets and finding no performance deficiencies relative to their results on prior benchmarks would falsify the claim that these new datasets reveal unappreciated shortcomings.

read the original abstract

The offline reinforcement learning (RL) setting (also known as full batch RL), where a policy is learned from a static dataset, is compelling as progress enables RL methods to take advantage of large, previously-collected datasets, much like how the rise of large datasets has fueled results in supervised learning. However, existing online RL benchmarks are not tailored towards the offline setting and existing offline RL benchmarks are restricted to data generated by partially-trained agents, making progress in offline RL difficult to measure. In this work, we introduce benchmarks specifically designed for the offline setting, guided by key properties of datasets relevant to real-world applications of offline RL. With a focus on dataset collection, examples of such properties include: datasets generated via hand-designed controllers and human demonstrators, multitask datasets where an agent performs different tasks in the same environment, and datasets collected with mixtures of policies. By moving beyond simple benchmark tasks and data collected by partially-trained RL agents, we reveal important and unappreciated deficiencies of existing algorithms. To facilitate research, we have released our benchmark tasks and datasets with a comprehensive evaluation of existing algorithms, an evaluation protocol, and open-source examples. This serves as a common starting point for the community to identify shortcomings in existing offline RL methods and a collaborative route for progress in this emerging area.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

D4RL supplies a practical benchmark suite for offline RL with datasets from humans, hand-designed controllers, and policy mixtures, plus baselines that show current methods struggle.

read the letter

The main point is that this paper releases new tasks and datasets for offline RL collected in ways that go beyond the usual partially-trained agent data. That includes human demonstrations, hand-designed controllers, multitask settings, and mixtures of policies. The release comes with an evaluation protocol and baseline numbers on existing algorithms, which is the concrete contribution here. It gives the field a shared starting point instead of everyone rolling their own small datasets. The paper does well on the release side: the artifacts are public, the motivation for the collection choices is laid out clearly, and the baselines make it easy for others to compare. The claim that these setups reveal deficiencies in current offline RL methods is supported by the reported results on the new tasks. The soft spot is the representativeness question. The datasets are chosen for qualitative reasons that match real-world offline scenarios, but the paper does not include quantitative checks like state-action coverage overlap or behavior policy statistics against any external real-world logs. That leaves open whether the observed gaps are general or tied to these particular synthetic collection processes. Still, the central argument holds because the work moves past the prior limited benchmarks and supplies the tools to test that. This is for RL researchers who need a standard offline evaluation setup. It deserves a serious referee because the artifacts are new, the evaluation is reproducible, and the field can build directly on it.

Referee Report

2 major / 2 minor

Summary. The paper introduces the D4RL benchmark suite for offline reinforcement learning, consisting of datasets and tasks collected via hand-designed controllers, human demonstrators, multitask settings, and mixtures of policies. It provides a comprehensive evaluation of existing offline RL algorithms on these datasets, an evaluation protocol, and open-source code, claiming that these resources reveal important deficiencies in current methods that are not apparent on simpler benchmarks or data from partially-trained agents.

Significance. If the evaluation results hold, D4RL could serve as a foundational benchmark for offline RL, analogous to how standardized datasets advanced supervised learning. The explicit focus on dataset collection properties relevant to real-world applications, combined with the open release of artifacts and protocol, provides a concrete starting point for identifying and addressing algorithmic limitations.

major comments (2)

[Introduction and Dataset Collection] Introduction and Dataset Collection sections: The claim that the chosen collection methods reveal 'important and unappreciated deficiencies' of existing algorithms rests on the assumption that hand-designed controllers, human demonstrators, multitask settings, and policy mixtures produce datasets whose coverage, multimodality, and distribution-shift properties are representative of real-world offline RL. No quantitative comparisons (e.g., state-action support overlap, behavior-policy return statistics, or entropy measures) against external real-world offline datasets are provided to support this link.
[Experiments] Experiments section: The abstract states that the benchmarks reveal deficiencies and includes a comprehensive evaluation, but the manuscript must explicitly report the quantitative results, baseline comparisons, and statistical significance for the performance gaps to allow verification that the deficiencies are substantial and not artifacts of the specific synthetic collection processes.

minor comments (2)

[Abstract] The abstract could more precisely state the number of tasks, environments, and dataset variants introduced to give readers an immediate sense of scale.
[Dataset Collection] Notation for policy mixtures and multitask data collection should be defined consistently when first introduced to avoid ambiguity in later sections.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback on our D4RL benchmark paper. The comments highlight important areas for clarifying the motivation behind our dataset choices and improving the explicit reporting of experimental results. We address each major comment below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Introduction and Dataset Collection] Introduction and Dataset Collection sections: The claim that the chosen collection methods reveal 'important and unappreciated deficiencies' of existing algorithms rests on the assumption that hand-designed controllers, human demonstrators, multitask settings, and policy mixtures produce datasets whose coverage, multimodality, and distribution-shift properties are representative of real-world offline RL. No quantitative comparisons (e.g., state-action support overlap, behavior-policy return statistics, or entropy measures) against external real-world offline datasets are provided to support this link.

Authors: We agree that direct quantitative comparisons to external real-world datasets would provide stronger support for the representativeness claim. Such standardized public real-world offline RL datasets with comparable metrics are unfortunately limited or unavailable for direct side-by-side analysis. Our collection methods were explicitly chosen to instantiate key real-world-relevant properties (narrow support and low entropy from expert/human data, multimodality from policy mixtures, and distribution shift from multitask data), which are documented in the paper's dataset descriptions and motivated by applications such as robotics. We will revise the Introduction and Dataset Collection sections to tone down the language from 'representative' to 'capturing important properties relevant to,' add explicit dataset statistics (e.g., return distributions and coverage measures) in a new table or subsection, and include a brief discussion of why these properties matter for real-world offline RL. revision: partial
Referee: [Experiments] Experiments section: The abstract states that the benchmarks reveal deficiencies and includes a comprehensive evaluation, but the manuscript must explicitly report the quantitative results, baseline comparisons, and statistical significance for the performance gaps to allow verification that the deficiencies are substantial and not artifacts of the specific synthetic collection processes.

Authors: The full manuscript (Section 5 and associated tables/figures) already presents comprehensive quantitative results across all tasks, comparing offline RL algorithms against expert performance, online RL baselines, and behavior cloning, with clear performance gaps highlighted (e.g., many offline methods failing to exceed behavior cloning returns on complex tasks). To make this more explicit and verifiable, we will revise the Experiments section to include a summary paragraph with key numerical results, add error bars or standard deviations from multiple random seeds, report statistical significance for major gaps where appropriate, and ensure baseline details and protocol are stated in the main text (with full details remaining in the appendix and released code). revision: yes

standing simulated objections not resolved

Direct quantitative comparisons (e.g., state-action overlap or entropy) to external real-world offline RL datasets, as no standardized public datasets exist for such analysis.

Circularity Check

0 steps flagged

No circularity in derivation or claims

full rationale

The paper introduces external artifacts (new datasets and benchmark tasks) and performs empirical evaluations of existing algorithms on them. It contains no equations, fitted parameters, predictions derived from inputs, self-definitional constructs, or load-bearing self-citations that reduce any central claim to its own inputs by construction. Dataset collection methods are motivated qualitatively as representative of real-world properties, but this is an assumption about external validity rather than a circular reduction within the paper's own logic. The work is self-contained as a benchmark contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

This is a benchmark introduction paper with no mathematical derivations, fitted parameters, or new physical entities. The contribution rests on domain assumptions about what dataset properties matter for real-world offline RL.

axioms (1)

domain assumption Offline RL progress is hindered by lack of benchmarks using realistic static datasets from non-RL sources.
Invoked in the abstract as the core motivation for creating D4RL.

invented entities (1)

D4RL benchmark suite no independent evidence
purpose: To provide standardized tasks and datasets for offline RL evaluation
Newly introduced collection of tasks and data releases.

pith-pipeline@v0.9.0 · 5533 in / 1182 out tokens · 28675 ms · 2026-05-12T23:14:59.570931+00:00 · methodology

discussion (0)

Forward citations

Cited by 36 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Offline Reinforcement Learning with Implicit Q-Learning
cs.LG 2021-10 unverdicted novelty 8.0

IQL achieves policy improvement in offline RL by implicitly estimating optimal action values through state-conditional upper expectiles of value functions, without querying Q-functions on out-of-distribution actions.
Matrix-Space Reinforcement Learning for Reusing Local Transition Geometry
cs.LG 2026-05 unverdicted novelty 7.0

MSRL represents trajectory segments as PSD matrices to prove additive composition properties and bootstrap value functions for better transfer, reaching 0.73 AUC versus 0.57-0.65 baselines.
Bridging Domain Gaps with Target-Aligned Generation for Offline Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

TCE bridges domain gaps in offline RL by selectively using source data or generating target-aligned transitions via a dual score-based model, outperforming baselines in experiments.
Aligning Flow Map Policies with Optimal Q-Guidance
cs.LG 2026-05 unverdicted novelty 7.0

Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.
Muninn: Your Trajectory Diffusion Model But Faster
cs.RO 2026-05 unverdicted novelty 7.0

Muninn accelerates diffusion trajectory planners up to 4.6x by spending an uncertainty budget to decide when to cache denoiser outputs, preserving performance and certifying bounded deviation from full computation.
Path-Coupled Bellman Flows for Distributional Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

Path-Coupled Bellman Flows use source-consistent Bellman-coupled paths and a lambda-parameterized control-variate to learn return distributions via flow matching, improving fidelity and stability over prior DRL approaches.
Hitting Time Isomorphism for Multi-Stage Planning with Foundation Policies
cs.LG 2026-05 unverdicted novelty 7.0

A hitting-time isomorphism framework learns asymmetric Hilbert-space geometries for offline RL, yielding the IEL algorithm with identifiability proofs and improved maze navigation performance.
Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

DOSER detects OOD actions via diffusion-model denoising error and applies selective regularization based on predicted transitions, proving gamma-contraction with performance bounds and outperforming priors on offline ...
Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning
cs.LG 2026-05 unverdicted novelty 7.0

FAN achieves state-of-the-art offline RL performance on robotic tasks by anchoring flow policies and using single-sample noise-conditioned Q-learning, with proven convergence and reduced runtimes.
SpecRLBench: A Benchmark for Generalization in Specification-Guided Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

SpecRLBench is a new benchmark evaluating generalization of LTL-guided RL methods across navigation and manipulation domains with static/dynamic environments and varied robot dynamics.
A Reward-Free Viewpoint on Multi-Objective Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

Adapting RFRL objectives as auxiliary tasks with preference-guided exploration outperforms prior MORL methods in performance and data efficiency on MO-Gymnasium tasks.
Preserve Support, Not Correspondence: Dynamic Routing for Offline Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

DROL trains one-step offline RL actors via top-1 dynamic routing of dataset actions to latent candidates, enabling local improvements while preserving data support and retaining cheap inference.
ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching
cs.RO 2026-04 unverdicted novelty 7.0

ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on lo...
A Generalist Agent
cs.AI 2022-05 accept novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
ROAD: Adaptive Data Mixing for Offline-to-Online Reinforcement Learning via Bi-Level Optimization
cs.LG 2026-05 unverdicted novelty 6.0

ROAD formulates data mixing as a bi-level optimization problem solved via multi-armed bandit to adaptively balance offline priors and online updates in RL.
Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy
cs.LG 2026-05 unverdicted novelty 6.0

Q-Flow enables stable optimization of expressive flow-based policies in RL by propagating terminal values along deterministic flow dynamics to intermediate states for gradient updates without solver unrolling.
Discrete Flow Matching for Offline-to-Online Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.
RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking
cs.AI 2026-05 unverdicted novelty 6.0

RankQ adds a self-supervised ranking loss to Q-learning to learn structured action orderings, yielding competitive or better performance than prior methods on D4RL benchmarks and large gains in vision-based robot fine-tuning.
Beyond Autoregressive RTG: Conditioning via Injection Outside Sequential Modeling in Decision Transformer
cs.LG 2026-05 unverdicted novelty 6.0

Injecting RTG into states outside the autoregressive sequence yields shorter, more efficient Decision Transformers that outperform the original on offline RL tasks.
When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning
cs.RO 2026-05 unverdicted novelty 6.0

Q2RL extracts Q-functions from BC policies via minimal interactions and applies Q-gating to enable stable offline-to-online RL, outperforming baselines on manipulation benchmarks and achieving up to 100% success on-robot.
Adaptive Policy Selection and Fine-Tuning under Interaction Budgets for Offline-to-Online Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

An adaptive UCB-based policy selection and fine-tuning strategy improves performance over standard O2O-RL baselines under interaction budgets.
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
cs.LG 2026-05 unverdicted novelty 6.0

OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
AdamO: A Collapse-Suppressed Optimizer for Offline RL
cs.LG 2026-05 unverdicted novelty 6.0

AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...
Borrowed Geometry: Computational Reuse of Frozen Text-Pretrained Transformer Weights Across Modalities
cs.LG 2026-05 unverdicted novelty 6.0

Frozen text-pretrained transformer weights transfer across modalities through a thin interface, achieving SOTA on a robotic task and parity on decision-making with far fewer trainable parameters.
When Policies Cannot Be Retrained: A Unified Closed-Form View of Post-Training Steering in Offline Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 6.0

For diagonal-Gaussian frozen actors, PoE with alpha equals KL adaptation with beta = alpha/(1-alpha); empirically, composition shows an actor-competence ceiling with 4/5/3 HELP/FROZEN/HURT split on D4RL and zero succe...
Occupancy Reward Shaping: Improving Credit Assignment for Offline Goal-Conditioned Reinforcement Learning
cs.LG 2026-04 conditional novelty 6.0

Occupancy Reward Shaping extracts goal-reaching rewards from world-model occupancy measures using optimal transport, improving offline goal-conditioned RL performance 2.2x on 13 tasks without changing the optimal policy.
DAG-STL: A Hierarchical Framework for Zero-Shot Trajectory Planning under Signal Temporal Logic Specifications
cs.RO 2026-04 unverdicted novelty 6.0

DAG-STL decomposes long-horizon STL planning into decomposition, timed waypoint allocation, and diffusion-based trajectory generation to enable zero-shot planning under unknown dynamics.
Fisher Decorator: Refining Flow Policy via a Local Transport Map
cs.LG 2026-04 unverdicted novelty 6.0

Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.
GIRL: Generative Imagination Reinforcement Learning via Information-Theoretic Hallucination Control
cs.LG 2026-04 unverdicted novelty 6.0

GIRL reduces latent rollout drift by 38-61% versus DreamerV3 in MBRL by grounding transitions with DINOv2 embeddings and using an information-theoretic adaptive bottleneck, yielding better long-horizon returns on cont...
Reinforced Self-Training (ReST) for Language Modeling
cs.CL 2023-08 unverdicted novelty 6.0

ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.
IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies
cs.LG 2023-04 conditional novelty 6.0

IDQL generalizes IQL into an actor-critic framework and uses diffusion policies for robust policy extraction, outperforming prior offline RL methods.
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
cs.RO 2021-08 accept novelty 6.0

A comprehensive benchmark study of offline imitation learning methods on multi-stage robot manipulation tasks identifies key sensitivities to algorithm design, data quality, and stopping criteria while releasing all d...
Trajectory-Level Data Augmentation for Offline Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 5.0

Trajectory-based data augmentation exploits geometric relationships between rewards, values, and logging policies to enable effective offline RL from few suboptimal trajectories.
ReinVBC: A Model-based Reinforcement Learning Approach to Vehicle Braking Controller
cs.RO 2026-04 unverdicted novelty 5.0

ReinVBC applies offline model-based RL to learn vehicle dynamics and braking policies, with results indicating real-world capability and potential to replace production anti-lock braking systems.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 35 Pith papers · 5 internal anchors

[1]

Optimality and approximation with policy gradient methods in markov decision processes

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. arXiv preprint arXiv:1908.00261, 2019a. Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. Striving for simplicity in off-policy deep reinforcement learning. CoRR, abs/1907.04543, 2019b. URL http://arx...

work page arXiv 1908
[2]

Scaling data-driven robotics with reward sketching and batch reinforcement learning.Preprint arXiv:1909.12200,

Serkan Cabi, Sergio G ´omez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad ˙Zołna, Yusuf Aytar, David Budden, Mel Vecerik, et al. A framework for data-driven robotics. arXiv preprint arXiv:1909.12200,

work page arXiv 1909
[3]

End- to-end driving via conditional imitation learning

Felipe Codevilla, Matthias Miiller, Antonio L ´opez, Vladlen Koltun, and Alexey Dosovitskiy. End- to-end driving via conditional imitation learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9. IEEE,

work page 2018
[4]

Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester

URL https://openreview.net/forum?id=r1genAVKPB. Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester. Challenges of real-world reinforcement learning. arXiv preprint arXiv:1904.12901,

work page arXiv 1904
[5]

An empirical investigation of the challenges of real-world reinforcement learn- ing

Gabriel Dulac-Arnold, Nir Levine, Daniel J Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester. An empirical investigation of the challenges of real-world reinforcement learn- ing. arXiv preprint arXiv:2003.11881,

work page arXiv 2003
[6]

Oﬀ-policy deep reinforcement learning without exploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. arXiv preprint arXiv:1812.02900, 2018a. Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th In- ternational Conference on ...

work page arXiv 1910
[7]

Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates

Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international confer- ence on robotics and automation (ICRA), pp. 3389–3396. IEEE,

work page 2017
[8]

Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning

10 Abhishek Gupta, Vikash Kumar, Corey Lynch, Sergey Levine, and Karol Hausman. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. arXiv preprint arXiv:1910.11956,

work page arXiv 1910
[9]

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. CoRR, abs/1801.01290, 2018a. URL http://arxiv.org/abs/1801.01290. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcemen...

work page internal anchor Pith review arXiv 2017
[10]

A real-time model-based reinforcement learning architecture for robot control

Todd Hester, Michael Quinlan, and Peter Stone. A real-time model-based reinforcement learning architecture for robot control. arXiv preprint arXiv:1105.1749,

work page arXiv
[11]

Recsim: A conﬁgurable simulation platform for recommender systems

Eugene Ie, Chih-wei Hsu, Martin Mladenov, Vihan Jain, Sanmit Narvekar, Jing Wang, Rui Wu, and Craig Boutilier. Recsim: A conﬁgurable simulation platform for recommender systems. arXiv preprint arXiv:1909.04847,

work page arXiv 1909
[13]

Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

URL http://arxiv. org/abs/1907.00456. Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Scalable deep reinforce- ment learning for vision-based robotic manipulation. In Conference on Robot Learning, pp. 651– 673,

work page Pith review arXiv 1907
[14]

Stabilizing oﬀ-policy q-learning via bootstrapping error reduction.arXiv preprint arXiv:1906.00949,

URL http://arxiv.org/abs/1906.00949. Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for ofﬂine reinforcement learning. arXiv preprint arXiv:2006.04779,

work page arXiv 1906
[15]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Ofﬂine reinforcement learning: Tuto- rial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643,

work page internal anchor Pith review Pith/arXiv arXiv 2005
[16]

Algaedice: Policy gradient from arbitrary experience

Oﬁr Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074,

work page arXiv 1912
[17]

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. Accelerating online reinforcement learning with ofﬂine datasets. arXiv preprint arXiv:2006.09359,

work page internal anchor Pith review arXiv 2006
[18]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[19]

Deep imitative models for ﬂexible inference, planning, and control

Nicholas Rhinehart, Rowan McAllister, and Sergey Levine. Deep imitative models for ﬂexible inference, planning, and control. arXiv preprint arXiv:1810.06544,

work page arXiv
[20]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems , pp. 5026–5033. IEEE,

work page 2012
[21]

Flow: Architecture and benchmarking for reinforcement learning in trafﬁc control

Cathy Wu, Aboudy Kreidieh, Kanaad Parvate, Eugene Vinitsky, and Alexandre M Bayen. Flow: Architecture and benchmarking for reinforcement learning in trafﬁc control. arXiv preprint arXiv:1710.05465, pp. 10,

work page arXiv
[22]

Behavior Regularized Offline Reinforcement Learning

Yifan Wu, George Tucker, and Oﬁr Nachum. Behavior regularized ofﬂine reinforcement learning. arXiv preprint arXiv:1911.11361,

work page internal anchor Pith review arXiv 1911
[23]

carla-town

12 Appendices A T ASK PROPERTIES The following is a full list of task properties and dataset statistics for all tasks in the benchmark. Note that the full dataset for “carla-town” requires over 30GB of memory to store, so we also provide a subsampled version of the dataset which we used in our experiments. Domain Task Name Controller Type # Samples Maze2D...

work page 2016
[24]

(2017) Random, Controller Franka Kitchen Gupta et al

Cloned Flow Wu et al. (2017) Random, Controller Franka Kitchen Gupta et al. (2019) Complete, Partial, Mixed (Gupta et al.,

work page 2017
[25]

Training

CARLA Dosovitskiy et al. (2017) Controller Table 4: Domains and dataset types contained within our benchmark. Maze2D and AntMaze are new domains we propose. For each dataset, we also include references to the source if orig- inally proposed in another work. Datasets borrowed from prior work include MuJoCo (Expert, Random, Medium), Adroit (Human, Expert), ...

work page 2017