pith. machine review for the scientific record. sign in

arxiv: 2004.07219 · v4 · submitted 2020-04-15 · 💻 cs.LG · stat.ML

Recognition: 1 theorem link

· Lean Theorem

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-12 23:14 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords offline reinforcement learningbenchmarksdatasetsD4RLdeep RLstatic datasetspolicy evaluation
0
0 comments X

The pith

New benchmark datasets for offline RL, drawn from human demonstrations and mixed policies, expose deficiencies in existing algorithms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces benchmark datasets specifically for offline reinforcement learning that reflect real-world data collection scenarios. These include collections from hand-designed controllers, human demonstrators, multitask environments, and mixtures of policies instead of relying only on data from partially trained agents. Testing current methods on these datasets shows important shortcomings that simpler benchmarks had hidden. This matters because offline RL aims to learn policies from large static datasets the way supervised learning does, but progress requires test conditions that match that setting. The work releases the datasets, an evaluation protocol, and baseline results to give the community a shared foundation for addressing those shortcomings.

Core claim

The authors establish that moving beyond simple benchmark tasks and data collected by partially-trained RL agents to datasets generated via hand-designed controllers, human demonstrators, multitask settings, and mixtures of policies reveals important and unappreciated deficiencies of existing offline RL algorithms, and they provide these benchmarks together with evaluations and open-source examples to serve as a common starting point.

What carries the argument

The D4RL benchmark suite of datasets and evaluation protocol, built around collection from hand-designed controllers, human demonstrators, multitask environments, and policy mixtures.

If this is right

  • Algorithms previously considered successful must be re-evaluated and often improved when tested on data from human and mixed-policy sources.
  • The community gains a standardized way to measure progress in offline RL that aligns with large static dataset use.
  • Development can now target handling of realistic data properties such as distribution shift from policy mixtures.
  • Large previously collected datasets become more directly usable for training without online interaction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Methods that succeed on these benchmarks may transfer more reliably to domains where online data collection is expensive or unsafe.
  • The emphasis on dataset diversity points to distribution shift as a central challenge that new offline RL techniques must address.
  • Similar benchmark design principles could be applied to other sequential decision problems that rely on logged data.

Load-bearing premise

Datasets generated via hand-designed controllers, human demonstrators, multitask settings, and mixtures of policies capture the key properties most relevant to real-world offline RL applications.

What would settle it

Running existing offline RL algorithms on the released D4RL datasets and finding no performance deficiencies relative to their results on prior benchmarks would falsify the claim that these new datasets reveal unappreciated shortcomings.

read the original abstract

The offline reinforcement learning (RL) setting (also known as full batch RL), where a policy is learned from a static dataset, is compelling as progress enables RL methods to take advantage of large, previously-collected datasets, much like how the rise of large datasets has fueled results in supervised learning. However, existing online RL benchmarks are not tailored towards the offline setting and existing offline RL benchmarks are restricted to data generated by partially-trained agents, making progress in offline RL difficult to measure. In this work, we introduce benchmarks specifically designed for the offline setting, guided by key properties of datasets relevant to real-world applications of offline RL. With a focus on dataset collection, examples of such properties include: datasets generated via hand-designed controllers and human demonstrators, multitask datasets where an agent performs different tasks in the same environment, and datasets collected with mixtures of policies. By moving beyond simple benchmark tasks and data collected by partially-trained RL agents, we reveal important and unappreciated deficiencies of existing algorithms. To facilitate research, we have released our benchmark tasks and datasets with a comprehensive evaluation of existing algorithms, an evaluation protocol, and open-source examples. This serves as a common starting point for the community to identify shortcomings in existing offline RL methods and a collaborative route for progress in this emerging area.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the D4RL benchmark suite for offline reinforcement learning, consisting of datasets and tasks collected via hand-designed controllers, human demonstrators, multitask settings, and mixtures of policies. It provides a comprehensive evaluation of existing offline RL algorithms on these datasets, an evaluation protocol, and open-source code, claiming that these resources reveal important deficiencies in current methods that are not apparent on simpler benchmarks or data from partially-trained agents.

Significance. If the evaluation results hold, D4RL could serve as a foundational benchmark for offline RL, analogous to how standardized datasets advanced supervised learning. The explicit focus on dataset collection properties relevant to real-world applications, combined with the open release of artifacts and protocol, provides a concrete starting point for identifying and addressing algorithmic limitations.

major comments (2)
  1. [Introduction and Dataset Collection] Introduction and Dataset Collection sections: The claim that the chosen collection methods reveal 'important and unappreciated deficiencies' of existing algorithms rests on the assumption that hand-designed controllers, human demonstrators, multitask settings, and policy mixtures produce datasets whose coverage, multimodality, and distribution-shift properties are representative of real-world offline RL. No quantitative comparisons (e.g., state-action support overlap, behavior-policy return statistics, or entropy measures) against external real-world offline datasets are provided to support this link.
  2. [Experiments] Experiments section: The abstract states that the benchmarks reveal deficiencies and includes a comprehensive evaluation, but the manuscript must explicitly report the quantitative results, baseline comparisons, and statistical significance for the performance gaps to allow verification that the deficiencies are substantial and not artifacts of the specific synthetic collection processes.
minor comments (2)
  1. [Abstract] The abstract could more precisely state the number of tasks, environments, and dataset variants introduced to give readers an immediate sense of scale.
  2. [Dataset Collection] Notation for policy mixtures and multitask data collection should be defined consistently when first introduced to avoid ambiguity in later sections.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback on our D4RL benchmark paper. The comments highlight important areas for clarifying the motivation behind our dataset choices and improving the explicit reporting of experimental results. We address each major comment below, indicating where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Introduction and Dataset Collection] Introduction and Dataset Collection sections: The claim that the chosen collection methods reveal 'important and unappreciated deficiencies' of existing algorithms rests on the assumption that hand-designed controllers, human demonstrators, multitask settings, and policy mixtures produce datasets whose coverage, multimodality, and distribution-shift properties are representative of real-world offline RL. No quantitative comparisons (e.g., state-action support overlap, behavior-policy return statistics, or entropy measures) against external real-world offline datasets are provided to support this link.

    Authors: We agree that direct quantitative comparisons to external real-world datasets would provide stronger support for the representativeness claim. Such standardized public real-world offline RL datasets with comparable metrics are unfortunately limited or unavailable for direct side-by-side analysis. Our collection methods were explicitly chosen to instantiate key real-world-relevant properties (narrow support and low entropy from expert/human data, multimodality from policy mixtures, and distribution shift from multitask data), which are documented in the paper's dataset descriptions and motivated by applications such as robotics. We will revise the Introduction and Dataset Collection sections to tone down the language from 'representative' to 'capturing important properties relevant to,' add explicit dataset statistics (e.g., return distributions and coverage measures) in a new table or subsection, and include a brief discussion of why these properties matter for real-world offline RL. revision: partial

  2. Referee: [Experiments] Experiments section: The abstract states that the benchmarks reveal deficiencies and includes a comprehensive evaluation, but the manuscript must explicitly report the quantitative results, baseline comparisons, and statistical significance for the performance gaps to allow verification that the deficiencies are substantial and not artifacts of the specific synthetic collection processes.

    Authors: The full manuscript (Section 5 and associated tables/figures) already presents comprehensive quantitative results across all tasks, comparing offline RL algorithms against expert performance, online RL baselines, and behavior cloning, with clear performance gaps highlighted (e.g., many offline methods failing to exceed behavior cloning returns on complex tasks). To make this more explicit and verifiable, we will revise the Experiments section to include a summary paragraph with key numerical results, add error bars or standard deviations from multiple random seeds, report statistical significance for major gaps where appropriate, and ensure baseline details and protocol are stated in the main text (with full details remaining in the appendix and released code). revision: yes

standing simulated objections not resolved
  • Direct quantitative comparisons (e.g., state-action overlap or entropy) to external real-world offline RL datasets, as no standardized public datasets exist for such analysis.

Circularity Check

0 steps flagged

No circularity in derivation or claims

full rationale

The paper introduces external artifacts (new datasets and benchmark tasks) and performs empirical evaluations of existing algorithms on them. It contains no equations, fitted parameters, predictions derived from inputs, self-definitional constructs, or load-bearing self-citations that reduce any central claim to its own inputs by construction. Dataset collection methods are motivated qualitatively as representative of real-world properties, but this is an assumption about external validity rather than a circular reduction within the paper's own logic. The work is self-contained as a benchmark contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

This is a benchmark introduction paper with no mathematical derivations, fitted parameters, or new physical entities. The contribution rests on domain assumptions about what dataset properties matter for real-world offline RL.

axioms (1)
  • domain assumption Offline RL progress is hindered by lack of benchmarks using realistic static datasets from non-RL sources.
    Invoked in the abstract as the core motivation for creating D4RL.
invented entities (1)
  • D4RL benchmark suite no independent evidence
    purpose: To provide standardized tasks and datasets for offline RL evaluation
    Newly introduced collection of tasks and data releases.

pith-pipeline@v0.9.0 · 5533 in / 1182 out tokens · 28675 ms · 2026-05-12T23:14:59.570931+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 36 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Offline Reinforcement Learning with Implicit Q-Learning

    cs.LG 2021-10 unverdicted novelty 8.0

    IQL achieves policy improvement in offline RL by implicitly estimating optimal action values through state-conditional upper expectiles of value functions, without querying Q-functions on out-of-distribution actions.

  2. Matrix-Space Reinforcement Learning for Reusing Local Transition Geometry

    cs.LG 2026-05 unverdicted novelty 7.0

    MSRL represents trajectory segments as PSD matrices to prove additive composition properties and bootstrap value functions for better transfer, reaching 0.73 AUC versus 0.57-0.65 baselines.

  3. Bridging Domain Gaps with Target-Aligned Generation for Offline Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    TCE bridges domain gaps in offline RL by selectively using source data or generating target-aligned transitions via a dual score-based model, outperforming baselines in experiments.

  4. Aligning Flow Map Policies with Optimal Q-Guidance

    cs.LG 2026-05 unverdicted novelty 7.0

    Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.

  5. Muninn: Your Trajectory Diffusion Model But Faster

    cs.RO 2026-05 unverdicted novelty 7.0

    Muninn accelerates diffusion trajectory planners up to 4.6x by spending an uncertainty budget to decide when to cache denoiser outputs, preserving performance and certifying bounded deviation from full computation.

  6. Path-Coupled Bellman Flows for Distributional Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    Path-Coupled Bellman Flows use source-consistent Bellman-coupled paths and a lambda-parameterized control-variate to learn return distributions via flow matching, improving fidelity and stability over prior DRL approaches.

  7. Hitting Time Isomorphism for Multi-Stage Planning with Foundation Policies

    cs.LG 2026-05 unverdicted novelty 7.0

    A hitting-time isomorphism framework learns asymmetric Hilbert-space geometries for offline RL, yielding the IEL algorithm with identifiability proofs and improved maze navigation performance.

  8. Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    DOSER detects OOD actions via diffusion-model denoising error and applies selective regularization based on predicted transitions, proving gamma-contraction with performance bounds and outperforming priors on offline ...

  9. Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    FAN achieves state-of-the-art offline RL performance on robotic tasks by anchoring flow policies and using single-sample noise-conditioned Q-learning, with proven convergence and reduced runtimes.

  10. SpecRLBench: A Benchmark for Generalization in Specification-Guided Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    SpecRLBench is a new benchmark evaluating generalization of LTL-guided RL methods across navigation and manipulation domains with static/dynamic environments and varied robot dynamics.

  11. A Reward-Free Viewpoint on Multi-Objective Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    Adapting RFRL objectives as auxiliary tasks with preference-guided exploration outperforms prior MORL methods in performance and data efficiency on MO-Gymnasium tasks.

  12. Preserve Support, Not Correspondence: Dynamic Routing for Offline Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    DROL trains one-step offline RL actors via top-1 dynamic routing of dataset actions to latent candidates, enabling local improvements while preserving data support and retaining cheap inference.

  13. ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching

    cs.RO 2026-04 unverdicted novelty 7.0

    ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on lo...

  14. A Generalist Agent

    cs.AI 2022-05 accept novelty 7.0

    Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

  15. ROAD: Adaptive Data Mixing for Offline-to-Online Reinforcement Learning via Bi-Level Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    ROAD formulates data mixing as a bi-level optimization problem solved via multi-armed bandit to adaptively balance offline priors and online updates in RL.

  16. Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy

    cs.LG 2026-05 unverdicted novelty 6.0

    Q-Flow enables stable optimization of expressive flow-based policies in RL by propagating terminal values along deterministic flow dynamics to intermediate states for gradient updates without solver unrolling.

  17. Discrete Flow Matching for Offline-to-Online Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.

  18. RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

    cs.AI 2026-05 unverdicted novelty 6.0

    RankQ adds a self-supervised ranking loss to Q-learning to learn structured action orderings, yielding competitive or better performance than prior methods on D4RL benchmarks and large gains in vision-based robot fine-tuning.

  19. Beyond Autoregressive RTG: Conditioning via Injection Outside Sequential Modeling in Decision Transformer

    cs.LG 2026-05 unverdicted novelty 6.0

    Injecting RTG into states outside the autoregressive sequence yields shorter, more efficient Decision Transformers that outperform the original on offline RL tasks.

  20. When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

    cs.RO 2026-05 unverdicted novelty 6.0

    Q2RL extracts Q-functions from BC policies via minimal interactions and applies Q-gating to enable stable offline-to-online RL, outperforming baselines on manipulation benchmarks and achieving up to 100% success on-robot.

  21. Adaptive Policy Selection and Fine-Tuning under Interaction Budgets for Offline-to-Online Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    An adaptive UCB-based policy selection and fine-tuning strategy improves performance over standard O2O-RL baselines under interaction budgets.

  22. OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

    cs.LG 2026-05 unverdicted novelty 6.0

    OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.

  23. AdamO: A Collapse-Suppressed Optimizer for Offline RL

    cs.LG 2026-05 unverdicted novelty 6.0

    AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.

  24. QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

    cs.LG 2026-05 unverdicted novelty 6.0

    QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...

  25. QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

    cs.LG 2026-05 unverdicted novelty 6.0

    QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...

  26. Borrowed Geometry: Computational Reuse of Frozen Text-Pretrained Transformer Weights Across Modalities

    cs.LG 2026-05 unverdicted novelty 6.0

    Frozen text-pretrained transformer weights transfer across modalities through a thin interface, achieving SOTA on a robotic task and parity on decision-making with far fewer trainable parameters.

  27. When Policies Cannot Be Retrained: A Unified Closed-Form View of Post-Training Steering in Offline Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    For diagonal-Gaussian frozen actors, PoE with alpha equals KL adaptation with beta = alpha/(1-alpha); empirically, composition shows an actor-competence ceiling with 4/5/3 HELP/FROZEN/HURT split on D4RL and zero succe...

  28. Occupancy Reward Shaping: Improving Credit Assignment for Offline Goal-Conditioned Reinforcement Learning

    cs.LG 2026-04 conditional novelty 6.0

    Occupancy Reward Shaping extracts goal-reaching rewards from world-model occupancy measures using optimal transport, improving offline goal-conditioned RL performance 2.2x on 13 tasks without changing the optimal policy.

  29. DAG-STL: A Hierarchical Framework for Zero-Shot Trajectory Planning under Signal Temporal Logic Specifications

    cs.RO 2026-04 unverdicted novelty 6.0

    DAG-STL decomposes long-horizon STL planning into decomposition, timed waypoint allocation, and diffusion-based trajectory generation to enable zero-shot planning under unknown dynamics.

  30. Fisher Decorator: Refining Flow Policy via a Local Transport Map

    cs.LG 2026-04 unverdicted novelty 6.0

    Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.

  31. GIRL: Generative Imagination Reinforcement Learning via Information-Theoretic Hallucination Control

    cs.LG 2026-04 unverdicted novelty 6.0

    GIRL reduces latent rollout drift by 38-61% versus DreamerV3 in MBRL by grounding transitions with DINOv2 embeddings and using an information-theoretic adaptive bottleneck, yielding better long-horizon returns on cont...

  32. Reinforced Self-Training (ReST) for Language Modeling

    cs.CL 2023-08 unverdicted novelty 6.0

    ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.

  33. IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

    cs.LG 2023-04 conditional novelty 6.0

    IDQL generalizes IQL into an actor-critic framework and uses diffusion policies for robust policy extraction, outperforming prior offline RL methods.

  34. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    cs.RO 2021-08 accept novelty 6.0

    A comprehensive benchmark study of offline imitation learning methods on multi-stage robot manipulation tasks identifies key sensitivities to algorithm design, data quality, and stopping criteria while releasing all d...

  35. Trajectory-Level Data Augmentation for Offline Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 5.0

    Trajectory-based data augmentation exploits geometric relationships between rewards, values, and logging policies to enable effective offline RL from few suboptimal trajectories.

  36. ReinVBC: A Model-based Reinforcement Learning Approach to Vehicle Braking Controller

    cs.RO 2026-04 unverdicted novelty 5.0

    ReinVBC applies offline model-based RL to learn vehicle dynamics and braking policies, with results indicating real-world capability and potential to replace production anti-lock braking systems.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 35 Pith papers · 5 internal anchors

  1. [1]

    Optimality and approximation with policy gradient methods in markov decision processes

    Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. arXiv preprint arXiv:1908.00261, 2019a. Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. Striving for simplicity in off-policy deep reinforcement learning. CoRR, abs/1907.04543, 2019b. URL http://arx...

  2. [2]

    Scaling data-driven robotics with reward sketching and batch reinforcement learning.Preprint arXiv:1909.12200,

    Serkan Cabi, Sergio G ´omez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad ˙Zołna, Yusuf Aytar, David Budden, Mel Vecerik, et al. A framework for data-driven robotics. arXiv preprint arXiv:1909.12200,

  3. [3]

    End- to-end driving via conditional imitation learning

    Felipe Codevilla, Matthias Miiller, Antonio L ´opez, Vladlen Koltun, and Alexey Dosovitskiy. End- to-end driving via conditional imitation learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9. IEEE,

  4. [4]

    Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester

    URL https://openreview.net/forum?id=r1genAVKPB. Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester. Challenges of real-world reinforcement learning. arXiv preprint arXiv:1904.12901,

  5. [5]

    An empirical investigation of the challenges of real-world reinforcement learn- ing

    Gabriel Dulac-Arnold, Nir Levine, Daniel J Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester. An empirical investigation of the challenges of real-world reinforcement learn- ing. arXiv preprint arXiv:2003.11881,

  6. [6]

    Off-policy deep reinforcement learning without exploration

    Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. arXiv preprint arXiv:1812.02900, 2018a. Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th In- ternational Conference on ...

  7. [7]

    Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates

    Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international confer- ence on robotics and automation (ICRA), pp. 3389–3396. IEEE,

  8. [8]

    Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning

    10 Abhishek Gupta, Vikash Kumar, Corey Lynch, Sergey Levine, and Karol Hausman. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. arXiv preprint arXiv:1910.11956,

  9. [9]

    Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. CoRR, abs/1801.01290, 2018a. URL http://arxiv.org/abs/1801.01290. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcemen...

  10. [10]

    A real-time model-based reinforcement learning architecture for robot control

    Todd Hester, Michael Quinlan, and Peter Stone. A real-time model-based reinforcement learning architecture for robot control. arXiv preprint arXiv:1105.1749,

  11. [11]

    Recsim: A configurable simulation platform for recommender systems

    Eugene Ie, Chih-wei Hsu, Martin Mladenov, Vihan Jain, Sanmit Narvekar, Jing Wang, Rui Wu, and Craig Boutilier. Recsim: A configurable simulation platform for recommender systems. arXiv preprint arXiv:1909.04847,

  12. [13]

    Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

    URL http://arxiv. org/abs/1907.00456. Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Scalable deep reinforce- ment learning for vision-based robotic manipulation. In Conference on Robot Learning, pp. 651– 673,

  13. [14]

    Stabilizing off-policy q-learning via bootstrapping error reduction.arXiv preprint arXiv:1906.00949,

    URL http://arxiv.org/abs/1906.00949. Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2006.04779,

  14. [15]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tuto- rial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643,

  15. [16]

    Algaedice: Policy gradient from arbitrary experience

    Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074,

  16. [17]

    AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

    Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359,

  17. [18]

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

    Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177,

  18. [19]

    Deep imitative models for flexible inference, planning, and control

    Nicholas Rhinehart, Rowan McAllister, and Sergey Levine. Deep imitative models for flexible inference, planning, and control. arXiv preprint arXiv:1810.06544,

  19. [20]

    Mujoco: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems , pp. 5026–5033. IEEE,

  20. [21]

    Flow: Architecture and benchmarking for reinforcement learning in traffic control

    Cathy Wu, Aboudy Kreidieh, Kanaad Parvate, Eugene Vinitsky, and Alexandre M Bayen. Flow: Architecture and benchmarking for reinforcement learning in traffic control. arXiv preprint arXiv:1710.05465, pp. 10,

  21. [22]

    Behavior Regularized Offline Reinforcement Learning

    Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361,

  22. [23]

    carla-town

    12 Appendices A T ASK PROPERTIES The following is a full list of task properties and dataset statistics for all tasks in the benchmark. Note that the full dataset for “carla-town” requires over 30GB of memory to store, so we also provide a subsampled version of the dataset which we used in our experiments. Domain Task Name Controller Type # Samples Maze2D...

  23. [24]

    (2017) Random, Controller Franka Kitchen Gupta et al

    Cloned Flow Wu et al. (2017) Random, Controller Franka Kitchen Gupta et al. (2019) Complete, Partial, Mixed (Gupta et al.,

  24. [25]

    Training

    CARLA Dosovitskiy et al. (2017) Controller Table 4: Domains and dataset types contained within our benchmark. Maze2D and AntMaze are new domains we propose. For each dataset, we also include references to the source if orig- inally proposed in another work. Datasets borrowed from prior work include MuJoCo (Expert, Random, Medium), Adroit (Human, Expert), ...