pith. machine review for the scientific record. sign in

arxiv: 2108.03298 · v2 · submitted 2021-08-06 · 💻 cs.RO · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

Authors on Pith no claims yet

Pith reviewed 2026-05-13 08:47 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords robot manipulationimitation learningoffline reinforcement learninghuman demonstrationsmulti-stage tasksbenchmarking
0
0 comments X

The pith

Learning from offline human demonstrations for robot manipulation is most sensitive to demonstration quality, algorithmic design choices, and evaluation stopping criteria.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper benchmarks six offline learning algorithms on five simulated and three real-world multi-stage manipulation tasks using human demonstration datasets of varying quality. It identifies that performance depends heavily on the quality of the provided demonstrations, varies with specific algorithmic design decisions, and changes based on when training stops because training and evaluation objectives differ. The results show these methods can produce proficient policies on complex tasks that go beyond the reach of current reinforcement learning approaches and scale directly to real-world settings using only raw sensory inputs. The work releases the datasets and implementations to support standardized future comparisons.

Core claim

The empirical study demonstrates that offline algorithms for learning robot manipulation from human data exhibit pronounced sensitivity to algorithmic hyperparameters and design choices, strong dependence on the quality of the demonstration datasets, and high variability in outcomes depending on the stopping criteria chosen during training versus evaluation.

What carries the argument

The large-scale comparative evaluation of six offline algorithms across eight manipulation tasks and datasets of controlled quality levels.

If this is right

  • Different algorithmic design choices produce large differences in final policy performance on the same data.
  • Higher-quality human demonstrations directly yield better learned policies than lower-quality ones.
  • Optimal training stopping points differ from optimal evaluation points because the objectives do not align.
  • Offline methods can succeed on multi-stage tasks where current reinforcement learning approaches fail.
  • The same methods apply directly to real-world manipulation using only raw camera and proprioceptive signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Efforts to improve data collection pipelines may yield larger gains than further algorithm tweaks alone.
  • Standardized open benchmarks of this form could reduce redundant experimentation across research groups.
  • The training-evaluation objective mismatch points to a general need for offline methods that optimize for evaluation metrics explicitly.

Load-bearing premise

The chosen five simulated and three real-world tasks together with their demonstration datasets of varying quality are representative enough to support general lessons for robot manipulation.

What would settle it

A new algorithm that maintains high success rates across all task difficulties and all demonstration quality levels, or one whose final performance shows no dependence on the choice of training stopping point in these exact setups, would falsify the main lessons.

read the original abstract

Imitating human demonstrations is a promising approach to endow robots with various manipulation capabilities. While recent advances have been made in imitation learning and batch (offline) reinforcement learning, a lack of open-source human datasets and reproducible learning methods make assessing the state of the field difficult. In this paper, we conduct an extensive study of six offline learning algorithms for robot manipulation on five simulated and three real-world multi-stage manipulation tasks of varying complexity, and with datasets of varying quality. Our study analyzes the most critical challenges when learning from offline human data for manipulation. Based on the study, we derive a series of lessons including the sensitivity to different algorithmic design choices, the dependence on the quality of the demonstrations, and the variability based on the stopping criteria due to the different objectives in training and evaluation. We also highlight opportunities for learning from human datasets, such as the ability to learn proficient policies on challenging, multi-stage tasks beyond the scope of current reinforcement learning methods, and the ability to easily scale to natural, real-world manipulation scenarios where only raw sensory signals are available. We have open-sourced our datasets and all algorithm implementations to facilitate future research and fair comparisons in learning from human demonstration data. Codebase, datasets, trained models, and more available at https://arise-initiative.github.io/robomimic-web/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper conducts an extensive empirical study of six offline learning algorithms (including behavioral cloning variants and offline RL methods) for robot manipulation. It evaluates them on five simulated and three real-world multi-stage tasks using human demonstration datasets of controlled varying quality, derives lessons on algorithmic sensitivities, demonstration quality dependence, and stopping-criteria variability, and highlights opportunities such as scaling to raw sensory inputs and outperforming current RL on complex tasks. All datasets, code, and models are open-sourced.

Significance. If the empirical findings hold, the work is significant for establishing reproducible benchmarks in offline imitation learning for robotics and providing actionable lessons on practical challenges like hyperparameter sensitivity and data quality. The open-sourcing of datasets, implementations, and trained models is a clear strength that enables fair comparisons and future research, directly addressing the lack of open human datasets noted in the abstract.

major comments (1)
  1. [§4] §4 (Experiments and Tasks): The central lessons on algorithmic design sensitivities, demonstration quality dependence, and stopping-criteria effects are derived from only eight tasks (five sim + three real) that share similar proprioceptive+RGB observations and rigid-body pick/place primitives. This limited span risks making the sensitivities regime-specific rather than fundamental to broader manipulation challenges involving different contact dynamics, longer horizons, or action precisions, weakening the generalization claim in the abstract and conclusion.
minor comments (2)
  1. [Table 1] Table 1 and §5: The stopping-criteria analysis would benefit from an explicit statement of how evaluation horizons and success thresholds are chosen independently of training objectives to avoid any appearance of post-selection.
  2. [Figure 2] Figure 2: Axis labels and legend entries for the different algorithm variants could be enlarged for readability in print.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the positive review and the recommendation for minor revision. We are grateful for the feedback highlighting the importance of our benchmark and open-sourcing efforts. We address the major comment as follows.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments and Tasks): The central lessons on algorithmic design sensitivities, demonstration quality dependence, and stopping-criteria effects are derived from only eight tasks (five sim + three real) that share similar proprioceptive+RGB observations and rigid-body pick/place primitives. This limited span risks making the sensitivities regime-specific rather than fundamental to broader manipulation challenges involving different contact dynamics, longer horizons, or action precisions, weakening the generalization claim in the abstract and conclusion.

    Authors: We acknowledge that our evaluation is based on eight tasks sharing similar observation modalities and action spaces. However, these tasks were designed to cover a spectrum of manipulation complexities, including varying numbers of stages (from 1 to 5), objects, and success criteria. The identified sensitivities to algorithmic choices, data quality, and stopping criteria stem from core issues in offline imitation learning, such as the distribution mismatch between human demonstrations and policy rollouts, which are not specific to pick-and-place primitives. Our claims in the abstract and conclusion are framed around 'robot manipulation' in the context of these multi-stage tasks, without asserting universality across all possible dynamics. The open release of all datasets, code, and models allows for easy extension to new tasks with different contact dynamics or longer horizons. We believe this does not weaken the generalization of the lessons within the scope of current offline learning for manipulation, but we are happy to add a discussion on the limitations of the task suite in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical lessons derived from external benchmarks

full rationale

The paper is an empirical benchmark study comparing six offline algorithms on five simulated plus three real multi-stage tasks using human demonstration datasets of controlled quality. Lessons on algorithmic sensitivity, demonstration quality, and stopping-criterion variability are extracted from direct performance measurements against external task benchmarks and datasets. No mathematical derivations, predictions, or uniqueness claims appear; nothing reduces to fitted parameters or self-citations by construction. The central claims remain independent of any internal definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmarking study with no new theoretical derivations, fitted parameters in a derivation sense, or invented entities; it relies on standard assumptions from imitation learning and reinforcement learning.

axioms (1)
  • domain assumption Standard Markov decision process assumptions and offline RL evaluation protocols hold for the chosen manipulation tasks.
    The study applies established offline learning algorithms whose validity rests on typical MDP and batch RL assumptions invoked in the methods section.

pith-pipeline@v0.9.0 · 5573 in / 1253 out tokens · 36175 ms · 2026-05-13T08:47:42.874136+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 35 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    cs.AI 2023-06 conditional novelty 8.0

    LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.

  2. Aligning Flow Map Policies with Optimal Q-Guidance

    cs.LG 2026-05 unverdicted novelty 7.0

    Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.

  3. Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation

    cs.RO 2026-05 conditional novelty 7.0

    A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.

  4. Beyond Isolation: A Unified Benchmark for General-Purpose Navigation

    cs.RO 2026-05 unverdicted novelty 7.0

    OmniNavBench is a unified benchmark for general-purpose navigation featuring composite multi-skill instructions, support for humanoid, quadrupedal and wheeled robots, and 1779 human teleoperated trajectories across 17...

  5. Hydra-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control

    cs.RO 2026-05 conditional novelty 7.0

    Frequency analysis of smooth robot actions bounds denoising error to low-frequency modes, enabling a sub-1% parameter 3D diffusion policy with two-step inference that reaches SOTA on manipulation benchmarks.

  6. ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching

    cs.RO 2026-04 unverdicted novelty 7.0

    ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on lo...

  7. Referring-Aware Visuomotor Policy Learning for Closed-Loop Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    ReV is a referring-aware visuomotor policy using coupled diffusion heads for real-time trajectory replanning in robotic manipulation, trained solely via targeted perturbations to expert demonstrations and achieving hi...

  8. You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector

    cs.RO 2026-03 conditional novelty 7.0

    Optimizing a single constant initial noise vector for frozen generative robot policies improves success rates on 38 of 43 tasks by up to 58% relative improvement.

  9. When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering

    cs.RO 2026-02 unverdicted novelty 7.0

    UPS framework uses conformal prediction to calibrate VLM verifiers for choosing between high-confidence action execution, natural language task queries, or policy interventions, then applies residual learning from int...

  10. One Step Diffusion via Shortcut Models

    cs.LG 2024-10 conditional novelty 7.0

    Shortcut models enable high-quality single or few-step sampling in diffusion models with one network and training phase by conditioning on desired step size.

  11. Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    cs.RO 2023-10 unverdicted novelty 7.0

    A collaborative dataset spanning 22 robots and 527 skills enables RT-X models that transfer capabilities across different robot embodiments.

  12. VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

    cs.RO 2022-09 unverdicted novelty 7.0

    VIP learns a visual embedding from human videos whose distance defines dense, smooth rewards for arbitrary goal-image robot tasks without task-specific fine-tuning.

  13. SID: Sliding into Distribution for Robust Few-Demonstration Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    SID achieves approximately 90% success on six real-world manipulation tasks with only two demonstrations under out-of-distribution initializations, with less than 10% performance drop under distractors and disturbances.

  14. StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception

    cs.RO 2026-05 unverdicted novelty 6.0

    StereoPolicy fuses stereo image pairs via a Stereo Transformer on pretrained 2D encoders to boost robotic manipulation policies, showing gains over monocular, RGB-D, point cloud, and multi-view methods in simulations ...

  15. DexSynRefine: Synthesizing and Refining Human-Object Interaction Motion for Physically Feasible Dexterous Robot Actions

    cs.RO 2026-05 unverdicted novelty 6.0

    DexSynRefine synthesizes HOI motions with an extended manifold method, refines them via task-space residual RL, and adapts for sim-to-real transfer, outperforming kinematic retargeting by 50-70 percentage points on fi...

  16. Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

    cs.AI 2026-05 unverdicted novelty 6.0

    LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.

  17. Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

    cs.AI 2026-05 unverdicted novelty 6.0

    LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.

  18. When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

    cs.RO 2026-05 unverdicted novelty 6.0

    Q2RL extracts Q-functions from BC policies via minimal interactions and applies Q-gating to enable stable offline-to-online RL, outperforming baselines on manipulation benchmarks and achieving up to 100% success on-robot.

  19. Learning Reactive Dexterous Grasping via Hierarchical Task-Space RL Planning and Joint-Space QP Control

    cs.RO 2026-05 conditional novelty 6.0

    A multi-agent RL high-level planner outputs task-space velocities that a GPU-parallel QP low-level controller converts to joint velocities while enforcing limits and collisions, yielding robust sim-to-real dexterous g...

  20. OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

    cs.LG 2026-05 unverdicted novelty 6.0

    OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.

  21. Hydra-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control

    cs.RO 2026-05 unverdicted novelty 6.0

    Hydra-DP3 is a lightweight 3D diffusion policy that uses frequency analysis of smooth action trajectories to enable two-step DDIM inference and achieves state-of-the-art results with under 1% of prior parameters.

  22. Hydra-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control

    cs.RO 2026-05 unverdicted novelty 6.0

    Hydra-DP3 achieves SOTA visuomotor performance with under 1% of prior 3D diffusion policy parameters by using frequency analysis to justify a lightweight decoder and two-step DDIM inference.

  23. An Efficient Metric for Data Quality Measurement in Imitation Learning

    cs.RO 2026-05 unverdicted novelty 6.0

    Power spectral density of trajectories ranks demonstration quality for imitation learning, enabling rollout-free curation that improves fine-tuned policy success.

  24. Learning from the Best: Smoothness-Driven Metrics for Data Quality in Imitation Learning

    cs.RO 2026-04 unverdicted novelty 6.0

    RINSE scores robot demonstration trajectories for smoothness via SAL and TED metrics to curate higher-quality data for behavioral cloning, improving success rates with less data on benchmarks and real robots.

  25. VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    VADF adds an Adaptive Loss Network for hard-negative training sampling and a Hierarchical Vision Task Segmenter for adaptive noise scheduling during inference to speed convergence and reduce timeouts in diffusion robo...

  26. WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations

    cs.RO 2026-04 unverdicted novelty 6.0

    WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match tele...

  27. SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

    cs.RO 2025-09 conditional novelty 6.0

    SimpleVLA-RL applies tailored reinforcement learning to VLA models, reaching SoTA on LIBERO, outperforming π₀ on RoboTwin, and surpassing SFT in real-world tasks while reducing data needs and identifying a 'pushcut' p...

  28. Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

    cs.RO 2025-04 unverdicted novelty 6.0

    Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...

  29. Unified Video Action Model

    cs.RO 2025-02 unverdicted novelty 6.0

    UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without p...

  30. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    cs.RO 2024-03 accept novelty 6.0

    DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.

  31. To Do or Not to Do: Ensuring the Safety of Visuomotor Policies Learned from Demonstrations

    cs.RO 2026-05 unverdicted novelty 5.0

    Execution guarantee certifies safe regions for IL policies via view synthesis and set invariance so that maximum task success is assured from within those regions even under small execution changes.

  32. Gated Memory Policy

    cs.RO 2026-04 unverdicted novelty 5.0

    GMP selectively activates and represents memory via a gate and lightweight cross-attention, yielding 30.1% higher success on non-Markovian robotic tasks while staying competitive on Markovian ones.

  33. Efficient Hierarchical Implicit Flow Q-learning for Offline Goal-conditioned Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    Proposes mean flow policies and LeJEPA loss to overcome Gaussian policy limits and weak subgoal generation in hierarchical offline GCRL, reporting strong results on OGBench state and pixel tasks.

  34. From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

    cs.RO 2026-04 accept novelty 5.0

    A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.

  35. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · cited by 32 Pith papers · 7 internal anchors

  1. [1]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  2. [2]

    Krizhevsky, I

    A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012

  3. [3]

    Redmon, S

    J. Redmon, S. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779–788, 2016

  4. [4]

    Ronneberger, P

    O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015

  5. [5]

    Know What You Don't Know: Unanswerable Questions for SQuAD

    P. Rajpurkar, R. Jia, and P. Liang. Know What You Don’t Know: Unanswerable Questions for SQuAD. arXiv preprint arXiv:1806.03822, 2018

  6. [6]

    Floridi and M

    L. Floridi and M. Chiriatti. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30(4):681–694, 2020

  7. [7]

    Devlin, M.-W

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019

  8. [8]

    D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems, pages 305–313, 1989

  9. [9]

    Zhang, Z

    T. Zhang, Z. McCarthy, O. Jow, D. Lee, K. Goldberg, and P. Abbeel. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. arXiv preprint arXiv:1710.04615, 2017

  10. [10]

    Learning to generalize across long-horizon tasks from human demonstrations

    A. Mandlekar, D. Xu, R. Martín-Martín, S. Savarese, and L. Fei-Fei. Learning to generalize across long-horizon tasks from human demonstrations. arXiv preprint arXiv:2003.06085, 2020

  11. [11]

    Lange, T

    S. Lange, T. Gabel, and M. Riedmiller. Batch reinforcement learning. InReinforcement learning, pages 45–73. Springer, 2012

  12. [12]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    S. Levine, A. Kumar, G. Tucker, and J. Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020

  13. [13]

    S. Cabi, S. Gómez Colmenarejo, A. Novikov, K. Konyushkova, S. Reed, R. Jeong, K. Zolna, Y . Aytar, D. Budden, M. Vecerik, et al. Scaling data-driven robotics with reward sketching and batch reinforcement learning. arXiv, pages arXiv–1909, 2019

  14. [14]

    Florence, L

    P. Florence, L. Manuelli, and R. Tedrake. Self-supervised correspondence in visuomotor policy learning. IEEE Robotics and Automation Letters, 5(2):492–499, 2019

  15. [15]

    Mandlekar, Y

    A. Mandlekar, Y . Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, S. Savarese, and L. Fei-Fei. RoboTurk: A Crowdsourcing Platform for Robotic Skill Learning through Imitation. In Conference on Robot Learning, 2018. 9

  16. [16]

    Sharma, L

    P. Sharma, L. Mohan, L. Pinto, and A. Gupta. Multiple interactions made easy (mime): Large scale demonstrations data for imitation. In Conference on robot learning , pages 906–915. PMLR, 2018

  17. [17]

    Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity,

    A. Mandlekar, J. Booher, M. Spero, A. Tung, A. Gupta, Y . Zhu, A. Garg, S. Savarese, and L. Fei-Fei. Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity. arXiv preprint arXiv:1911.04052, 2019

  18. [18]

    J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020

  19. [19]

    Gulcehre, Z

    C. Gulcehre, Z. Wang, A. Novikov, T. L. Paine, S. G. Colmenarejo, K. Zolna, R. Agarwal, J. Merel, D. Mankowitz, C. Paduraru, et al. Rl unplugged: Benchmarks for offline reinforcement learning. arXiv preprint arXiv:2006.13888, 2020

  20. [20]

    Mandlekar, F

    A. Mandlekar, F. Ramos, B. Boots, S. Savarese, L. Fei-Fei, A. Garg, and D. Fox. Iris: Implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. In IEEE International Conference on Robotics and Automation (ICRA), pages 4414–4420. IEEE, 2020

  21. [21]

    S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635, 2011

  22. [22]

    D. Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000

  23. [23]

    Toyer, R

    S. Toyer, R. Shah, A. Critch, and S. Russell. The magical benchmark for robust imitation. arXiv preprint arXiv:2011.00401, 2020

  24. [24]

    James, Z

    S. James, Z. Ma, D. R. Arrojo, and A. J. Davison. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

  25. [25]

    A. Zeng, P. Florence, J. Tompson, S. Welker, J. Chien, M. Attarian, T. Armstrong, I. Krasin, D. Duong, V . Sindhwani, et al. Transporter networks: Rearranging the visual world for robotic manipulation. arXiv preprint arXiv:2010.14406, 2020

  26. [26]

    Fujimoto, D

    S. Fujimoto, D. Meger, and D. Precup. Off-policy deep reinforcement learning without explo- ration. In International Conference on Machine Learning, pages 2052–2062. PMLR, 2019

  27. [27]

    Conservative q-learning for offline reinforcement learning, 2020

    A. Kumar, A. Zhou, G. Tucker, and S. Levine. Conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2006.04779, 2020

  28. [28]

    T. L. Paine, C. Paduraru, A. Michi, C. Gulcehre, K. Zolna, A. Novikov, Z. Wang, and N. de Freitas. Hyperparameter selection for offline reinforcement learning. arXiv preprint arXiv:2007.09055, 2020

  29. [29]

    J. Fu, M. Norouzi, O. Nachum, G. Tucker, Z. Wang, A. Novikov, M. Yang, M. R. Zhang, Y . Chen, A. Kumar, et al. Benchmarks for deep off-policy evaluation. arXiv preprint arXiv:2103.16596, 2021

  30. [30]

    Haarnoja, A

    T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861–1870. PMLR, 2018

  31. [31]

    A. J. Ijspeert, J. Nakanishi, and S. Schaal. Movement imitation with nonlinear dynamical systems in humanoid robots. Proceedings 2002 IEEE International Conference on Robotics and Automation, 2:1398–1403 vol.2, 2002

  32. [32]

    C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine. One-shot visual imitation learning via meta-learning. In Conference in Robot Learning, volume abs/1709.04905, 2017

  33. [33]

    Billard, S

    A. Billard, S. Calinon, R. Dillmann, and S. Schaal. Robot programming by demonstration. In Springer Handbook of Robotics, 2008. 10

  34. [34]

    Calinon, F

    S. Calinon, F. D’halluin, E. L. Sauser, D. G. Caldwell, and A. Billard. Learning and reproduction of gestures by imitation. IEEE Robotics and Automation Magazine, 17:44–54, 2010

  35. [35]

    C. Wang, R. Wang, D. Xu, A. Mandlekar, L. Fei-Fei, and S. Savarese. Generalization through hand-eye coordination: An action space for learning spatially-invariant visuomotor control. arXiv preprint arXiv:2103.00375, 2021

  36. [36]

    A. Tung, J. Wong, A. Mandlekar, R. Martín-Martín, Y . Zhu, L. Fei-Fei, and S. Savarese. Learning multi-arm manipulation through collaborative teleoperation. arXiv preprint arXiv:2012.06738, 2020

  37. [37]

    Y . Wu, G. Tucker, and O. Nachum. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019

  38. [38]

    Z. Wang, A. Novikov, K. ˙Zołna, J. T. Springenberg, S. Reed, B. Shahriari, N. Siegel, J. Merel, C. Gulcehre, N. Heess, et al. Critic regularized regression. arXiv preprint arXiv:2006.15134, 2020

  39. [39]

    N. Y . Siegel, J. T. Springenberg, F. Berkenkamp, A. Abdolmaleki, M. Neunert, T. Lampe, R. Hafner, N. Heess, and M. Riedmiller. Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. arXiv preprint arXiv:2002.08396, 2020

  40. [40]

    MOReL: Model-based offline reinforcement learning

    R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims. Morel: Model-based offline reinforcement learning. arXiv preprint arXiv:2005.05951, 2020

  41. [41]

    T. Yu, G. Thomas, L. Yu, S. Ermon, J. Zou, S. Levine, C. Finn, and T. Ma. Mopo: Model-based offline policy optimization. arXiv preprint arXiv:2005.13239, 2020

  42. [42]

    S. K. S. Ghasemipour, D. Schuurmans, and S. S. Gu. Emaq: Expected-max q-learning operator for simple yet effective offline and online rl. arXiv preprint arXiv:2007.11091, 2020

  43. [43]

    T. Yu, A. Kumar, R. Rafailov, A. Rajeswaran, S. Levine, and C. Finn. Combo: Conservative offline model-based policy optimization. arXiv preprint arXiv:2102.08363, 2021

  44. [44]

    J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning quadrupedal locomotion over challenging terrain. Science robotics, 5(47), 2020

  45. [45]

    C. Finn, X. Y . Tan, Y . Duan, T. Darrell, S. Levine, and P. Abbeel. Deep spatial autoencoders for visuomotor learning. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 512–519. IEEE, 2016

  46. [46]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770– 778, 2016

  47. [47]

    Agarwal, D

    R. Agarwal, D. Schuurmans, and M. Norouzi. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning, pages 104–114. PMLR, 2020

  48. [48]

    A. Ajay, A. Kumar, P. Agrawal, S. Levine, and O. Nachum. Opal: Offline primitive discovery for accelerating offline reinforcement learning. arXiv preprint arXiv:2010.13611, 2020

  49. [49]

    Pertsch, Y

    K. Pertsch, Y . Lee, and J. J. Lim. Accelerating reinforcement learning with learned skill priors. arXiv preprint arXiv:2010.11944, 2020

  50. [50]

    L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch. Decision transformer: Reinforcement learning via sequence modeling. arXiv preprint arXiv:2106.01345, 2021

  51. [51]

    Janner, Q

    M. Janner, Q. Li, and S. Levine. Reinforcement learning as one big sequence modeling problem. arXiv preprint arXiv:2106.02039, 2021

  52. [52]

    Yang and O

    M. Yang and O. Nachum. Representation matters: Offline pretraining for sequential decision making. arXiv preprint arXiv:2102.05815, 2021. 11

  53. [53]

    Nachum and M

    O. Nachum and M. Yang. Provable representation learning for imitation with contrastive fourier features. arXiv preprint arXiv:2105.12272, 2021

  54. [54]

    Hersch, F

    M. Hersch, F. Guenter, S. Calinon, and A. Billard. Dynamical system modulation for robot learning via kinesthetic demonstrations. IEEE Transactions on Robotics, 24(6):1463–1467, 2008

  55. [55]

    Kormushev, S

    P. Kormushev, S. Calinon, and D. G. Caldwell. Imitation learning of positional and force skills demonstrated via kinesthetic teaching and haptic input. Advanced Robotics, 25(5):581–603, 2011

  56. [56]

    Akgun, M

    B. Akgun, M. Cakmak, J. W. Yoo, and A. L. Thomaz. Trajectories and keyframes for kines- thetic teaching: A human-robot interaction perspective. In Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot Interaction, pages 391–398, 2012

  57. [57]

    Bajcsy, D

    A. Bajcsy, D. P. Losey, M. K. O’Malley, and A. D. Dragan. Learning robot objectives from physical human interaction. Proceedings of Machine Learning Research, 78:217–226, 2017

  58. [58]

    Bajcsy, D

    A. Bajcsy, D. P. Losey, M. K. O’Malley, and A. D. Dragan. Learning from physical human corrections, one feature at a time. In Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, pages 141–149, 2018

  59. [59]

    Human-in-the-loop imitation learning using remote teleoperation.arXiv preprint arXiv:2012.06733, 2020

    A. Mandlekar, D. Xu, R. Martín-Martín, Y . Zhu, L. Fei-Fei, and S. Savarese. Human-in-the-loop imitation learning using remote teleoperation. arXiv preprint arXiv:2012.06733, 2020

  60. [60]

    Zhang, F

    R. Zhang, F. Torabi, L. Guan, D. H. Ballard, and P. Stone. Leveraging human guidance for deep reinforcement learning tasks. arXiv preprint arXiv:1909.09906, 2019

  61. [61]

    Loftin, B

    R. Loftin, B. Peng, J. MacGlashan, M. L. Littman, M. E. Taylor, J. Huang, and D. L. Roberts. Learning behaviors via human-delivered discrete feedback: modeling implicit feedback strate- gies to speed up learning. Autonomous agents and multi-agent systems, 30(1):30–59, 2016

  62. [62]

    MacGlashan, M

    J. MacGlashan, M. K. Ho, R. Loftin, B. Peng, D. Roberts, M. E. Taylor, and M. L. Littman. Interactive learning from policy-dependent human feedback. arXiv preprint arXiv:1701.06049, 2017

  63. [63]

    P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems , pages 4299–4307, 2017

  64. [64]

    End-to-End Robotic Reinforcement Learning without Reward Engineering

    A. Singh, L. Yang, K. Hartikainen, C. Finn, and S. Levine. End-to-end robotic reinforcement learning without reward engineering. arXiv preprint arXiv:1904.07854, 2019

  65. [65]

    Y . Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In International conference on machine learning, pages 1329–

  66. [66]

    T. Wang, X. Bao, I. Clavera, J. Hoang, Y . Wen, E. Langlois, S. Zhang, G. Zhang, P. Abbeel, and J. Ba. Benchmarking model-based reinforcement learning. arXiv preprint arXiv:1907.02057, 2019

  67. [67]

    Andrychowicz, A

    M. Andrychowicz, A. Raichuk, P. Sta´nczyk, M. Orsini, S. Girgin, R. Marinier, L. Hussenot, M. Geist, O. Pietquin, M. Michalski, et al. What matters in on-policy reinforcement learning? a large-scale empirical study. arXiv preprint arXiv:2006.05990, 2020

  68. [68]

    Henderson, R

    P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep reinforce- ment learning that matters. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

  69. [69]

    Memmesheimer, I

    R. Memmesheimer, I. Mykhalchyshyna, V . Seib, and D. Paulus. Simitate: A hybrid imitation learning benchmark. arXiv preprint arXiv:1905.06002, 2019

  70. [70]

    Hussenot, M

    L. Hussenot, M. Andrychowicz, D. Vincent, R. Dadashi, A. Raichuk, L. Stafiniak, S. Girgin, R. Marinier, N. Momchev, S. Ramos, et al. Hyperparameter selection for imitation learning. arXiv preprint arXiv:2105.12034, 2021. 12

  71. [71]

    M. A. Rana, D. Chen, J. Williams, V . Chu, S. R. Ahmadzadeh, and S. Chernova. Benchmark for skill learning from demonstration: Impact of user experience, task complexity, and start config- uration on performance. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 7561–7567. IEEE, 2020

  72. [72]

    Lemme, Y

    A. Lemme, Y . Meirovitch, M. Khansari-Zadeh, T. Flash, A. Billard, and J. J. Steil. Open- source benchmarking for learned reaching motion generation in robotics. Paladyn, Journal of Behavioral Robotics, 6(1), 2015

  73. [73]

    L. Fan, Y . Zhu, J. Zhu, Z. Liu, O. Zeng, A. Gupta, J. Creus-Costa, S. Savarese, and L. Fei-Fei. Surreal: Open-source reinforcement learning framework and robot manipulation benchmark. In Conference on Robot Learning, pages 767–782. PMLR, 2018

  74. [74]

    X. Lin, Y . Wang, J. Olkin, and D. Held. Softgym: Benchmarking deep reinforcement learning for deformable object manipulation. arXiv preprint arXiv:2011.07215, 2020

  75. [75]

    T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, pages 1094–1100. PMLR, 2020

  76. [76]

    Y . Zhu, J. Wong, A. Mandlekar, and R. Martín-Martín. robosuite: A modular simulation framework and benchmark for robot learning. In arXiv preprint arXiv:2009.12293, 2020

  77. [77]

    Ahmed, F

    O. Ahmed, F. Träuble, A. Goyal, A. Neitz, Y . Bengio, B. Schölkopf, M. Wüthrich, and S. Bauer. Causalworld: A robotic manipulation benchmark for causal structure and transfer learning. arXiv preprint arXiv:2010.04296, 2020

  78. [78]

    D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  79. [79]

    D. P. Kingma and M. Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

  80. [80]

    T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015

Showing first 80 references.