arxiv: 2108.03298 · v2 · submitted 2021-08-06 · 💻 cs.RO · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

Ajay Mandlekar , Danfei Xu , Josiah Wong , Soroush Nasiriany , Chen Wang , Rohun Kulkarni , Li Fei-Fei , Silvio Savarese

show 2 more authors

Yuke Zhu Roberto Mart\'in-Mart\'in

Authors on Pith no claims yet

Pith reviewed 2026-05-13 08:47 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords robot manipulationimitation learningoffline reinforcement learninghuman demonstrationsmulti-stage tasksbenchmarking

0 comments

The pith

Learning from offline human demonstrations for robot manipulation is most sensitive to demonstration quality, algorithmic design choices, and evaluation stopping criteria.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper benchmarks six offline learning algorithms on five simulated and three real-world multi-stage manipulation tasks using human demonstration datasets of varying quality. It identifies that performance depends heavily on the quality of the provided demonstrations, varies with specific algorithmic design decisions, and changes based on when training stops because training and evaluation objectives differ. The results show these methods can produce proficient policies on complex tasks that go beyond the reach of current reinforcement learning approaches and scale directly to real-world settings using only raw sensory inputs. The work releases the datasets and implementations to support standardized future comparisons.

Core claim

The empirical study demonstrates that offline algorithms for learning robot manipulation from human data exhibit pronounced sensitivity to algorithmic hyperparameters and design choices, strong dependence on the quality of the demonstration datasets, and high variability in outcomes depending on the stopping criteria chosen during training versus evaluation.

What carries the argument

The large-scale comparative evaluation of six offline algorithms across eight manipulation tasks and datasets of controlled quality levels.

If this is right

Different algorithmic design choices produce large differences in final policy performance on the same data.
Higher-quality human demonstrations directly yield better learned policies than lower-quality ones.
Optimal training stopping points differ from optimal evaluation points because the objectives do not align.
Offline methods can succeed on multi-stage tasks where current reinforcement learning approaches fail.
The same methods apply directly to real-world manipulation using only raw camera and proprioceptive signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Efforts to improve data collection pipelines may yield larger gains than further algorithm tweaks alone.
Standardized open benchmarks of this form could reduce redundant experimentation across research groups.
The training-evaluation objective mismatch points to a general need for offline methods that optimize for evaluation metrics explicitly.

Load-bearing premise

The chosen five simulated and three real-world tasks together with their demonstration datasets of varying quality are representative enough to support general lessons for robot manipulation.

What would settle it

A new algorithm that maintains high success rates across all task difficulties and all demonstration quality levels, or one whose final performance shows no dependence on the choice of training stopping point in these exact setups, would falsify the main lessons.

read the original abstract

Imitating human demonstrations is a promising approach to endow robots with various manipulation capabilities. While recent advances have been made in imitation learning and batch (offline) reinforcement learning, a lack of open-source human datasets and reproducible learning methods make assessing the state of the field difficult. In this paper, we conduct an extensive study of six offline learning algorithms for robot manipulation on five simulated and three real-world multi-stage manipulation tasks of varying complexity, and with datasets of varying quality. Our study analyzes the most critical challenges when learning from offline human data for manipulation. Based on the study, we derive a series of lessons including the sensitivity to different algorithmic design choices, the dependence on the quality of the demonstrations, and the variability based on the stopping criteria due to the different objectives in training and evaluation. We also highlight opportunities for learning from human datasets, such as the ability to learn proficient policies on challenging, multi-stage tasks beyond the scope of current reinforcement learning methods, and the ability to easily scale to natural, real-world manipulation scenarios where only raw sensory signals are available. We have open-sourced our datasets and all algorithm implementations to facilitate future research and fair comparisons in learning from human demonstration data. Codebase, datasets, trained models, and more available at https://arise-initiative.github.io/robomimic-web/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Solid large-scale empirical comparison with open resources that yields practical lessons on offline imitation for manipulation, though the tasks share enough structure that the lessons may be narrower than presented.

read the letter

This paper is worth your time if you work on imitation learning for robots. It runs a head-to-head of six offline algorithms on five simulated and three real multi-stage tasks, using datasets of varying quality, and releases all the data, code, and models. That scale plus the open-sourcing is the main new thing here, and it produces usable takeaways on how sensitive results are to algorithmic details, demonstration quality, and when to stop training. The real-robot sections are especially useful because they stick to raw sensory inputs rather than privileged state. The experiments follow standard offline RL protocols and the reported comparisons line up with the claims, so the central findings hold up on their own terms. The open resources are a genuine service to the field and make the work easy to build on or check. The soft spot is the task set. All eight tasks lean on similar pick-and-place primitives with comparable observation spaces and rigid-body dynamics, so the observed sensitivities to algorithm choices and data quality could be specific to this regime rather than fundamental across manipulation. If your problems involve heavy contact, deformable objects, or much longer horizons, you would want to test whether the lessons transfer. That does not invalidate the results within the paper's scope, but it does limit how far the generalizations should be pushed. This is the kind of paper that helps standardize benchmarks and gives practitioners concrete guidance, so it belongs in a reading group and deserves referee time. I would send it out for review.

Referee Report

1 major / 2 minor

Summary. The paper conducts an extensive empirical study of six offline learning algorithms (including behavioral cloning variants and offline RL methods) for robot manipulation. It evaluates them on five simulated and three real-world multi-stage tasks using human demonstration datasets of controlled varying quality, derives lessons on algorithmic sensitivities, demonstration quality dependence, and stopping-criteria variability, and highlights opportunities such as scaling to raw sensory inputs and outperforming current RL on complex tasks. All datasets, code, and models are open-sourced.

Significance. If the empirical findings hold, the work is significant for establishing reproducible benchmarks in offline imitation learning for robotics and providing actionable lessons on practical challenges like hyperparameter sensitivity and data quality. The open-sourcing of datasets, implementations, and trained models is a clear strength that enables fair comparisons and future research, directly addressing the lack of open human datasets noted in the abstract.

major comments (1)

[§4] §4 (Experiments and Tasks): The central lessons on algorithmic design sensitivities, demonstration quality dependence, and stopping-criteria effects are derived from only eight tasks (five sim + three real) that share similar proprioceptive+RGB observations and rigid-body pick/place primitives. This limited span risks making the sensitivities regime-specific rather than fundamental to broader manipulation challenges involving different contact dynamics, longer horizons, or action precisions, weakening the generalization claim in the abstract and conclusion.

minor comments (2)

[Table 1] Table 1 and §5: The stopping-criteria analysis would benefit from an explicit statement of how evaluation horizons and success thresholds are chosen independently of training objectives to avoid any appearance of post-selection.
[Figure 2] Figure 2: Axis labels and legend entries for the different algorithm variants could be enlarged for readability in print.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the positive review and the recommendation for minor revision. We are grateful for the feedback highlighting the importance of our benchmark and open-sourcing efforts. We address the major comment as follows.

read point-by-point responses

Referee: [§4] §4 (Experiments and Tasks): The central lessons on algorithmic design sensitivities, demonstration quality dependence, and stopping-criteria effects are derived from only eight tasks (five sim + three real) that share similar proprioceptive+RGB observations and rigid-body pick/place primitives. This limited span risks making the sensitivities regime-specific rather than fundamental to broader manipulation challenges involving different contact dynamics, longer horizons, or action precisions, weakening the generalization claim in the abstract and conclusion.

Authors: We acknowledge that our evaluation is based on eight tasks sharing similar observation modalities and action spaces. However, these tasks were designed to cover a spectrum of manipulation complexities, including varying numbers of stages (from 1 to 5), objects, and success criteria. The identified sensitivities to algorithmic choices, data quality, and stopping criteria stem from core issues in offline imitation learning, such as the distribution mismatch between human demonstrations and policy rollouts, which are not specific to pick-and-place primitives. Our claims in the abstract and conclusion are framed around 'robot manipulation' in the context of these multi-stage tasks, without asserting universality across all possible dynamics. The open release of all datasets, code, and models allows for easy extension to new tasks with different contact dynamics or longer horizons. We believe this does not weaken the generalization of the lessons within the scope of current offline learning for manipulation, but we are happy to add a discussion on the limitations of the task suite in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical lessons derived from external benchmarks

full rationale

The paper is an empirical benchmark study comparing six offline algorithms on five simulated plus three real multi-stage tasks using human demonstration datasets of controlled quality. Lessons on algorithmic sensitivity, demonstration quality, and stopping-criterion variability are extracted from direct performance measurements against external task benchmarks and datasets. No mathematical derivations, predictions, or uniqueness claims appear; nothing reduces to fitted parameters or self-citations by construction. The central claims remain independent of any internal definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmarking study with no new theoretical derivations, fitted parameters in a derivation sense, or invented entities; it relies on standard assumptions from imitation learning and reinforcement learning.

axioms (1)

domain assumption Standard Markov decision process assumptions and offline RL evaluation protocols hold for the chosen manipulation tasks.
The study applies established offline learning algorithms whose validity rests on typical MDP and batch RL assumptions invoked in the methods section.

pith-pipeline@v0.9.0 · 5573 in / 1253 out tokens · 36175 ms · 2026-05-13T08:47:42.874136+00:00 · methodology

discussion (0)

Forward citations

Cited by 35 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
cs.AI 2023-06 conditional novelty 8.0

LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.
Aligning Flow Map Policies with Optimal Q-Guidance
cs.LG 2026-05 unverdicted novelty 7.0

Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.
Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation
cs.RO 2026-05 conditional novelty 7.0

A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
Beyond Isolation: A Unified Benchmark for General-Purpose Navigation
cs.RO 2026-05 unverdicted novelty 7.0

OmniNavBench is a unified benchmark for general-purpose navigation featuring composite multi-skill instructions, support for humanoid, quadrupedal and wheeled robots, and 1779 human teleoperated trajectories across 17...
Hydra-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control
cs.RO 2026-05 conditional novelty 7.0

Frequency analysis of smooth robot actions bounds denoising error to low-frequency modes, enabling a sub-1% parameter 3D diffusion policy with two-step inference that reaches SOTA on manipulation benchmarks.
ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching
cs.RO 2026-04 unverdicted novelty 7.0

ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on lo...
Referring-Aware Visuomotor Policy Learning for Closed-Loop Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

ReV is a referring-aware visuomotor policy using coupled diffusion heads for real-time trajectory replanning in robotic manipulation, trained solely via targeted perturbations to expert demonstrations and achieving hi...
You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector
cs.RO 2026-03 conditional novelty 7.0

Optimizing a single constant initial noise vector for frozen generative robot policies improves success rates on 38 of 43 tasks by up to 58% relative improvement.
When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering
cs.RO 2026-02 unverdicted novelty 7.0

UPS framework uses conformal prediction to calibrate VLM verifiers for choosing between high-confidence action execution, natural language task queries, or policy interventions, then applies residual learning from int...
One Step Diffusion via Shortcut Models
cs.LG 2024-10 conditional novelty 7.0

Shortcut models enable high-quality single or few-step sampling in diffusion models with one network and training phase by conditioning on desired step size.
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
cs.RO 2023-10 unverdicted novelty 7.0

A collaborative dataset spanning 22 robots and 527 skills enables RT-X models that transfer capabilities across different robot embodiments.
VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training
cs.RO 2022-09 unverdicted novelty 7.0

VIP learns a visual embedding from human videos whose distance defines dense, smooth rewards for arbitrary goal-image robot tasks without task-specific fine-tuning.
SID: Sliding into Distribution for Robust Few-Demonstration Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

SID achieves approximately 90% success on six real-world manipulation tasks with only two demonstrations under out-of-distribution initializations, with less than 10% performance drop under distractors and disturbances.
StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception
cs.RO 2026-05 unverdicted novelty 6.0

StereoPolicy fuses stereo image pairs via a Stereo Transformer on pretrained 2D encoders to boost robotic manipulation policies, showing gains over monocular, RGB-D, point cloud, and multi-view methods in simulations ...
DexSynRefine: Synthesizing and Refining Human-Object Interaction Motion for Physically Feasible Dexterous Robot Actions
cs.RO 2026-05 unverdicted novelty 6.0

DexSynRefine synthesizes HOI motions with an extended manifold method, refines them via task-space residual RL, and adapts for sim-to-real transfer, outperforming kinematic retargeting by 50-70 percentage points on fi...
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
cs.AI 2026-05 unverdicted novelty 6.0

LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
cs.AI 2026-05 unverdicted novelty 6.0

LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.
When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning
cs.RO 2026-05 unverdicted novelty 6.0

Q2RL extracts Q-functions from BC policies via minimal interactions and applies Q-gating to enable stable offline-to-online RL, outperforming baselines on manipulation benchmarks and achieving up to 100% success on-robot.
Learning Reactive Dexterous Grasping via Hierarchical Task-Space RL Planning and Joint-Space QP Control
cs.RO 2026-05 conditional novelty 6.0

A multi-agent RL high-level planner outputs task-space velocities that a GPU-parallel QP low-level controller converts to joint velocities while enforcing limits and collisions, yielding robust sim-to-real dexterous g...
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
cs.LG 2026-05 unverdicted novelty 6.0

OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
Hydra-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control
cs.RO 2026-05 unverdicted novelty 6.0

Hydra-DP3 is a lightweight 3D diffusion policy that uses frequency analysis of smooth action trajectories to enable two-step DDIM inference and achieves state-of-the-art results with under 1% of prior parameters.
Hydra-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control
cs.RO 2026-05 unverdicted novelty 6.0

Hydra-DP3 achieves SOTA visuomotor performance with under 1% of prior 3D diffusion policy parameters by using frequency analysis to justify a lightweight decoder and two-step DDIM inference.
An Efficient Metric for Data Quality Measurement in Imitation Learning
cs.RO 2026-05 unverdicted novelty 6.0

Power spectral density of trajectories ranks demonstration quality for imitation learning, enabling rollout-free curation that improves fine-tuned policy success.
Learning from the Best: Smoothness-Driven Metrics for Data Quality in Imitation Learning
cs.RO 2026-04 unverdicted novelty 6.0

RINSE scores robot demonstration trajectories for smoothness via SAL and TED metrics to curate higher-quality data for behavioral cloning, improving success rates with less data on benchmarks and real robots.
VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

VADF adds an Adaptive Loss Network for hard-negative training sampling and a Hierarchical Vision Task Segmenter for adaptive noise scheduling during inference to speed convergence and reduce timeouts in diffusion robo...
WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations
cs.RO 2026-04 unverdicted novelty 6.0

WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match tele...
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
cs.RO 2025-09 conditional novelty 6.0

SimpleVLA-RL applies tailored reinforcement learning to VLA models, reaching SoTA on LIBERO, outperforming π₀ on RoboTwin, and surpassing SFT in real-world tasks while reducing data needs and identifying a 'pushcut' p...
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
cs.RO 2025-04 unverdicted novelty 6.0

Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...
Unified Video Action Model
cs.RO 2025-02 unverdicted novelty 6.0

UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without p...
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
cs.RO 2024-03 accept novelty 6.0

DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.
To Do or Not to Do: Ensuring the Safety of Visuomotor Policies Learned from Demonstrations
cs.RO 2026-05 unverdicted novelty 5.0

Execution guarantee certifies safe regions for IL policies via view synthesis and set invariance so that maximum task success is assured from within those regions even under small execution changes.
Gated Memory Policy
cs.RO 2026-04 unverdicted novelty 5.0

GMP selectively activates and represents memory via a gate and lightweight cross-attention, yielding 30.1% higher success on non-Markovian robotic tasks while staying competitive on Markovian ones.
Efficient Hierarchical Implicit Flow Q-learning for Offline Goal-conditioned Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 5.0

Proposes mean flow policies and LeJEPA loss to overcome Gaussian policy limits and weak subgoal generation in hierarchical offline GCRL, reporting strong results on OGBench state and pixel tasks.
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
cs.RO 2026-04 accept novelty 5.0

A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · cited by 32 Pith papers · 7 internal anchors

[1]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009
[2]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012

work page 2012
[3]

Redmon, S

J. Redmon, S. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Uniﬁed, real-time object detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779–788, 2016

work page 2016
[4]

Ronneberger, P

O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015

work page 2015
[5]

Know What You Don't Know: Unanswerable Questions for SQuAD

P. Rajpurkar, R. Jia, and P. Liang. Know What You Don’t Know: Unanswerable Questions for SQuAD. arXiv preprint arXiv:1806.03822, 2018

work page Pith review arXiv 2018
[6]

Floridi and M

L. Floridi and M. Chiriatti. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30(4):681–694, 2020

work page 2020
[7]

Devlin, M.-W

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019

work page 2019
[8]

D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems, pages 305–313, 1989

work page 1989
[9]

Zhang, Z

T. Zhang, Z. McCarthy, O. Jow, D. Lee, K. Goldberg, and P. Abbeel. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. arXiv preprint arXiv:1710.04615, 2017

work page arXiv 2017
[10]

Learning to generalize across long-horizon tasks from human demonstrations

A. Mandlekar, D. Xu, R. Martín-Martín, S. Savarese, and L. Fei-Fei. Learning to generalize across long-horizon tasks from human demonstrations. arXiv preprint arXiv:2003.06085, 2020

work page arXiv 2003
[11]

Lange, T

S. Lange, T. Gabel, and M. Riedmiller. Batch reinforcement learning. InReinforcement learning, pages 45–73. Springer, 2012

work page 2012
[12]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

S. Levine, A. Kumar, G. Tucker, and J. Fu. Ofﬂine reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[13]

S. Cabi, S. Gómez Colmenarejo, A. Novikov, K. Konyushkova, S. Reed, R. Jeong, K. Zolna, Y . Aytar, D. Budden, M. Vecerik, et al. Scaling data-driven robotics with reward sketching and batch reinforcement learning. arXiv, pages arXiv–1909, 2019

work page 1909
[14]

Florence, L

P. Florence, L. Manuelli, and R. Tedrake. Self-supervised correspondence in visuomotor policy learning. IEEE Robotics and Automation Letters, 5(2):492–499, 2019

work page 2019
[15]

Mandlekar, Y

A. Mandlekar, Y . Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, S. Savarese, and L. Fei-Fei. RoboTurk: A Crowdsourcing Platform for Robotic Skill Learning through Imitation. In Conference on Robot Learning, 2018. 9

work page 2018
[16]

Sharma, L

P. Sharma, L. Mohan, L. Pinto, and A. Gupta. Multiple interactions made easy (mime): Large scale demonstrations data for imitation. In Conference on robot learning , pages 906–915. PMLR, 2018

work page 2018
[17]

Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity,

A. Mandlekar, J. Booher, M. Spero, A. Tung, A. Gupta, Y . Zhu, A. Garg, S. Savarese, and L. Fei-Fei. Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity. arXiv preprint arXiv:1911.04052, 2019

work page arXiv 1911
[18]

J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[19]

Gulcehre, Z

C. Gulcehre, Z. Wang, A. Novikov, T. L. Paine, S. G. Colmenarejo, K. Zolna, R. Agarwal, J. Merel, D. Mankowitz, C. Paduraru, et al. Rl unplugged: Benchmarks for ofﬂine reinforcement learning. arXiv preprint arXiv:2006.13888, 2020

work page arXiv 2006
[20]

Mandlekar, F

A. Mandlekar, F. Ramos, B. Boots, S. Savarese, L. Fei-Fei, A. Garg, and D. Fox. Iris: Implicit reinforcement without interaction at scale for learning control from ofﬂine robot manipulation data. In IEEE International Conference on Robotics and Automation (ICRA), pages 4414–4420. IEEE, 2020

work page 2020
[21]

S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artiﬁcial intelligence and statistics, pages 627–635, 2011

work page 2011
[22]

D. Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000

work page 2000
[23]

Toyer, R

S. Toyer, R. Shah, A. Critch, and S. Russell. The magical benchmark for robust imitation. arXiv preprint arXiv:2011.00401, 2020

work page arXiv 2011
[24]

James, Z

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

work page 2020
[25]

A. Zeng, P. Florence, J. Tompson, S. Welker, J. Chien, M. Attarian, T. Armstrong, I. Krasin, D. Duong, V . Sindhwani, et al. Transporter networks: Rearranging the visual world for robotic manipulation. arXiv preprint arXiv:2010.14406, 2020

work page arXiv 2010
[26]

Fujimoto, D

S. Fujimoto, D. Meger, and D. Precup. Off-policy deep reinforcement learning without explo- ration. In International Conference on Machine Learning, pages 2052–2062. PMLR, 2019

work page 2052
[27]

Conservative q-learning for offline reinforcement learning, 2020

A. Kumar, A. Zhou, G. Tucker, and S. Levine. Conservative q-learning for ofﬂine reinforcement learning. arXiv preprint arXiv:2006.04779, 2020

work page arXiv 2006
[28]

T. L. Paine, C. Paduraru, A. Michi, C. Gulcehre, K. Zolna, A. Novikov, Z. Wang, and N. de Freitas. Hyperparameter selection for ofﬂine reinforcement learning. arXiv preprint arXiv:2007.09055, 2020

work page arXiv 2007
[29]

J. Fu, M. Norouzi, O. Nachum, G. Tucker, Z. Wang, A. Novikov, M. Yang, M. R. Zhang, Y . Chen, A. Kumar, et al. Benchmarks for deep off-policy evaluation. arXiv preprint arXiv:2103.16596, 2021

work page arXiv 2021
[30]

Haarnoja, A

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861–1870. PMLR, 2018

work page 2018
[31]

A. J. Ijspeert, J. Nakanishi, and S. Schaal. Movement imitation with nonlinear dynamical systems in humanoid robots. Proceedings 2002 IEEE International Conference on Robotics and Automation, 2:1398–1403 vol.2, 2002

work page 2002
[32]

C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine. One-shot visual imitation learning via meta-learning. In Conference in Robot Learning, volume abs/1709.04905, 2017

work page arXiv 2017
[33]

Billard, S

A. Billard, S. Calinon, R. Dillmann, and S. Schaal. Robot programming by demonstration. In Springer Handbook of Robotics, 2008. 10

work page 2008
[34]

Calinon, F

S. Calinon, F. D’halluin, E. L. Sauser, D. G. Caldwell, and A. Billard. Learning and reproduction of gestures by imitation. IEEE Robotics and Automation Magazine, 17:44–54, 2010

work page 2010
[35]

C. Wang, R. Wang, D. Xu, A. Mandlekar, L. Fei-Fei, and S. Savarese. Generalization through hand-eye coordination: An action space for learning spatially-invariant visuomotor control. arXiv preprint arXiv:2103.00375, 2021

work page arXiv 2021
[36]

A. Tung, J. Wong, A. Mandlekar, R. Martín-Martín, Y . Zhu, L. Fei-Fei, and S. Savarese. Learning multi-arm manipulation through collaborative teleoperation. arXiv preprint arXiv:2012.06738, 2020

work page arXiv 2012
[37]

Y . Wu, G. Tucker, and O. Nachum. Behavior regularized ofﬂine reinforcement learning. arXiv preprint arXiv:1911.11361, 2019

work page internal anchor Pith review arXiv 1911
[38]

Z. Wang, A. Novikov, K. ˙Zołna, J. T. Springenberg, S. Reed, B. Shahriari, N. Siegel, J. Merel, C. Gulcehre, N. Heess, et al. Critic regularized regression. arXiv preprint arXiv:2006.15134, 2020

work page arXiv 2006
[39]

N. Y . Siegel, J. T. Springenberg, F. Berkenkamp, A. Abdolmaleki, M. Neunert, T. Lampe, R. Hafner, N. Heess, and M. Riedmiller. Keep doing what worked: Behavioral modelling priors for ofﬂine reinforcement learning. arXiv preprint arXiv:2002.08396, 2020

work page arXiv 2002
[40]

MOReL: Model-based offline reinforcement learning

R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims. Morel: Model-based ofﬂine reinforcement learning. arXiv preprint arXiv:2005.05951, 2020

work page arXiv 2005
[41]

T. Yu, G. Thomas, L. Yu, S. Ermon, J. Zou, S. Levine, C. Finn, and T. Ma. Mopo: Model-based ofﬂine policy optimization. arXiv preprint arXiv:2005.13239, 2020

work page arXiv 2005
[42]

S. K. S. Ghasemipour, D. Schuurmans, and S. S. Gu. Emaq: Expected-max q-learning operator for simple yet effective ofﬂine and online rl. arXiv preprint arXiv:2007.11091, 2020

work page arXiv 2007
[43]

T. Yu, A. Kumar, R. Rafailov, A. Rajeswaran, S. Levine, and C. Finn. Combo: Conservative ofﬂine model-based policy optimization. arXiv preprint arXiv:2102.08363, 2021

work page arXiv 2021
[44]

J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning quadrupedal locomotion over challenging terrain. Science robotics, 5(47), 2020

work page 2020
[45]

C. Finn, X. Y . Tan, Y . Duan, T. Darrell, S. Levine, and P. Abbeel. Deep spatial autoencoders for visuomotor learning. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 512–519. IEEE, 2016

work page 2016
[46]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770– 778, 2016

work page 2016
[47]

Agarwal, D

R. Agarwal, D. Schuurmans, and M. Norouzi. An optimistic perspective on ofﬂine reinforcement learning. In International Conference on Machine Learning, pages 104–114. PMLR, 2020

work page 2020
[48]

A. Ajay, A. Kumar, P. Agrawal, S. Levine, and O. Nachum. Opal: Ofﬂine primitive discovery for accelerating ofﬂine reinforcement learning. arXiv preprint arXiv:2010.13611, 2020

work page arXiv 2010
[49]

Pertsch, Y

K. Pertsch, Y . Lee, and J. J. Lim. Accelerating reinforcement learning with learned skill priors. arXiv preprint arXiv:2010.11944, 2020

work page arXiv 2010
[50]

L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch. Decision transformer: Reinforcement learning via sequence modeling. arXiv preprint arXiv:2106.01345, 2021

work page arXiv 2021
[51]

Janner, Q

M. Janner, Q. Li, and S. Levine. Reinforcement learning as one big sequence modeling problem. arXiv preprint arXiv:2106.02039, 2021

work page arXiv 2021
[52]

Yang and O

M. Yang and O. Nachum. Representation matters: Ofﬂine pretraining for sequential decision making. arXiv preprint arXiv:2102.05815, 2021. 11

work page arXiv 2021
[53]

Nachum and M

O. Nachum and M. Yang. Provable representation learning for imitation with contrastive fourier features. arXiv preprint arXiv:2105.12272, 2021

work page arXiv 2021
[54]

Hersch, F

M. Hersch, F. Guenter, S. Calinon, and A. Billard. Dynamical system modulation for robot learning via kinesthetic demonstrations. IEEE Transactions on Robotics, 24(6):1463–1467, 2008

work page 2008
[55]

Kormushev, S

P. Kormushev, S. Calinon, and D. G. Caldwell. Imitation learning of positional and force skills demonstrated via kinesthetic teaching and haptic input. Advanced Robotics, 25(5):581–603, 2011

work page 2011
[56]

Akgun, M

B. Akgun, M. Cakmak, J. W. Yoo, and A. L. Thomaz. Trajectories and keyframes for kines- thetic teaching: A human-robot interaction perspective. In Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot Interaction, pages 391–398, 2012

work page 2012
[57]

Bajcsy, D

A. Bajcsy, D. P. Losey, M. K. O’Malley, and A. D. Dragan. Learning robot objectives from physical human interaction. Proceedings of Machine Learning Research, 78:217–226, 2017

work page 2017
[58]

Bajcsy, D

A. Bajcsy, D. P. Losey, M. K. O’Malley, and A. D. Dragan. Learning from physical human corrections, one feature at a time. In Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, pages 141–149, 2018

work page 2018
[59]

Human-in-the-loop imitation learning using remote teleoperation.arXiv preprint arXiv:2012.06733, 2020

A. Mandlekar, D. Xu, R. Martín-Martín, Y . Zhu, L. Fei-Fei, and S. Savarese. Human-in-the-loop imitation learning using remote teleoperation. arXiv preprint arXiv:2012.06733, 2020

work page arXiv 2012
[60]

Zhang, F

R. Zhang, F. Torabi, L. Guan, D. H. Ballard, and P. Stone. Leveraging human guidance for deep reinforcement learning tasks. arXiv preprint arXiv:1909.09906, 2019

work page arXiv 1909
[61]

Loftin, B

R. Loftin, B. Peng, J. MacGlashan, M. L. Littman, M. E. Taylor, J. Huang, and D. L. Roberts. Learning behaviors via human-delivered discrete feedback: modeling implicit feedback strate- gies to speed up learning. Autonomous agents and multi-agent systems, 30(1):30–59, 2016

work page 2016
[62]

MacGlashan, M

J. MacGlashan, M. K. Ho, R. Loftin, B. Peng, D. Roberts, M. E. Taylor, and M. L. Littman. Interactive learning from policy-dependent human feedback. arXiv preprint arXiv:1701.06049, 2017

work page arXiv 2017
[63]

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems , pages 4299–4307, 2017

work page 2017
[64]

End-to-End Robotic Reinforcement Learning without Reward Engineering

A. Singh, L. Yang, K. Hartikainen, C. Finn, and S. Levine. End-to-end robotic reinforcement learning without reward engineering. arXiv preprint arXiv:1904.07854, 2019

work page Pith review arXiv 1904
[65]

Y . Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In International conference on machine learning, pages 1329–

work page
[66]

T. Wang, X. Bao, I. Clavera, J. Hoang, Y . Wen, E. Langlois, S. Zhang, G. Zhang, P. Abbeel, and J. Ba. Benchmarking model-based reinforcement learning. arXiv preprint arXiv:1907.02057, 2019

work page Pith review arXiv 1907
[67]

Andrychowicz, A

M. Andrychowicz, A. Raichuk, P. Sta´nczyk, M. Orsini, S. Girgin, R. Marinier, L. Hussenot, M. Geist, O. Pietquin, M. Michalski, et al. What matters in on-policy reinforcement learning? a large-scale empirical study. arXiv preprint arXiv:2006.05990, 2020

work page arXiv 2006
[68]

Henderson, R

P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep reinforce- ment learning that matters. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 32, 2018

work page 2018
[69]

Memmesheimer, I

R. Memmesheimer, I. Mykhalchyshyna, V . Seib, and D. Paulus. Simitate: A hybrid imitation learning benchmark. arXiv preprint arXiv:1905.06002, 2019

work page arXiv 1905
[70]

Hussenot, M

L. Hussenot, M. Andrychowicz, D. Vincent, R. Dadashi, A. Raichuk, L. Staﬁniak, S. Girgin, R. Marinier, N. Momchev, S. Ramos, et al. Hyperparameter selection for imitation learning. arXiv preprint arXiv:2105.12034, 2021. 12

work page arXiv 2021
[71]

M. A. Rana, D. Chen, J. Williams, V . Chu, S. R. Ahmadzadeh, and S. Chernova. Benchmark for skill learning from demonstration: Impact of user experience, task complexity, and start conﬁg- uration on performance. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 7561–7567. IEEE, 2020

work page 2020
[72]

Lemme, Y

A. Lemme, Y . Meirovitch, M. Khansari-Zadeh, T. Flash, A. Billard, and J. J. Steil. Open- source benchmarking for learned reaching motion generation in robotics. Paladyn, Journal of Behavioral Robotics, 6(1), 2015

work page 2015
[73]

L. Fan, Y . Zhu, J. Zhu, Z. Liu, O. Zeng, A. Gupta, J. Creus-Costa, S. Savarese, and L. Fei-Fei. Surreal: Open-source reinforcement learning framework and robot manipulation benchmark. In Conference on Robot Learning, pages 767–782. PMLR, 2018

work page 2018
[74]

X. Lin, Y . Wang, J. Olkin, and D. Held. Softgym: Benchmarking deep reinforcement learning for deformable object manipulation. arXiv preprint arXiv:2011.07215, 2020

work page arXiv 2011
[75]

T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, pages 1094–1100. PMLR, 2020

work page 2020
[76]

Y . Zhu, J. Wong, A. Mandlekar, and R. Martín-Martín. robosuite: A modular simulation framework and benchmark for robot learning. In arXiv preprint arXiv:2009.12293, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[77]

Ahmed, F

O. Ahmed, F. Träuble, A. Goyal, A. Neitz, Y . Bengio, B. Schölkopf, M. Wüthrich, and S. Bauer. Causalworld: A robotic manipulation benchmark for causal structure and transfer learning. arXiv preprint arXiv:2010.04296, 2020

work page arXiv 2010
[78]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[79]

D. P. Kingma and M. Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[80]

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

Showing first 80 references.