Recognition: 2 theorem links
· Lean TheoremWhat Matters in Learning from Offline Human Demonstrations for Robot Manipulation
Pith reviewed 2026-05-13 08:47 UTC · model grok-4.3
The pith
Learning from offline human demonstrations for robot manipulation is most sensitive to demonstration quality, algorithmic design choices, and evaluation stopping criteria.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The empirical study demonstrates that offline algorithms for learning robot manipulation from human data exhibit pronounced sensitivity to algorithmic hyperparameters and design choices, strong dependence on the quality of the demonstration datasets, and high variability in outcomes depending on the stopping criteria chosen during training versus evaluation.
What carries the argument
The large-scale comparative evaluation of six offline algorithms across eight manipulation tasks and datasets of controlled quality levels.
If this is right
- Different algorithmic design choices produce large differences in final policy performance on the same data.
- Higher-quality human demonstrations directly yield better learned policies than lower-quality ones.
- Optimal training stopping points differ from optimal evaluation points because the objectives do not align.
- Offline methods can succeed on multi-stage tasks where current reinforcement learning approaches fail.
- The same methods apply directly to real-world manipulation using only raw camera and proprioceptive signals.
Where Pith is reading between the lines
- Efforts to improve data collection pipelines may yield larger gains than further algorithm tweaks alone.
- Standardized open benchmarks of this form could reduce redundant experimentation across research groups.
- The training-evaluation objective mismatch points to a general need for offline methods that optimize for evaluation metrics explicitly.
Load-bearing premise
The chosen five simulated and three real-world tasks together with their demonstration datasets of varying quality are representative enough to support general lessons for robot manipulation.
What would settle it
A new algorithm that maintains high success rates across all task difficulties and all demonstration quality levels, or one whose final performance shows no dependence on the choice of training stopping point in these exact setups, would falsify the main lessons.
read the original abstract
Imitating human demonstrations is a promising approach to endow robots with various manipulation capabilities. While recent advances have been made in imitation learning and batch (offline) reinforcement learning, a lack of open-source human datasets and reproducible learning methods make assessing the state of the field difficult. In this paper, we conduct an extensive study of six offline learning algorithms for robot manipulation on five simulated and three real-world multi-stage manipulation tasks of varying complexity, and with datasets of varying quality. Our study analyzes the most critical challenges when learning from offline human data for manipulation. Based on the study, we derive a series of lessons including the sensitivity to different algorithmic design choices, the dependence on the quality of the demonstrations, and the variability based on the stopping criteria due to the different objectives in training and evaluation. We also highlight opportunities for learning from human datasets, such as the ability to learn proficient policies on challenging, multi-stage tasks beyond the scope of current reinforcement learning methods, and the ability to easily scale to natural, real-world manipulation scenarios where only raw sensory signals are available. We have open-sourced our datasets and all algorithm implementations to facilitate future research and fair comparisons in learning from human demonstration data. Codebase, datasets, trained models, and more available at https://arise-initiative.github.io/robomimic-web/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts an extensive empirical study of six offline learning algorithms (including behavioral cloning variants and offline RL methods) for robot manipulation. It evaluates them on five simulated and three real-world multi-stage tasks using human demonstration datasets of controlled varying quality, derives lessons on algorithmic sensitivities, demonstration quality dependence, and stopping-criteria variability, and highlights opportunities such as scaling to raw sensory inputs and outperforming current RL on complex tasks. All datasets, code, and models are open-sourced.
Significance. If the empirical findings hold, the work is significant for establishing reproducible benchmarks in offline imitation learning for robotics and providing actionable lessons on practical challenges like hyperparameter sensitivity and data quality. The open-sourcing of datasets, implementations, and trained models is a clear strength that enables fair comparisons and future research, directly addressing the lack of open human datasets noted in the abstract.
major comments (1)
- [§4] §4 (Experiments and Tasks): The central lessons on algorithmic design sensitivities, demonstration quality dependence, and stopping-criteria effects are derived from only eight tasks (five sim + three real) that share similar proprioceptive+RGB observations and rigid-body pick/place primitives. This limited span risks making the sensitivities regime-specific rather than fundamental to broader manipulation challenges involving different contact dynamics, longer horizons, or action precisions, weakening the generalization claim in the abstract and conclusion.
minor comments (2)
- [Table 1] Table 1 and §5: The stopping-criteria analysis would benefit from an explicit statement of how evaluation horizons and success thresholds are chosen independently of training objectives to avoid any appearance of post-selection.
- [Figure 2] Figure 2: Axis labels and legend entries for the different algorithm variants could be enlarged for readability in print.
Simulated Author's Rebuttal
Thank you for the positive review and the recommendation for minor revision. We are grateful for the feedback highlighting the importance of our benchmark and open-sourcing efforts. We address the major comment as follows.
read point-by-point responses
-
Referee: [§4] §4 (Experiments and Tasks): The central lessons on algorithmic design sensitivities, demonstration quality dependence, and stopping-criteria effects are derived from only eight tasks (five sim + three real) that share similar proprioceptive+RGB observations and rigid-body pick/place primitives. This limited span risks making the sensitivities regime-specific rather than fundamental to broader manipulation challenges involving different contact dynamics, longer horizons, or action precisions, weakening the generalization claim in the abstract and conclusion.
Authors: We acknowledge that our evaluation is based on eight tasks sharing similar observation modalities and action spaces. However, these tasks were designed to cover a spectrum of manipulation complexities, including varying numbers of stages (from 1 to 5), objects, and success criteria. The identified sensitivities to algorithmic choices, data quality, and stopping criteria stem from core issues in offline imitation learning, such as the distribution mismatch between human demonstrations and policy rollouts, which are not specific to pick-and-place primitives. Our claims in the abstract and conclusion are framed around 'robot manipulation' in the context of these multi-stage tasks, without asserting universality across all possible dynamics. The open release of all datasets, code, and models allows for easy extension to new tasks with different contact dynamics or longer horizons. We believe this does not weaken the generalization of the lessons within the scope of current offline learning for manipulation, but we are happy to add a discussion on the limitations of the task suite in the revised manuscript. revision: partial
Circularity Check
No circularity: empirical lessons derived from external benchmarks
full rationale
The paper is an empirical benchmark study comparing six offline algorithms on five simulated plus three real multi-stage tasks using human demonstration datasets of controlled quality. Lessons on algorithmic sensitivity, demonstration quality, and stopping-criterion variability are extracted from direct performance measurements against external task benchmarks and datasets. No mathematical derivations, predictions, or uniqueness claims appear; nothing reduces to fitted parameters or self-citations by construction. The central claims remain independent of any internal definitional loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard Markov decision process assumptions and offline RL evaluation protocols hold for the chosen manipulation tasks.
Forward citations
Cited by 30 Pith papers
-
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.
-
Aligning Flow Map Policies with Optimal Q-Guidance
Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.
-
Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation
A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
-
Beyond Isolation: A Unified Benchmark for General-Purpose Navigation
OmniNavBench is a unified benchmark for general-purpose navigation featuring composite multi-skill instructions, support for humanoid, quadrupedal and wheeled robots, and 1779 human teleoperated trajectories across 17...
-
Hydra-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control
Frequency analysis of smooth robot actions bounds denoising error to low-frequency modes, enabling a sub-1% parameter 3D diffusion policy with two-step inference that reaches SOTA on manipulation benchmarks.
-
ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching
ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on lo...
-
Referring-Aware Visuomotor Policy Learning for Closed-Loop Manipulation
ReV is a referring-aware visuomotor policy using coupled diffusion heads for real-time trajectory replanning in robotic manipulation, trained solely via targeted perturbations to expert demonstrations and achieving hi...
-
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
A collaborative dataset spanning 22 robots and 527 skills enables RT-X models that transfer capabilities across different robot embodiments.
-
SID: Sliding into Distribution for Robust Few-Demonstration Manipulation
SID achieves approximately 90% success on six real-world manipulation tasks with only two demonstrations under out-of-distribution initializations, with less than 10% performance drop under distractors and disturbances.
-
StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception
StereoPolicy fuses stereo image pairs via a Stereo Transformer on pretrained 2D encoders to boost robotic manipulation policies, showing gains over monocular, RGB-D, point cloud, and multi-view methods in simulations ...
-
DexSynRefine: Synthesizing and Refining Human-Object Interaction Motion for Physically Feasible Dexterous Robot Actions
DexSynRefine synthesizes HOI motions with an extended manifold method, refines them via task-space residual RL, and adapts for sim-to-real transfer, outperforming kinematic retargeting by 50-70 percentage points on fi...
-
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
-
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.
-
When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning
Q2RL extracts Q-functions from BC policies via minimal interactions and applies Q-gating to enable stable offline-to-online RL, outperforming baselines on manipulation benchmarks and achieving up to 100% success on-robot.
-
Learning Reactive Dexterous Grasping via Hierarchical Task-Space RL Planning and Joint-Space QP Control
A multi-agent RL high-level planner outputs task-space velocities that a GPU-parallel QP low-level controller converts to joint velocities while enforcing limits and collisions, yielding robust sim-to-real dexterous g...
-
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
-
Hydra-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control
Hydra-DP3 achieves SOTA visuomotor performance with under 1% of prior 3D diffusion policy parameters by using frequency analysis to justify a lightweight decoder and two-step DDIM inference.
-
Hydra-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control
Hydra-DP3 is a lightweight 3D diffusion policy that uses frequency analysis of smooth action trajectories to enable two-step DDIM inference and achieves state-of-the-art results with under 1% of prior parameters.
-
An Efficient Metric for Data Quality Measurement in Imitation Learning
Power spectral density of trajectories ranks demonstration quality for imitation learning, enabling rollout-free curation that improves fine-tuned policy success.
-
Learning from the Best: Smoothness-Driven Metrics for Data Quality in Imitation Learning
RINSE scores robot demonstration trajectories for smoothness via SAL and TED metrics to curate higher-quality data for behavioral cloning, improving success rates with less data on benchmarks and real robots.
-
VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation
VADF adds an Adaptive Loss Network for hard-negative training sampling and a Hierarchical Vision Task Segmenter for adaptive noise scheduling during inference to speed convergence and reduce timeouts in diffusion robo...
-
WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations
WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match tele...
-
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...
-
Unified Video Action Model
UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without p...
-
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.
-
To Do or Not to Do: Ensuring the Safety of Visuomotor Policies Learned from Demonstrations
Execution guarantee certifies safe regions for IL policies via view synthesis and set invariance so that maximum task success is assured from within those regions even under small execution changes.
-
Gated Memory Policy
GMP selectively activates and represents memory via a gate and lightweight cross-attention, yielding 30.1% higher success on non-Markovian robotic tasks while staying competitive on Markovian ones.
-
Efficient Hierarchical Implicit Flow Q-learning for Offline Goal-conditioned Reinforcement Learning
Proposes mean flow policies and LeJEPA loss to overcome Gaussian policy limits and weak subgoal generation in hierarchical offline GCRL, reporting strong results on OGBench state and pixel tasks.
-
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Reference graph
Works this paper leans on
-
[1]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009
work page 2009
-
[2]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012
work page 2012
- [3]
-
[4]
O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015
work page 2015
-
[5]
Know What You Don't Know: Unanswerable Questions for SQuAD
P. Rajpurkar, R. Jia, and P. Liang. Know What You Don’t Know: Unanswerable Questions for SQuAD. arXiv preprint arXiv:1806.03822, 2018
work page Pith review arXiv 2018
-
[6]
L. Floridi and M. Chiriatti. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30(4):681–694, 2020
work page 2020
-
[7]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019
work page 2019
-
[8]
D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems, pages 305–313, 1989
work page 1989
- [9]
-
[10]
Learning to generalize across long-horizon tasks from human demonstrations
A. Mandlekar, D. Xu, R. Martín-Martín, S. Savarese, and L. Fei-Fei. Learning to generalize across long-horizon tasks from human demonstrations. arXiv preprint arXiv:2003.06085, 2020
- [11]
-
[12]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
S. Levine, A. Kumar, G. Tucker, and J. Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[13]
S. Cabi, S. Gómez Colmenarejo, A. Novikov, K. Konyushkova, S. Reed, R. Jeong, K. Zolna, Y . Aytar, D. Budden, M. Vecerik, et al. Scaling data-driven robotics with reward sketching and batch reinforcement learning. arXiv, pages arXiv–1909, 2019
work page 1909
-
[14]
P. Florence, L. Manuelli, and R. Tedrake. Self-supervised correspondence in visuomotor policy learning. IEEE Robotics and Automation Letters, 5(2):492–499, 2019
work page 2019
-
[15]
A. Mandlekar, Y . Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, S. Savarese, and L. Fei-Fei. RoboTurk: A Crowdsourcing Platform for Robotic Skill Learning through Imitation. In Conference on Robot Learning, 2018. 9
work page 2018
- [16]
-
[17]
A. Mandlekar, J. Booher, M. Spero, A. Tung, A. Gupta, Y . Zhu, A. Garg, S. Savarese, and L. Fei-Fei. Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity. arXiv preprint arXiv:1911.04052, 2019
-
[18]
J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[19]
C. Gulcehre, Z. Wang, A. Novikov, T. L. Paine, S. G. Colmenarejo, K. Zolna, R. Agarwal, J. Merel, D. Mankowitz, C. Paduraru, et al. Rl unplugged: Benchmarks for offline reinforcement learning. arXiv preprint arXiv:2006.13888, 2020
-
[20]
A. Mandlekar, F. Ramos, B. Boots, S. Savarese, L. Fei-Fei, A. Garg, and D. Fox. Iris: Implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. In IEEE International Conference on Robotics and Automation (ICRA), pages 4414–4420. IEEE, 2020
work page 2020
-
[21]
S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635, 2011
work page 2011
-
[22]
D. Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000
work page 2000
- [23]
- [24]
- [25]
-
[26]
S. Fujimoto, D. Meger, and D. Precup. Off-policy deep reinforcement learning without explo- ration. In International Conference on Machine Learning, pages 2052–2062. PMLR, 2019
work page 2052
- [27]
- [28]
- [29]
-
[30]
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861–1870. PMLR, 2018
work page 2018
-
[31]
A. J. Ijspeert, J. Nakanishi, and S. Schaal. Movement imitation with nonlinear dynamical systems in humanoid robots. Proceedings 2002 IEEE International Conference on Robotics and Automation, 2:1398–1403 vol.2, 2002
work page 2002
- [32]
-
[33]
A. Billard, S. Calinon, R. Dillmann, and S. Schaal. Robot programming by demonstration. In Springer Handbook of Robotics, 2008. 10
work page 2008
-
[34]
S. Calinon, F. D’halluin, E. L. Sauser, D. G. Caldwell, and A. Billard. Learning and reproduction of gestures by imitation. IEEE Robotics and Automation Magazine, 17:44–54, 2010
work page 2010
- [35]
- [36]
-
[37]
Y . Wu, G. Tucker, and O. Nachum. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019
work page internal anchor Pith review arXiv 1911
- [38]
- [39]
-
[40]
MOReL: Model-based offline reinforcement learning
R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims. Morel: Model-based offline reinforcement learning. arXiv preprint arXiv:2005.05951, 2020
- [41]
- [42]
- [43]
-
[44]
J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning quadrupedal locomotion over challenging terrain. Science robotics, 5(47), 2020
work page 2020
-
[45]
C. Finn, X. Y . Tan, Y . Duan, T. Darrell, S. Levine, and P. Abbeel. Deep spatial autoencoders for visuomotor learning. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 512–519. IEEE, 2016
work page 2016
-
[46]
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770– 778, 2016
work page 2016
-
[47]
R. Agarwal, D. Schuurmans, and M. Norouzi. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning, pages 104–114. PMLR, 2020
work page 2020
- [48]
-
[49]
K. Pertsch, Y . Lee, and J. J. Lim. Accelerating reinforcement learning with learned skill priors. arXiv preprint arXiv:2010.11944, 2020
- [50]
- [51]
-
[52]
M. Yang and O. Nachum. Representation matters: Offline pretraining for sequential decision making. arXiv preprint arXiv:2102.05815, 2021. 11
-
[53]
O. Nachum and M. Yang. Provable representation learning for imitation with contrastive fourier features. arXiv preprint arXiv:2105.12272, 2021
- [54]
-
[55]
P. Kormushev, S. Calinon, and D. G. Caldwell. Imitation learning of positional and force skills demonstrated via kinesthetic teaching and haptic input. Advanced Robotics, 25(5):581–603, 2011
work page 2011
- [56]
- [57]
- [58]
-
[59]
A. Mandlekar, D. Xu, R. Martín-Martín, Y . Zhu, L. Fei-Fei, and S. Savarese. Human-in-the-loop imitation learning using remote teleoperation. arXiv preprint arXiv:2012.06733, 2020
- [60]
- [61]
-
[62]
J. MacGlashan, M. K. Ho, R. Loftin, B. Peng, D. Roberts, M. E. Taylor, and M. L. Littman. Interactive learning from policy-dependent human feedback. arXiv preprint arXiv:1701.06049, 2017
-
[63]
P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems , pages 4299–4307, 2017
work page 2017
- [64]
-
[65]
Y . Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In International conference on machine learning, pages 1329–
- [66]
-
[67]
M. Andrychowicz, A. Raichuk, P. Sta´nczyk, M. Orsini, S. Girgin, R. Marinier, L. Hussenot, M. Geist, O. Pietquin, M. Michalski, et al. What matters in on-policy reinforcement learning? a large-scale empirical study. arXiv preprint arXiv:2006.05990, 2020
-
[68]
P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep reinforce- ment learning that matters. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018
work page 2018
-
[69]
R. Memmesheimer, I. Mykhalchyshyna, V . Seib, and D. Paulus. Simitate: A hybrid imitation learning benchmark. arXiv preprint arXiv:1905.06002, 2019
-
[70]
L. Hussenot, M. Andrychowicz, D. Vincent, R. Dadashi, A. Raichuk, L. Stafiniak, S. Girgin, R. Marinier, N. Momchev, S. Ramos, et al. Hyperparameter selection for imitation learning. arXiv preprint arXiv:2105.12034, 2021. 12
-
[71]
M. A. Rana, D. Chen, J. Williams, V . Chu, S. R. Ahmadzadeh, and S. Chernova. Benchmark for skill learning from demonstration: Impact of user experience, task complexity, and start config- uration on performance. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 7561–7567. IEEE, 2020
work page 2020
- [72]
-
[73]
L. Fan, Y . Zhu, J. Zhu, Z. Liu, O. Zeng, A. Gupta, J. Creus-Costa, S. Savarese, and L. Fei-Fei. Surreal: Open-source reinforcement learning framework and robot manipulation benchmark. In Conference on Robot Learning, pages 767–782. PMLR, 2018
work page 2018
- [74]
-
[75]
T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, pages 1094–1100. PMLR, 2020
work page 2020
-
[76]
Y . Zhu, J. Wong, A. Mandlekar, and R. Martín-Martín. robosuite: A modular simulation framework and benchmark for robot learning. In arXiv preprint arXiv:2009.12293, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
- [77]
-
[78]
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[79]
D. P. Kingma and M. Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[80]
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.