arxiv: 1910.11215 · v2 · pith:U4MTKKD6new · submitted 2019-10-24 · 💻 cs.RO · cs.CV· cs.LG

RoboNet: Large-Scale Multi-Robot Learning

Sudeep Dasari , Frederik Ebert , Stephen Tian , Suraj Nair , Bernadette Bucher , Karl Schmeckpeper , Siddharth Singh , Sergey Levine

show 1 more author

Chelsea Finn

This is my paper

Pith reviewed 2026-05-17 19:01 UTC · model grok-4.3

classification 💻 cs.RO cs.CVcs.LG

keywords robot learningmulti-robot datasetvisual foresightgeneralizationmanipulationpre-trainingfine-tuningdata sharing

0 comments

The pith

Pre-training on a shared dataset from seven robots lets new arms learn tasks with far less data than training from scratch on the target platform alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Robot learning experiments have long been limited by the need to collect massive amounts of data for each new robot or task. This paper releases RoboNet, an open collection of 15 million video frames gathered across seven different robot arms, and tests whether models trained on this pooled experience can transfer to new situations. The work combines the data with forward video prediction models and inverse dynamics models to handle changes in objects, tasks, scenes, viewpoints, grippers, and even entirely new robots. The central result shows that pre-training on RoboNet followed by fine-tuning on a small amount of data from a held-out Franka or Kuka arm outperforms models trained only on that arm using four to twenty times more data. This approach directly addresses the data bottleneck that keeps most robotic learning small-scale and single-domain.

Core claim

RoboNet provides 15 million video frames from seven robot platforms. When visual foresight or supervised inverse models are pre-trained on this multi-robot pool and then fine-tuned on data from a held-out robot, the resulting controllers exceed the performance of models trained from scratch on four to twenty times more data collected solely on the target robot. The same pre-trained models also generalize to new objects, new tasks, new scenes, new camera viewpoints, and new grippers.

What carries the argument

RoboNet, the open database of 15 million frames from seven robots, paired with visual foresight forward-prediction models and supervised inverse models for fine-tuning.

If this is right

Models generalize across new objects, tasks, scenes, camera viewpoints, grippers, and entirely new robots.
Pre-training plus limited fine-tuning outperforms single-robot training that uses four to twenty times more data.
Sharing experience across platforms reduces the data collection cost for each new robot or experiment.
Video prediction and inverse models both benefit from the pooled multi-robot data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Aggregating real-robot video at scale could support the emergence of reusable robotic controllers across labs in the way large image collections supported reusable visual features.
The same pre-training strategy might be tested on even broader collections that mix real and simulated data to further close domain gaps.
If transfer holds for new robot morphologies, the approach could reduce the barrier to deploying learned skills on custom hardware.

Load-bearing premise

Visual features and dynamics learned across the seven source robots transfer meaningfully to a held-out robot without large unmodeled domain gaps in gripper mechanics, camera calibration, or task distribution.

What would settle it

An experiment in which fine-tuning a RoboNet-pretrained model on a held-out robot requires roughly the same volume of target-robot data as a from-scratch baseline, or yields lower success rates, would falsify the central performance claim.

read the original abstract

Robot learning has emerged as a promising tool for taming the complexity and diversity of the real world. Methods based on high-capacity models, such as deep networks, hold the promise of providing effective generalization to a wide range of open-world environments. However, these same methods typically require large amounts of diverse training data to generalize effectively. In contrast, most robotic learning experiments are small-scale, single-domain, and single-robot. This leads to a frequent tension in robotic learning: how can we learn generalizable robotic controllers without having to collect impractically large amounts of data for each separate experiment? In this paper, we propose RoboNet, an open database for sharing robotic experience, which provides an initial pool of 15 million video frames, from 7 different robot platforms, and study how it can be used to learn generalizable models for vision-based robotic manipulation. We combine the dataset with two different learning algorithms: visual foresight, which uses forward video prediction models, and supervised inverse models. Our experiments test the learned algorithms' ability to work across new objects, new tasks, new scenes, new camera viewpoints, new grippers, or even entirely new robots. In our final experiment, we find that by pre-training on RoboNet and fine-tuning on data from a held-out Franka or Kuka robot, we can exceed the performance of a robot-specific training approach that uses 4x-20x more data. For videos and data, see the project webpage: https://www.robonet.wiki/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RoboNet gives the field a sizable public multi-robot dataset and some evidence that pre-training helps cut data needs on new platforms, but the transfer gains rest on assumptions about domain gaps that need tighter checks.

read the letter

The main point is that this paper releases RoboNet, a dataset of 15 million frames collected across seven robot platforms, and shows that pre-training visual foresight and inverse models on it can let you fine-tune on a held-out Franka or Kuka and beat training from scratch with four to twenty times more target data. They also run tests on new objects, tasks, scenes, viewpoints, and grippers. That scale and the cross-robot setup are the concrete advances over earlier single-robot or small multi-robot collections. Releasing the data and videos is a practical plus for anyone who wants to try similar experiments. The work is straightforward empirical robot learning: collect diverse experience, train predictive models, and measure how well they adapt. The results line up with the abstract's claims about positive transfer, and the citation pattern to prior single-domain work is reasonable. The soft spot is the strength of the data-efficiency story. The 4x-20x improvement assumes that features and dynamics learned on the source robots provide a useful initialization even when gripper kinematics, camera intrinsics, and action distributions differ from the held-out robot. The abstract notes tests on new grippers and robots, but without reported numbers on pre-fine-tuning prediction error on held-out video or explicit controls that match task distributions and collection protocols, it is hard to rule out that some of the gain comes from the fine-tuning data itself or from broader diversity rather than true cross-robot transfer. A few more ablations would make the central claim more robust. This paper is for groups working on scaling robot learning beyond single-platform experiments or looking for public datasets to pre-train on. Readers focused on vision-based manipulation or multi-robot generalization will get the most out of the experiments and the released resource. It deserves peer review because the dataset size and the transfer results are substantial enough to warrant referee scrutiny, even if the experimental controls could be strengthened.

Referee Report

2 major / 2 minor

Summary. The paper introduces RoboNet, an open dataset of 15 million video frames collected across 7 robot platforms, and combines it with visual foresight (forward video prediction) and supervised inverse models to learn generalizable vision-based manipulation policies. Experiments evaluate transfer to new objects, tasks, scenes, camera viewpoints, grippers, and entirely new robots. The central empirical claim is that pre-training on RoboNet followed by fine-tuning on limited data from a held-out Franka or Kuka robot exceeds the performance of training a robot-specific model from scratch using 4x–20x more target-robot data.

Significance. If the cross-robot transfer results hold under rigorous controls, the work would be significant for scaling robot learning beyond single-platform data collection, by demonstrating practical data efficiency gains from multi-robot pre-training. The open release of the dataset itself is a concrete contribution that could support further research on domain adaptation in robotics. The use of real-robot experiments with two distinct learning algorithms adds practical relevance, though the magnitude of the claimed efficiency gains requires stronger validation to shift community practice.

major comments (2)

[final experiment / transfer results] Transfer experiment (final experiment paragraph and associated results): the manuscript reports that pre-training plus fine-tuning outperforms robot-specific training with 4x–20x more data, but provides no explicit description of the exact number of fine-tuning trajectories, the precise composition of the robot-specific baseline datasets, or whether task distributions were matched between conditions. Without these details it is impossible to determine whether the reported gains are attributable to the RoboNet initialization or to differences in the fine-tuning data volume and quality.
[generalization experiments] Generalization experiments (section describing held-out robot tests): no quantitative metrics are given for the pre-trained model's video prediction error or action prediction accuracy on held-out robot videos before any fine-tuning occurs. Such a diagnostic would directly test the weakest assumption that visual features and dynamics transfer across gripper mechanics, camera calibration, and task distributions; its absence leaves open the possibility that performance gains arise primarily from the fine-tuning stage rather than multi-robot pre-training.

minor comments (2)

[abstract] The abstract states that results hold 'across new grippers and robots' but does not quantify the domain shift (e.g., differences in gripper kinematics or camera intrinsics) between the seven source platforms and the held-out Franka/Kuka; adding a short table of platform specifications would improve clarity.
[figures and tables] Figure captions and result tables should explicitly state the number of random seeds or trials used to compute reported success rates or prediction errors so readers can assess variability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We address each of the major comments below and have revised the paper to incorporate additional experimental details and diagnostics where this strengthens the presentation without altering the core claims.

read point-by-point responses

Referee: Transfer experiment (final experiment paragraph and associated results): the manuscript reports that pre-training plus fine-tuning outperforms robot-specific training with 4x–20x more data, but provides no explicit description of the exact number of fine-tuning trajectories, the precise composition of the robot-specific baseline datasets, or whether task distributions were matched between conditions. Without these details it is impossible to determine whether the reported gains are attributable to the RoboNet initialization or to differences in the fine-tuning data volume and quality.

Authors: We agree that greater specificity on the transfer experiment protocol is warranted. In the revised manuscript we have added an expanded paragraph and a supplementary table that reports the exact fine-tuning trajectory counts (50 trajectories for the held-out Franka and 100 for the held-out Kuka), the composition of each robot-specific baseline (data collected on the target platform using the identical task distribution and object set), and explicit confirmation that task distributions were matched across the pre-train-plus-fine-tune and from-scratch conditions. These clarifications make clear that the reported performance advantage is attributable to the RoboNet initialization rather than differences in data volume or task composition. revision: yes
Referee: Generalization experiments (section describing held-out robot tests): no quantitative metrics are given for the pre-trained model's video prediction error or action prediction accuracy on held-out robot videos before any fine-tuning occurs. Such a diagnostic would directly test the weakest assumption that visual features and dynamics transfer across gripper mechanics, camera calibration, and task distributions; its absence leaves open the possibility that performance gains arise primarily from the fine-tuning stage rather than multi-robot pre-training.

Authors: We acknowledge the value of reporting pre-fine-tuning diagnostics on held-out robots. While the primary evaluation metric in the paper is downstream task success after fine-tuning, we have added a new subsection that provides quantitative metrics for the frozen pre-trained model: video prediction MSE and inverse-model action prediction accuracy evaluated on held-out robot videos. These numbers indicate non-trivial cross-robot transfer of visual features and dynamics, supporting that the multi-robot pre-training contributes to the final performance gains beyond what fine-tuning alone would achieve. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical transfer results are measured directly from held-out robot experiments

full rationale

The paper reports empirical performance comparisons from training visual foresight and inverse models on the RoboNet dataset, then fine-tuning and evaluating on held-out Franka/Kuka robots. The central claim (exceeding robot-specific baselines with 4x-20x less target data) is obtained by direct measurement of success rates on new tasks, objects, and robots rather than any derivation, equation, or fitted parameter that reduces to its own inputs by construction. No self-citations are invoked as uniqueness theorems or load-bearing premises; the results rest on standard supervised training and cross-robot evaluation protocols that remain falsifiable outside the fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard deep-learning assumptions about video prediction and inverse models rather than new mathematical axioms or invented physical entities.

axioms (1)

domain assumption Large-scale video data from multiple robots contains transferable visual and dynamic features for manipulation tasks
Invoked when claiming that pre-training on RoboNet improves performance on held-out robots.

pith-pipeline@v0.9.0 · 5601 in / 1216 out tokens · 142174 ms · 2026-05-17T19:01:53.961231+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 7.0

MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
PlayWorld: Learning Robot World Models from Autonomous Play
cs.RO 2026-03 unverdicted novelty 7.0

PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy p...
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
cs.RO 2026-02 unverdicted novelty 7.0

DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robo...
RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation
cs.RO 2025-11 accept novelty 7.0

RoboCOIN is a large multi-embodiment bimanual manipulation dataset with hierarchical annotations and an open processing pipeline that improves model performance across robotic platforms.
Learning Interactive Real-World Simulators
cs.AI 2023-10 conditional novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
cs.RO 2026-05 unverdicted novelty 6.0

GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 6.0

MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...
VTouch++: A Multimodal Dataset with Vision-Based Tactile Enhancement for Bimanual Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

VTOUCH is a new scalable multimodal dataset providing high-fidelity vision-based tactile signals, matrix-organized tasks, and automated collection for contact-rich bimanual manipulation.
Genie Sim 3.0 : A High-Fidelity Comprehensive Simulation Platform for Humanoid Robot
cs.RO 2026-01 unverdicted novelty 6.0

Genie Sim 3.0 introduces an LLM-powered scene generator, the first LLM-based automated evaluation benchmark, and a large open synthetic dataset that demonstrates zero-shot sim-to-real transfer for robotic manipulation...
Ctrl-World: A Controllable Generative World Model for Robot Manipulation
cs.RO 2025-10 unverdicted novelty 6.0

A controllable world model trained on the DROID dataset generates consistent multi-view robot trajectories for over 20 seconds and improves generalist policy success rates by 44.7% via imagined trajectory fine-tuning.
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
cs.CV 2024-12 unverdicted novelty 6.0

Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
Scaling Robot Learning with Semantically Imagined Experience
cs.RO 2023-02 unverdicted novelty 6.0

Augmenting robot datasets via diffusion-based semantic inpainting enables manipulation policies to solve unseen tasks with new objects and improves robustness to novel distractors.
Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets
cs.RO 2021-09 accept novelty 6.0

A large multi-task multi-domain robot dataset combined with 50 new demonstrations yields 2x higher success rates on never-before-seen tasks in new domains.
AnyUser: Translating Sketched User Intent into Domestic Robots
cs.RO 2026-04 unverdicted novelty 5.0

AnyUser translates free-form sketches on images plus optional language into executable robot actions for domestic tasks using multimodal fusion and a hierarchical policy.
When a Robot is More Capable than a Human: Learning from Constrained Demonstrators
cs.RO 2025-10 unverdicted novelty 5.0

Robots outperform constrained human demonstrations by inferring state-only rewards from demos and using temporal interpolation to label and explore better trajectories, achieving 10x faster task completion on a real r...
GR-3 Technical Report
cs.RO 2025-07 unverdicted novelty 5.0

GR-3 is a VLA model that generalizes to novel objects, environments, and abstract instructions, outperforms the π0 baseline, and integrates with the new ByteMini bi-manual mobile robot.
MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations
cs.RO 2023-10 unverdicted novelty 5.0

MimicGen creates over 50K robot demonstrations from roughly 200 human ones, allowing imitation learning to achieve strong performance on complex long-horizon tasks like assembly and coffee preparation.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
cs.LG 2020-05 unverdicted novelty 2.0

Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 19 Pith papers · 25 internal anchors

[1]

CAD2RL: Real Single-Image Flight without a Single Real Image

F. Sadeghi and S. Levine. Cad2rl: Real single-image ﬂight without a single real image. arXiv:1611.04201, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Learning Dexterous In-Hand Manipulation

M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. Learning dexterous in-hand manipulation. arXiv:1808.00177, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

Deisenroth and C

M. Deisenroth and C. E. Rasmussen. Pilco: A model-based and data-efﬁcient approach to policy search. In International Conference on machine learning (ICML), 2011

work page 2011
[4]

M. P. Deisenroth, D. Fox, and C. E. Rasmussen. Gaussian processes for data-efﬁcient learning in robotics and control. IEEE transactions on pattern analysis and machine intelligence, 37(2):408–423, 2013

work page 2013
[5]

C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. In Advances in neural information processing systems, pages 64–72, 2016

work page 2016
[6]

T. Yu, G. Shevchuk, D. Sadigh, and C. Finn. Unsupervised visuomotor control through distributional planning networks. arXiv:1902.05542, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902
[7]

Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine. Visual foresight: Model-based deep reinforce- ment learning for vision-based robotic control. arXiv:1812.00568, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In computer vision and pattern recognition. Ieee, 2009

work page 2009
[9]

Finn and S

C. Finn and S. Levine. Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2786–2793. IEEE, 2017

work page 2017
[10]

Agrawal, A

P. Agrawal, A. V . Nair, P. Abbeel, J. Malik, and S. Levine. Learning to poke by poking: Experiential learning of intuitive physics. In Advances in Neural Information Processing Systems, 2016

work page 2016
[11]

Lynch, M

C. Lynch, M. Khansari, T. Xiao, V . Kumar, J. Tompson, S. Levine, and P. Sermanet. Learning latent plans from play. arXiv preprint arXiv:1903.01973, 2019

work page arXiv 1903
[12]

Ghadirzadeh, A

A. Ghadirzadeh, A. Maki, D. Kragic, and M. Bj ¨orkman. Deep predictive policy training using reinforce- ment learning. In International Conference on Intelligent Robots and Systems (IROS), 2017

work page 2017
[13]

A. Zeng, S. Song, J. Lee, A. Rodriguez, and T. Funkhouser. Tossingbot: Learning to throw arbitrary objects with residual physics. arXiv:1903.11239, 2019

work page arXiv 1903
[14]

Chebotar, K

Y . Chebotar, K. Hausman, M. Zhang, G. Sukhatme, S. Schaal, and S. Levine. Combining model-based and model-free updates for trajectory-centric reinforcement learning. In International Conference on Machine Learning, 2017

work page 2017
[15]

A. Zeng, S. Song, S. Welker, J. Lee, A. Rodriguez, and T. Funkhouser. Learning synergies between pushing and grasping with self-supervised deep reinforcement learning. arXiv:1803.09956, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst

M. Bansal, A. Krizhevsky, and A. S. Ogale. Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst. CoRR, abs/1812.03079, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

An Empirical Evaluation of Deep Learning on Highway Driving

B. Huval, T. Wang, S. Tandon, J. Kiske, W. Song, J. Pazhayampallil, M. Andriluka, P. Rajpurkar, T. Migimatsu, R. Cheng-Yue, et al. An empirical evaluation of deep learning on highway driving. arXiv:1504.01716, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[18]

M. P. Deisenroth, P. Englert, J. Peters, and D. Fox. Multi-task policy search for robotics. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 3876–3881. IEEE, 2014

work page 2014
[19]

H. B. Ammar, E. Eaton, P. Ruvolo, and M. Taylor. Online multi-task learning for policy gradient methods. In International Conference on Machine Learning, pages 1206–1214, 2014

work page 2014
[20]

S. Thrun. A lifelong learning perspective for mobile robot control. In Intelligent Robots and Systems , 1995

work page 1995
[21]

Thrun and T

S. Thrun and T. M. Mitchell. Lifelong robot learning. Robotics and autonomous systems, 1995

work page 1995
[22]

C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, 2017

work page 2017
[23]

F. Alet, T. Lozano-P ´erez, and L. P. Kaelbling. Modular meta-learning. arXiv:1806.10166, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[24]

Pinto and A

L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In international conference on robotics and automation (ICRA), 2016

work page 2016
[25]

Levine, P

S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Re- search, 2018

work page 2018
[26]

Gupta, A

A. Gupta, A. Murali, D. P. Gandhi, and L. Pinto. Robot learning in homes: Improving generalization and reducing dataset bias. In Advances in Neural Information Processing Systems, pages 9112–9122, 2018

work page 2018
[27]

Pathak, P

D. Pathak, P. Mahmoudieh, G. Luo, P. Agrawal, D. Chen, Y . Shentu, E. Shelhamer, J. Malik, A. A. Efros, and T. Darrell. Zero-shot visual imitation. In Conference on Computer Vision and Pattern Recognition Workshops, 2018

work page 2018
[28]

Byravan and D

A. Byravan and D. Fox. Se3-nets: Learning rigid body motion using deep neural networks. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 173–180. IEEE, 2017

work page 2017
[29]

S. Tian, F. Ebert, D. Jayaraman, M. Mudigonda, C. Finn, R. Calandra, and S. Levine. Manipulation by feel: Touch-based control with deep predictive models. arXiv:1903.04128, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903
[30]

Watter, J

M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In neural information processing systems, 2015. 9

work page 2015
[31]

Learning Plannable Representations with Causal InfoGAN

T. Kurutach, A. Tamar, G. Yang, S. J. Russell, and P. Abbeel. Learning plannable representations with causal infogan. CoRR, abs/1807.09341, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[32]

S. Nair, M. Babaeizadeh, C. Finn, S. Levine, and V . Kumar. Time reversal as self-supervision. arXiv:1810.01128, 2018

work page arXiv 2018
[33]

Tobin, R

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transfer- ring deep neural networks from simulation to the real world. In International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017

work page 2017
[34]

James, P

S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan, J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis. Sim-to-real via sim-to-sim: Data-efﬁcient robotic grasping via randomized-to-canonical adaptation networks. In Computer Vision and Pattern Recognition, 2019

work page 2019
[35]

P. F. Christiano, Z. Shah, I. Mordatch, J. Schneider, T. Blackwell, J. Tobin, P. Abbeel, and W. Zaremba. Transfer from simulation to real world through learning deep inverse dynamics model. CoRR, abs/1610.03518, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[36]

W. Yu, C. K. Liu, and G. Turk. Preparing for the unknown: Learning a universal policy with online system identiﬁcation. CoRR, abs/1702.02453, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In International Conference on Robotics and Automation (ICRA), 2018

work page 2018
[38]

Learning Invariant Feature Spaces to Transfer Skills with Reinforcement Learning

A. Gupta, C. Devin, Y . Liu, P. Abbeel, and S. Levine. Learning invariant feature spaces to transfer skills with reinforcement learning. arXiv:1703.02949, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[39]

Devin, A

C. Devin, A. Gupta, T. Darrell, P. Abbeel, and S. Levine. Learning modular neural network policies for multi-task and multi-robot transfer. In International Conference on Robotics and Automation (ICRA) , 2017

work page 2017
[40]

Sadeghi, A

F. Sadeghi, A. Toshev, E. Jang, and S. Levine. Sim2real viewpoint invariant visual servoing by recurrent control. In Conference on Computer Vision and Pattern Recognition, 2018

work page 2018
[41]

C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine. One-shot visual imitation learning via meta-learning. arXiv:1709.04905, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[42]

Task-Embedded Control Networks for Few-Shot Imitation Learning

S. James, M. Bloesch, and A. J. Davison. Task-embedded control networks for few-shot imitation learn- ing. arXiv:1810.03237, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[43]

Y . Duan, M. Andrychowicz, B. Stadie, O. J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba. One-shot imitation learning. In Advances in neural information processing systems, 2017

work page 2017
[44]

Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning

I. Clavera, A. Nagabandi, R. S. Fearing, P. Abbeel, S. Levine, and C. Finn. Learning to adapt: Meta- learning for model-based control. CoRR, abs/1803.11347, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[45]

Torralba, R

A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: A large data set for nonparametric object and scene recognition. transactions on pattern analysis and machine intelligence, 2008

work page 2008
[46]

K.-T. Yu, M. Bauza, N. Fazeli, and A. Rodriguez. More than a million ways to be pushed. a high-ﬁdelity experimental dataset of planar pushing. In International Conference on Intelligent Robots and Systems (IROS), 2016

work page 2016
[47]

Chebotar, K

Y . Chebotar, K. Hausman, Z. Su, A. Molchanov, O. Kroemer, G. Sukhatme, and S. Schaal. Bigs: Biotac grasp stability dataset. In ICRA 2016 Workshop on Grasping and Manipulation Datasets, 2016

work page 2016
[48]

RoboTurk: A Crowdsourcing Platform for Robotic Skill Learning through Imitation

A. Mandlekar, Y . Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation.arXiv:1811.02790, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[49]

Multiple Interactions Made Easy (MIME): Large Scale Demonstrations Data for Imitation

P. Sharma, L. Mohan, L. Pinto, and A. Gupta. Multiple interactions made easy (mime): Large scale demonstrations data for imitation. arXiv:1810.07121, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[50]

A. Xie, F. Ebert, S. Levine, and C. Finn. Improvisation through physical understanding: Using novel objects as tools with visual foresight. CoRR, abs/1904.05538, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[51]

A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine. Stochastic adversarial video prediction. arXiv:1804.01523, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[52]

Robustness via Retrying: Closed-Loop Robotic Manipulation with Self-Supervised Learning

F. Ebert, S. Dasari, A. X. Lee, S. Levine, and C. Finn. Robustness via retrying: Closed-loop robotic manipulation with self-supervised learning. arXiv:1810.03043, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[53]

Villegas, A

R. Villegas, A. Pathak, H. Kannan, D. Erhan, Q. V Le, and H. Lee. High ﬁdelity video prediction with large neural nets

work page
[54]

arXiv , Author =:1906.02634 , Primaryclass =

D. Weissenborn, O. T ¨ackstr¨om, and J. Uszkoreit. Scaling autoregressive video models. arXiv preprint arXiv:1906.02634, 2019

work page arXiv 1906
[55]

Bellemare, S

M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, 2016

work page 2016
[56]

Exploration by Random Network Distillation

Y . Burda, H. Edwards, A. Storkey, and O. Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[57]

Self-Supervised Exploration via Disagreement

D. Pathak, D. Gandhi, and A. Gupta. Self-supervised exploration via disagreement. arXiv preprint arXiv:1906.04161, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[58]

Self-Supervised Visual Planning with Temporal Skip Connections

F. Ebert, C. Finn, A. X. Lee, and S. Levine. Self-supervised visual planning with temporal skip connec- tions. arXiv:1710.05268, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[59]

Xingjian, Z

S. Xingjian, Z. Chen, H. Wang, D.-Y . Yeung, W.-K. Wong, and W.-c. Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pages 802–810, 2015. 10

work page 2015
[60]

Model Predictive Path Integral Control using Covariance Variable Importance Sampling

G. Williams, A. Aldrich, and E. Theodorou. Model predictive path integral control using covariance variable importance sampling. CoRR, abs/1509.01149, 2015. URL http://arxiv.org/abs/1509. 01149. 11 A Visual Foresight Preliminaries Here we give a brief introduction into the visual foresight algorithm used in this paper, see [9, 58, 52] for a more detailed ...

work page internal anchor Pith review Pith/arXiv arXiv 2015