pith. machine review for the scientific record. sign in

arxiv: 2109.13396 · v1 · submitted 2021-09-27 · 💻 cs.RO · cs.AI

Recognition: 2 theorem links

Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

Bernadette Bucher, Chelsea Finn, Frederik Ebert, Georgios Georgakis, Karl Schmeckpeper, Kostas Daniilidis, Sergey Levine, Yanlai Yang

Pith reviewed 2026-05-13 19:51 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords robot learninggeneralizationcross-domain datamulti-task datasetdemonstration learningtransfer learningrobotic skills
0
0 comments X

The pith

A shared multi-task multi-domain robot dataset doubles success rates for new tasks in new environments when added to just 50 demonstrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors collect and release a dataset containing 7200 demonstrations of 71 tasks performed in 10 different environments. They test whether including this data during training helps a robot learn an entirely new task in an entirely new setting. When the shared dataset is combined with only 50 demonstrations of the new task, average success rates double compared with training on the target-domain data alone. Even a small number of demonstrations from the new domain suffice to let the robot perform many of its previously learned tasks there. The results indicate that reusable cross-domain collections can reduce the need to gather large task-specific datasets for each new robot project.

Core claim

By collecting a large multi-domain multi-task dataset with 7200 demonstrations of 71 tasks across 10 environments, the authors demonstrate that jointly training with this dataset plus 50 demonstrations of a never-before-seen task in a new domain leads to a 2x improvement in success rate compared to using target domain data alone. Data for only a few tasks in a new domain can bridge the domain gap and make it possible for a robot to perform a variety of prior tasks that were only seen in other domains.

What carries the argument

The Bridge Data collection, which supplies cross-task and cross-domain demonstrations so that end-to-end policies trained on it generalize to unseen tasks and environments.

If this is right

  • Robots can acquire new skills with far less per-project data collection.
  • A small amount of data from a new environment allows reuse of many previously learned skills in that environment.
  • Shared datasets become a practical way to bootstrap learning instead of starting from scratch each time.
  • Generalization improves without exhaustive data collection in every new setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Growing the dataset with additional domains would likely further reduce the number of demonstrations needed for new tasks.
  • The same bridging approach could extend to different robot hardware or sensor suites.
  • If the dataset continues to expand, reliance on simulation for initial training may decrease.

Load-bearing premise

The collected tasks and domains are representative enough that cross-domain data produces positive transfer rather than interference for arbitrary new tasks and environments.

What would settle it

A new task and new domain in which adding the Bridge Data to the 50 target demonstrations lowers success rate below the level achieved with the 50 demonstrations alone.

read the original abstract

Robot learning holds the promise of learning policies that generalize broadly. However, such generalization requires sufficiently diverse datasets of the task of interest, which can be prohibitively expensive to collect. In other fields, such as computer vision, it is common to utilize shared, reusable datasets, such as ImageNet, to overcome this challenge, but this has proven difficult in robotics. In this paper, we ask: what would it take to enable practical data reuse in robotics for end-to-end skill learning? We hypothesize that the key is to use datasets with multiple tasks and multiple domains, such that a new user that wants to train their robot to perform a new task in a new domain can include this dataset in their training process and benefit from cross-task and cross-domain generalization. To evaluate this hypothesis, we collect a large multi-domain and multi-task dataset, with 7,200 demonstrations constituting 71 tasks across 10 environments, and empirically study how this data can improve the learning of new tasks in new environments. We find that jointly training with the proposed dataset and 50 demonstrations of a never-before-seen task in a new domain on average leads to a 2x improvement in success rate compared to using target domain data alone. We also find that data for only a few tasks in a new domain can bridge the domain gap and make it possible for a robot to perform a variety of prior tasks that were only seen in other domains. These results suggest that reusing diverse multi-task and multi-domain datasets, including our open-source dataset, may pave the way for broader robot generalization, eliminating the need to re-collect data for each new robot learning project.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Bridge Data, a multi-domain multi-task robotic dataset of 7,200 demonstrations spanning 71 tasks across 10 environments. Its central empirical claim is that jointly training on this dataset together with 50 demonstrations of a previously unseen task in a new domain produces an average 2x improvement in success rate relative to training on the 50 target-domain demonstrations alone; it further reports that limited data in a new domain can enable a robot to perform tasks previously observed only in other domains.

Significance. If the reported gains are robust, the work supplies concrete evidence that large-scale, reusable cross-domain datasets can materially reduce per-task data collection costs in robot learning, mirroring the role of ImageNet-style resources in vision. The open release of the dataset itself constitutes a reusable asset for the community.

major comments (2)
  1. [Experimental Evaluation] Experimental section: the manuscript reports an average 2x success-rate gain but supplies insufficient detail on training procedures, baseline implementations, number of independent runs per condition, observed variance, and whether statistical tests were used to establish significance of the improvement over the target-only baseline. These omissions make it difficult to rule out post-hoc selection effects or implementation differences.
  2. [§5] §5 (held-out evaluation): all reported test tasks are drawn from the same overall collection protocol and visual regimes as the training environments. This limits the strength of the claim that the dataset produces positive transfer for arbitrary new domains; the current results do not yet demonstrate robustness to substantial changes in lighting, object appearance, robot kinematics, or task structure outside the 10 environments.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'on average leads to a 2x improvement' should be accompanied by the precise mean and a measure of spread (standard deviation or range) across the evaluated tasks.
  2. [Dataset Description] Dataset description: the selection criteria for the 10 environments and 71 tasks should be stated more explicitly so readers can assess how representative they are of typical manipulation scenarios.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation for minor revision. We address each major comment below and will revise the manuscript to improve experimental transparency and clarify the scope of our claims.

read point-by-point responses
  1. Referee: [Experimental Evaluation] Experimental section: the manuscript reports an average 2x success-rate gain but supplies insufficient detail on training procedures, baseline implementations, number of independent runs per condition, observed variance, and whether statistical tests were used to establish significance of the improvement over the target-only baseline. These omissions make it difficult to rule out post-hoc selection effects or implementation differences.

    Authors: We agree that additional experimental details are required for reproducibility and to strengthen confidence in the results. In the revised manuscript we will expand the experimental section to provide: a full description of training procedures including all hyperparameters, network architectures, and optimization settings; explicit implementation details for each baseline; the number of independent runs per condition (five runs were performed); observed variance reported as standard deviations; and results from statistical significance tests (paired t-tests) confirming the 2x improvement over the target-only baseline. These additions will directly address concerns about implementation differences and selection effects. revision: yes

  2. Referee: [§5] §5 (held-out evaluation): all reported test tasks are drawn from the same overall collection protocol and visual regimes as the training environments. This limits the strength of the claim that the dataset produces positive transfer for arbitrary new domains; the current results do not yet demonstrate robustness to substantial changes in lighting, object appearance, robot kinematics, or task structure outside the 10 environments.

    Authors: We acknowledge that the held-out tasks share the same overall collection protocol and visual regimes as the training environments. While the ten environments already include meaningful diversity in settings, objects, and lighting, the results do not demonstrate robustness to arbitrary new domains involving major shifts such as different robot kinematics or extreme lighting changes outside the collected data. In the revision we will update §5 and the discussion to more precisely scope our claims to positive transfer across the diversity present in Bridge Data, while explicitly noting this limitation for broader generalization. This clarification will better contextualize the empirical findings. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical success rates are measured outcomes, not reductions to fitted inputs

full rationale

The paper collects a multi-task multi-domain dataset of 7200 demonstrations and reports measured success rates on held-out tasks when training with the dataset plus 50 target demos. These results are direct experimental measurements rather than predictions derived from equations or parameters fitted inside the work. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation chain; the central claim rests on independent robot trials whose outcomes are not tautological with the data collection protocol.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that joint training on the collected data improves performance; it assumes standard imitation or reinforcement learning algorithms can leverage cross-domain demonstrations without negative transfer.

axioms (1)
  • domain assumption Standard policy learning algorithms can effectively utilize demonstrations from multiple tasks and domains without negative interference.
    Implicit in the joint training setup described in the abstract.

pith-pipeline@v0.9.0 · 5626 in / 1215 out tokens · 43714 ms · 2026-05-13T19:51:16.787575+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

  2. MolmoAct2: Action Reasoning Models for Real-world Deployment

    cs.RO 2026-05 unverdicted novelty 7.0

    MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.

  3. BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination

    cs.RO 2026-04 conditional novelty 7.0

    BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.

  4. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    cs.RO 2023-04 conditional novelty 7.0

    Low-cost imprecise robots achieve 80-90% success on six fine bimanual manipulation tasks using imitation learning with a new Action Chunking with Transformers algorithm trained on only 10 minutes of demonstrations.

  5. RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

    cs.RO 2026-05 unverdicted novelty 6.0

    A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.

  6. BEACON: Cross-Domain Co-Training of Generative Robot Policies via Best-Effort Adaptation

    cs.RO 2026-05 unverdicted novelty 6.0

    BEACON uses discrepancy-aware importance reweighting to jointly train diffusion-based robot policies and source sample weights, improving performance over target-only and fixed-ratio baselines in cross-domain manipula...

  7. BEACON: Cross-Domain Co-Training of Generative Robot Policies via Best-Effort Adaptation

    cs.RO 2026-05 unverdicted novelty 6.0

    BEACON uses discrepancy-aware importance reweighting to co-train generative robot policies from abundant source and limited target demonstrations, yielding better robustness and implicit feature alignment.

  8. MolmoAct2: Action Reasoning Models for Real-world Deployment

    cs.RO 2026-05 unverdicted novelty 6.0

    MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...

  9. Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation

    cs.CV 2026-05 unverdicted novelty 6.0

    A video transfer pipeline augments simulated VLA data into realistic videos while preserving actions, yielding consistent performance gains on robot benchmarks such as 8% on Robotwin 2.0.

  10. Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos

    cs.CV 2026-04 unverdicted novelty 6.0

    EgoIn uses a fine-tuned vision-language model to infer transition steps and a conditioning module plus auxiliary supervision to generate coherent egocentric video sequences of object state changes.

  11. RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    cs.RO 2025-06 unverdicted novelty 6.0

    RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.

  12. SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    cs.LG 2025-06 unverdicted novelty 6.0

    SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.

  13. Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    cs.CV 2024-12 unverdicted novelty 6.0

    Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.

  14. CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    cs.RO 2024-11 unverdicted novelty 6.0

    CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...

  15. $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    cs.LG 2024-10 unverdicted novelty 6.0

    π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.

  16. OpenVLA: An Open-Source Vision-Language-Action Model

    cs.RO 2024-06 unverdicted novelty 6.0

    OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.

  17. Octo: An Open-Source Generalist Robot Policy

    cs.RO 2024-05 unverdicted novelty 6.0

    Octo is an open-source transformer-based generalist robot policy pretrained on 800k trajectories that serves as an effective initialization for finetuning across diverse robotic platforms.

  18. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    cs.RO 2024-03 accept novelty 6.0

    DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.

  19. Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

    cs.RO 2023-12 conditional novelty 6.0

    A GPT-style model pre-trained on large video datasets achieves 94.9% success on CALVIN multi-task manipulation and 85.4% zero-shot generalization, outperforming prior baselines.

  20. Cortex 2.0: Grounding World Models in Real-World Industrial Deployment

    cs.RO 2026-04 unverdicted novelty 5.0

    Cortex 2.0 introduces world-model-based planning that generates and scores future trajectories to outperform reactive vision-language-action baselines on industrial robotic tasks including pick-and-place, sorting, and...

  21. StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement

    cs.RO 2026-04 unverdicted novelty 5.0

    StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...

  22. ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning

    cs.RO 2026-04 unverdicted novelty 5.0

    ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.

  23. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 21 Pith papers · 1 internal anchor

  1. [1]

    Imagenet classifica- tion with deep convolutional neural networks,

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica- tion with deep convolutional neural networks,” Advances in neural information processing systems , vol. 25, pp. 1097–1105, 2012

  2. [2]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018

  3. [3]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Conference on Computer Vision and Pattern Recognition , 2009

  4. [4]

    Gradient surgery for multi-task learning,

    T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gradient surgery for multi-task learning,” arXiv preprint arXiv:2001.06782, 2020

  5. [5]

    Mt-opt: Continuous multi-task robotic reinforcement learning at scale,

    D. Kalashnikov, J. Varley, Y . Chebotar, B. Swanson, R. Jon- schkowski, C. Finn, S. Levine, and K. Hausman, “Mt-opt: Continuous multi-task robotic reinforcement learning at scale,” arXiv preprint arXiv:2104.08212, 2021

  6. [6]

    Robonet: Large-scale multi-robot learning,

    S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn, “Robonet: Large-scale multi-robot learning,” arXiv preprint arXiv:1910.11215 , 2019

  7. [7]

    One-shot visual imitation learning via meta-learning,

    C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine, “One-shot visual imitation learning via meta-learning,” in Conference on Robot Learning. PMLR, 2017, pp. 357–368

  8. [8]

    One-Shot Imitation Learning

    Y . Duan, M. Andrychowicz, B. C. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba, “One-shot imitation learn- ing,” arXiv preprint arXiv:1703.07326 , 2017

  9. [9]

    Generative adversarial imitation learning,

    J. Ho and S. Ermon, “Generative adversarial imitation learning,” arXiv preprint arXiv:1606.03476, 2016

  10. [10]

    One-shot imitation from observing humans via domain-adaptive meta-learning,

    T. Yu, C. Finn, A. Xie, S. Dasari, T. Zhang, P. Abbeel, and S. Levine, “One-shot imitation from observing humans via domain-adaptive meta-learning,” arXiv preprint arXiv:1802.01557 , 2018

  11. [11]

    Imitation from ob- servation: Learning to imitate behaviors from raw video via context translation,

    Y . Liu, A. Gupta, P. Abbeel, and S. Levine, “Imitation from ob- servation: Learning to imitate behaviors from raw video via context translation,” in International Conference on Robotics and Automation (ICRA), 2018

  12. [12]

    Time-contrastive networks: Self-supervised learning from video,

    P. Sermanet, C. Lynch, Y . Chebotar, J. Hsu, E. Jang, S. Schaal, S. Levine, and G. Brain, “Time-contrastive networks: Self-supervised learning from video,” in 2018 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2018, pp. 1134–1141

  13. [13]

    Human-centered collaborative robots with deep reinforcement learn- ing,

    A. Ghadirzadeh, X. Chen, W. Yin, Z. Yi, M. Bjorkman, and D. Kragic, “Human-centered collaborative robots with deep reinforcement learn- ing,” IEEE Robotics and Automation Letters , 2020

  14. [14]

    Model-based visual planning with self-supervised func- tional distances,

    S. Tian, S. Nair, F. Ebert, S. Dasari, B. Eysenbach, C. Finn, and S. Levine, “Model-based visual planning with self-supervised func- tional distances,” arXiv preprint arXiv:2012.15373 , 2020

  15. [15]

    Deep imitation learning for complex manipulation tasks from virtual reality teleoperation,

    T. Zhang, Z. McCarthy, O. Jow, D. Lee, X. Chen, K. Goldberg, and P. Abbeel, “Deep imitation learning for complex manipulation tasks from virtual reality teleoperation,” in 2018 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2018, pp. 5628–5635

  16. [16]

    Multiple interactions made easy (mime): Large scale demonstrations data for imitation,

    P. Sharma, L. Mohan, L. Pinto, and A. Gupta, “Multiple interactions made easy (mime): Large scale demonstrations data for imitation,” in Conference on robot learning . PMLR, 2018, pp. 906–915

  17. [17]

    Roboturk: A crowdsourcing platform for robotic skill learning through imitation,

    A. Mandlekar, Y . Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, S. Savarese, and L. Fei- Fei, “Roboturk: A crowdsourcing platform for robotic skill learning through imitation,” in Conference on Robot Learning , 2018

  18. [18]

    Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity,

    A. Mandlekar, J. Booher, M. Spero, A. Tung, A. Gupta, Y . Zhu, A. Garg, S. Savarese, and L. Fei-Fei, “Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity,” arXiv:1911.04052, 2019

  19. [19]

    Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours,

    L. Pinto and A. Gupta, “Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours,” in international conference on robotics and automation (ICRA) . IEEE, 2016

  20. [20]

    Deep visual foresight for planning robot motion,

    C. Finn and S. Levine, “Deep visual foresight for planning robot motion,” in 2017 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2017, pp. 2786–2793

  21. [21]

    Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,

    S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,” The International Journal of Robotics Research, vol. 37, no. 4-5, pp. 421–436, 2018

  22. [22]

    Scalable deep reinforcement learning for vision-based robotic manipulation,

    D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . Vanhoucke,et al., “Scalable deep reinforcement learning for vision-based robotic manipulation,” in Conference on Robot Learning . PMLR, 2018, pp. 651–673

  23. [23]

    Visual foresight: Model-based deep reinforcement learning for vision-based robotic control,

    F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine, “Visual foresight: Model-based deep reinforcement learning for vision-based robotic control,” arXiv preprint arXiv:1812.00568 , 2018

  24. [24]

    Tossing- bot: Learning to throw arbitrary objects with residual physics,

    A. Zeng, S. Song, J. Lee, A. Rodriguez, and T. Funkhouser, “Tossing- bot: Learning to throw arbitrary objects with residual physics,” IEEE Transactions on Robotics , vol. 36, no. 4, pp. 1307–1319, 2020

  25. [25]

    Visual imitation made easy,

    S. Young, D. Gandhi, S. Tulsiani, A. Gupta, P. Abbeel, and L. Pinto, “Visual imitation made easy,” arXiv e-prints , pp. arXiv–2008, 2020

  26. [26]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Conference on Computer Vision and Pattern Recognition, 2016

  27. [27]

    Deep spatial autoencoders for visuomotor learning,

    C. Finn, X. Y . Tan, Y . Duan, T. Darrell, S. Levine, and P. Abbeel, “Deep spatial autoencoders for visuomotor learning,” in International Conference on Robotics and Automation (ICRA) , 2016

  28. [28]

    End-to-end training of deep visuomotor policies,

    S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016