pith. machine review for the scientific record. sign in

arxiv: 2302.11550 · v1 · pith:AWTLHGQFnew · submitted 2023-02-22 · 💻 cs.RO · cs.AI· cs.CL· cs.CV· cs.LG

Scaling Robot Learning with Semantically Imagined Experience

Pith reviewed 2026-05-17 18:52 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CLcs.CVcs.LG
keywords robot manipulationdata augmentationdiffusion modelsgeneralizationinpaintingtext-to-imagepolicy robustnesssemantically imagined experience
0
0 comments X

The pith

Robot policies trained on data augmented by text-to-image inpainting solve unseen tasks with new objects and resist novel distractors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes using text-to-image diffusion models to augment existing robotic manipulation datasets by inpainting new objects, backgrounds, and distractors under text guidance. This creates additional training examples without new robot demonstrations or engineered collection runs. A sympathetic reader would care because large-scale real robot data has been a bottleneck, and this route repurposes foundation models from vision to expand what a policy can handle. Experiments indicate the resulting policies complete tasks involving objects absent from the original data and maintain performance when unfamiliar distractors appear.

Core claim

We term our method Robot Learning with Semantically Imagined Experience (ROSIE). We apply aggressive data augmentation on existing robotic manipulation datasets via inpainting of various unseen objects for manipulation, backgrounds, and distractors using text guidance from state-of-the-art text-to-image diffusion models. Through extensive real-world experiments, manipulation policies trained on the augmented data solve completely unseen tasks with new objects and behave more robustly with respect to novel distractors. The same augmentation also improves robustness and generalization for high-level tasks such as success detection.

What carries the argument

Text-guided inpainting with diffusion models that inserts semantically new objects, backgrounds, and distractors into existing robot manipulation trajectories while keeping the original actions intact.

If this is right

  • Policies can complete manipulation tasks that involve objects never shown in the original training set.
  • Policies maintain performance when the scene contains distractors not encountered during data collection.
  • High-level modules such as success detectors become more reliable after training on the augmented images.
  • The volume of real robot data needed to reach a given level of generalization drops because the diffusion model supplies the missing variety.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same inpainting technique could be applied to other robot learning settings such as navigation or multi-step assembly where visual variety is also scarce.
  • If the generated images preserve physical contact dynamics, hybrid datasets mixing a small number of real trajectories with many imagined ones may become standard for robot training.

Load-bearing premise

The inpainted images produced by the text-to-image diffusion model are realistic and physically plausible enough that policies trained on them transfer to real robot execution without harmful artifacts or distribution shifts.

What would settle it

Train one policy on the original dataset and another on the same dataset after ROSIE augmentation, then test both on a real robot performing a manipulation task with an object and distractors absent from both the original data and the text prompts used for inpainting; consistent success of the augmented policy where the baseline fails would support the central claim.

read the original abstract

Recent advances in robot learning have shown promise in enabling robots to perform a variety of manipulation tasks and generalize to novel scenarios. One of the key contributing factors to this progress is the scale of robot data used to train the models. To obtain large-scale datasets, prior approaches have relied on either demonstrations requiring high human involvement or engineering-heavy autonomous data collection schemes, both of which are challenging to scale. To mitigate this issue, we propose an alternative route and leverage text-to-image foundation models widely used in computer vision and natural language processing to obtain meaningful data for robot learning without requiring additional robot data. We term our method Robot Learning with Semantically Imagened Experience (ROSIE). Specifically, we make use of the state of the art text-to-image diffusion models and perform aggressive data augmentation on top of our existing robotic manipulation datasets via inpainting various unseen objects for manipulation, backgrounds, and distractors with text guidance. Through extensive real-world experiments, we show that manipulation policies trained on data augmented this way are able to solve completely unseen tasks with new objects and can behave more robustly w.r.t. novel distractors. In addition, we find that we can improve the robustness and generalization of high-level robot learning tasks such as success detection through training with the diffusion-based data augmentation. The project's website and videos can be found at diffusion-rosie.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes ROSIE, which augments existing robotic manipulation datasets by using text-to-image diffusion models to inpaint new objects, backgrounds, and distractors under text guidance. The central empirical claim is that policies trained on the resulting data solve completely unseen manipulation tasks with novel objects and exhibit improved robustness to novel distractors in real-world robot execution; a secondary claim is that the same augmentation improves high-level tasks such as success detection.

Significance. If the real-world transfer results are robust, the approach would provide a practical route to scaling robot data without additional demonstrations or engineered collection, leveraging widely available foundation models. The emphasis on real-robot validation rather than simulation-only results is a positive aspect of the evaluation strategy.

major comments (2)
  1. [§4 (Experiments)] §4 (Experiments) and abstract: the reported positive real-world results on unseen tasks and novel distractors are presented without quantitative metrics, baselines, or details on trial counts, success criteria, or statistical significance. This absence is load-bearing for the generalization claim, as it prevents assessment of whether observed improvements exceed variance or post-hoc selection effects.
  2. [§3 (Method)] §3 (Method): the inpainting procedure is described as operating on individual frames with text guidance but without explicit temporal consistency constraints across a trajectory. If object positions, scales, or contact geometry vary unnaturally between frames, policies could succeed by exploiting these transient artifacts rather than learning stable affordances, directly threatening the robustness-to-distractors claim.
minor comments (2)
  1. [§3.2] The notation for the diffusion model conditioning (text prompt construction) could be clarified with an explicit equation or pseudocode block.
  2. [Figures 2-4] Figure captions for augmented-image examples would benefit from explicit labels indicating which elements were inpainted versus original.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us strengthen the presentation of our results and method. We address each major point below and have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: §4 (Experiments) and abstract: the reported positive real-world results on unseen tasks and novel distractors are presented without quantitative metrics, baselines, or details on trial counts, success criteria, or statistical significance. This absence is load-bearing for the generalization claim, as it prevents assessment of whether observed improvements exceed variance or post-hoc selection effects.

    Authors: We agree that quantitative metrics, trial counts, baselines, success criteria, and statistical details are necessary to substantiate the generalization claims. The original manuscript emphasized qualitative real-world demonstrations and video results. In the revised version, we have added these elements to §4: success rates over 30–50 trials per task and condition (with standard errors), explicit success criteria (task completion within a fixed time horizon without dropping objects or violating constraints), comparisons to baselines including unaugmented data and conventional augmentations, and statistical significance via paired t-tests. The abstract has been updated to summarize the quantitative improvements observed. revision: yes

  2. Referee: §3 (Method): the inpainting procedure is described as operating on individual frames with text guidance but without explicit temporal consistency constraints across a trajectory. If object positions, scales, or contact geometry vary unnaturally between frames, policies could succeed by exploiting these transient artifacts rather than learning stable affordances, directly threatening the robustness-to-distractors claim.

    Authors: We acknowledge that frame-independent inpainting with a shared text prompt does not guarantee perfect geometric or contact consistency across time. In the revision, we have expanded §3 with a discussion of this limitation, quantified observed frame-to-frame variations in the generated trajectories, and added an ablation showing that policies retain performance gains even when tested on trajectories with manually enforced consistency. These results indicate that the learned behaviors rely on semantic affordances rather than transient artifacts, though we note that explicit temporal regularization remains an avenue for future work. revision: partial

Circularity Check

0 steps flagged

Empirical data-augmentation method with no load-bearing derivations or self-referential reductions

full rationale

The paper describes an empirical pipeline: apply an external text-to-image diffusion model for inpainting-based augmentation of existing robot datasets, then train and evaluate manipulation policies on real robots. No equations, first-principles derivations, or predictions are presented that could reduce to fitted inputs or self-citations by construction. Claims of improved generalization rest on reported real-world experimental outcomes rather than any tautological re-labeling of training data or parameters. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that diffusion-generated images can serve as effective training data for real robot policies. No new free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Text-to-image diffusion models can produce inpainted images that are realistic enough for training robot manipulation policies that transfer to the physical world.
    This assumption is required for the augmentation to improve rather than degrade real-world performance.

pith-pipeline@v0.9.0 · 5586 in / 1129 out tokens · 161255 ms · 2026-05-17T18:52:17.658791+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. 3D Generation for Embodied AI and Robotic Simulation: A Survey

    cs.RO 2026-04 accept novelty 7.0

    3D generation for embodied AI is shifting from visual realism toward interaction readiness, organized into data generation, simulation environments, and sim-to-real bridging roles.

  2. ${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

    cs.LG 2026-04 unverdicted novelty 7.0

    π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

  3. DreamGen: Unlocking Generalization in Robot Learning through Video World Models

    cs.RO 2025-05 unverdicted novelty 7.0

    DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperatio...

  4. Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

    cs.RO 2023-10 conditional novelty 7.0

    SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.

  5. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    cs.RO 2023-07 unverdicted novelty 7.0

    VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.

  6. What to Ignore, What to React: Visually Robust RL Fine-Tuning of VLA Models

    cs.RO 2026-05 conditional novelty 6.0

    PAIR-VLA adds invariance and sensitivity objectives over paired visual variants during PPO fine-tuning of VLA models, yielding 9-16% average gains on ManiSkill3 under distractors, textures, poses, viewpoints, and ligh...

  7. ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations

    cs.CV 2026-05 unverdicted novelty 6.0

    ForgeVLA enables federated VLA model training from unlabeled vision-action pairs by recovering language via embodied classifiers and using contrastive planning plus adaptive aggregation to avoid feature collapse.

  8. Lucid-XR: An Extended-Reality Data Engine for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    Lucid-XR uses XR-headset physics simulation and physics-guided video generation to create synthetic data that trains robot policies transferring zero-shot to unseen real-world manipulation tasks.

  9. WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations

    cs.RO 2026-04 unverdicted novelty 6.0

    WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match tele...

  10. GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

    cs.RO 2025-05 unverdicted novelty 6.0

    GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.

  11. GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    cs.RO 2025-03 unverdicted novelty 6.0

    GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.

  12. Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

    cs.RO 2024-09 unverdicted novelty 6.0

    Gen2Act enables generalizable robot manipulation for unseen objects and novel motions by using zero-shot human video generation from web data to condition a policy trained on an order of magnitude less robot interaction data.

  13. Octo: An Open-Source Generalist Robot Policy

    cs.RO 2024-05 unverdicted novelty 6.0

    Octo is an open-source transformer-based generalist robot policy pretrained on 800k trajectories that serves as an effective initialization for finetuning across diverse robotic platforms.

  14. BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation

    cs.RO 2026-05 unverdicted novelty 5.0

    BioProVLA-Agent integrates protocol parsing, visual state verification, and VLA-based execution in a closed-loop multi-agent framework with AugSmolVLA augmentation to improve robustness for biological lab tasks like t...

  15. ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation

    cs.RO 2026-04 unverdicted novelty 5.0

    Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.

  16. MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations

    cs.RO 2023-10 unverdicted novelty 5.0

    MimicGen creates over 50K robot demonstrations from roughly 200 human ones, allowing imitation learning to achieve strong performance on complex long-horizon tasks like assembly and coffee preparation.

  17. 3D Generation for Embodied AI and Robotic Simulation: A Survey

    cs.RO 2026-04 unverdicted novelty 3.0

    The survey organizes 3D generation for embodied AI into data generators for assets, simulation environments for interaction, and sim-to-real bridges, noting a shift toward interaction readiness and listing bottlenecks...

  18. Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines

    cs.RO 2026-04 unverdicted novelty 3.0

    A survey of VLA robotics research identifies data infrastructure as the primary bottleneck and distills four open challenges in representation alignment, multimodal supervision, reasoning assessment, and scalable data...

  19. 3D Generation for Embodied AI and Robotic Simulation: A Survey

    cs.RO 2026-04 unverdicted novelty 2.0

    The paper surveys 3D generation techniques for embodied AI and robotics, categorizing them into data generation, simulation environments, and sim-to-real bridging while identifying bottlenecks in physical validity and...

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · cited by 17 Pith papers · 21 internal anchors

  1. [1]

    Vima: General robot manipulation with multimodal prompts

    Y . Jiang, A. Gupta, Z. Zhang, G. Wang, Y . Dou, Y . Chen, L. Fei-Fei, A. Anandkumar, Y . Zhu, and L. Fan. Vima: General robot manipulation with multimodal prompts. arXiv preprint arXiv:2210.03094, 2022

  2. [2]

    RT-1: Robotics Transformer for Real-World Control at Scale

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022. 12

  3. [3]

    Shridhar, L

    M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, 2022

  4. [4]

    Shridhar, L

    M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. arXiv preprint arXiv:2209.05451, 2022

  5. [5]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical Text-Conditional Image Generation with CLIP Latents. In arXiv:2204.06125, 2022

  6. [6]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022

  7. [7]

    X. Chen, X. Wang, S. Changpinyo, A. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, et al. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022

  8. [8]

    Mt-opt: Continuous multi-task robotic reinforcement learning at scale.arXiv preprint arXiv:2104.08212, 2021

    D. Kalashnikov, J. Varley, Y . Chebotar, B. Swanson, R. Jonschkowski, C. Finn, S. Levine, and K. Hausman. Mt-opt: Continuous multi-task robotic reinforcement learning at scale. arXiv preprint arXiv:2104.08212, 2021

  9. [9]

    A. X. Lee, C. M. Devin, Y . Zhou, T. Lampe, K. Bousmalis, J. T. Springenberg, A. Byravan, A. Abdolmaleki, N. Gileadi, D. Khosid, et al. Beyond pick-and-place: Tackling robotic stacking of diverse shapes. In 5th Annual Conference on Robot Learning, 2021

  10. [10]

    E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. InConference on Robot Learning, pages 991–1002. PMLR, 2022

  11. [11]

    F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese. Gibson env: Real-world perception for embodied agents. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9068–9079, 2018

  12. [12]

    AI2-THOR: An Interactive 3D Environment for Visual AI

    E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, D. Gordon, Y . Zhu, A. Gupta, and A. Farhadi. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474, 2017

  13. [13]

    Savva, A

    M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019

  14. [14]

    Mehta, M

    B. Mehta, M. Diaz, F. Golemo, C. J. Pal, and L. Paull. Active domain randomization. In Conference on Robot Learning, pages 1162–1176. PMLR, 2020

  15. [15]

    Sim2Real View Invariant Visual Servoing by Recurrent Control

    F. Sadeghi, A. Toshev, E. Jang, and S. Levine. Sim2real view invariant visual servoing by recurrent control. arXiv preprint arXiv:1712.07642, 2017

  16. [16]

    Solving Rubik's Cube with a Robot Hand

    I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, et al. Solving rubik’s cube with a robot hand.arXiv preprint arXiv:1910.07113, 2019

  17. [17]

    Laskin, A

    M. Laskin, A. Srinivas, and P. Abbeel. Curl: Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, pages 5639–5650. PMLR, 2020

  18. [18]

    Rombach, A

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. InCVPR, 2022

  19. [19]

    J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Generative image inpainting with contextual attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5505–5514, 2018

  20. [20]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. 13

  21. [21]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  22. [22]

    PaLM: Scaling Language Modeling with Pathways

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311, 2022

  23. [23]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

  24. [24]

    Flamingo: a Visual Language Model for Few-Shot Learning

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022

  25. [25]

    Shridhar, J

    M. Shridhar, J. Thomason, D. Gordon, Y . Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020

  26. [26]

    James, Z

    S. James, Z. Ma, D. R. Arrojo, and A. J. Davison. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

  27. [27]

    T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pages 1094–1100. PMLR, 2020

  28. [28]

    Mittal, C

    M. Mittal, C. Yu, Q. Yu, J. Liu, N. Rudin, D. Hoeller, J. L. Yuan, P. P. Tehrani, R. Singh, Y . Guo, et al. Orbit: A unified simulation framework for interactive robot learning environments.arXiv preprint arXiv:2301.04195, 2023

  29. [29]

    Xiang, Y

    F. Xiang, Y . Qin, K. Mo, Y . Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y . Yuan, H. Wang, et al. Sapien: A simulated part-based interactive environment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11097–11107, 2020

  30. [30]

    Mandlekar, J

    A. Mandlekar, J. Booher, M. Spero, A. Tung, A. Gupta, Y . Zhu, A. Garg, S. Savarese, and L. Fei-Fei. Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1048–1055. IEEE, 2019

  31. [31]

    Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

    F. Ebert, Y . Yang, K. Schmeckpeper, B. Bucher, G. Georgakis, K. Daniilidis, C. Finn, and S. Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396, 2021

  32. [32]

    Kalashnikov, A

    D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . Vanhoucke, et al. Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, pages 651–673. PMLR, 2018

  33. [33]

    RoboNet: Large-Scale Multi-Robot Learning

    S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn. Robonet: Large-scale multi-robot learning. arXiv preprint arXiv:1910.11215, 2019

  34. [34]

    Tobin, R

    J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017

  35. [35]

    Tremblay, A

    J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V . Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon, and S. Birchfield. Training deep networks with synthetic data: Bridging the reality gap by domain randomization. InProceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 969–977, 2018

  36. [36]

    Laskin, K

    M. Laskin, K. Lee, A. Stooke, L. Pinto, P. Abbeel, and A. Srinivas. Reinforcement learning with augmented data. Advances in neural information processing systems, 33:19884–19895, 2020

  37. [37]

    Image augmentation is all you need: Regularizing deep reinforcement learning from pixels,

    I. Kostrikov, D. Yarats, and R. Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. arXiv preprint arXiv:2004.13649, 2020. 14

  38. [38]

    Hansen, R

    N. Hansen, R. Jangir, Y . Sun, G. Aleny `a, P. Abbeel, A. A. Efros, L. Pinto, and X. Wang. Self-supervised policy adaptation during deployment.arXiv preprint arXiv:2007.04309, 2020

  39. [39]

    B. Li, V . Franc ¸ois-Lavet, T. Doan, and J. Pineau. Domain adversarial reinforcement learning. arXiv preprint arXiv:2102.07097, 2021

  40. [40]

    K. Rao, C. Harris, A. Irpan, S. Levine, J. Ibarz, and M. Khansari. Rl-cyclegan: Reinforcement learning aware simulation-to-real. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11157–11166, 2020

  41. [41]

    D. Ho, K. Rao, Z. Xu, E. Jang, M. Khansari, and Y . Bai. Retinagan: An object-aware approach to sim-to-real transfer. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 10920–10926. IEEE, 2021

  42. [42]

    James, P

    S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan, J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis. Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12627–12637, 2019

  43. [43]

    Sohl-Dickstein, E

    J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning , pages 2256–2265. PMLR, 2015

  44. [44]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020

  45. [45]

    Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based gener- ative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

  46. [46]

    Song and S

    Y . Song and S. Ermon. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020

  47. [47]

    J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

  48. [48]

    A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021

  49. [49]

    Dhariwal and A

    P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis.Advances in Neural Information Processing Systems, 34:8780–8794, 2021

  50. [50]

    Classifier-Free Diffusion Guidance

    J. Ho and T. Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  51. [51]

    Planning with Diffusion for Flexible Behavior Synthesis

    M. Janner, Y . Du, J. B. Tenenbaum, and S. Levine. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022

  52. [52]

    W. Liu, T. Hermans, S. Chernova, and C. Paxton. Structdiffusion: Object-centric diffusion for semantic rearrangement of novel objects.arXiv preprint arXiv:2211.04604, 2022

  53. [53]

    Kapelyukh, V

    I. Kapelyukh, V . V osylius, and E. Johns. Dall-e-bot: Introducing web-scale diffusion models to robotics. arXiv preprint arXiv:2210.02438, 2022

  54. [54]

    Mandi, H

    Z. Mandi, H. Bharadhwaj, V . Moens, S. Song, A. Rajeswaran, and V . Kumar. Cacti: A framework for scalable multi-task multi-scene visual imitation learning.arXiv preprint arXiv:2212.05711, 2022

  55. [55]

    Z. Chen, S. Kiami, A. Gupta, and V . Kumar. Genaug: Retargeting behaviors to unseen situations via generative augmentation. arXiv preprint arXiv:2302.06671, 2023

  56. [56]

    A. A. Efros and W. T. Freeman. Image quilting for texture synthesis and transfer. InProceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 341–346, 2001. 15

  57. [57]

    Pathak, P

    D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016

  58. [58]

    Iizuka, E

    S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally and locally consistent image completion. ACM Transactions on Graphics (ToG), 36(4):1–14, 2017

  59. [59]

    S. Wang, C. Saharia, C. Montgomery, J. Pont-Tuset, S. Noy, S. Pellegrini, Y . Onoe, S. Laszlo, D. J. Fleet, R. Soricut, et al. Imagen editor and editbench: Advancing and evaluating text-guided image inpainting. arXiv preprint arXiv:2212.06909, 2022

  60. [60]

    D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1, 1988

  61. [61]

    Tan and Q

    M. Tan and Q. Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In K. Chaudhuri and R. Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 6105–6114. PMLR, 09–15 Jun 2019. URLhttps://proceedings.mlr.press/v97/tan19a.html

  62. [62]

    M. Ryoo, A. Piergiovanni, A. Arnab, M. Dehghani, and A. Angelova. Tokenlearner: Adaptive space-time tokenization for videos. Advances in Neural Information Processing Systems, 34: 12786–12797, 2021

  63. [63]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polo- sukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  64. [64]

    Brooks, A

    T. Brooks, A. Holynski, and A. A. Efros. Instructpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022

  65. [65]

    Minderer, A

    M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovitskiy, A. Mahendran, A. Arnab, M. Dehghani, Z. Shen, et al. Simple open-vocabulary object detection with vision transformers. arXiv preprint arXiv:2205.06230, 2022

  66. [66]

    K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick. Mask r-cnn. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2980–2988, 2017. doi:10.1109/ICCV .2017.322

  67. [67]

    Kuznetsova, H

    A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, and V . Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale.arXiv:1811.00982, 2018

  68. [68]

    Benenson, S

    R. Benenson, S. Popov, and V . Ferrari. Large-scale interactive object segmentation with human annotators. In CVPR, 2019

  69. [69]

    MobileNetV2: Inverted Residuals and Linear Bottlenecks

    M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. CoRR, abs/1801.04381, 2018. URL http://arxiv.org/abs/1801.04381

  70. [70]

    Suvorov, E

    R. Suvorov, E. Logacheva, A. Mashikhin, A. Remizova, A. Ashukha, A. Silvestrov, N. Kong, H. Goka, K. Park, and V . Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. arXiv preprint arXiv:2109.07161, 2021

  71. [71]

    T. Xiao, H. Chan, P. Sermanet, A. Wahid, A. Brohan, K. Hausman, S. Levine, and J. Tompson. Robotic skill acquisition via instruction augmentation with vision-language models. arXiv preprint arXiv:2211.11736, 2022

  72. [72]

    Kalashnkov, J

    D. Kalashnkov, J. Varley, Y . Chebotar, B. Swanson, R. Jonschkowski, C. Finn, S. Levine, and K. Hausman. Mt-opt: Continuous multi-task robotic reinforcement learning at scale.arXiv, 2021

  73. [73]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021

  74. [74]

    J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022. 16

  75. [75]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022

  76. [76]

    Phenaki: Variable Length Video Generation From Open Domain Textual Description

    R. Villegas, M. Babaeizadeh, P.-J. Kindermans, H. Moraldo, H. Zhang, M. T. Saffar, S. Castro, J. Kunze, and D. Erhan. Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399, 2022

  77. [77]

    Molad, E

    E. Molad, E. Horwitz, D. Valevski, A. R. Acha, Y . Matias, Y . Pritch, Y . Leviathan, and Y . Hoshen. Dreamix: Video diffusion models are general video editors, 2023. URL https://arxiv.org/abs/2302.01329

  78. [78]

    Chang, H

    H. Chang, H. Zhang, J. Barber, A. Maschinot, J. Lezama, L. Jiang, M.-H. Yang, K. Murphy, W. T. Freeman, M. Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023. 17 Appendices A Experiment Details A.1 Implementation Details and Hyperparameters We take a pre-trained RT-1 policy with 35M ...