pith. machine review for the scientific record. sign in

arxiv: 2406.02523 · v1 · submitted 2024-06-04 · 💻 cs.RO · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Aaron Lo, Abhiram Maddukuri, Abhishek Joshi, Adeet Parikh, Ajay Mandlekar, Lance Zhang, Soroush Nasiriany, Yuke Zhu

Pith reviewed 2026-05-12 23:41 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords robot simulationimitation learningsynthetic datageneralist robotskitchen environmentsscaling trendssimulation to realeveryday tasks
0
0 comments X

The pith

Large-scale kitchen simulation enables scaling imitation learning for generalist robots using synthetic data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RoboCasa as a simulation framework to scale robot learning by creating realistic kitchen environments and tasks. It claims that this approach can generate massive synthetic datasets for imitation learning, overcoming the lack of real robot data. Experiments demonstrate a clear scaling trend where more synthetic data improves performance, and policies show promise for real-world tasks. A sympathetic reader would care because this points to a practical way to train versatile robots for daily activities without collecting enormous real-world datasets. If the claim holds, simulation could become the main source for training generalist robots.

Core claim

RoboCasa is a large-scale simulation framework for training generalist robots in everyday kitchen environments, featuring thousands of 3D assets across over 150 object categories, dozens of interactable furniture and appliances, enrichment with generative AI for assets and textures, a set of 100 tasks including composite ones guided by large language models, high-quality human demonstrations, automated trajectory generation to enlarge datasets, and experiments showing clear scaling trends in imitation learning with synthetic data along with promise for harnessing it in real-world tasks.

What carries the argument

The RoboCasa simulation framework that provides realistic scenes, diverse assets, tasks, and methods to generate large synthetic robot datasets for imitation learning.

Load-bearing premise

The simulation's physical fidelity, asset diversity, and task coverage are sufficient for policies trained entirely in simulation to transfer meaningfully to real robots without extensive additional real-world data.

What would settle it

A real-world experiment where increasing the volume of synthetic training data produces no corresponding increase in task success rates on physical robots, or where real-world performance stays significantly below simulation performance.

read the original abstract

Recent advancements in Artificial Intelligence (AI) have largely been propelled by scaling. In Robotics, scaling is hindered by the lack of access to massive robot datasets. We advocate using realistic physical simulation as a means to scale environments, tasks, and datasets for robot learning methods. We present RoboCasa, a large-scale simulation framework for training generalist robots in everyday environments. RoboCasa features realistic and diverse scenes focusing on kitchen environments. We provide thousands of 3D assets across over 150 object categories and dozens of interactable furniture and appliances. We enrich the realism and diversity of our simulation with generative AI tools, such as object assets from text-to-3D models and environment textures from text-to-image models. We design a set of 100 tasks for systematic evaluation, including composite tasks generated by the guidance of large language models. To facilitate learning, we provide high-quality human demonstrations and integrate automated trajectory generation methods to substantially enlarge our datasets with minimal human burden. Our experiments show a clear scaling trend in using synthetically generated robot data for large-scale imitation learning and show great promise in harnessing simulation data in real-world tasks. Videos and open-source code are available at https://robocasa.ai/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces RoboCasa, a large-scale physical simulation framework focused on kitchen environments for training generalist robots. It provides thousands of 3D assets across over 150 categories, dozens of interactable furniture and appliances, 100 tasks (including LLM-guided composite tasks), high-quality human demonstrations, and automated trajectory generation methods to scale datasets with minimal human effort. The central empirical claims are a clear scaling trend in large-scale imitation learning from synthetically generated robot data and great promise for harnessing such simulation data in real-world tasks.

Significance. If the reported scaling trends and sim-to-real transfer results hold, RoboCasa could provide a valuable open resource for addressing data scarcity in robotics by enabling scalable synthetic data generation. The integration of generative AI tools for assets and textures, combined with the release of code and videos, supports reproducibility and community use.

major comments (2)
  1. Abstract: the claim that 'experiments show a clear scaling trend' and 'great promise in harnessing simulation data in real-world tasks' is presented without any quantitative metrics, baselines, error bars, exact data volumes, or real-robot success rates, preventing verification of the central empirical assertions.
  2. Experiments section (implied by abstract claims): the sim-to-real transfer component is load-bearing for the 'great promise' statement yet lacks supporting details on physical fidelity (contact dynamics, friction, object properties), domain randomization, asset quality from text-to-3D models, or ablation results showing real-robot performance improving with synthetic data scale.
minor comments (3)
  1. Provide explicit comparisons to existing simulation frameworks (e.g., AI2-THOR, Habitat) in terms of asset diversity, task coverage, and data generation scale to better situate the contribution.
  2. Expand the description of the 100 tasks and LLM-guided composite task generation with concrete examples and statistics on task complexity.
  3. Ensure all experimental figures and tables include error bars, statistical tests, and clear axis labels for the scaling curves.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each of the major comments below and have made revisions to the paper where necessary to improve the clarity and completeness of our empirical claims.

read point-by-point responses
  1. Referee: Abstract: the claim that 'experiments show a clear scaling trend' and 'great promise in harnessing simulation data in real-world tasks' is presented without any quantitative metrics, baselines, error bars, exact data volumes, or real-robot success rates, preventing verification of the central empirical assertions.

    Authors: We acknowledge the referee's concern regarding the lack of quantitative details in the abstract. To address this, we have revised the abstract to incorporate references to specific quantitative results from our experiments, such as the scaling behavior observed with varying dataset sizes and the success rates in real-world tasks. The full details, including baselines, error bars, exact data volumes, and real-robot performance metrics, are provided in the Experiments section, and the abstract now points to these for verification. revision: yes

  2. Referee: Experiments section (implied by abstract claims): the sim-to-real transfer component is load-bearing for the 'great promise' statement yet lacks supporting details on physical fidelity (contact dynamics, friction, object properties), domain randomization, asset quality from text-to-3D models, or ablation results showing real-robot performance improving with synthetic data scale.

    Authors: We agree that additional details on the sim-to-real aspects would strengthen the manuscript. In the revised version, we have added descriptions of the physical fidelity aspects, including the modeling of contact dynamics, friction, and object properties in the simulator. We have also elaborated on the domain randomization strategies employed and the quality assurance for assets generated via text-to-3D models. Furthermore, we include ablation studies that demonstrate the improvement in real-robot performance with increasing scales of synthetic data. These revisions provide the necessary supporting information for the claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical scaling claims

full rationale

The paper introduces the RoboCasa simulation framework and reports direct empirical results from training imitation learning policies on data generated inside it, including scaling trends with synthetic data volume and some real-robot transfer observations. These are observed outcomes of running the described pipelines rather than any derivation, equation, or fitted parameter that reduces to its own inputs by construction. No self-citations serve as load-bearing justifications for uniqueness or ansatzes, and no mathematical claims are present that would trigger self-definitional or renaming patterns. The work is self-contained as a new tool plus its experimental evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions of physics engines and generative models rather than introducing new fitted constants or invented physical entities.

axioms (1)
  • domain assumption Physics simulation in the chosen engine produces trajectories sufficiently close to real-world dynamics for policy transfer
    Invoked implicitly when claiming promise for real-world tasks

pith-pipeline@v0.9.0 · 5543 in / 1267 out tokens · 51519 ms · 2026-05-12T23:41:18.402343+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 30 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SafeManip: A Property-Driven Benchmark for Temporal Safety Evaluation in Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 8.0

    SafeManip is a new benchmark that applies LTLf monitors to assess temporal safety properties across eight categories in robotic manipulation, demonstrating that task success frequently fails to ensure safe execution i...

  2. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  3. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  4. DockAnywhere: Data-Efficient Visuomotor Policy Learning for Mobile Manipulation via Novel Demonstration Generation

    cs.RO 2026-04 unverdicted novelty 7.0

    DockAnywhere lifts single demonstrations to diverse docking points via structure-preserving augmentation and point-cloud spatial editing to improve viewpoint generalization in visuomotor policies for mobile manipulation.

  5. AffordSim: A Scalable Data Generator and Benchmark for Affordance-Aware Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    AffordSim is the first simulation framework integrating open-vocabulary 3D affordance detection into scalable manipulation data generation, with a 50-task benchmark showing imitation learning succeeds on grasping but ...

  6. Referring-Aware Visuomotor Policy Learning for Closed-Loop Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    ReV is a referring-aware visuomotor policy using coupled diffusion heads for real-time trajectory replanning in robotic manipulation, trained solely via targeted perturbations to expert demonstrations and achieving hi...

  7. D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

    cs.AI 2026-05 unverdicted novelty 6.0

    D-VLA introduces plane decoupling and a swimlane asynchronous pipeline to achieve high-concurrency RL training and linear scalability for billion- to trillion-parameter vision-language-action models.

  8. Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

    cs.RO 2026-05 unverdicted novelty 6.0

    Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action...

  9. RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

    cs.RO 2026-05 unverdicted novelty 6.0

    RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.

  10. Adaptive Q-Chunking for Offline-to-Online Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Adaptive Q-Chunking selects optimal action chunk sizes at each state via normalized advantage comparisons to outperform fixed chunk sizes in offline-to-online RL on robot benchmarks.

  11. How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study

    cs.CR 2026-05 unverdicted novelty 6.0

    Vision-language models exhibit perceptual fragility and fail to consistently respect privacy constraints when operating in simulated physical environments, with performance declining in cluttered scenes and under conf...

  12. How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study

    cs.CR 2026-05 unverdicted novelty 6.0

    VLMs show consistent deficits in identifying sensitive items in cluttered scenes, adapting to social contexts, and resolving conflicts between commands and privacy constraints in a new physical simulator benchmark.

  13. LeHome: A Simulation Environment for Deformable Object Manipulation in Household Scenarios

    cs.RO 2026-04 unverdicted novelty 6.0

    LeHome is a simulation platform offering high-fidelity dynamics for robotic manipulation of varied deformable objects in household settings, with support for multiple robot embodiments including low-cost hardware.

  14. Exploring High-Order Self-Similarity for Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    The MOSS module learns and combines multi-order space-time self-similarity features to enhance temporal dynamics modeling in videos across action recognition, VQA, and robotic tasks.

  15. Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 6.0

    State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks...

  16. From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation

    cs.RO 2026-04 unverdicted novelty 6.0

    Digital Cousins is a generative real-to-sim method that creates diverse high-fidelity simulation scenes from real panoramas to improve generalization in robot learning and evaluation.

  17. A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies

    cs.RO 2026-04 unverdicted novelty 6.0

    Sim-and-real co-training for robot policies is driven primarily by balanced cross-domain representation alignment and secondarily by domain-dependent action reweighting.

  18. Grounded World Model for Semantically Generalizable Planning

    cs.RO 2026-04 conditional novelty 6.0

    A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.

  19. AffordSim: A Scalable Data Generator and Benchmark for Affordance-Aware Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    AffordSim integrates open-vocabulary 3D affordance prediction into simulation trajectory generation to create a 50-task benchmark that reaches 93% of manual annotation success rates and enables 24% average zero-shot s...

  20. RoboPlayground: Democratizing Robotic Evaluation through Structured Physical Domains

    cs.RO 2026-04 unverdicted novelty 6.0

    RoboPlayground reframes robotic manipulation evaluation as a language-driven process over structured physical domains, letting users author varied yet reproducible tasks that reveal policy generalization failures.

  21. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

    cs.AI 2026-01 conditional novelty 6.0

    Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.

  22. Nautilus: From One Prompt to Plug-and-Play Robot Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.

  23. What Will Happen Next: Large Models-Driven Deduction for Emergency Instances

    cs.AI 2026-05 unverdicted novelty 5.0

    WLDS applies large models with factual and logical calibration to produce diverse text-and-image deductions of emergency scenarios beyond what traditional fixed simulations can generate.

  24. EmbodiedClaw: Conversational Workflow Execution for Embodied AI Development

    cs.RO 2026-04 unverdicted novelty 5.0

    EmbodiedClaw automates embodied AI development workflows through conversation, reducing manual effort and improving consistency and reproducibility.

  25. Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images

    cs.CV 2026-04 unverdicted novelty 5.0

    UniSplat learns consistent 3D geometry, appearance, and semantics from unposed images using dual masking, progressive Gaussian splatting, and recalibration to align predictions across tasks.

  26. CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment

    cs.RO 2026-04 unverdicted novelty 5.0

    CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer ...

  27. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  28. JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

    cs.RO 2026-04 unverdicted novelty 4.0

    JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.

  29. World Simulation with Video Foundation Models for Physical AI

    cs.CV 2025-10 unverdicted novelty 4.0

    Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.

  30. Cosmos World Foundation Model Platform for Physical AI

    cs.CV 2025-01 unverdicted novelty 3.0

    The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 28 Pith papers · 8 internal anchors

  1. [1]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Her- zog, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691 , 2022

  2. [2]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1: Robotics transformer for real-world control at scale. In arXiv preprint arXiv:2212.06817 , 2022

  3. [3]

    Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions

    Yevgen Chebotar, Quan Vuong, Karol Hausman, Fei Xia, Yao Lu, Alex Irpan, Aviral Kumar, Tianhe Yu, Alexander Herzog, Karl Pertsch, et al. Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions. In Conference on Robot Learning , pages 3909–3928. PMLR, 2023

  4. [4]

    Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137 , 2023

  5. [5]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment Collaboration et al. Open X- Embodiment: Robotic learning datasets and RT-X mod- els. https://arxiv.org/abs/2310.08864, 2023

  6. [6]

    Imitating task and motion planning with visuomotor transformers

    Murtaza Dalal, Ajay Mandlekar, Caelan Garrett, Ankur Handa, Ruslan Salakhutdinov, and Dieter Fox. Imitating task and motion planning with visuomotor transformers. arXiv preprint arXiv:2305.16309 , 2023

  7. [7]

    Robonet: Large-scale multi-robot learning

    Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning. In Conference on Robot Learning , 2019

  8. [8]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. arXiv preprint arXiv:2212.08051, 2022

  9. [9]

    Bridge data: Boosting generalization of robotic skills with cross- domain datasets

    Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Dani- ilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross- domain datasets. In Robotics: Science and Systems (RSS) , 2022

  10. [10]

    (2023) Maniskill2: A unified benchmark for generalizable manipulation skills

    Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, et al. Maniskill2: A unified benchmark for generalizable manipulation skills. arXiv preprint arXiv:2302.04659, 2023

  11. [11]

    Benchmarking offline reinforcement learning on real-robot hardware

    Nico G ¨urtler, Sebastian Blaes, Pavel Kolev, Felix Widmaier, Manuel W ¨uthrich, Stefan Bauer, Bernhard Sch¨olkopf, and Georg Martius. Benchmarking offline reinforcement learning on real-robot hardware. arXiv preprint arXiv:2307.15690, 2023

  12. [12]

    Scaling up and distilling down: Language-guided robot skill acquisition

    Huy Ha, Pete Florence, and Shuran Song. Scaling up and distilling down: Language-guided robot skill acquisition. In Conference on Robot Learning , pages 3766–3777. PMLR, 2023

  13. [13]

    A holistic approach to reactive mobile manipulation

    Jesse Haviland, Niko S ¨underhauf, and Peter Corke. A holistic approach to reactive mobile manipulation. IEEE Robotics and Automation Letters , 7(2):3122–3129, 2022

  14. [14]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural infor- mation processing systems , 33:6840–6851, 2020

  15. [15]

    Rlbench: The robot learning bench- mark & learning environment

    Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. Rlbench: The robot learning bench- mark & learning environment. IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

  16. [16]

    Bc-z: Zero-shot task generalization with robotic imitation learning

    Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning , 2021

  17. [17]

    Vima: General robot manipulation with multimodal prompts

    Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. Vima: General robot manipulation with multimodal prompts. In Inter- national Conference on Machine Learning , 2023

  18. [18]

    Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation, 2018

    Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018

  19. [19]

    Mt-opt: Continuous multi-task robotic reinforcement learning at scale,

    Dmitry Kalashnikov, Jacob Varley, Yevgen Chebotar, Benjamin Swanson, Rico Jonschkowski, Chelsea Finn, Sergey Levine, and Karol Hausman. Mt-opt: Continuous multi-task robotic reinforcement learning at scale. arXiv preprint arXiv:2104.08212, 2021

  20. [20]

    Droid: A large-scale in-the-wild robot manipulation dataset, 2024

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset, 2024

  21. [21]

    AI2-THOR: An Interactive 3D Environment for Visual AI

    Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli Van- derBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. AI2-THOR: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474, 2017

  22. [22]

    A workflow for offline model- free robotic reinforcement learning

    Aviral Kumar, Anikait Singh, Stephen Tian, Chelsea Finn, and Sergey Levine. A workflow for offline model- free robotic reinforcement learning. arXiv preprint arXiv:2109.10813, 2021

  23. [23]

    Pre-training for robots: Offline RL enables learning new tasks from a handful of trials

    Aviral Kumar, Anikait Singh, Frederik Ebert, Mitsuhiko Nakamoto, Yanlai Yang, Chelsea Finn, and Sergey Levine. Pre-training for robots: Offline rl enables learn- ing new tasks from a handful of trials. arXiv preprint arXiv:2210.05178, 2022

  24. [24]

    Learning hand-eye coordination for robotic grasping with large-scale data collection

    Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. Learning hand-eye coordination for robotic grasping with large-scale data collection. In ISER, pages 173–184, 2016

  25. [25]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020

  26. [26]

    igibson 2.0: Object-centric simulation for robot learning of everyday household tasks

    Chengshu Li, Fei Xia, Roberto Mart ´ın-Mart´ın, Michael Lingelbach, Sanjana Srivastava, Bokui Shen, Kent Vainio, Cem Gokmen, Gokul Dharan, Tanish Jain, et al. igibson 2.0: Object-centric simulation for robot learning of everyday household tasks. arXiv preprint arXiv:2108.03272, 2021

  27. [27]

    Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation

    Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gok- men, Sanjana Srivastava, Roberto Mart ´ın-Mart´ın, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. In Conference on Robot Learning , pages 80–93. PMLR, 2023

  28. [28]

    LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. arXiv preprint arXiv:2306.03310, 2023

  29. [29]

    Roboturk: A crowdsourcing platform for robotic skill learning through imitation

    Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, Silvio Savarese, and Li Fei-Fei. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning , 2018

  30. [30]

    Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity,

    Ajay Mandlekar, Jonathan Booher, Max Spero, Albert Tung, Anchit Gupta, Yuke Zhu, Animesh Garg, Silvio Savarese, and Li Fei-Fei. Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity. arXiv preprint arXiv:1911.04052, 2019

  31. [31]

    Learning to generalize across long-horizon tasks from human demonstrations

    Ajay Mandlekar, Danfei Xu, Roberto Mart ´ın-Mart´ın, Silvio Savarese, and Li Fei-Fei. Learning to generalize across long-horizon tasks from human demonstrations. In Robotics: Science and Systems (RSS) , 2020

  32. [32]

    Human-in- the-loop imitation learning using remote teleoperation,

    Ajay Mandlekar, Danfei Xu, Roberto Mart ´ın-Mart´ın, Yuke Zhu, Li Fei-Fei, and Silvio Savarese. Human-in- the-loop imitation learning using remote teleoperation,

  33. [33]

    URL https://arxiv.org/abs/2012.06733

  34. [34]

    What matters in learning from offline human demonstra- tions for robot manipulation

    Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart ´ın-Mart´ın. What matters in learning from offline human demonstra- tions for robot manipulation. In Conference on Robot Learning, 2021

  35. [35]

    Mimicgen: A data generation system for scalable robot learning using human demonstrations, 2023

    Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Ire- tiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. arXiv preprint arXiv:2310.17596 , 2023

  36. [36]

    GPT-4 technical report, 2024

    OpenAI et al. GPT-4 technical report, 2024

  37. [37]

    Supersizing self- supervision: Learning to grasp from 50k tries and 700 robot hours

    Lerrel Pinto and Abhinav Gupta. Supersizing self- supervision: Learning to grasp from 50k tries and 700 robot hours. In Robotics and Automation (ICRA), 2016 IEEE Int’l Conference on . IEEE, 2016

  38. [38]

    Alvinn: An autonomous land vehicle in a neural network

    Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems, pages 305–313, 1989

  39. [39]

    High-resolution image synthesis with latent diffusion models, 2021

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models, 2021

  40. [40]

    De- noising diffusion implicit models, 2022

    Jiaming Song, Chenlin Meng, and Stefano Ermon. De- noising diffusion implicit models, 2022

  41. [41]

    Habitat 2.0: Training home assistants to rearrange their habitat

    Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, et al. Habitat 2.0: Training home assistants to rearrange their habitat. Advances in Neural Informa- tion Processing Systems , 34:251–266, 2021

  42. [42]

    Large language models as generalizable policies for embodied tasks

    Andrew Szot, Max Schwarzer, Harsh Agrawal, Bog- dan Mazoure, Walter Talbott, Katherine Metcalf, Natalie Mackraz, Devon Hjelm, and Alexander Toshev. Large language models as generalizable policies for embodied tasks. arXiv preprint arXiv:2310.17722 , 2023

  43. [43]

    Gemini: A family of highly capable multimodal models, 2024

    Gemini Team. Gemini: A family of highly capable multimodal models, 2024

  44. [44]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems , 2017

  45. [45]

    Gensim: Generating robotic simulation tasks via large language models

    Lirui Wang, Yiyang Ling, Zhecheng Yuan, Mohit Shrid- har, Chen Bao, Yuzhe Qin, Bailin Wang, Huazhe Xu, and Xiaolong Wang. Gensim: Generating robotic simulation tasks via large language models. In Arxiv, 2023

  46. [46]

    Robogen: Towards unleashing infinite data for automated robot learning via generative simulation.arXiv preprint arXiv:2311.01455, 2023

    Yufei Wang, Zhou Xian, Feng Chen, Tsun-Hsuan Wang, Yian Wang, Katerina Fragkiadaki, Zackory Erickson, David Held, and Chuang Gan. Robogen: Towards un- leashing infinite data for automated robot learning via generative simulation. arXiv preprint arXiv:2311.01455 , 2023

  47. [47]

    More than a million ways to be pushed

    Kuan-Ting Yu, Maria Bauza, Nima Fazeli, and Alberto Rodriguez. More than a million ways to be pushed. a high-fidelity experimental dataset of planar pushing. In Int’l Conference on Intelligent Robots and Systems, 2016

  48. [48]

    MuJoCo Menagerie: A collection of high- quality simulation models for MuJoCo, 2022

    Kevin Zakka, Yuval Tassa, and MuJoCo Menagerie Con- tributors. MuJoCo Menagerie: A collection of high- quality simulation models for MuJoCo, 2022. URL http://github.com/google-deepmind/mujoco menagerie

  49. [49]

    Transporter networks: Rearranging the visual world for robotic manipulation

    Andy Zeng, Pete Florence, Jonathan Tompson, Stefan Welker, Jonathan Chien, Maria Attarian, Travis Arm- strong, Ivan Krasin, Dan Duong, Vikas Sindhwani, and Johnny Lee. Transporter networks: Rearranging the visual world for robotic manipulation. In Conference on Robot Learning , 2020

  50. [50]

    Deep imita- tion learning for complex manipulation tasks from virtual reality teleoperation

    Tianhao Zhang, Zoe McCarthy, Owen Jow, Dennis Lee, Xi Chen, Ken Goldberg, and Pieter Abbeel. Deep imita- tion learning for complex manipulation tasks from virtual reality teleoperation. In IEEE International Conference on Robotics and Automation (ICRA) , 2018

  51. [51]

    Rein- forcement and imitation learning for diverse visuomotor skills

    Yuke Zhu, Ziyu Wang, Josh Merel, Andrei Rusu, Tom Erez, Serkan Cabi, Saran Tunyasuvunakool, J ´anos Kram´ar, Raia Hadsell, Nando de Freitas, et al. Rein- forcement and imitation learning for diverse visuomotor skills. arXiv preprint arXiv:1802.09564 , 2018

  52. [52]

    robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

    Yuke Zhu, Josiah Wong, Ajay Mandlekar, and Roberto Mart´ın-Mart´ın. robosuite: A modular simulation frame- work and benchmark for robot learning. In arXiv preprint arXiv:2009.12293, 2020. VII. S IMULATOR We benchmark the speed of our simulator on the PickPlaceCounterToCab task, running for 10 episodes, with each episode spawned in a random scene. We use n...