pith. sign in

arxiv: 2606.17511 · v1 · pith:6V7W2EJKnew · submitted 2026-06-16 · 💻 cs.RO · cs.AI· cs.CV

MagicSim: A Unified Infrastructure for Executable Embodied Interaction

Pith reviewed 2026-06-27 00:58 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV
keywords embodied simulationrobot learningunified runtimeMarkov decision processautomatic trajectory generationplanner-in-the-loopYAML world specificationmultimodal data collection
0
0 comments X

The pith

MagicSim unifies world construction, embodied execution, evaluation, rollout generation, and agent interaction in one deterministic runtime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MagicSim to address how robot learning simulations are currently split across disconnected layers that use magic actions or forward-only renders. It builds everything around one deterministic batched runtime and a shared Markov decision process. YAML specifications define contents, placement, behavior, and agent exposure separately, then the system constructs executable worlds that support multiple task families and embodiments in a single reset-and-step loop. High-level commands are grounded through skills and planners into actual robot actions rather than simulator edits. The same task definition then enables benchmarking, automatic trajectory collection, and interactive interfaces while saving structured multimodal data from successful episodes.

Core claim

MagicSim constructs diverse executable worlds from YAML-first specifications and realizes high-level commands as robot actions inside one deterministic batched runtime and shared MDP. A common execution interface routes commands through controllers, atomic skills, planner primitives, and asynchronous planning. One task definition supports benchmark and RL evaluation, an autocollect interface that turns commands into grounded trajectories, and agent or VLM-facing interaction. Commands advance through a Command-Skill-Planner-Robot-Record pipeline while per-environment states progress independently above the shared physics tick, and successful rollouts are recorded as structured multimodal traj

What carries the argument

The deterministic batched runtime and shared MDP that executes a Command->Skill->Planner->Robot->Record pipeline, grounding high-level commands as robot actions rather than direct state edits.

If this is right

  • One task definition supports three distinct uses: benchmark evaluation, automatic rollout collection, and interactive agent interfaces.
  • Commands are turned into grounded robot trajectories that align language supervision with action, visual, and task status representations.
  • Per-environment command, skill, planning, retry, annotation, and episode states advance independently above the shared physics tick.
  • Successful episodes are saved as structured multimodal trajectories for downstream training or analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The unified loop could simplify scaling of language-conditioned robot policies by removing the need to maintain separate collection and evaluation codebases.
  • It might enable tighter closed-loop testing of planner primitives directly inside the same environment used for data generation.
  • Future work could test whether adding new sensor models or physics variants requires changes only to the YAML layer or also to the core execution loop.

Load-bearing premise

A single deterministic batched runtime and shared MDP can support all diverse task families, interaction regimes, physics, sensors, and embodiments without significant trade-offs in performance or fidelity.

What would settle it

A head-to-head test on a complex multi-embodiment task where MagicSim produces measurably lower physics fidelity or slower per-step throughput than a specialized simulator built only for that task family.

read the original abstract

Robot learning and embodied agents now require simulation to serve as a shared execution substrate linking control, skills, and planning, not only as a renderer, controller testbed, or fixed task environment. Existing pipelines split these layers with "magic" actions, disconnected training environments, or forward-only renders that cannot reproduce, evaluate, and annotate the same episode. We present MagicSim, an embodied interaction infrastructure built around one deterministic batched runtime and a shared Markov decision process (MDP). From YAML-first specifications that decouple contents, placement, behavior, and agent exposure, MagicSim constructs diverse executable worlds spanning task families, interaction regimes, physics, layouts, sensors, avatars, and robot embodiments in one reset-and-step loop. A common execution interface grounds high-level commands through controllers, atomicskills, planner primitives, and asynchronous planning, realizing them as robot actions rather than simulator-side state edits. One task definition supports three capabilities: benchmark and RL evaluation, an autocollect interface that automatically turns commands into grounded trajectories, and agent/VLM-facing interaction. For automatic execution, commands flow through a Command->Skill->Planner->Robot->Record pipeline, while per-environment command, skill, planning, retry, annotation, and episode states advance independently above the shared physics tick. Successful rollouts are saved as structured multimodal trajectories aligning language supervision, action representations, visual/geometric representations, and task-level status with the executed episode. MagicSim thus unifies diverse world construction, embodied execution, task evaluation, automatic rollout generation, and interactive agent interfaces in one planner-in-the-loop runtime.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents MagicSim, a unified infrastructure for embodied interaction in robotics. It is built around one deterministic batched runtime and a shared Markov decision process (MDP). From YAML-first specifications that decouple contents, placement, behavior, and agent exposure, the system constructs diverse executable worlds spanning task families, interaction regimes, physics, sensors, avatars, and robot embodiments. A common Command->Skill->Planner->Robot execution interface grounds high-level commands as robot actions. One task definition supports benchmark/RL evaluation, automatic rollout generation via autocollect, and interactive agent/VLM interfaces, with per-environment states advancing independently above the shared physics tick. Successful rollouts are saved as structured multimodal trajectories. The paper claims this unifies world construction, embodied execution, task evaluation, automatic rollout generation, and interactive interfaces in one planner-in-the-loop runtime.

Significance. If the system performs as described without the hypothesized fidelity or throughput trade-offs, MagicSim would offer a meaningful contribution to robot learning by replacing fragmented simulation pipelines with a single shared substrate that consistently links control, skills, planning, evaluation, and data collection across heterogeneous tasks and embodiments.

major comments (1)
  1. [Abstract] Abstract and overall manuscript: the central claim that one deterministic batched runtime and shared MDP can instantiate and execute worlds spanning diverse task families, physics, sensors, and embodiments without significant performance or fidelity trade-offs is load-bearing for the contribution, yet the manuscript supplies no implementation details, throughput measurements, error rates, fidelity comparisons, or ablation studies to support it.
minor comments (1)
  1. The description of independent per-environment states advancing above the shared tick would benefit from a diagram or pseudocode to clarify the separation between command/skill/planning layers and the physics tick.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for identifying the load-bearing nature of our central claim and the absence of supporting empirical evidence. We agree this requires strengthening and will revise the manuscript to include the requested details.

read point-by-point responses
  1. Referee: [Abstract] Abstract and overall manuscript: the central claim that one deterministic batched runtime and shared MDP can instantiate and execute worlds spanning diverse task families, physics, sensors, and embodiments without significant performance or fidelity trade-offs is load-bearing for the contribution, yet the manuscript supplies no implementation details, throughput measurements, error rates, fidelity comparisons, or ablation studies to support it.

    Authors: We agree that the claim is central and that the current manuscript does not provide the requested quantitative support. The manuscript emphasizes the architectural unification via the YAML-first specifications, shared MDP, and Command->Skill->Planner->Robot pipeline but lacks implementation specifics on the batched runtime, performance metrics, or comparisons. In revision we will add: (1) detailed implementation of the deterministic batched runtime and per-environment state advancement; (2) throughput measurements (steps/sec across environment counts and task types); (3) error rates for rollout generation and task success; (4) fidelity comparisons against standard simulators for physics, sensors, and embodiments; and (5) ablations isolating the effects of batching and the shared MDP. These additions will directly address whether significant trade-offs exist. revision: yes

Circularity Check

0 steps flagged

No circularity: system-description paper with no derivations, predictions, or load-bearing equations

full rationale

The manuscript is an infrastructure/system paper whose central claim is the existence and unification of a deterministic batched runtime + shared MDP that supports diverse embodied tasks. No equations, fitted parameters, predictions, or derivation chain appear in the abstract or described full text. The architecture (YAML decoupling, Command->Skill->Planner->Robot pipeline, per-env state above shared tick) is presented descriptively; success is not claimed via reduction to prior self-defined quantities or self-citations. The reader's assessment of score 1.0 is consistent with the absence of any of the enumerated circularity patterns. The paper is self-contained against external benchmarks in the sense that its claims are architectural assertions open to empirical validation outside any internal fit.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or data-fitting steps are described; the contribution is a software infrastructure rather than a parameterized model.

pith-pipeline@v0.9.1-grok · 5878 in / 1200 out tokens · 56129 ms · 2026-06-27T00:58:09.721798+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

123 extracted references · 8 canonical work pages

  1. [1]

    Pi-0.7: A steerable generalist robotic foundation model with emergent capabilities

    Physical Intelligence. Pi-0.7: A steerable generalist robotic foundation model with emergent capabilities. arXiv preprint, 2026. CorpusID: 287607456

  2. [2]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch...

  3. [3]

    URLhttps://api.semanticscholar.org/CorpusID:277993634

  4. [4]

    Gen-0: Embodied foundation models that scale with physical interaction

    Generalist AI Team. Gen-0: Embodied foundation models that scale with physical interaction. Generalist AI Blog, 2025. November 4, 2025

  5. [5]

    Gr00t n1: An open foundation model for generalist humanoid robots

    NVIDIA. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv:2503.14734, 2025

  6. [6]

    World action models are zero-shot policies

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, and Joel Jang. World action models are zero-shot policies. arXiv:2602.15922, 2026. 53

  7. [7]

    Fast-wam: Do world action models need test-time future imagination?, 2026

    Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?, 2026. URLhttps://arxiv.org/abs/2603.16666

  8. [8]

    Learning to feel the future: Dreamtacvla for contact-rich manipulation.ArXiv, abs/2512.23864, 2025

    Guo Ye, Zexi Zhang, Xu Zhao, Shang Wu, Haoran Lu, Shihan Lu, and Han Liu. Learning to feel the future: Dreamtacvla for contact-rich manipulation.ArXiv, abs/2512.23864, 2025. URLhttps://api.semanticscholar. org/CorpusID:284350273

  9. [9]

    Vagen: Reinforcing world model reasoning for multi-turn vlm agents.ArXiv, abs/2510.16907, 2025

    Kangrui Wang, Pingyue Zhang, Zihan Wang, Yaning Gao, Linjie Li, Qineng Wang, Hanyang Chen, Chi Wan, Yiping Lu, Zhengyuan Yang, Lijuan Wang, Ranjay Krishna, Jiajun Wu, Fei-Fei Li, Yejin Choi, and Manling Li. Vagen: Reinforcing world model reasoning for multi-turn vlm agents.ArXiv, abs/2510.16907, 2025. URL https://api.semanticscholar.org/CorpusID:282210682

  10. [10]

    Lam, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Fei-Fei Li, Lijuan Wang, Yejin Choi, and Manling Li

    Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Monica S. Lam, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Fei-Fei Li, Lijuan Wang, Yejin Choi, and Manling Li. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning.ArXiv, abs/2504.20073, 2025....

  11. [11]

    Embodied ai agents: Modeling the world.ArXiv, abs/2506.22355, 2025

    Pascale Fung, Yoram Bachrach, Asli Celikyilmaz, Kamalika Chaudhuri, Delong Chen, Willy Chung, Emmanuel Dupoux, Hervé Jégou, Alessandro Lazaric, Arjun Majumdar, Andrea Madotto, Franziska Meier, Florian Metze, Théo Moutakanni, Juan Pino, Basile Terver, Joseph Tighe, and Jitendra Malik. Embodied ai agents: Modeling the world.ArXiv, abs/2506.22355, 2025. URLh...

  12. [12]

    MuJoCo: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. InIEEE/RSJ International Conference on Intelligent Robots and Systems, 2012

  13. [13]

    Isaac Gym: High performance GPU-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

    Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, and Gavriel State. Isaac Gym: High performance GPU-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

  14. [14]

    Chang, Leonidas J

    Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, Li Yi, Angel X. Chang, Leonidas J. Guibas, and Hao Su. SAPIEN: A simulated part-based interactive environment. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

  15. [15]

    Learning part-aware dense 3d feature field for generalizable articulated object manipulation, 2026

    Yue Chen, Muqing Jiang, Kaifeng Zheng, Jiaqi Liang, Chenrui Tie, Haoran Lu, Ruihai Wu, and Hao Dong. Learning part-aware dense 3d feature field for generalizable articulated object manipulation, 2026. URL https://arxiv.org/abs/2602.14193

  16. [17]

    Broadcasting support relations recursively from local dynamics for object retrieval in clutters.ArXiv, abs/2406.02283, 2024

    Yitong Li, Ruihai Wu, Haoran Lu, Chuanruo Ning, Yan Shen, Guanqi Zhan, and Hao Dong. Broadcasting support relations recursively from local dynamics for object retrieval in clutters.ArXiv, abs/2406.02283, 2024. URLhttps://api.semanticscholar.org/CorpusID:270226492

  17. [18]

    Neural dynamics augmented diffusion policy

    Ruihai Wu, Haozhe Chen, Mingtong Zhang, Haoran Lu, Yitong Li, and Yunzhu Li. Neural dynamics augmented diffusion policy. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13234–13241,

  18. [19]

    doi: 10.1109/ICRA55743.2025.11128651

  19. [20]

    Garmentlab: A unified simulation and benchmark for garment manipula- tion

    Haoran Lu, Ruihai Wu, Yitong Li, Sijie Li, Ziyu Zhu, Chuanruo Ning, Yan Shen, Longzan Luo, Yuan- pei Chen, and Hao Dong. Garmentlab: A unified simulation and benchmark for garment manipula- tion. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, page...

  20. [21]

    Unigarment: A unified simulation and benchmark for garment manipulation, 2025

    Haoran Lu, Yitong Li, Ruihai Wu, Chuanruo Ning, Yan Shen, and Hao Dong. Unigarment: A unified simulation and benchmark for garment manipulation, 2025. URLhttps://api.semanticscholar.org/CorpusID:275782214. Manuscript

  21. [22]

    Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904, 2025

    Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, et al. Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904, 2025

  22. [23]

    Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning.arXiv preprint arXiv:2511.04831, 2025

    Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, et al. Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning.arXiv preprint arXiv:2511.04831, 2025. 54

  23. [24]

    Maniskill2: A unified benchmark for generalizable manipulation skills.arXiv preprint arXiv:2302.04659, 2023

    Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, et al. Maniskill2: A unified benchmark for generalizable manipulation skills.arXiv preprint arXiv:2302.04659, 2023

  24. [25]

    Habitat: A platform for embodied ai research.arXiv preprint arXiv:1904.01201, 2019

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied ai research.arXiv preprint arXiv:1904.01201, 2019

  25. [26]

    Tchapmi, Micael E

    Bokui Shen, Fei Xia, Chengshu Li, Roberto Martín-Martín, Linxi Fan, Guanzhi Wang, Claudia Pérez-D’Arpino, Shyamal Buch, Sanjana Srivastava, Lyne P. Tchapmi, Micael E. Tchapmi, Kent Vainio, Josiah Wong, Li Fei-Fei, and Silvio Savarese. igibson 1.0: A simulation environment for interactive tasks in large realistic scenes.arXiv preprint arXiv:2012.02924, 2020

  26. [27]

    Karen Liu, Jiajun Wu, and Li Fei-Fei

    Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Wensi Ai, Benjamin Martinez, Hang Yin, Michael Lingelbach, Minjune Hwang, Ayano Hiranaka, Sujay Garlanka, Arman Aydin, Sharon Lee, Jiankai Sun, Mona Anvari, Manasi Sharma, Dhruva Bansal, Samuel Hunter, Kyu-Young Kim, Alan Lou, Caleb R ...

  27. [28]

    Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J. Davison. RLBench: The robot learning benchmark & learning environment.arXiv preprint arXiv:1909.12271, 2019

  28. [29]

    CALVIN: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.arXiv preprint arXiv:2112.03227, 2021

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. CALVIN: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.arXiv preprint arXiv:2112.03227, 2021

  29. [30]

    RoboCasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. RoboCasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024

  30. [31]

    Open x-embodiment: Robotic learning datasets and RT-X models.arXiv preprint arXiv:2310.08864, 2023

    Open X-Embodiment Collaboration. Open x-embodiment: Robotic learning datasets and RT-X models.arXiv preprint arXiv:2310.08864, 2023

  31. [32]

    Bridgedata v2: A dataset for robot learning at scale.arXiv preprint arXiv:2308.12952, 2023

    Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale.arXiv preprint arXiv:2308.12952, 2023

  32. [33]

    DROID: A large-scale in-the-wild robot manipulation dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, et al. DROID: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024

  33. [34]

    Vima: General robot manipulation with multimodal prompts.arXiv preprint arXiv:2210.03094, 2022

    Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. Vima: General robot manipulation with multimodal prompts.arXiv preprint arXiv:2210.03094, 2022

  34. [35]

    Mimicgen: A data generation system for scalable robot learning using human demonstrations

    Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. In Conference on Robot Learning (CoRL), 2023. arXiv:2310.17596

  35. [36]

    Sucan, Mark Moll, and Lydia E

    Ioan A. Sucan, Mark Moll, and Lydia E. Kavraki. The open motion planning library.IEEE Robotics & Automation Magazine, 19(4):72–82, 2012

  36. [37]

    Reducing the barrier to entry of complex robotic software: a MoveIt! case study.arXiv preprint arXiv:1404.3785, 2014

    David Coleman, Ioan Sucan, Sachin Chitta, and Nikolaus Correll. Reducing the barrier to entry of complex robotic software: a MoveIt! case study.arXiv preprint arXiv:1404.3785, 2014

  37. [38]

    Hierarchical task and motion planning in the now

    Leslie Pack Kaelbling and Tomás Lozano-Pérez. Hierarchical task and motion planning in the now. In2011 IEEE International Conference on Robotics and Automation, pages 1470–1477, 2011

  38. [39]

    A survey of optimization- based task and motion planning: From classical to learning approaches.arXiv preprint arXiv:2404.02817, 2024

    Zhigen Zhao, Shuo Cheng, Yan Ding, Ziyi Zhou, Shiqi Zhang, Danfei Xu, and Ye Zhao. A survey of optimization- based task and motion planning: From classical to learning approaches.arXiv preprint arXiv:2404.02817, 2024

  39. [40]

    curobo: Parallelized collision-free minimum-jerk robot motion generation.arXiv preprint arXiv:2310.17274, 2023

    Balakumar Sundaralingam, Siva Kumar Sastry Hari, Adam Fishman, Caelan Garrett, Karl Van Wyk, Valts Blukis, Alexander Millane, Helen Oleynikova, Ankur Handa, Fabio Ramos, Nathan Ratliff, and Dieter Fox. curobo: Parallelized collision-free minimum-jerk robot motion generation.arXiv preprint arXiv:2310.17274, 2023. 55

  40. [41]

    RT-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  41. [42]

    RT-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023

  42. [43]

    Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

  43. [44]

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. PaLM-E: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

  44. [45]

    Orbit: A unified simulation framework for interactive robot learning environments.IEEE Robotics and Automation Letters, 8(6):3740–3747, 2023

    Mayank Mittal, Calvin Yu, Qinxi Yu, Jingzhou Liu, Nikita Rudin, David Hoeller, Jia Lin Yuan, Ritvik Singh, Yunrong Guo, Hammad Mazhar, Ajay Mandlekar, Buck Babich, Gavriel State, Marco Hutter, and Animesh Garg. Orbit: A unified simulation framework for interactive robot learning environments.IEEE Robotics and Automation Letters, 8(6):3740–3747, 2023

  45. [46]

    Domain random- ization for transferring deep neural networks from simulation to the real world

    Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain random- ization for transferring deep neural networks from simulation to the real world. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017

  46. [47]

    Openai gym.arXiv preprint arXiv:1606.01540, 2016

    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym.arXiv preprint arXiv:1606.01540, 2016

  47. [48]

    Hybridflow: A flexible and efficient RLHF framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. InProceedings of the Twentieth European Conference on Computer Systems (EuroSys), 2025. The verl library implements HybridFlow

  48. [49]

    RLinf: Flexible and efficient large-scale reinforcement learning via macro-to-micro flow transformation.arXiv preprint arXiv:2509.15965, 2025

    Chao Yu, Yuanqing Wang, Zhen Guo, Hao Lin, Si Xu, Hongzhi Zang, Quanlu Zhang, Yongji Wu, Chunyang Zhu, Junhao Hu, et al. RLinf: Flexible and efficient large-scale reinforcement learning via macro-to-micro flow transformation.arXiv preprint arXiv:2509.15965, 2025

  49. [50]

    Scenesmith: Agentic generation of simulation-ready indoor scenes, 2026

    Nicholas Pfaff, Thomas Cohn, Sergey Zakharov, Rick Cory, and Russ Tedrake. Scenesmith: Agentic generation of simulation-ready indoor scenes, 2026. URLhttps://arxiv.org/abs/2602.09153

  50. [51]

    Holodeck: Language guided generation of 3d embodied ai environments

    Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, and Christopher Clark. Holodeck: Language guided generation of 3d embodied ai environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog...

  51. [52]

    Pink: Python inverse kinematics based on Pinocchio, 2026

    Stéphane Caron, Yann De Mont-Marin, Rohan Budhiraja, Seung Hyeon Bang, Ivan Domrachev, Simeon Nedelchev, Peter Du, Adrien Escande, Joris Vaillant, Bruce Wingo, Santosh Patapati, Daniel San José Pro, and Nicolas Guillermo Marticorena Vidal. Pink: Python inverse kinematics based on Pinocchio, 2026. URL https://github.com/stephane-caron/pink

  52. [53]

    HOMIE: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025

    Qingwei Ben, Feiyu Jia, Jia Zeng, Junting Dong, Dahua Lin, and Jiangmiao Pang. HOMIE: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025

  53. [54]

    Agile: A comprehensive workflow for humanoid loco-manipulation learning, 2026

    Huihua Zhao*, Rafael Cathomen*, Lionel Gulich, Wei Liu, Efe Arda Ongan, Michael Lin, Shalin Jain, Soha Pouya, and Yan Chang. Agile: A comprehensive workflow for humanoid loco-manipulation learning, 2026. URL https://arxiv.org/abs/2603.20147

  54. [55]

    The dynamic window approach to collision avoidance

    Dieter Fox, Wolfram Burgard, and Sebastian Thrun. The dynamic window approach to collision avoidance. IEEE Robotics & Automation Magazine, 4(1):23–33, 1997

  55. [56]

    Gonzalez, Clark Barrett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient execution of structured language model programs. InAdvances in Neural Information Processing Systems, 2024

  56. [57]

    MindCube: Spatial mental modeling from limited views.arXiv preprint arXiv:2506.21458, 2025

    Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Jiajun Wu, Li Fei-Fei, and Manling Li. MindCube: Spatial mental modeling from limited views.arXiv preprint arXiv:2506.21458, 2025. 56

  57. [58]

    Phys4D: Fine-grained physics-consistent 4D modeling from video diffusion.arXiv preprint arXiv:2603.03485, 2026

    Haoran Lu, Shang Wu, Jianshu Zhang, Maojiang Su, Guo Ye, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Zhaoran Wang, and Han Liu. Phys4D: Fine-grained physics-consistent 4D modeling from video diffusion.arXiv preprint arXiv:2603.03485, 2026

  58. [59]

    Wenzhen Yuan, Siyuan Dong, and Edward H. Adelson. GelSight: High-resolution robot tactile sensors for estimating geometry and force.Sensors, 17(12):2762, 2017. doi: 10.3390/s17122762

  59. [60]

    Taxim: An example-based simulation model for GelSight tactile sensors.IEEE Robotics and Automation Letters, 7(2):2361–2368, 2022

    Zilin Si and Wenzhen Yuan. Taxim: An example-based simulation model for GelSight tactile sensors.IEEE Robotics and Automation Letters, 7(2):2361–2368, 2022

  60. [61]

    TacSL: A library for visuotactile sensor simulation and learning.arXiv preprint arXiv:2408.06506, 2024

    Iretiayo Akinola, Jie Xu, Jan Carius, Dieter Fox, and Yashraj Narang. TacSL: A library for visuotactile sensor simulation and learning.arXiv preprint arXiv:2408.06506, 2024

  61. [62]

    FlexiTac: A low-cost, open-source, scalable tactile sensing solution for robotic systems.arXiv preprint arXiv:2604.28156, 2026

    Binghao Huang and Yunzhu Li. FlexiTac: A low-cost, open-source, scalable tactile sensing solution for robotic systems.arXiv preprint arXiv:2604.28156, 2026

  62. [63]

    Tacmap: Bridging the tactile sim-to-real gap via geometry-consistent penetration depth map.arXiv preprint arXiv:2602.21625, 2026

    Lei Su, Zhijie Peng, Renyuan Ren, Shengping Mao, Juan Du, Kaifeng Zhang, and Xuezhou Zhu. Tacmap: Bridging the tactile sim-to-real gap via geometry-consistent penetration depth map.arXiv preprint arXiv:2602.21625, 2026

  63. [64]

    Annotateanything: Automatic annotation of 3D assets for robot manipulation, 2026

    AnnotateAnything Team. Annotateanything: Automatic annotation of 3D assets for robot manipulation, 2026. Companion paper, under review. Citation to be updated upon publication

  64. [65]

    Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

    Shuai Bai, Yuheng Cai, Ruisheng Chen, Kai Chen, Xi Chen, Zesen Cheng, Lianghao Deng, Wenyu Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

  65. [66]

    Qwen3.5: Towards native multimodal agents

    Qwen Team. Qwen3.5: Towards native multimodal agents. Official release post, February 2026. URLhttps: //www.alibabacloud.com/blog/602894. Accessed 2026-06-10

  66. [67]

    Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J. Guibas. PointNet++: Deep hierarchical feature learning on point sets in a metric space. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

  67. [68]

    P3-SAM: Native 3D part segmentation.arXiv preprint arXiv:2509.06784, 2025

    Changfeng Ma, Yang Li, Xinhao Yan, Jiachen Xu, Yunhan Yang, Chunshi Wang, Zibo Zhao, Yanwen Guo, Zhuo Chen, and Chunchao Guo. P3-SAM: Native 3D part segmentation.arXiv preprint arXiv:2509.06784, 2025

  68. [69]

    X-Part: High fidelity and structure coherent shape decomposition.arXiv preprint arXiv:2509.08643, 2025

    Xinhao Yan, Jiachen Xu, Yang Li, Changfeng Ma, Yunhan Yang, Chunshi Wang, Zibo Zhao, Zeqiang Lai, Yunfei Zhao, Zhuo Chen, et al. X-Part: High fidelity and structure coherent shape decomposition.arXiv preprint arXiv:2509.08643, 2025

  69. [70]

    NVIDIA Isaac Sim documentation

    NVIDIA. NVIDIA Isaac Sim documentation. https://docs.isaacsim.omniverse.nvidia.com, 2025. Accessed 2026-06-10

  70. [71]

    Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai

    Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse-kai Chan, Yuan Gao, Xuanlin Li, Tongzhou Mu, Nan Xiao, Arnav Gurha, Viswesh Nagaswamy Rajesh, Yong Woo Choi, Yen-Ru Chen, Zhiao Huang, Roberto Calandra, Rui Chen, Shan Luo, and Hao Su. Maniskill3: Gpu parallelized robotics simulation and r...

  71. [72]

    Turner, Oleksandr Maksymets, Zsolt Kira, Mrinal Kalakrishnan, Jitendra Malik, Devendra Singh Chaplot, Unnat Jain, Dhruv Batra, Akshara Rai, and Roozbeh Mottaghi

    Xavier Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Tsung-Yen Yang, Ruslan Partsey, Ruta Desai, Alexander William Clegg, Michal Hlavac, So Yeon Min, Vladimír Vondruš, Theophile Gervet, Vincent-Pierre Berges, John M. Turner, Oleksandr Maksymets, Zsolt Kira, Mrinal Kalakrishnan, Jitendra Malik, Devendra Singh Chaplot, Unnat Jain, Dhruv Batra, ...

  72. [73]

    Learning to walk in minutes using massively parallel deep reinforcement learning

    Nikita Rudin, David Hoeller, Philipp Reist, and Marco Hutter. Learning to walk in minutes using massively parallel deep reinforcement learning. InProceedings of the 5th Conference on Robot Learning (CoRL), volume 164 ofProceedings of Machine Learning Research, pages 91–100. PMLR, 2022

  73. [74]

    Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

    Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InRobotics: Science and Systems (RSS), 2023. arXiv:2304.13705

  74. [75]

    Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning

    Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. InIEEE International Conference on Robotics and Automation (ICRA), 2025. arXiv:2410.24185

  75. [76]

    Skillmimicgen: Automated demonstration genera- tion for efficient skill learning and deployment

    Caelan Garrett, Ajay Mandlekar, Bowen Wen, and Dieter Fox. Skillmimicgen: Automated demonstration genera- tion for efficient skill learning and deployment. InConference on Robot Learning (CoRL), 2024. arXiv:2410.18907. 57

  76. [77]

    Softmimicgen: A data generation system for scalable robot learning in deformable object manipulation.arXiv preprint arXiv:2603.25725, 2026

    Masoud Moghani, Mahdi Azizian, Animesh Garg, Yuke Zhu, Sean Huver, and Ajay Mandlekar. Softmimicgen: A data generation system for scalable robot learning in deformable object manipulation.arXiv preprint arXiv:2603.25725, 2026

  77. [78]

    Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Qiwei Liang, Zixuan Li, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan-ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, and Yao Mu. Robotwin 2.0: A scalable d...

  78. [79]

    Gensim: Generating robotic simulation tasks via large language models

    Lirui Wang, Yiyang Ling, Zhecheng Yuan, Mohit Shridhar, Chen Bao, Yuzhe Qin, Bailin Wang, Huazhe Xu, and Xiaolong Wang. Gensim: Generating robotic simulation tasks via large language models. InInternational Conference on Learning Representations (ICLR), 2024. arXiv:2310.01361

  79. [80]

    Gensim2: Scaling robot data generation with multi-modal and reasoning llms

    Pu Hua, Minghuan Liu, Annabella Macaluso, Yunfeng Lin, Weinan Zhang, Huazhe Xu, and Lirui Wang. Gensim2: Scaling robot data generation with multi-modal and reasoning llms. InConference on Robot Learning (CoRL),

  80. [81]

    Robogen: Towards unleashing infinite data for automated robot learning via generative simulation.arXiv preprint arXiv:2311.01455, 2023

    Yufei Wang, Zhou Xian, Feng Chen, Tsun-Hsuan Wang, Yian Wang, Katerina Fragkiadaki, Zackory Erickson, David Held, and Chuang Gan. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation.arXiv preprint arXiv:2311.01455, 2023

Showing first 80 references.