pith. machine review for the scientific record. sign in

arxiv: 2603.15759 · v2 · submitted 2026-03-16 · 💻 cs.RO · cs.AI· cs.LG

Recognition: no theorem link

Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:45 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords simulation distillationworld modelssim-to-realrobot adaptationlatent dynamicsonline planningcontact-rich tasks
0
0 comments X

The pith

Simulation Distillation pretrains world models in simulation so that only the latent dynamics need updating for fast real-world robot adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Simulation Distillation to solve the problem of training robotic world models without impractical amounts of real-world data. It pretrains the full world model in a physics simulator to extract structural priors, then transfers the encoder, reward model, and value function to the real robot while adapting only the latent dynamics model using real-world prediction losses. This setup turns real-world adaptation into a supervised system identification task that still supports dense long-horizon planning. Sympathetic readers would care because it enables reliable online improvement in challenging contact-rich manipulation and quadruped tasks where end-to-end methods fail to progress or get worse.

Core claim

By distilling priors from simulators into a world model, SimDist allows the encoder, reward model, and value function to transfer directly from simulation to real observations. Only the latent dynamics are updated online with prediction losses on real data. This preserves planning signals for counterfactual reasoning and leads to rapid performance gains across tasks, unlike prior adaptation approaches that struggle or degrade.

What carries the argument

The transferred encoder, reward model, and value function combined with an adaptable latent dynamics model updated solely via real-world prediction losses.

If this is right

  • World models can be adapted to real robots by supervised updates to dynamics only.
  • Online planning becomes feasible with limited real-world experience in long-horizon tasks.
  • Contact-rich manipulation and locomotion improve rapidly during real-world finetuning.
  • Prior methods that update more components tend to degrade rather than improve.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the approach to tasks with larger visual domain gaps might require partial encoder updates.
  • Similar distillation could apply to other model-based methods beyond world models.
  • Testing on more diverse real-world scenarios would show if prediction losses alone suffice broadly.

Load-bearing premise

The components pretrained in simulation, specifically the encoder, reward model, and value function, stay effective on real-world observations without any retraining.

What would settle it

Observing that real-world performance stops improving or worsens after several adaptation episodes when only the latent dynamics are updated, while methods that update the full model succeed.

Figures

Figures reproduced from arXiv: 2603.15759 by Abhishek Gupta, David Fridovich-Keil, Dong-Ki Kim, Fernando Palafox, Jacob Levy, Kevin Huang, Patrick Yin, Shayegan Omidshafiei, Tyler Westenbroek.

Figure 1
Figure 1. Figure 1: Failures of zero-shot sim-to-real policies (left). Our framework SimDist rapidly overcomes the dynamics gap and improves performance with minimal real-world interaction. We demonstrate substantial gains in task execution on both precise manipulation and quadrupedal locomotion tasks with only 15-30 minutes of real-world data, substantially outperforming baselines. Abstract—Simulation-to-real transfer remain… view at source ↗
Figure 2
Figure 2. Figure 2: SimDist overview. 1) An expert policy, policy checkpoints, and a value function are trained in simulation using privileged state. 2) Large-scale training data are generated by combining expert and sub-optimal policies with contiguous action perturbations, yielding diverse trajectories with dense reward and value supervision. 3) A planning-oriented latent world model is pretrained on this data, learning rep… view at source ↗
Figure 3
Figure 3. Figure 3: World model architecture. The most recent observation is encoded into a latent representation while a history encoder processes a history of observations and actions. These jointly condition a transformer-based latent dynamics model that predicts future latent trajectories under candidate action sequences. Transformer-based reward and value heads evaluate predicted trajectories to produce reward and value … view at source ↗
Figure 4
Figure 4. Figure 4: Real-world results. Success rate for two manipulation tasks, computed over 20 trials, and average forward progress for two quadruped locomotion tasks, averaged across all 15 trials (3 speeds, 5 trials each), as a function of real-world finetuning data. SimDist exhibits rapid and consistent improvement with limited data by finetuning only the latent dynamics model while planning with frozen reward and value… view at source ↗
Figure 5
Figure 5. Figure 5: Value predictions from SimDist along successful and failed real￾world Peg trajectories starting from the same initial condition. The predicted values track task progress and clearly distinguish successful from failure [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Scatter-Plot showing successful and failed attempts at solving the Peg Wide task for Diffusion Policy (Right) and the final trained policy for SimDist(Left). The broad coverage of pretraining data for SimDist enables efficiently learning policies which are far more robust than baselines. collapse by transferring value functions from simulation, but remains significantly more sample inefficient than SimDist… view at source ↗
Figure 7
Figure 7. Figure 7: (a) Finetuning drastically lowers dynamics prediction loss during a quadruped Slippery Slope trial. (b) Overlaid frames showing the front left foot slipping during the trial. (c) Foot-trajectory predictions from the world model at the same instant: the finetuned model correctly anticipates future slippage, while the pretrained model fails to do so. (d) Visualization of sampling-based planning. Candidate ac… view at source ↗
Figure 8
Figure 8. Figure 8: Detailed world model architecture for the quadruped. TABLE VII: World model architectural parameters for the quadruped. Parameter Value Embedding dimension 64 Proprioceptive observations MLP hidden dims 128, 128 CNN kernel size 3 CNN strides 2, 2, 2 CNN features 8, 16, 32 All transformers MLP hidden size 256 Dynamics transformer layers 2 Dynamics transformer heads 8 Reward transformer layers 1 Reward trans… view at source ↗
read the original abstract

Robot learning requires adaptation methods that improve reliably from limited, mixed-quality interaction data. This is especially challenging in long-horizon, contact-rich tasks, where end-to-end policy finetuning remains inefficient and brittle. World models offer a compelling alternative: by predicting the outcomes of candidate action sequences, they enable online planning through counterfactual reasoning. However, training action-conditioned robotic world models directly in the real world requires diverse data at impractical scale. We introduce Simulation Distillation (SimDist), a framework that uses physics simulators as a scalable source of action-conditioned robot experience. During pretraining, SimDist distills structural priors from the simulator into a world model that enables planning from raw real-world observations. During real-world adaptation, SimDist transfers the encoder, reward model, and value function learned in simulation, and updates only the latent dynamics model using real-world prediction losses. This reduces adaptation to supervised system identification while preserving dense, long-horizon planning signals for online improvement. Across contact-rich manipulation and quadruped locomotion tasks, SimDist rapidly improves with experience, while prior adaptation methods struggle to make progress or degrade during online finetuning. Project website and code: https://sim-dist.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces Simulation Distillation (SimDist), a framework that pretrains an action-conditioned world model in physics simulation to distill structural priors, then transfers the encoder, reward model, and value function to the real world while updating only the latent dynamics model via supervised prediction losses on real interaction data. This is claimed to enable rapid online improvement via planning in long-horizon contact-rich tasks such as manipulation and quadruped locomotion, outperforming prior adaptation methods that struggle or degrade.

Significance. If the empirical claims hold with rigorous validation, the approach could meaningfully advance sim-to-real transfer for model-based planning by reducing adaptation to system identification while retaining dense planning signals. This addresses data-efficiency bottlenecks in contact-rich robotics where end-to-end finetuning is brittle.

major comments (3)
  1. [Abstract and Experiments] Abstract and Experiments section: The central claim of rapid improvement across tasks while prior methods struggle or degrade is stated without any quantitative metrics, error bars, ablation studies, baseline comparisons, or experimental protocols. This prevents assessment of whether the data support the superiority assertion.
  2. [Method] Method description (adaptation phase): The load-bearing assumption that the fixed sim-pretrained encoder, reward model, and value function produce usable signals on real observations without retraining is not supported by described ablations or transfer-success metrics. In contact-rich regimes, sim-to-real gaps in perception or contact dynamics typically invalidate such fixed components, yet only latent dynamics are updated.
  3. [Adaptation and Planning] Adaptation and planning description: The assertion that real-world prediction losses alone suffice to adapt the latent dynamics for successful long-horizon planning assumes the transferred reward/value remain aligned with real outcomes. No concrete test (e.g., planning success with frozen vs. adapted components) is reported to address the risk of optimizing against misaligned objectives.
minor comments (1)
  1. [Abstract] The provision of a project website and code link is a positive step toward reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which identifies key areas to strengthen the clarity and rigor of our empirical claims. We address each major comment below and will incorporate revisions to provide additional quantitative support, ablations, and validation experiments.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: The central claim of rapid improvement across tasks while prior methods struggle or degrade is stated without any quantitative metrics, error bars, ablation studies, baseline comparisons, or experimental protocols. This prevents assessment of whether the data support the superiority assertion.

    Authors: The Experiments section reports results via performance curves with error bars across tasks, including direct comparisons to prior adaptation baselines and ablations on adaptation components. To make these claims more explicit, we will revise the abstract to include key quantitative metrics (e.g., success rates and adaptation speed) and add a summary table of baseline comparisons with error bars. We will also expand the experimental protocols subsection to detail data collection, evaluation metrics, and hyperparameters. revision: yes

  2. Referee: [Method] Method description (adaptation phase): The load-bearing assumption that the fixed sim-pretrained encoder, reward model, and value function produce usable signals on real observations without retraining is not supported by described ablations or transfer-success metrics. In contact-rich regimes, sim-to-real gaps in perception or contact dynamics typically invalidate such fixed components, yet only latent dynamics are updated.

    Authors: We agree this assumption requires stronger validation. The manuscript already includes ablations showing that full finetuning degrades performance while partial dynamics update succeeds. We will add explicit transfer-success metrics (e.g., encoder and reward prediction accuracy on held-out real data) and an ablation comparing planning performance when freezing versus updating the reward and value functions. These additions will be placed in the revised Experiments section. revision: yes

  3. Referee: [Adaptation and Planning] Adaptation and planning description: The assertion that real-world prediction losses alone suffice to adapt the latent dynamics for successful long-horizon planning assumes the transferred reward/value remain aligned with real outcomes. No concrete test (e.g., planning success with frozen vs. adapted components) is reported to address the risk of optimizing against misaligned objectives.

    Authors: We recognize the value of directly testing alignment. While overall task success with SimDist provides indirect evidence, we will add a concrete experiment reporting planning success rates using frozen versus adapted dynamics (with reward/value held fixed). This will quantify any misalignment risk and be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; method is standard transfer with independent adaptation

full rationale

The paper presents Simulation Distillation as pretraining a world model (encoder, reward, value, dynamics) in simulation, then transferring the encoder/reward/value while adapting only the latent dynamics via standard supervised prediction losses on real data. This reduces adaptation to system identification without any self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations in the abstract or described chain. No equations or steps reduce outputs to inputs by construction; the approach relies on external simulation data and real-world losses as independent signals, making the derivation self-contained against benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that simulation-derived structural priors transfer usefully to real observations and that dynamics-only updates suffice for planning; no free parameters or invented entities are explicitly named in the abstract.

axioms (1)
  • domain assumption Physics simulators supply structural priors that remain useful after transfer to real robot observations
    Invoked in the pretraining and transfer steps described in the abstract.

pith-pipeline@v0.9.0 · 5547 in / 1256 out tokens · 58036 ms · 2026-05-15T09:45:47.899076+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 10 internal anchors

  1. [1]

    Adaptive control

    Karl Johan ˚Astr¨om. Adaptive control. InMathematical System Theory: The Influence of RE Kalman, pages 437–

  2. [2]

    Efficient online reinforcement learning with offline data

    Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning, pages 1577–1594. PMLR, 2023

  3. [3]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

  4. [4]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video- language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

  5. [5]

    Transdreamer: Reinforcement learning with transformer world models.arXiv preprint arXiv:2202.09481, 2022

    Chang Chen, Yi-Fu Wu, Jaesik Yoon, and Sungjin Ahn. Transdreamer: Reinforcement learning with transformer world models.arXiv preprint arXiv:2202.09481, 2022

  6. [6]

    Xinyue Chen, Che Wang, Zijian Zhou, and Keith W. Ross. Randomized ensembled double q-learning: Learn- ing fast without a model. InInternational Confer- ence on Learning Representations, 2021. URL https: //openreview.net/forum?id=AY8zfZm0tDd

  7. [7]

    Randomized ensembled double q-learning: Learn- ing fast without a model

    Xinyue Chen, Che Wang, Zijian Zhou, and Keith W Ross. Randomized ensembled double q-learning: Learn- ing fast without a model. InInternational Conference on Learning Representations, 2021

  8. [8]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 2024

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 2024

  9. [9]

    Deep reinforcement learning in a handful of trials using probabilistic dynamics models

    Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in neural information processing systems, 31, 2018

  10. [10]

    Wensing, Benjamin Katz, Gerardo Bledt, and Sangbae Kim

    Jared Di Carlo, Patrick M. Wensing, Benjamin Katz, Gerardo Bledt, and Sangbae Kim. Dynamic locomotion in the mit cheetah 3 through convex model-predictive control. In2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–9, 2018. doi: 10.1109/IROS.2018.8594448

  11. [11]

    Finetuning offline world models in the real world, 2023

    Yunhai Feng, Nicklas Hansen, Ziyan Xiong, Chan- dramouli Rajagopalan, and Xiaolong Wang. Finetuning offline world models in the real world, 2023. URL https://arxiv.org/abs/2310.16029

  12. [12]

    Learning to Walk via Deep Reinforcement Learning

    Tuomas Haarnoja, Sehoon Ha, Aurick Zhou, Jie Tan, George Tucker, and Sergey Levine. Learning to walk via deep reinforcement learning.arXiv preprint arXiv:1812.11103, 2018

  13. [13]

    Soft Actor-Critic Algorithms and Applications

    Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications.arXiv preprint arXiv:1812.05905, 2018

  14. [14]

    Dream to control: Learning behaviors by latent imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mo- hammad Norouzi. Dream to control: Learning behaviors by latent imagination. InInternational Conference on Learning Representations, 2020

  15. [15]

    Tem- poral difference learning for model predictive control

    Nicklas A Hansen, Hao Su, and Xiaolong Wang. Tem- poral difference learning for model predictive control. InInternational Conference on Machine Learning, pages 8387–8406. PMLR, 2022

  16. [16]

    Minho Heo, Youngwoon Lee, Doohyun Lee, and Joseph J. Lim. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation. In Robotics: Science and Systems, 2023

  17. [17]

    Dropout q- functions for doubly efficient reinforcement learning

    Takuya Hiraoka, Takahisa Imagawa, Taisei Hashimoto, Takashi Onishi, and Yoshimasa Tsuruoka. Dropout q- functions for doubly efficient reinforcement learning. In International Conference on Learning Representations, 2022

  18. [18]

    Dropout q-functions for doubly efficient reinforcement learn- ing

    Takuya Hiraoka, Takahisa Imagawa, Taisei Hashimoto, Takashi Onishi, and Yoshimasa Tsuruoka. Dropout q-functions for doubly efficient reinforcement learn- ing. InInternational Conference on Learning Repre- sentations, 2022. URL https://openreview.net/forum?id= xCVJMsPv3RT

  19. [19]

    $\pi^{*}_{0.6}$: a VLA That Learns From Experience

    Physical Intelligence, Ali Amin, Raichelle Aniceto, Ash- win Balakrishna, Kevin Black, Ken Conley, Grace Con- nors, James Darpinian, Karan Dhabalia, Jared DiCarlo, et al. Pi*0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

  20. [20]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

  21. [21]

    PTR Prentice-Hall Upper Saddle River, NJ, 1996

    Petros A Ioannou and Jing Sun.Robust adaptive control, volume 1. PTR Prentice-Hall Upper Saddle River, NJ, 1996

  22. [22]

    Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

    Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

  23. [23]

    When to trust your model: Model-based pol- icy optimization, 2021

    Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based pol- icy optimization, 2021. URL https://arxiv.org/abs/1906. 08253

  24. [24]

    Scalable deep reinforcement learning for vision-based robotic manipulation

    Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. Scalable deep reinforcement learning for vision-based robotic manipulation. In Aude Billard, Anca Dragan, Jan Peters, and Jun Morimoto, editors, Proceedings of The 2nd Conference ...

  25. [25]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  26. [26]

    Offline reinforcement learning with implicit q-learning

    Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. InInter- national Conference on Learning Representations, 2022

  27. [27]

    Rma: Rapid motor adaptation for legged robots

    Ashish Kumar, Zipeng Fu, Deepak Pathak, and Jitendra Malik. Rma: Rapid motor adaptation for legged robots. arXiv preprint arXiv:2107.04034, 2021

  28. [28]

    Conservative q-learning for offline reinforcement learning, 2020

    Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning, 2020. URL https://arxiv.org/abs/2006.04779

  29. [29]

    Adapting world models with latent-state dynamics residuals, 2025

    JB Lanier, Kyungmin Kim, Armin Karamzade, Yifei Liu, Ankita Sinha, Kat He, Davide Corsi, and Roy Fox. Adapting world models with latent-state dynamics residuals, 2025. URL https://arxiv.org/abs/2504.02252

  30. [30]

    Learning quadrupedal locomotion over challenging terrain.Science robotics, 5 (47):eabc5986, 2020

    Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, and Marco Hutter. Learning quadrupedal locomotion over challenging terrain.Science robotics, 5 (47):eabc5986, 2020

  31. [31]

    Rl-100: Performant robotic manipulation with real-world reinforcement learning.arXiv preprint arXiv:2510.14830, 2025

    Kun Lei, Huanyu Li, Dongjie Yu, Zhenyu Wei, Lingxiao Guo, Zhennan Jiang, Ziyu Wang, Shiyu Liang, and Huazhe Xu. Rl-100: Performant robotic manipulation with real-world reinforcement learning.arXiv preprint arXiv:2510.14830, 2025

  32. [32]

    End-to-end training of deep visuomotor policies

    Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(39):1–40,

  33. [33]

    URL http://jmlr.org/papers/v17/15-522.html

  34. [34]

    Learning to walk from three minutes of real-world data with semi-structured dynamics models

    Jacob Levy, Tyler Westenbroek, and David Fridovich- Keil. Learning to walk from three minutes of real-world data with semi-structured dynamics models. InConfer- ence on Robot Learning, pages 2061–2079. PMLR, 2025

  35. [35]

    Offline robotic world model: Learning robotic policies without a physics simulator.arXiv preprint arXiv:2504.16680, 2025

    Chenhao Li, Andreas Krause, and Marco Hutter. Offline robotic world model: Learning robotic policies without a physics simulator.arXiv preprint arXiv:2504.16680, 2025

  36. [36]

    Unified Video Action Model

    Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025

  37. [37]

    Serl: A software suite for sample-efficient robotic reinforcement learning

    Jianlan Luo, Zheyuan Hu, Charles Xu, You Liang Tan, Jacob Berg, Archit Sharma, Stefan Schaal, Chelsea Finn, Abhishek Gupta, and Sergey Levine. Serl: A software suite for sample-efficient robotic reinforcement learning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16961–16969. IEEE, 2024

  38. [38]

    Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

    Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Mu ˜noz, Xinjie Yao, Ren ´e Zurbr ¨ugg, Nikita Rudin, Lukasz Wawrzyniak, Milad Rakhsha, Alain Denzler, Eric Hei- den, Ales Borovicka, Ossama Ahmed, Iretiayo Akinola, Abrar Anwar, Mark T. Carlson, Ji Yuan Feng, Ani- mesh Garg, Renato Gasoto, Lionel Gulich, Yijie...

  39. [39]

    Model predictive control: past, present and future.Computers & chemical engineering, 23(4-5):667–682, 1999

    Manfred Morari and Jay H Lee. Model predictive control: past, present and future.Computers & chemical engineering, 23(4-5):667–682, 1999

  40. [40]

    Factory: Fast contact for robotic assembly.arXiv preprint arXiv:2205.03532, 2022

    Yashraj Narang, Kier Storey, Iretiayo Akinola, Miles Macklin, Philipp Reist, Lukasz Wawrzyniak, Yunrong Guo, Adam Moravanszky, Gavriel State, Michelle Lu, et al. Factory: Fast contact for robotic assembly.arXiv preprint arXiv:2205.03532, 2022

  41. [41]

    Genie 2: A large-scale foundation world model

    Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexan- dre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sul- tan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, A...

  42. [42]

    Contactnets: Learning discontinuous contact dynamics with smooth, implicit representations

    Samuel Pfrommer, Mathew Halm, and Michael Posa. Contactnets: Learning discontinuous contact dynamics with smooth, implicit representations. InConference on Robot Learning, pages 2279–2291. PMLR, 2021

  43. [43]

    Learning to walk in minutes using massively parallel deep reinforcement learning

    Nikita Rudin, David Hoeller, Philipp Reist, and Marco Hutter. Learning to walk in minutes using massively parallel deep reinforcement learning. InConference on robot learning, pages 91–100. PMLR, 2022

  44. [44]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  45. [45]

    Dextrah-rgb: Visuomotor policies to grasp anything with dexterous hands.arXiv preprint arXiv:2412.01791, 2024

    Ritvik Singh, Arthur Allshire, Ankur Handa, Nathan Ratliff, and Karl Van Wyk. Dextrah-rgb: Visuomotor policies to grasp anything with dexterous hands.arXiv preprint arXiv:2412.01791, 2024

  46. [46]

    On the adaptive control of robot manipulators.The international journal of robotics research, 6(3):49–59, 1987

    Jean-Jacques E Slotine and Weiping Li. On the adaptive control of robot manipulators.The international journal of robotics research, 6(3):49–59, 1987

  47. [47]

    Legged robots that keep on learning: Fine-tuning locomotion policies in the real world

    Laura Smith, J Chase Kew, Xue Bin Peng, Sehoon Ha, Jie Tan, and Sergey Levine. Legged robots that keep on learning: Fine-tuning locomotion policies in the real world. In2022 international conference on robotics and automation (ICRA), pages 1593–1599. IEEE, 2022

  48. [48]

    A walk in the park: Learning to walk in 20 minutes with model-free reinforcement learning.arXiv preprint arXiv:2208.07860, 2022

    Laura Smith, Ilya Kostrikov, and Sergey Levine. A walk in the park: Learning to walk in 20 minutes with model-free reinforcement learning.arXiv preprint arXiv:2208.07860, 2022

  49. [49]

    Grow your limits: Continuous improvement with real-world rl for robotic locomotion

    Laura Smith, Yunhao Cao, and Sergey Levine. Grow your limits: Continuous improvement with real-world rl for robotic locomotion. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 10829–10836. IEEE, 2024

  50. [50]

    Overcoming the sim-to- real gap: Leveraging simulation to learn to explore for real-world rl.Advances in Neural Information Processing Systems, 37:78715–78765, 2024

    Andrew Wagenmaker, Kevin Huang, Liyiming Ke, Kevin Jamieson, and Abhishek Gupta. Overcoming the sim-to- real gap: Leveraging simulation to learn to explore for real-world rl.Advances in Neural Information Processing Systems, 37:78715–78765, 2024

  51. [51]

    Lyapunov design for robust and efficient robotic reinforcement learning,

    Tyler Westenbroek, Fernando Castaneda, Ayush Agrawal, Shankar Sastry, and Koushil Sreenath. Lyapunov design for robust and efficient robotic reinforcement learning,

  52. [52]

    URL https://arxiv.org/abs/2208.06721

  53. [53]

    Enabling efficient, reliable real-world reinforce- ment learning with approximate physics-based models

    Tyler Westenbroek, Jacob Levy, and David Fridovich- Keil. Enabling efficient, reliable real-world reinforce- ment learning with approximate physics-based models. InConference on Robot Learning, pages 2478–2497. PMLR, 2023

  54. [54]

    Aggressive driving with model predictive path integral control

    Grady Williams, Paul Drews, Brian Goldfain, James M Rehg, and Evangelos A Theodorou. Aggressive driving with model predictive path integral control. In2016 IEEE International Conference on Robotics and Automation (ICRA), pages 1433–1440. IEEE, 2016

  55. [55]

    Anycar to anywhere: Learning universal dynamics model for agile and adaptive mobil- ity

    Wenli Xiao, Haoru Xue, Tony Tao, Dvij Kalaria, John M Dolan, and Guanya Shi. Anycar to anywhere: Learning universal dynamics model for agile and adaptive mobil- ity. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8819–8825. IEEE, 2025

  56. [56]

    Latent diffusion planning for imitation learning

    Amber Xie, Oleh Rybkin, Dorsa Sadigh, and Chelsea Finn. Latent diffusion planning for imitation learning. arXiv preprint arXiv:2504.16925, 2025

  57. [57]

    Neural robot dynamics

    Jie Xu, Eric Heiden, Iretiayo Akinola, Dieter Fox, Miles Macklin, and Yashraj Narang. Neural robot dynamics. arXiv preprint arXiv:2508.15755, 2025

  58. [58]

    Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 1(2):6, 2023

    Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 1(2):6, 2023

  59. [59]

    Rapidly adapting policies to the real world via simulation-guided fine-tuning.arXiv preprint arXiv:2502.02705, 2025

    Patrick Yin, Tyler Westenbroek, Simran Bagaria, Kevin Huang, Ching-an Cheng, Andrey Kobolov, and Ab- hishek Gupta. Rapidly adapting policies to the real world via simulation-guided fine-tuning.arXiv preprint arXiv:2502.02705, 2025

  60. [60]

    Emergent dexterity via diverse resets and large-scale reinforcement learning

    Patrick Yin, Tyler Westenbroek, Zhengyu Zhang, Ignacio Dagnino, Eeshani Shilamkar, Numfor Mbiziwo-Tiapo, Simran Bagaria, Xinlei Liu, Galen Mullins, Andrey Kolobov, and Abhishek Gupta. Emergent dexterity via diverse resets and large-scale reinforcement learning. InThe Fourteenth International Conference on Learn- ing Representations, 2026. URL https://open...

  61. [61]

    Mopo: Model-based offline policy optimization

    Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33: 14129–14142, 2020

  62. [62]

    Combo: Conservative offline model-based policy optimization

    Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Ra- jeswaran, Sergey Levine, and Chelsea Finn. Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, 34: 28954–28967, 2021

  63. [63]

    Rewind: Language-guided rewards teach robot policies without new demonstrations, 2025

    Jiahui Zhang, Yusen Luo, Abrar Anwar, Sumedh Anand Sontakke, Joseph J Lim, Jesse Thomason, Erdem Biyik, and Jesse Zhang. Rewind: Language-guided rewards teach robot policies without new demonstrations, 2025. URL https://arxiv.org/abs/2505.10911

  64. [64]

    Efficient online reinforcement learn- ing fine-tuning need not retain offline data.arXiv preprint arXiv:2412.07762, 2024

    Zhiyuan Zhou, Andy Peng, Qiyang Li, Sergey Levine, and Aviral Kumar. Efficient online reinforcement learn- ing fine-tuning need not retain offline data.arXiv preprint arXiv:2412.07762, 2024

  65. [65]

    Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

    Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025. APPENDIX A. Diverse Data Generation Details Algorithm 2 details the data generation process. Data gen- eration proceeds with ru...