arxiv: 2603.15759 · v2 · submitted 2026-03-16 · 💻 cs.RO · cs.AI· cs.LG

Recognition: no theorem link

Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation

Jacob Levy , Tyler Westenbroek , Kevin Huang , Fernando Palafox , Patrick Yin , Shayegan Omidshafiei , Dong-Ki Kim , Abhishek Gupta

show 1 more author

David Fridovich-Keil

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:45 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords simulation distillationworld modelssim-to-realrobot adaptationlatent dynamicsonline planningcontact-rich tasks

0 comments

The pith

Simulation Distillation pretrains world models in simulation so that only the latent dynamics need updating for fast real-world robot adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Simulation Distillation to solve the problem of training robotic world models without impractical amounts of real-world data. It pretrains the full world model in a physics simulator to extract structural priors, then transfers the encoder, reward model, and value function to the real robot while adapting only the latent dynamics model using real-world prediction losses. This setup turns real-world adaptation into a supervised system identification task that still supports dense long-horizon planning. Sympathetic readers would care because it enables reliable online improvement in challenging contact-rich manipulation and quadruped tasks where end-to-end methods fail to progress or get worse.

Core claim

By distilling priors from simulators into a world model, SimDist allows the encoder, reward model, and value function to transfer directly from simulation to real observations. Only the latent dynamics are updated online with prediction losses on real data. This preserves planning signals for counterfactual reasoning and leads to rapid performance gains across tasks, unlike prior adaptation approaches that struggle or degrade.

What carries the argument

The transferred encoder, reward model, and value function combined with an adaptable latent dynamics model updated solely via real-world prediction losses.

If this is right

World models can be adapted to real robots by supervised updates to dynamics only.
Online planning becomes feasible with limited real-world experience in long-horizon tasks.
Contact-rich manipulation and locomotion improve rapidly during real-world finetuning.
Prior methods that update more components tend to degrade rather than improve.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the approach to tasks with larger visual domain gaps might require partial encoder updates.
Similar distillation could apply to other model-based methods beyond world models.
Testing on more diverse real-world scenarios would show if prediction losses alone suffice broadly.

Load-bearing premise

The components pretrained in simulation, specifically the encoder, reward model, and value function, stay effective on real-world observations without any retraining.

What would settle it

Observing that real-world performance stops improving or worsens after several adaptation episodes when only the latent dynamics are updated, while methods that update the full model succeed.

Figures

Figures reproduced from arXiv: 2603.15759 by Abhishek Gupta, David Fridovich-Keil, Dong-Ki Kim, Fernando Palafox, Jacob Levy, Kevin Huang, Patrick Yin, Shayegan Omidshafiei, Tyler Westenbroek.

**Figure 1.** Figure 1: Failures of zero-shot sim-to-real policies (left). Our framework SimDist rapidly overcomes the dynamics gap and improves performance with minimal real-world interaction. We demonstrate substantial gains in task execution on both precise manipulation and quadrupedal locomotion tasks with only 15-30 minutes of real-world data, substantially outperforming baselines. Abstract—Simulation-to-real transfer remain… view at source ↗

**Figure 2.** Figure 2: SimDist overview. 1) An expert policy, policy checkpoints, and a value function are trained in simulation using privileged state. 2) Large-scale training data are generated by combining expert and sub-optimal policies with contiguous action perturbations, yielding diverse trajectories with dense reward and value supervision. 3) A planning-oriented latent world model is pretrained on this data, learning rep… view at source ↗

**Figure 3.** Figure 3: World model architecture. The most recent observation is encoded into a latent representation while a history encoder processes a history of observations and actions. These jointly condition a transformer-based latent dynamics model that predicts future latent trajectories under candidate action sequences. Transformer-based reward and value heads evaluate predicted trajectories to produce reward and value … view at source ↗

**Figure 4.** Figure 4: Real-world results. Success rate for two manipulation tasks, computed over 20 trials, and average forward progress for two quadruped locomotion tasks, averaged across all 15 trials (3 speeds, 5 trials each), as a function of real-world finetuning data. SimDist exhibits rapid and consistent improvement with limited data by finetuning only the latent dynamics model while planning with frozen reward and value… view at source ↗

**Figure 5.** Figure 5: Value predictions from SimDist along successful and failed realworld Peg trajectories starting from the same initial condition. The predicted values track task progress and clearly distinguish successful from failure [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Scatter-Plot showing successful and failed attempts at solving the Peg Wide task for Diffusion Policy (Right) and the final trained policy for SimDist(Left). The broad coverage of pretraining data for SimDist enables efficiently learning policies which are far more robust than baselines. collapse by transferring value functions from simulation, but remains significantly more sample inefficient than SimDist… view at source ↗

**Figure 7.** Figure 7: (a) Finetuning drastically lowers dynamics prediction loss during a quadruped Slippery Slope trial. (b) Overlaid frames showing the front left foot slipping during the trial. (c) Foot-trajectory predictions from the world model at the same instant: the finetuned model correctly anticipates future slippage, while the pretrained model fails to do so. (d) Visualization of sampling-based planning. Candidate ac… view at source ↗

**Figure 8.** Figure 8: Detailed world model architecture for the quadruped. TABLE VII: World model architectural parameters for the quadruped. Parameter Value Embedding dimension 64 Proprioceptive observations MLP hidden dims 128, 128 CNN kernel size 3 CNN strides 2, 2, 2 CNN features 8, 16, 32 All transformers MLP hidden size 256 Dynamics transformer layers 2 Dynamics transformer heads 8 Reward transformer layers 1 Reward trans… view at source ↗

read the original abstract

Robot learning requires adaptation methods that improve reliably from limited, mixed-quality interaction data. This is especially challenging in long-horizon, contact-rich tasks, where end-to-end policy finetuning remains inefficient and brittle. World models offer a compelling alternative: by predicting the outcomes of candidate action sequences, they enable online planning through counterfactual reasoning. However, training action-conditioned robotic world models directly in the real world requires diverse data at impractical scale. We introduce Simulation Distillation (SimDist), a framework that uses physics simulators as a scalable source of action-conditioned robot experience. During pretraining, SimDist distills structural priors from the simulator into a world model that enables planning from raw real-world observations. During real-world adaptation, SimDist transfers the encoder, reward model, and value function learned in simulation, and updates only the latent dynamics model using real-world prediction losses. This reduces adaptation to supervised system identification while preserving dense, long-horizon planning signals for online improvement. Across contact-rich manipulation and quadruped locomotion tasks, SimDist rapidly improves with experience, while prior adaptation methods struggle to make progress or degrade during online finetuning. Project website and code: https://sim-dist.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SimDist's selective transfer—freezing sim-pretrained encoder, reward, and value while only updating dynamics—gives a clean way to adapt world models, but the abstract's lack of numbers leaves the contact-rich claims hard to evaluate.

read the letter

The main contribution here is a targeted adaptation scheme for world models. They pretrain the full thing in simulation, then on real data they keep the encoder, reward model, and value function fixed and only update the latent dynamics with prediction losses. This reduces the real-world phase to supervised system identification while trying to preserve long-horizon planning signals from the sim components. That framing is straightforward and directly attacks the data scale problem for contact-rich, long-horizon robot tasks. It is a step beyond generic sim-to-real fine-tuning because it explicitly decides which pieces to carry over and which to refresh. The abstract positions this as working better than prior adaptation methods on manipulation and quadruped locomotion, where those methods either stall or degrade. That outcome would be useful if it holds. The idea is motivated and the component split is a reasonable design choice for keeping planning viable during adaptation. The main limitation is that the abstract supplies no quantitative results, ablations, or experimental details, so the size of the gains and the conditions under which they appear remain unclear. The central assumption—that the sim-trained reward and value functions still give usable signals on real observations—needs direct evidence, especially in contact-rich regimes where small perception or physics mismatches usually break fixed reward models. If the paper shows that the dynamics update compensates for any misalignment, or quantifies how much the transferred components degrade, that would strengthen the case. This is worth a serious referee for groups working on model-based planning and sim-to-real transfer. The framework is clean enough and the problem is real enough that referees should see the full experiments and decide whether the results support the claims.

Referee Report

3 major / 1 minor

Summary. The paper introduces Simulation Distillation (SimDist), a framework that pretrains an action-conditioned world model in physics simulation to distill structural priors, then transfers the encoder, reward model, and value function to the real world while updating only the latent dynamics model via supervised prediction losses on real interaction data. This is claimed to enable rapid online improvement via planning in long-horizon contact-rich tasks such as manipulation and quadruped locomotion, outperforming prior adaptation methods that struggle or degrade.

Significance. If the empirical claims hold with rigorous validation, the approach could meaningfully advance sim-to-real transfer for model-based planning by reducing adaptation to system identification while retaining dense planning signals. This addresses data-efficiency bottlenecks in contact-rich robotics where end-to-end finetuning is brittle.

major comments (3)

[Abstract and Experiments] Abstract and Experiments section: The central claim of rapid improvement across tasks while prior methods struggle or degrade is stated without any quantitative metrics, error bars, ablation studies, baseline comparisons, or experimental protocols. This prevents assessment of whether the data support the superiority assertion.
[Method] Method description (adaptation phase): The load-bearing assumption that the fixed sim-pretrained encoder, reward model, and value function produce usable signals on real observations without retraining is not supported by described ablations or transfer-success metrics. In contact-rich regimes, sim-to-real gaps in perception or contact dynamics typically invalidate such fixed components, yet only latent dynamics are updated.
[Adaptation and Planning] Adaptation and planning description: The assertion that real-world prediction losses alone suffice to adapt the latent dynamics for successful long-horizon planning assumes the transferred reward/value remain aligned with real outcomes. No concrete test (e.g., planning success with frozen vs. adapted components) is reported to address the risk of optimizing against misaligned objectives.

minor comments (1)

[Abstract] The provision of a project website and code link is a positive step toward reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which identifies key areas to strengthen the clarity and rigor of our empirical claims. We address each major comment below and will incorporate revisions to provide additional quantitative support, ablations, and validation experiments.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: The central claim of rapid improvement across tasks while prior methods struggle or degrade is stated without any quantitative metrics, error bars, ablation studies, baseline comparisons, or experimental protocols. This prevents assessment of whether the data support the superiority assertion.

Authors: The Experiments section reports results via performance curves with error bars across tasks, including direct comparisons to prior adaptation baselines and ablations on adaptation components. To make these claims more explicit, we will revise the abstract to include key quantitative metrics (e.g., success rates and adaptation speed) and add a summary table of baseline comparisons with error bars. We will also expand the experimental protocols subsection to detail data collection, evaluation metrics, and hyperparameters. revision: yes
Referee: [Method] Method description (adaptation phase): The load-bearing assumption that the fixed sim-pretrained encoder, reward model, and value function produce usable signals on real observations without retraining is not supported by described ablations or transfer-success metrics. In contact-rich regimes, sim-to-real gaps in perception or contact dynamics typically invalidate such fixed components, yet only latent dynamics are updated.

Authors: We agree this assumption requires stronger validation. The manuscript already includes ablations showing that full finetuning degrades performance while partial dynamics update succeeds. We will add explicit transfer-success metrics (e.g., encoder and reward prediction accuracy on held-out real data) and an ablation comparing planning performance when freezing versus updating the reward and value functions. These additions will be placed in the revised Experiments section. revision: yes
Referee: [Adaptation and Planning] Adaptation and planning description: The assertion that real-world prediction losses alone suffice to adapt the latent dynamics for successful long-horizon planning assumes the transferred reward/value remain aligned with real outcomes. No concrete test (e.g., planning success with frozen vs. adapted components) is reported to address the risk of optimizing against misaligned objectives.

Authors: We recognize the value of directly testing alignment. While overall task success with SimDist provides indirect evidence, we will add a concrete experiment reporting planning success rates using frozen versus adapted dynamics (with reward/value held fixed). This will quantify any misalignment risk and be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; method is standard transfer with independent adaptation

full rationale

The paper presents Simulation Distillation as pretraining a world model (encoder, reward, value, dynamics) in simulation, then transferring the encoder/reward/value while adapting only the latent dynamics via standard supervised prediction losses on real data. This reduces adaptation to system identification without any self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations in the abstract or described chain. No equations or steps reduce outputs to inputs by construction; the approach relies on external simulation data and real-world losses as independent signals, making the derivation self-contained against benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that simulation-derived structural priors transfer usefully to real observations and that dynamics-only updates suffice for planning; no free parameters or invented entities are explicitly named in the abstract.

axioms (1)

domain assumption Physics simulators supply structural priors that remain useful after transfer to real robot observations
Invoked in the pretraining and transfer steps described in the abstract.

pith-pipeline@v0.9.0 · 5547 in / 1256 out tokens · 58036 ms · 2026-05-15T09:45:47.899076+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 10 internal anchors

[1]

Adaptive control

Karl Johan ˚Astr¨om. Adaptive control. InMathematical System Theory: The Influence of RE Kalman, pages 437–

work page
[2]

Efficient online reinforcement learning with offline data

Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning, pages 1577–1594. PMLR, 2023

work page 2023
[3]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

work page 2024
[4]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video- language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Transdreamer: Reinforcement learning with transformer world models.arXiv preprint arXiv:2202.09481, 2022

Chang Chen, Yi-Fu Wu, Jaesik Yoon, and Sungjin Ahn. Transdreamer: Reinforcement learning with transformer world models.arXiv preprint arXiv:2202.09481, 2022

work page arXiv 2022
[6]

Xinyue Chen, Che Wang, Zijian Zhou, and Keith W. Ross. Randomized ensembled double q-learning: Learn- ing fast without a model. InInternational Confer- ence on Learning Representations, 2021. URL https: //openreview.net/forum?id=AY8zfZm0tDd

work page 2021
[7]

Randomized ensembled double q-learning: Learn- ing fast without a model

Xinyue Chen, Che Wang, Zijian Zhou, and Keith W Ross. Randomized ensembled double q-learning: Learn- ing fast without a model. InInternational Conference on Learning Representations, 2021

work page 2021
[8]

Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 2024

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 2024

work page 2024
[9]

Deep reinforcement learning in a handful of trials using probabilistic dynamics models

Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in neural information processing systems, 31, 2018

work page 2018
[10]

Wensing, Benjamin Katz, Gerardo Bledt, and Sangbae Kim

Jared Di Carlo, Patrick M. Wensing, Benjamin Katz, Gerardo Bledt, and Sangbae Kim. Dynamic locomotion in the mit cheetah 3 through convex model-predictive control. In2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–9, 2018. doi: 10.1109/IROS.2018.8594448

work page doi:10.1109/iros.2018.8594448 2018
[11]

Finetuning offline world models in the real world, 2023

Yunhai Feng, Nicklas Hansen, Ziyan Xiong, Chan- dramouli Rajagopalan, and Xiaolong Wang. Finetuning offline world models in the real world, 2023. URL https://arxiv.org/abs/2310.16029

work page arXiv 2023
[12]

Learning to Walk via Deep Reinforcement Learning

Tuomas Haarnoja, Sehoon Ha, Aurick Zhou, Jie Tan, George Tucker, and Sergey Levine. Learning to walk via deep reinforcement learning.arXiv preprint arXiv:1812.11103, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

Soft Actor-Critic Algorithms and Applications

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications.arXiv preprint arXiv:1812.05905, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Dream to control: Learning behaviors by latent imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mo- hammad Norouzi. Dream to control: Learning behaviors by latent imagination. InInternational Conference on Learning Representations, 2020

work page 2020
[15]

Tem- poral difference learning for model predictive control

Nicklas A Hansen, Hao Su, and Xiaolong Wang. Tem- poral difference learning for model predictive control. InInternational Conference on Machine Learning, pages 8387–8406. PMLR, 2022

work page 2022
[16]

Minho Heo, Youngwoon Lee, Doohyun Lee, and Joseph J. Lim. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation. In Robotics: Science and Systems, 2023

work page 2023
[17]

Dropout q- functions for doubly efficient reinforcement learning

Takuya Hiraoka, Takahisa Imagawa, Taisei Hashimoto, Takashi Onishi, and Yoshimasa Tsuruoka. Dropout q- functions for doubly efficient reinforcement learning. In International Conference on Learning Representations, 2022

work page 2022
[18]

Dropout q-functions for doubly efficient reinforcement learn- ing

Takuya Hiraoka, Takahisa Imagawa, Taisei Hashimoto, Takashi Onishi, and Yoshimasa Tsuruoka. Dropout q-functions for doubly efficient reinforcement learn- ing. InInternational Conference on Learning Repre- sentations, 2022. URL https://openreview.net/forum?id= xCVJMsPv3RT

work page 2022
[19]

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

Physical Intelligence, Ali Amin, Raichelle Aniceto, Ash- win Balakrishna, Kevin Black, Ken Conley, Grace Con- nors, James Darpinian, Karan Dhabalia, Jared DiCarlo, et al. Pi*0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

PTR Prentice-Hall Upper Saddle River, NJ, 1996

Petros A Ioannou and Jing Sun.Robust adaptive control, volume 1. PTR Prentice-Hall Upper Saddle River, NJ, 1996

work page 1996
[22]

Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

work page arXiv 2025
[23]

When to trust your model: Model-based pol- icy optimization, 2021

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based pol- icy optimization, 2021. URL https://arxiv.org/abs/1906. 08253

work page 2021
[24]

Scalable deep reinforcement learning for vision-based robotic manipulation

Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. Scalable deep reinforcement learning for vision-based robotic manipulation. In Aude Billard, Anca Dragan, Jan Peters, and Jun Morimoto, editors, Proceedings of The 2nd Conference ...

work page 2018
[25]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Offline reinforcement learning with implicit q-learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. InInter- national Conference on Learning Representations, 2022

work page 2022
[27]

Rma: Rapid motor adaptation for legged robots

Ashish Kumar, Zipeng Fu, Deepak Pathak, and Jitendra Malik. Rma: Rapid motor adaptation for legged robots. arXiv preprint arXiv:2107.04034, 2021

work page arXiv 2021
[28]

Conservative q-learning for offline reinforcement learning, 2020

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning, 2020. URL https://arxiv.org/abs/2006.04779

work page arXiv 2020
[29]

Adapting world models with latent-state dynamics residuals, 2025

JB Lanier, Kyungmin Kim, Armin Karamzade, Yifei Liu, Ankita Sinha, Kat He, Davide Corsi, and Roy Fox. Adapting world models with latent-state dynamics residuals, 2025. URL https://arxiv.org/abs/2504.02252

work page arXiv 2025
[30]

Learning quadrupedal locomotion over challenging terrain.Science robotics, 5 (47):eabc5986, 2020

Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, and Marco Hutter. Learning quadrupedal locomotion over challenging terrain.Science robotics, 5 (47):eabc5986, 2020

work page 2020
[31]

Rl-100: Performant robotic manipulation with real-world reinforcement learning.arXiv preprint arXiv:2510.14830, 2025

Kun Lei, Huanyu Li, Dongjie Yu, Zhenyu Wei, Lingxiao Guo, Zhennan Jiang, Ziyu Wang, Shiyu Liang, and Huazhe Xu. Rl-100: Performant robotic manipulation with real-world reinforcement learning.arXiv preprint arXiv:2510.14830, 2025

work page arXiv 2025
[32]

End-to-end training of deep visuomotor policies

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(39):1–40,

work page
[33]

URL http://jmlr.org/papers/v17/15-522.html

work page
[34]

Learning to walk from three minutes of real-world data with semi-structured dynamics models

Jacob Levy, Tyler Westenbroek, and David Fridovich- Keil. Learning to walk from three minutes of real-world data with semi-structured dynamics models. InConfer- ence on Robot Learning, pages 2061–2079. PMLR, 2025

work page 2061
[35]

Offline robotic world model: Learning robotic policies without a physics simulator.arXiv preprint arXiv:2504.16680, 2025

Chenhao Li, Andreas Krause, and Marco Hutter. Offline robotic world model: Learning robotic policies without a physics simulator.arXiv preprint arXiv:2504.16680, 2025

work page arXiv 2025
[36]

Unified Video Action Model

Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Serl: A software suite for sample-efficient robotic reinforcement learning

Jianlan Luo, Zheyuan Hu, Charles Xu, You Liang Tan, Jacob Berg, Archit Sharma, Stefan Schaal, Chelsea Finn, Abhishek Gupta, and Sergey Levine. Serl: A software suite for sample-efficient robotic reinforcement learning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16961–16969. IEEE, 2024

work page 2024
[38]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Mu ˜noz, Xinjie Yao, Ren ´e Zurbr ¨ugg, Nikita Rudin, Lukasz Wawrzyniak, Milad Rakhsha, Alain Denzler, Eric Hei- den, Ales Borovicka, Ossama Ahmed, Iretiayo Akinola, Abrar Anwar, Mark T. Carlson, Ji Yuan Feng, Ani- mesh Garg, Renato Gasoto, Lionel Gulich, Yijie...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Model predictive control: past, present and future.Computers & chemical engineering, 23(4-5):667–682, 1999

Manfred Morari and Jay H Lee. Model predictive control: past, present and future.Computers & chemical engineering, 23(4-5):667–682, 1999

work page 1999
[40]

Factory: Fast contact for robotic assembly.arXiv preprint arXiv:2205.03532, 2022

Yashraj Narang, Kier Storey, Iretiayo Akinola, Miles Macklin, Philipp Reist, Lukasz Wawrzyniak, Yunrong Guo, Adam Moravanszky, Gavriel State, Michelle Lu, et al. Factory: Fast contact for robotic assembly.arXiv preprint arXiv:2205.03532, 2022

work page arXiv 2022
[41]

Genie 2: A large-scale foundation world model

Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexan- dre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sul- tan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, A...

work page 2024
[42]

Contactnets: Learning discontinuous contact dynamics with smooth, implicit representations

Samuel Pfrommer, Mathew Halm, and Michael Posa. Contactnets: Learning discontinuous contact dynamics with smooth, implicit representations. InConference on Robot Learning, pages 2279–2291. PMLR, 2021

work page 2021
[43]

Learning to walk in minutes using massively parallel deep reinforcement learning

Nikita Rudin, David Hoeller, Philipp Reist, and Marco Hutter. Learning to walk in minutes using massively parallel deep reinforcement learning. InConference on robot learning, pages 91–100. PMLR, 2022

work page 2022
[44]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[45]

Dextrah-rgb: Visuomotor policies to grasp anything with dexterous hands.arXiv preprint arXiv:2412.01791, 2024

Ritvik Singh, Arthur Allshire, Ankur Handa, Nathan Ratliff, and Karl Van Wyk. Dextrah-rgb: Visuomotor policies to grasp anything with dexterous hands.arXiv preprint arXiv:2412.01791, 2024

work page arXiv 2024
[46]

On the adaptive control of robot manipulators.The international journal of robotics research, 6(3):49–59, 1987

Jean-Jacques E Slotine and Weiping Li. On the adaptive control of robot manipulators.The international journal of robotics research, 6(3):49–59, 1987

work page 1987
[47]

Legged robots that keep on learning: Fine-tuning locomotion policies in the real world

Laura Smith, J Chase Kew, Xue Bin Peng, Sehoon Ha, Jie Tan, and Sergey Levine. Legged robots that keep on learning: Fine-tuning locomotion policies in the real world. In2022 international conference on robotics and automation (ICRA), pages 1593–1599. IEEE, 2022

work page 2022
[48]

A walk in the park: Learning to walk in 20 minutes with model-free reinforcement learning.arXiv preprint arXiv:2208.07860, 2022

Laura Smith, Ilya Kostrikov, and Sergey Levine. A walk in the park: Learning to walk in 20 minutes with model-free reinforcement learning.arXiv preprint arXiv:2208.07860, 2022

work page arXiv 2022
[49]

Grow your limits: Continuous improvement with real-world rl for robotic locomotion

Laura Smith, Yunhao Cao, and Sergey Levine. Grow your limits: Continuous improvement with real-world rl for robotic locomotion. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 10829–10836. IEEE, 2024

work page 2024
[50]

Overcoming the sim-to- real gap: Leveraging simulation to learn to explore for real-world rl.Advances in Neural Information Processing Systems, 37:78715–78765, 2024

Andrew Wagenmaker, Kevin Huang, Liyiming Ke, Kevin Jamieson, and Abhishek Gupta. Overcoming the sim-to- real gap: Leveraging simulation to learn to explore for real-world rl.Advances in Neural Information Processing Systems, 37:78715–78765, 2024

work page 2024
[51]

Lyapunov design for robust and efficient robotic reinforcement learning,

Tyler Westenbroek, Fernando Castaneda, Ayush Agrawal, Shankar Sastry, and Koushil Sreenath. Lyapunov design for robust and efficient robotic reinforcement learning,

work page
[52]

URL https://arxiv.org/abs/2208.06721

work page arXiv
[53]

Enabling efficient, reliable real-world reinforce- ment learning with approximate physics-based models

Tyler Westenbroek, Jacob Levy, and David Fridovich- Keil. Enabling efficient, reliable real-world reinforce- ment learning with approximate physics-based models. InConference on Robot Learning, pages 2478–2497. PMLR, 2023

work page 2023
[54]

Aggressive driving with model predictive path integral control

Grady Williams, Paul Drews, Brian Goldfain, James M Rehg, and Evangelos A Theodorou. Aggressive driving with model predictive path integral control. In2016 IEEE International Conference on Robotics and Automation (ICRA), pages 1433–1440. IEEE, 2016

work page 2016
[55]

Anycar to anywhere: Learning universal dynamics model for agile and adaptive mobil- ity

Wenli Xiao, Haoru Xue, Tony Tao, Dvij Kalaria, John M Dolan, and Guanya Shi. Anycar to anywhere: Learning universal dynamics model for agile and adaptive mobil- ity. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8819–8825. IEEE, 2025

work page 2025
[56]

Latent diffusion planning for imitation learning

Amber Xie, Oleh Rybkin, Dorsa Sadigh, and Chelsea Finn. Latent diffusion planning for imitation learning. arXiv preprint arXiv:2504.16925, 2025

work page arXiv 2025
[57]

Neural robot dynamics

Jie Xu, Eric Heiden, Iretiayo Akinola, Dieter Fox, Miles Macklin, and Yashraj Narang. Neural robot dynamics. arXiv preprint arXiv:2508.15755, 2025

work page arXiv 2025
[58]

Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 1(2):6, 2023

Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 1(2):6, 2023

work page arXiv 2023
[59]

Rapidly adapting policies to the real world via simulation-guided fine-tuning.arXiv preprint arXiv:2502.02705, 2025

Patrick Yin, Tyler Westenbroek, Simran Bagaria, Kevin Huang, Ching-an Cheng, Andrey Kobolov, and Ab- hishek Gupta. Rapidly adapting policies to the real world via simulation-guided fine-tuning.arXiv preprint arXiv:2502.02705, 2025

work page arXiv 2025
[60]

Emergent dexterity via diverse resets and large-scale reinforcement learning

Patrick Yin, Tyler Westenbroek, Zhengyu Zhang, Ignacio Dagnino, Eeshani Shilamkar, Numfor Mbiziwo-Tiapo, Simran Bagaria, Xinlei Liu, Galen Mullins, Andrey Kolobov, and Abhishek Gupta. Emergent dexterity via diverse resets and large-scale reinforcement learning. InThe Fourteenth International Conference on Learn- ing Representations, 2026. URL https://open...

work page 2026
[61]

Mopo: Model-based offline policy optimization

Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33: 14129–14142, 2020

work page 2020
[62]

Combo: Conservative offline model-based policy optimization

Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Ra- jeswaran, Sergey Levine, and Chelsea Finn. Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, 34: 28954–28967, 2021

work page 2021
[63]

Rewind: Language-guided rewards teach robot policies without new demonstrations, 2025

Jiahui Zhang, Yusen Luo, Abrar Anwar, Sumedh Anand Sontakke, Joseph J Lim, Jesse Thomason, Erdem Biyik, and Jesse Zhang. Rewind: Language-guided rewards teach robot policies without new demonstrations, 2025. URL https://arxiv.org/abs/2505.10911

work page arXiv 2025
[64]

Efficient online reinforcement learn- ing fine-tuning need not retain offline data.arXiv preprint arXiv:2412.07762, 2024

Zhiyuan Zhou, Andy Peng, Qiyang Li, Sergey Levine, and Aviral Kumar. Efficient online reinforcement learn- ing fine-tuning need not retain offline data.arXiv preprint arXiv:2412.07762, 2024

work page arXiv 2024
[65]

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025. APPENDIX A. Diverse Data Generation Details Algorithm 2 details the data generation process. Data gen- eration proceeds with ru...

work page internal anchor Pith review Pith/arXiv arXiv 2025