pith. sign in

arxiv: 2606.21406 · v1 · pith:DGE2GOMPnew · submitted 2026-06-19 · 💻 cs.RO · cs.CV

Robot Self-Improvement via Human-Video Dynamics Models

Pith reviewed 2026-06-26 14:38 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords robot self-improvementhuman video priorsdynamics modelsaction correctionembodiment transfermanipulation taskspolicy improvementfailure-based learning
0
0 comments X

The pith

Human videos supply transferable models that let robots correct their own failures and improve policies without extra data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that human videos can supply more than initial policies: they can yield action, dynamics, and value representations that remain useful when a robot tries tasks itself. These representations then support a method for turning each robot failure into ranked corrective actions that update the policy. A sympathetic reader would care because this combination of abundant human video and cheap robot failure data could reduce the need for expensive robot-specific training collections. The result is demonstrated across seven real tasks on two different robot arms, where success rates rise from 40 percent to 81 percent.

Core claim

Human videos can be used to learn embodiment-agnostic action, dynamics, and value representations that transfer across robot embodiments, providing the predictive foundation required for robots to autonomously improve from their own rollouts and failures. We introduce Dynamics-Guided Action Correction (DGAC), a training-free approach that leverages these adapted models to repair failed states: each failure becomes a query for which the learned models propose and rank corrective actions, turning failures into supervision for the next policy update. Across seven real-world manipulation tasks spanning both a mobile manipulator and a static manipulator arm, our approach improves success rates fr

What carries the argument

Dynamics-Guided Action Correction (DGAC), a training-free procedure that queries the human-video models to propose and rank corrective actions for each observed robot failure state.

If this is right

  • Robot policies can be improved iteratively using only their own failed rollouts and the fixed human-video models.
  • The same human-video representations work for both mobile and fixed-base manipulators without embodiment-specific changes.
  • Multiple different policy backbones receive large gains from the same correction procedure.
  • Self-improvement no longer requires fresh robot data collection or model retraining after the initial human-video stage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Robots could keep refining behaviors in homes or factories by treating everyday failures as new supervision signals.
  • The approach suggests shifting data collection emphasis from robot demonstrations toward large-scale human video archives.
  • Similar failure-to-correction loops might appear in other embodied domains such as mobile navigation once suitable video priors exist.

Load-bearing premise

The action, dynamics, and value representations extracted from human videos remain accurate enough on robot bodies to identify and rank useful corrective actions without any robot-specific retraining.

What would settle it

Running DGAC on a new robot embodiment or task and finding that success rates stay flat or drop compared with the uncorrected policy.

Figures

Figures reproduced from arXiv: 2606.21406 by Anran Zhang, Daniel Cremers, Hanzhi Chen, Kejia Chen, Oier Mees, Shi Chen, Simon Schaefer, Stefan Leutenegger.

Figure 1
Figure 1. Figure 1: We learn transferable models from human videos, enabling different robots to ground human priors through real-world rollouts and improve from failures. Humans rarely master physical skills from a single successful demonstration. Instead, they learn through observation, practice, and failure: watching others interact with the world, attempting the task, recognizing mis￾takes, and adjusting future behavior. … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our framework. We first pretrain reusable policy, dynamics, and value models from [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Given a failed state, DGAC samples candidate actions (left), rolls them out with the dynamics model, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: We design seven real-world manipulation tasks across two robot platforms, covering reaching, grasp [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: DGAC repairs failure-inducing actions from the original policy (top) by steering predicted futures [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation of human-video pre-training scale [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

A central question in robot learning is how to acquire skills from the kinds of data that humans learn from: passive observation, embodied practice, and the experience of failure. Human videos provide the first of these in abundance, and prior work has shown they can initialize useful policies. Far less clear is whether they can support the second and third: whether priors extracted from human videos can ground a robot's own attempts well enough to evaluate them, correct them, and improve from them. In this work, we show that human videos can be used to learn embodiment-agnostic action, dynamics, and value representations that transfer across robot embodiments, providing the predictive foundation required for robots to autonomously improve from their own rollouts and failures. We introduce Dynamics-Guided Action Correction (DGAC), a training-free approach that leverages these adapted models to repair failed states: each failure becomes a query for which the learned models propose and rank corrective actions, turning failures into supervision for the next policy update. Across seven real-world manipulation tasks spanning both a mobile manipulator and a static manipulator arm, our approach improves success rates from 40% to 81% across multiple policy backbones, demonstrating cross-embodiment robot self-improvement from human-video priors. These results show that human priors and robot failures can be combined to enable scalable autonomous policy improvement. Project page: https://ethz-mrl.github.io/robot-self-improvement-website/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that human videos can be used to learn embodiment-agnostic action, dynamics, and value representations that transfer across robot embodiments. It introduces Dynamics-Guided Action Correction (DGAC), a training-free method that leverages these models to repair failed robot states by proposing and ranking corrective actions from failures, enabling autonomous policy improvement. Empirical results across seven real-world manipulation tasks on a mobile manipulator and a static arm show success rates rising from 40% to 81% across multiple policy backbones.

Significance. If the central empirical claim holds with rigorous validation, the work would demonstrate a scalable route to robot self-improvement that combines abundant passive human video data with robot-specific failures, reducing reliance on embodiment-specific training data.

major comments (2)
  1. [Abstract] Abstract: the central claim of reliable cross-embodiment transfer and corrective-action ranking rests on the assertion that human-video models remain accurate on robot states without adaptation, yet the abstract supplies no methods, observation-alignment procedure, or failure analysis, rendering the 40%-to-81% improvement impossible to evaluate for soundness or post-hoc selection.
  2. [Abstract] The weakest assumption (human-video dynamics and value estimates suffice to rank corrective actions on robot states) is load-bearing for the self-improvement result; without explicit quantification of embodiment gap or ablation on alignment steps, it is unclear whether the reported gains are attributable to the claimed priors or to unstated robot-specific components.
minor comments (1)
  1. The project page is referenced but no quantitative results, code, or additional experimental details appear in the provided text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on the abstract. The comments highlight opportunities to strengthen the presentation of our cross-embodiment claims. We address each point below and will revise the abstract to improve clarity and evaluability while preserving its concise nature.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of reliable cross-embodiment transfer and corrective-action ranking rests on the assertion that human-video models remain accurate on robot states without adaptation, yet the abstract supplies no methods, observation-alignment procedure, or failure analysis, rendering the 40%-to-81% improvement impossible to evaluate for soundness or post-hoc selection.

    Authors: The abstract is a high-level summary; the full manuscript details the training-free DGAC procedure, the embodiment-agnostic representations, and the observation alignment steps used to apply human-video models to robot states. We agree the abstract can better support evaluation and will revise it to briefly reference the alignment procedure (e.g., visual feature matching) and note that results are aggregated across seven tasks and multiple policy backbones with no post-hoc selection. Full failure analysis and per-task breakdowns appear in Section 5. revision: yes

  2. Referee: [Abstract] The weakest assumption (human-video dynamics and value estimates suffice to rank corrective actions on robot states) is load-bearing for the self-improvement result; without explicit quantification of embodiment gap or ablation on alignment steps, it is unclear whether the reported gains are attributable to the claimed priors or to unstated robot-specific components.

    Authors: The manuscript includes ablations across policy backbones and comparisons with and without the human-video priors to attribute gains to the cross-embodiment models. The DGAC method is explicitly training-free on robot data. We will revise the abstract to state that no robot-specific fine-tuning or adaptation of the dynamics/value models occurs, and we will add a parenthetical reference to the alignment ablations in the main text. Explicit numerical quantification of the embodiment gap (e.g., prediction error deltas) is not currently reported and would require additional analysis. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript text (abstract plus full description) presents an empirical method (DGAC) and reports success-rate improvements on real-world tasks. No equations, derivations, fitted parameters, or mathematical claims appear that could reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains. The central claim is an observed performance gain from human-video priors transferred without embodiment-specific adaptation; this is framed as an experimental outcome rather than a closed-form derivation that collapses to its inputs by construction. No load-bearing steps matching the enumerated circularity patterns are present.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, modeling choices, or dataset details from which free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.1-grok · 5805 in / 1119 out tokens · 31066 ms · 2026-06-26T14:38:17.195694+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 22 linked inside Pith

  1. [1]

    E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc- z: Zero-shot task generalization with robotic imitation learning. InCoRL, pages 991–1002. PMLR, 2022

  2. [2]

    Intelligence, A

    P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, et al.π ∗ 0.6: A vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

  3. [3]

    J. Yang, K. Lin, J. Li, W. Zhang, T. Lin, L. Wu, Z. Su, H. Zhao, Y .-Q. Zhang, L. Chen, et al. Rise: Self-improving robot policy with compositional world model.arXiv preprint arXiv:2602.11075, 2026

  4. [4]

    S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3m: A universal visual represen- tation for robot manipulation, 2022. arXiv:2203.12601. 9

  5. [5]

    Bharadhwaj, A

    H. Bharadhwaj, A. Gupta, V . Kumar, and S. Tulsiani. Towards generalizable zero-shot manip- ulation via translating human interaction plans. InICRA, pages 6904–6911. IEEE, 2024

  6. [6]

    T. Xiao, I. Radosavovic, T. Darrell, and J. Malik. Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173, 2022

  7. [7]

    Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang. Vip: Towards univer- sal visual reward and representation via value-implicit pre-training, 2023. arXiv:2210.00030

  8. [8]

    H. Chen, B. Sun, A. Zhang, M. Pollefeys, and S. Leutenegger. Vidbot: Learning generalizable 3d actions from in-the-wild 2d human videos for zero-shot robotic manipulation.CVPR, 2025

  9. [9]

    M. Xu, Z. Xu, Y . Xu, C. Chi, G. Wetzstein, M. Veloso, and S. Song. Flow as the cross-domain manipulation interface. InCoRL, 2024

  10. [10]

    Zhang, H

    A. Zhang, H. Chen, Y . Burkhardt, Y . Zhong, J. Betz, H. Oleynikova, and S. Leutenegger. Actron3d: Learning actionable neural functions from videos for transferable robotic manipu- lation.arXiv preprint arXiv:2510.12971, 2025

  11. [11]

    R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

  12. [12]

    Borja-Diaz, O

    J. Borja-Diaz, O. Mees, G. Kalweit, L. Hermann, J. Boedecker, and W. Burgard. Affordance learning from play for sample-efficient policy learning. InProceedings of the IEEE Interna- tional Conference on Robotics and Automation (ICRA), Philadelphia, USA, 2022

  13. [13]

    G. Zhou, H. Pan, Y . LeCun, and L. Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983, 2024

  14. [14]

    Zhang, L

    B. Zhang, L. Ke, A. W. Harley, and K. Fragkiadaki. Tapip3d: Tracking any point in persistent 3d geometry, 2025. arXiv:2504.14717

  15. [15]

    Intelligence, K

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: A vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

  16. [16]

    Z. Feng, Q. Li, H. Liang, R. Yang, Y . Shen, Z. Du, Z. Zhang, Y . Deng, L. Zhao, H. Zhao, Z. Lu, O. Mees, M. Pollefeys, J. Yang, and B. Guo. From human videos to robot manipulation: A survey on scalable vision-language-action learning with human-centric data. InProceed- ings of the 35th International Joint Conference on Artificial Intelligence (IJCAI-26),...

  17. [17]

    S. Bahl, A. Gupta, and D. Pathak. Human-to-robot imitation in the wild.arXiv preprint arXiv:2207.09450, 2022

  18. [18]

    in- the-wild

    A. S. Chen, S. Nair, and C. Finn. Learning generalizable robotic reward functions from" in- the-wild" human videos.arXiv preprint arXiv:2103.16817, 2021

  19. [19]

    O. Mees, M. Merklinger, G. Kalweit, and W. Burgard. Adversarial skill networks: Unsuper- vised robot skill learning from videos. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 2020

  20. [20]

    J. Shi, J. Smith, J. Qian, and D. Jayaraman. Points2reward: Robotic manipulation rewards from just one video. InRobotics: Science and Systems (RSS), 2025

  21. [21]

    S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos as a versatile representation for robotics. InCVPR, pages 13778–13790, 2023. 10

  22. [22]

    Mendonca, S

    R. Mendonca, S. Bahl, and D. Pathak. Structured world models from human videos.arXiv preprint arXiv:2308.10901, 2023

  23. [23]

    Zhong, H

    Y . Zhong, H. Chen, S. Schaefer, A. Zhang, and S. Leutenegger. Gopla: Generalizable ob- ject placement learning via synthetic augmentation of human arrangement.arXiv preprint arXiv:2510.14627, 2025

  24. [24]

    Papagiannis, N

    G. Papagiannis, N. D. Palo, P. Vitiello, and E. Johns. R+x: Retrieval and execution from everyday human videos.ICRA, 2025

  25. [25]

    Qin, Y .-H

    Y . Qin, Y .-H. Wu, S. Liu, H. Jiang, R. Yang, Y . Fu, and X. Wang. Dexmv: Imitation learning for dexterous manipulation from human videos. InECCV, pages 570–587. Springer, 2022

  26. [26]

    Q. Li, Y . Deng, Y . Liang, L. Luo, L. Zhou, C. Yao, L. Zeng, Z. Feng, H. Liang, S. Xu, et al. Scalable vision-language-action model pretraining for robotic manipulation with real-life hu- man activity videos.arXiv preprint arXiv:2510.21571, 2025

  27. [27]

    C. Yuan, C. Wen, T. Zhang, and Y . Gao. General flow as foundation affordance for scalable robot learning.CoRL, 2024

  28. [28]

    P. Wu, A. Escontrela, D. Hafner, P. Abbeel, and K. Goldberg. Daydreamer: World models for physical robot learning. InConference on robot learning, pages 2226–2240. PMLR, 2023

  29. [29]

    Nagabandi, K

    A. Nagabandi, K. Konolige, S. Levine, and V . Kumar. Deep dynamics models for learning dexterous manipulation. InConference on robot learning, pages 1101–1112. PMLR, 2020

  30. [30]

    K. Chua, R. Calandra, R. McAllister, and S. Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models.Advances in neural information processing systems, 31, 2018

  31. [31]

    Hafner, T

    D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

  32. [32]

    Hafner, T

    D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning latent dynamics for planning from pixels. InInternational conference on machine learning, pages 2555–2565. PMLR, 2019

  33. [33]

    Nematollahi, O

    I. Nematollahi, O. Mees, L. Hermann, and W. Burgard. Hindsight for foresight: Un- supervised structured dynamics models from physical interaction. InProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Ve- gas, USA, 2020. URLhttp://ais.informatik.uni-freiburg.de/publications/ papers/nematoli20iros.pdf

  34. [34]

    B. Hou, G. Li, J. Jia, T. An, X. Guo, S. Leng, H. Geng, Y . Ze, T. Harada, P. Torr, O. Mees, M. Pollefeys, Z. Liu, J. Wu, P. Abbeel, J. Malik, Y . Du, and J. Yang. World model for robot learning: A comprehensive survey.arXiv preprint arXiv:2605.00080, 2026

  35. [35]

    Y . Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

  36. [36]

    A. L. Chandra, I. Nematollahi, C. Huang, T. Welschehold, W. Burgard, and A. Valada. Diwa: Diffusion policy adaptation with world models.arXiv preprint arXiv:2508.03645, 2025

  37. [37]

    G. R. Team, K. Choromanski, C. Devin, Y . Du, D. Dwibedi, R. Gao, A. Jindal, T. Kipf, S. Kir- mani, I. Leal, et al. Evaluating gemini robotics policies in a veo world simulator.arXiv preprint arXiv:2512.10675, 2025

  38. [39]

    H. Qi, H. Yin, A. Zhu, Y . Du, and H. Yang. Inference-time enhancement of generative robot policies via predictive world modeling.IEEE Robotics and Automation Letters, 2026

  39. [40]

    Du and S

    M. Du and S. Song. Dynaguide: Steering diffusion polices with active dynamic guidance. arXiv preprint arXiv:2506.13922, 2025

  40. [41]

    L. Ke, Y . Zhang, A. Deshpande, S. Srinivasa, and A. Gupta. Ccil: Continuity-based data augmentation for corrective imitation learning.arXiv preprint arXiv:2310.12972, 2023

  41. [42]

    J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y . Fang, F. Hu, S. Huang, K. Kundalia, Y .-C. Lin, et al. Dreamgen: Unlocking generalization in robot learning through neural trajectories.arXiv e-prints, pages arXiv–2505, 2025

  42. [43]

    P. Wu, Y . Shentu, Q. Liao, D. Jin, M. Guo, K. Sreenath, X. Lin, and P. Abbeel. Robocopi- lot: Human-in-the-loop interactive imitation learning for robot manipulation.arXiv preprint arXiv:2503.07771, 2025

  43. [44]

    Kelly, C

    M. Kelly, C. Sidrane, K. Driggs-Campbell, and M. J. Kochenderfer. Hg-dagger: Interactive imitation learning with human experts. In2019 International Conference on Robotics and Automation (ICRA), pages 8077–8083. IEEE, 2019

  44. [45]

    J. Luo, C. Xu, J. Wu, and S. Levine. Precise and dexterous robotic manipulation via human- in-the-loop reinforcement learning.Science Robotics, 10(105):eads5033, 2025

  45. [46]

    Z. Zhou, P. Atreya, A. Lee, H. Walke, O. Mees, and S. Levine. Autonomous improvement of instruction following skills via foundation models.arXiv preprint arXiv:407.20635, 2024

  46. [47]

    Kalashnikov, J

    D. Kalashnikov, J. Varley, Y . Chebotar, B. Swanson, R. Jonschkowski, C. Finn, S. Levine, and K. Hausman. Mt-opt: Continuous multi-task robotic reinforcement learning at scale.arXiv preprint arXiv:2104.08212, 2021

  47. [48]

    H. Li, Y . Zuo, J. Yu, Y . Zhang, Z. Yang, K. Zhang, X. Zhu, Y . Zhang, T. Chen, G. Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674, 2025

  48. [49]

    K. Lei, H. Li, D. Yu, Z. Wei, L. Guo, Z. Jiang, Z. Wang, S. Liang, and H. Xu. Rl- 100: Performant robotic manipulation with real-world reinforcement learning.arXiv preprint arXiv:2510.14830, 2025

  49. [50]

    Y . Guo, J. Zhang, X. Chen, X. Ji, Y .-J. Wang, Y . Hu, and J. Chen. Improving vision-language- action model with online reinforcement learning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 15665–15672. IEEE, 2025

  50. [51]

    W. Xiao, H. Lin, A. Peng, H. Xue, T. He, Y . Xie, F. Hu, J. Wu, Z. Luo, L. Fan, et al. Self- improving vision-language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091, 2025

  51. [52]

    Ankile, A

    L. Ankile, A. Simeonov, I. Shenfeld, M. Torne, and P. Agrawal. From imitation to refinement – residual rl for precise visual assembly, 2024. URLhttps://arxiv.org/abs/2407.16677

  52. [53]

    Lipman, R

    Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  53. [54]

    Li and K

    T. Li and K. He. Back to basics: Let denoising generative models denoise, 2025.URL https://arxiv. org/abs/2511.13720, 7, 2026

  54. [55]

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023. 12

  55. [56]

    Frans, S

    K. Frans, S. Park, P. Abbeel, and S. Levine. Diffusion guidance is a controllable policy im- provement operator.arXiv preprint arXiv:2505.23458, 2025

  56. [57]

    Siméoni, H

    O. Siméoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Cou- prie, J. Mairal, H. Jégou, P. Labatut, and P. Bojanowski. DINOv3, 2025. URLhttps: //arxiv.org/abs/...

  57. [58]

    Raffel, N

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

  58. [59]

    Gpt-4o system card, 2024

    OpenAI. Gpt-4o system card, 2024. arXiv:2410.21276

  59. [60]

    R. S. Sutton. Learning to predict by the methods of temporal differences.Machine learning, 3 (1):9–44, 1988

  60. [61]

    X. B. Peng, A. Kumar, G. Zhang, and S. Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.CoRR, abs/1910.00177, 2019. URLhttps: //arxiv.org/abs/1910.00177

  61. [62]

    Mendonca, S

    R. Mendonca, S. Bahl, and D. Pathak. Structured world models from human videos. InRSS, 2023

  62. [63]

    Sun and S

    Z. Sun and S. Song. Latent policy barrier: Learning robust visuomotor policies by staying in-distribution.arXiv preprint arXiv:2508.05941, 2025

  63. [64]

    Openai gpt-5 system card, 2026

    OpenAI. Openai gpt-5 system card, 2026. URLhttps://arxiv.org/abs/2601.03267

  64. [65]

    Y . Liu, Y . Liu, C. Jiang, K. Lyu, W. Wan, H. Shen, B. Liang, Z. Fu, H. Wang, and L. Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InCVPR, pages 21013–21022, June 2022

  65. [66]

    Werby, M

    A. Werby, M. Büchner, A. Röfer, C. Huang, W. Burgard, and A. Valada. Articulated object estimation in the wild. InConference on Robot Learning (CoRL), volume 2, 2025

  66. [67]

    Hoque, P

    R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

  67. [68]

    H. Lin, S. Chen, J. Liew, D. Y . Chen, Z. Li, G. Shi, J. Feng, and B. Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  68. [69]

    Pavlakos, D

    G. Pavlakos, D. Shan, I. Radosavovic, A. Kanazawa, D. Fouhey, and J. Malik. Reconstructing hands in 3D with transformers. InCVPR, 2024

  69. [70]

    Savitzky and M

    A. Savitzky and M. J. Golay. Smoothing and differentiation of data by simplified least squares procedures.Analytical chemistry, 36(8):1627–1639, 1964

  70. [71]

    C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023

  71. [72]

    Q. Wang, Y . Zhang, A. Holynski, A. A. Efros, and A. Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025

  72. [73]

    Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks. InCVPR, pages 5745–5753, 2019. 13

  73. [74]

    A. Bar, G. Zhou, D. Tran, T. Darrell, and Y . LeCun. Navigation world models, 2024. URL https://arxiv.org/abs/2412.03572

  74. [75]

    Peebles and S

    W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  75. [76]

    Loshchilov and F

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  76. [77]

    Wan Team, A. Wang, B. Ai, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 14 A.1 Supplementary Video We include a supplementary video showcasing an overview of our framework, along with demon- strations of various real-world robot manipulation tasks:https://www.youtube.com/watch?v= ZW3ZHjrllJA. A.2 ...