Recognition: 2 theorem links
· Lean TheoremDreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
Pith reviewed 2026-05-16 16:57 UTC · model grok-4.3
The pith
A world model pretrained on 44k hours of human videos transfers to robots with accurate physics and control after minimal fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DreamDojo learns diverse interactions and dexterous controls from 44k hours of egocentric human videos using continuous latent actions as unified proxy actions. After post-training on small-scale target robot data, the model shows strong understanding of physics and precise action controllability on multiple challenging out-of-distribution benchmarks. A distillation pipeline further accelerates the model to real-time operation at 10.81 FPS while enhancing context consistency.
What carries the argument
Continuous latent actions, which serve as unified proxy actions learned from unlabeled videos to bridge human data to robot control.
If this is right
- Supports live teleoperation of robots using the generative world model.
- Facilitates policy evaluation in simulated environments.
- Enables model-based planning for complex robotic tasks.
- Provides real-time inference at over 10 FPS after distillation.
Where Pith is reading between the lines
- Scaling to even larger video corpora could further improve generalization across environments.
- The latent action representation might apply to non-robotic control domains with similar data constraints.
- Integration with existing robot policies could reduce the need for real-world trial-and-error learning.
Load-bearing premise
Latent actions derived from human videos can serve as effective proxies for robot actions without introducing domain gaps that impair accurate physics modeling.
What would settle it
Demonstrating that post-training fails to produce reliable predictions of contact-rich dynamics on out-of-distribution robot benchmarks would falsify the central claim.
read the original abstract
Being able to simulate the outcomes of actions in varied environments will revolutionize the development of generalist agents at scale. However, modeling these world dynamics, especially for dexterous robotics tasks, poses significant challenges due to limited data coverage and scarce action labels. As an endeavor towards this end, we introduce DreamDojo, a foundation world model that learns diverse interactions and dexterous controls from 44k hours of egocentric human videos. Our data mixture represents the largest video dataset to date for world model pretraining, spanning a wide range of daily scenarios with diverse objects and skills. To address the scarcity of action labels, we introduce continuous latent actions as unified proxy actions, enhancing interaction knowledge transfer from unlabeled videos. After post-training on small-scale target robot data, DreamDojo demonstrates a strong understanding of physics and precise action controllability. We also devise a distillation pipeline that accelerates DreamDojo to a real-time speed of 10.81 FPS and further improves context consistency. Our work enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model-based planning. Systematic evaluation on multiple challenging out-of-distribution (OOD) benchmarks verifies the significance of our method for simulating open-world, contact-rich tasks, paving the way for general-purpose robot world models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DreamDojo, a foundation world model pretrained on the largest reported video dataset (44k hours of egocentric human videos) by learning continuous latent actions as unified proxy controls to overcome the lack of action labels. After post-training on small-scale target robot data, the model is claimed to exhibit strong physics understanding and precise action controllability on multiple challenging out-of-distribution benchmarks. A distillation pipeline is presented to accelerate inference to 10.81 FPS while improving context consistency, enabling applications including live teleoperation, policy evaluation, and model-based planning.
Significance. If the empirical transfer results hold under rigorous controls, the work would mark a meaningful step toward scalable generalist robot world models by showing that large-scale unlabeled human video can supply interaction priors that reduce reliance on robot-specific labeled data. The real-time distillation component adds practical value for deployment. However, the absence of isolated transfer metrics in the abstract leaves the magnitude of the advance difficult to gauge against prior video-pretrained world models.
major comments (3)
- [Abstract] Abstract: The central claim of 'strong understanding of physics and precise action controllability' on OOD benchmarks is unsupported by any quantitative metrics, error bars, baseline comparisons, or ablation results; this omission is load-bearing because the abstract is the only location where the post-training transfer performance is summarized.
- [§4] §4 (Evaluation): No quantitative isolation experiment (e.g., forward-dynamics prediction error or controllability success rate with vs. without the 44k-hour human pretraining stage) is reported, leaving open whether continuous latent actions successfully bridge embodiment gaps or merely overfit to the small robot fine-tuning set.
- [§3.2] §3.2 (Latent Action Model): The training objective and architecture for continuous latent actions contain no reported regularization, diversity metrics, or mode-collapse diagnostics; without these, it is impossible to verify that the latent space preserves accurate contact-rich dynamics rather than learning spurious correlations that would invalidate OOD controllability claims.
minor comments (2)
- [Abstract] Abstract: '44k hours' should be expanded to '44,000 hours' for immediate readability.
- [§5] §5 (Distillation): The description of the teacher-student distillation pipeline would benefit from an explicit statement of the loss terms used to preserve context consistency.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We have revised the paper to strengthen the quantitative support for our claims, including updates to the abstract and additional analyses in the evaluation and method sections. Below we respond point by point to the major comments.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of 'strong understanding of physics and precise action controllability' on OOD benchmarks is unsupported by any quantitative metrics, error bars, baseline comparisons, or ablation results; this omission is load-bearing because the abstract is the only location where the post-training transfer performance is summarized.
Authors: We agree that the abstract should include key quantitative results to support the central claims. In the revised manuscript, we have updated the abstract to report specific metrics from our OOD benchmarks, including success rates with error bars and comparisons against baselines, to better substantiate the claims of physics understanding and action controllability. revision: yes
-
Referee: [§4] §4 (Evaluation): No quantitative isolation experiment (e.g., forward-dynamics prediction error or controllability success rate with vs. without the 44k-hour human pretraining stage) is reported, leaving open whether continuous latent actions successfully bridge embodiment gaps or merely overfit to the small robot fine-tuning set.
Authors: We acknowledge the importance of an explicit isolation experiment to demonstrate the benefit of large-scale human pretraining. We have added a new ablation study in §4 that quantitatively compares forward-dynamics prediction error and controllability success rates with and without the 44k-hour human pretraining stage on the OOD benchmarks, showing that pretraining provides substantial gains beyond the small robot fine-tuning data alone. revision: yes
-
Referee: [§3.2] §3.2 (Latent Action Model): The training objective and architecture for continuous latent actions contain no reported regularization, diversity metrics, or mode-collapse diagnostics; without these, it is impossible to verify that the latent space preserves accurate contact-rich dynamics rather than learning spurious correlations that would invalidate OOD controllability claims.
Authors: We have revised §3.2 to explicitly describe the regularization terms in the training objective for continuous latent actions. We now also report diversity metrics (e.g., latent space entropy) and mode-collapse diagnostics (e.g., reconstruction fidelity on contact-rich sequences) to confirm that the latent space captures meaningful dynamics. revision: yes
Circularity Check
No significant circularity; claims rest on empirical pretraining and post-training pipeline
full rationale
The paper describes a standard two-stage training process: unsupervised learning of continuous latent actions from 44k hours of human video as proxy controls, followed by post-training on limited robot data and evaluation on OOD benchmarks. No equations, uniqueness theorems, or fitted parameters are presented that reduce the final physics prediction or controllability claims to the inputs by construction. The central results are framed as experimental outcomes from the data mixture and distillation pipeline rather than self-definitional identities or self-citation chains. The abstract and method summary contain no load-bearing self-citations or ansatzes smuggled from prior author work that would force the reported transfer performance.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human videos provide sufficient coverage of contact-rich dynamics for downstream robot tasks
invented entities (1)
-
continuous latent actions
no independent evidence
Forward citations
Cited by 17 Pith papers
-
TouchAnything: A Dataset and Framework for Bimanual Tactile Estimation from Egocentric Video
EgoTouch is a new multi-view egocentric dataset with dense bimanual tactile supervision, and TouchAnything is a baseline framework showing that wrist views improve vision-based tactile prediction over egocentric input alone.
-
DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies
DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.
-
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...
-
MoRight: Motion Control Done Right
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...
-
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
-
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.
-
Grounded World Model for Semantically Generalizable Planning
A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
-
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
-
SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds
SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformabl...
-
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
-
Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints
A new occlusion-aware control module generates high-fidelity egocentric videos from sparse 3D hand joints, supported by a million-clip dataset and cross-embodiment benchmark.
-
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...
-
Nautilus: From One Prompt to Plug-and-Play Robot Learning
NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
-
STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation
STARRY uses unified diffusion to align spatial-temporal world predictions with action generation plus GASAM for geometry-aware attention, reaching 93.82%/93.30% success on 50 bimanual tasks in simulation and raising r...
-
Lifting Embodied World Models for Planning and Control
Composing a policy that maps 2D waypoints to joint actions with a frozen world model yields a lifted world model that achieves 3.8 times lower mean joint error than direct low-level search while being more compute-eff...
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...
Reference graph
Works this paper leans on
-
[1]
World Simulation with Video Foundation Models for Physical AI
Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World Simulation with Video Foundation Models for Physical AI. arXiv preprint arXiv:2511.00062, 2025. 2, 3, 5, 9
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Diffusion for World Modeling: Visual Details Matter in Atari
Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for World Modeling: Visual Details Matter in Atari. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 16
work page 2024
-
[3]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning.arXiv preprint arXiv:2506.09985, 2025. 16
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Whole-Body Conditioned Egocentric Video Prediction.arXiv preprint arXiv:2506.21552, 2025
Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, and Jitendra Malik. Whole-Body Conditioned Egocentric Video Prediction.arXiv preprint arXiv:2506.21552, 2025
-
[5]
Genie 3: A New Frontier for World Models, 2025
Philip Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, et al. Genie 3: A New Frontier for World Models, 2025. URL https://deepmind.google/discover/blog/ genie-3-a-new-frontier-for-world-models/. 2, 15, 16
work page 2025
-
[6]
Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation World Models. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2025. 16
work page 2025
-
[7]
H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation
Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, and Jun Zhu. H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation. InProc. of the Conf. on Artificial Intelligence (AAAI), 2025. 4
work page 2025
-
[8]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. GR00T N1: An Open Foundation Model for Generalist Humanoid Robots.arXiv preprint arXiv:2503.14734, 2025. 13, 15
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A Vision-Language-Action Flow Model for General Robot Control.arXiv preprint arXiv:2410.24164, 2024. 15
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1: Robotics Transformer for Real-World Control at Scale.arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Genie: Generative Interactive Environments
Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative Interactive Environments. InProc. of the International Conf. on Machine learning (ICML), 2024. 6, 7, 9, 16
work page 2024
-
[12]
Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems.arXiv preprint arXiv:2503.06669, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. UniVLA: Learning to Act Anywhere with Task-centric Latent Actions. InProc. Robotics: Science and Systems (RSS), 2025. 16
work page 2025
-
[14]
Jiahang Cao, Yize Huang, Hanzhong Guo, Rui Zhang, Mu Nan, Weijian Mai, Jiaxu Wang, Hao Cheng, Jingkai Sun, Gang Han, et al. Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-Time Distribution-Level Composition.arXiv preprint arXiv:2510.01068, 2025. 14 25 DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
-
[15]
Large Video Planner Enables Generalizable Robot Control
Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, et al. Large Video Planner Enables Generalizable Robot Control.arXiv preprint arXiv:2512.15840, 2025. 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Xiaoyu Chen, Junliang Guo, Tianyu He, Chuheng Zhang, Pushi Zhang, Derek Cathera Yang, Li Zhao, and Jiang Bian. IGOR: Image-GOal Representations are the Atomic Control Units for Foundation Models in Embodied AI.arXiv preprint arXiv:2411.00785, 2024. 16
-
[17]
villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models
Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. villa-X: Enhancing Latent Action Modeling in Vision-Language- Action Models.arXiv preprint arXiv:2507.23682, 2025. 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.arXiv preprint arXiv:2507.06261, 2025. 10
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho- Jui Hsieh. Self-Forcing++: Towards Minute-Scale High-Quality Video Generation.arXiv preprint arXiv:2510.02283, 2025. 17
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
RoboNet: Large-Scale Multi-Robot Learning.arXiv preprint arXiv:1910.11215, 2019
Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. RoboNet: Large-Scale Multi-Robot Learning.arXiv preprint arXiv:1910.11215, 2019
-
[21]
DynaGuide: Steering Diffusion Polices with Active Dynamic Guidance
Maximilian Du and Shuran Song. DynaGuide: Steering Diffusion Polices with Active Dynamic Guidance. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 16
work page 2025
-
[22]
Vista: AGeneralizableDrivingWorldModelwithHighFidelityandVersatileControllability
Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and HongyangLi. Vista: AGeneralizableDrivingWorldModelwithHighFidelityandVersatileControllability. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 7, 10
work page 2024
-
[23]
AdaWorld: Learning Adaptable World Models with Latent Actions
Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. AdaWorld: Learning Adaptable World Models with Latent Actions. InProc. of the International Conf. on Machine learning (ICML), 2025. 2, 6, 7, 16
work page 2025
-
[24]
Learning Latent Action World Models In The Wild.arXiv preprint arXiv:2601.05230, 2026
Quentin Garrido, Tushar Nagarajan, Basile Terver, Nicolas Ballas, Yann LeCun, and Michael Rabbat. Learning Latent Action World Models In The Wild.arXiv preprint arXiv:2601.05230, 2026. 16
-
[25]
RaktimGautamGoswami, AmirBar, David Fan, Tsung-YenYang, GaoyueZhou, PrashanthKrishnamurthy, Michael Rabbat, Farshad Khorrami, and Yann LeCun. World Models Can Leverage Human Videos for Dexterous Manipulation.arXiv preprint arXiv:2512.13644, 2025. 16
-
[26]
Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, and Jiang Bian. MineWorld: A Real-Time and Open-Source Interactive World Model on Minecraft.arXiv preprint arXiv:2504.08388,
-
[27]
Ctrl-World: A Controllable Generative World Model for Robot Manipulation
Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-World: A Controllable Generative World Model for Robot Manipulation.arXiv preprint arXiv:2510.10125, 2025. 6, 16
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Recurrent World Models Facilitate Policy Evolution
David Ha and Jürgen Schmidhuber. Recurrent World Models Facilitate Policy Evolution. InAdvances in Neural Information Processing Systems (NeurIPS), 2018. 16
work page 2018
-
[29]
Mastering Diverse Domains through World Models.Nature, 2025
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering Diverse Domains through World Models.Nature, 2025. 16 26 DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
work page 2025
-
[30]
Training Agents Inside of Scalable World Models
Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training Agents Inside of Scalable World Models. arXiv preprint arXiv:2509.24527, 2025. 16
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
1X World Model: Evaluating Bits, not Atoms,
Daniel Ho, Jack Monas, Juntao Ren, and Christina Yu. 1X World Model: Evaluating Bits, not Atoms,
-
[32]
URLhttps://www.1x.tech/1x-world-model.pdf. 15
-
[33]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-Free Diffusion Guidance.arXiv preprint arXiv:2207.12598,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
RELIC: Interactive Video World Model with Long-Horizon Memory.arXiv preprint arXiv:2512.04040, 2025
Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, et al. RELIC: Interactive Video World Model with Long-Horizon Memory.arXiv preprint arXiv:2512.04040, 2025. 15
-
[35]
EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video
Ryan Hoque, Peide Huang, David Yoon, Mouli Sivapurapu, and Jian Zhang. EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video.arXiv preprint arXiv:2505.11709, 2025. 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Image Quality Metrics: PSNR vs
Alain Hore and Djemel Ziou. Image Quality Metrics: PSNR vs. SSIM. InProc. of the International Conf. on Pattern Recognition (ICPR), 2010. 10
work page 2010
-
[37]
GAIA-1: A Generative World Model for Autonomous Driving
Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. GAIA-1: A Generative World Model for Autonomous Driving.arXiv preprint arXiv:2309.17080, 2023. 16
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
LoRA: Low-Rank Adaptation of Large Language Models
Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-Rank Adaptation of Large Language Models. InProc. of the International Conf. on Learning Representations (ICLR), 2022. 15
work page 2022
-
[39]
Yafei Hu, Quanting Xie, Vidhi Jain, Jonathan Francis, Jay Patrikar, Nikhil Keetha, Seungchan Kim, Yaqi Xie, Tianyi Zhang, Hao-Shu Fang, et al. Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis.arXiv preprint arXiv:2312.08782, 2023. 2
-
[40]
Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2World: Crafting Video Diffusion Models to Interactive World Models.arXiv preprint arXiv:2505.14357, 2025. 5, 6
-
[41]
Towards Video World Models, 2025
Xun Huang. Towards Video World Models, 2025. URLhttps://www.xunhuang.me/blogs/world_ model.html. 8, 16
work page 2025
-
[42]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 2, 8, 17
work page 2025
-
[43]
A Smooth Sea Never Made a Skilled SAILOR: Robust Imitation via Learning to Search
Arnav Kumar Jain, Vibhakar Mohta, Subin Kim, Atiksh Bhardwaj, Juntao Ren, Yunhai Feng, Sanjiban Choudhury, and Gokul Swamy. A Smooth Sea Never Made a Skilled SAILOR: Robust Imitation via Learning to Search. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 16
work page 2025
-
[44]
DreamGen: Unlocking Generalization in Robot Learning through Video World Models
Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. DreamGen: Unlocking Generalization in Robot Learning through Video World Models. InProc. Conf. on Robot Learning (CoRL), 2025. 16
work page 2025
-
[45]
Yuxin Jiang, Shengcong Chen, Siyuan Huang, Liliang Chen, Pengfei Zhou, Yue Liao, Xindong He, Chiming Liu, Hongsheng Li, Maoqing Yao, et al. EnerVerse-AC: Envisioning Embodied Environments with Action Condition.arXiv preprint arXiv:2505.09723, 2025. 16
-
[46]
Zhennan Jiang, Kai Liu, Yuxin Qin, Shuai Tian, Yupeng Zheng, Mingcai Zhou, Chao Yu, Haoran Li, and Dongbin Zhao. World4RL: Diffusion World Models for Policy Refinement with Reinforcement Learning for Robotic Manipulation.arXiv preprint arXiv:2509.19080, 2025. 16 27 DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
-
[47]
World and Human Action Models Towards Gameplay Ideation.Nature, 2025
Anssi Kanervisto, Dave Bignell, Linda Yilin Wen, Martin Grayson, Raluca Georgescu, Sergio Valcar- cel Macua, Shan Zheng Tan, Tabish Rashid, Tim Pearce, Yuhan Cao, et al. World and Human Action Models Towards Gameplay Ideation.Nature, 2025. 16
work page 2025
-
[48]
Simar Kareer, Karl Pertsch, James Darpinian, Judy Hoffman, Danfei Xu, Sergey Levine, Chelsea Finn, and Suraj Nair. Emergence of Human to Robot Transfer in Vision-Language-Action Models.arXiv preprint arXiv:2512.22414, 2025. 4
-
[49]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
AlexanderKhazatsky, KarlPertsch, SurajNair, AshwinBalakrishna, SudeepDasari, SiddharthKaramcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. DROID: A Large- Scale In-The-Wild Robot Manipulation Dataset.arXiv preprint arXiv:2403.12945, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Learning to Simulate Dynamic Environments with GameGAN
Seung Wook Kim, Yuhao Zhou, Jonah Philion, Antonio Torralba, and Sanja Fidler. Learning to Simulate Dynamic Environments with GameGAN. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2020. 16
work page 2020
-
[51]
DriveGAN: Towards a Controllable High-Quality Neural Simulation
Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Fidler. DriveGAN: Towards a Controllable High-Quality Neural Simulation. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR),
-
[52]
Auto-Encoding Variational Bayes
Diederik Kingma and Max Welling. Auto-Encoding Variational Bayes.arXiv preprint arXiv:1312.6114,
work page internal anchor Pith review Pith/arXiv arXiv
-
[53]
3D and 4D World Modeling: A Survey.arXiv preprint arXiv:2509.07996,
Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, et al. 3D and 4D World Modeling: A Survey.arXiv preprint arXiv:2509.07996,
-
[54]
A Path Towards Autonomous Machine Intelligence.Open Review, 2022
Yann LeCun. A Path Towards Autonomous Machine Intelligence.Open Review, 2022. 2
work page 2022
-
[55]
Hengtao Li, Pengxiang Ding, Runze Suo, Yihao Wang, Zirui Ge, Dongyuan Zang, Kexian Yu, Mingyang Sun, Hongyin Zhang, Donglin Wang, et al. VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators.arXiv preprint arXiv:2510.00406, 2025. 16
-
[56]
Qixiu Li, Yu Deng, Yaobo Liang, Lin Luo, Lei Zhou, Chengtang Yao, Lingqi Zeng, Zhiyuan Feng, Huizhi Liang, Sicheng Xu, et al. Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos.arXiv preprint arXiv:2510.21571, 2025. 4
-
[57]
Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified Video Action Model. InProc. Robotics: Science and Systems (RSS), 2025. 16
work page 2025
-
[58]
Evaluating Real-World Robot Manipulation Policies in Simulation
Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating Real-World Robot Manipulation Policies in Simulation.arXiv preprint arXiv:2405.05941, 2024. 13
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
WorldEval: World Model as Real-World Robot Policies Evaluator.arXiv preprint arXiv:2505.19017, 2025
Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. WorldEval: World Model as Real-World Robot Policies Evaluator.arXiv preprint arXiv:2505.19017, 2025. 13, 16
-
[60]
Anthony Liang, Pavel Czempin, Matthew Hong, Yutai Zhou, Erdem Biyik, and Stephen Tu. CLAM: Continuous Latent Action Models for Robot Learning from Unlabeled Demonstrations.arXiv preprint arXiv:2505.04999, 2025. 16
-
[61]
Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation
Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, and Lu Jiang. Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation. In Advances in Neural Information Processing Systems (NeurIPS), 2025. 16 28 DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
work page 2025
-
[62]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow Matching for Generative Modeling.arXiv preprint arXiv:2210.02747, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[63]
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling Forcing: Autoregressive Long Video Diffusion in Real Time.arXiv preprint arXiv:2509.25161, 2025. 17
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation
Mingyu Liu, Jiuhe Shu, Hui Chen, Zeju Li, Canyu Zhao, Jiange Yang, Shenyuan Gao, Hao Chen, and Chunhua Shen. StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation.arXiv preprint arXiv:2510.05057, 2025. 16
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[65]
EgoZero: Robot Learning from Smart Glasses.arXiv preprint arXiv:2505.20290, 2025
Vincent Liu, Ademi Adeniji, Haotian Zhan, Siddhant Haldar, Raunaq Bhirangi, Pieter Abbeel, and Lerrel Pinto. EgoZero: Robot Learning from Smart Glasses.arXiv preprint arXiv:2505.20290, 2025. 4
-
[66]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. InProc. of the International Conf. on Learning Representations (ICLR), 2019. 9
work page 2019
-
[67]
Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos.arXiv preprint arXiv:2507.15597, 2025. 4
-
[68]
Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive Language: Talking to Robots in Real Time.IEEE Robotics and Automation Letters (RA-L), 2023
work page 2023
-
[69]
Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild
Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, Hyo Jin Kim, et al. Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild. InProc. of the European Conf. on Computer Vision (ECCV), 2024
work page 2024
-
[70]
Structured World Models from Human Videos
Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured World Models from Human Videos. In Proc. Robotics: Science and Systems (RSS), 2023. 16
work page 2023
-
[71]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning Robust Visual Features without Supervision.arXiv preprint arXiv:2304.07193, 2023. 20
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[72]
Enrico Pallotta, Sina Mokhtarzadeh Azar, Lars Doorenbos, Serdar Ozsoy, Umar Iqbal, and Juergen Gall. EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses.arXiv preprint arXiv:2511.18173, 2025
-
[73]
Genie 2: A Large-Scale Foundation World Model, 2024
Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, et al. Genie 2: A Large-Scale Foundation World Model, 2024. URL https://deepmind.google/discover/blog/ genie-2-a-large-scale-foundation-world-model/. 5, 16
work page 2024
-
[74]
Reconstructing Hands in 3D with Transformers
Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing Hands in 3D with Transformers. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2024. 6, 11
work page 2024
-
[75]
Scalable Diffusion Models with Transformers
William Peebles and Saining Xie. Scalable Diffusion Models with Transformers. InProc. of the IEEE International Conf. on Computer Vision (ICCV), 2023. 3
work page 2023
-
[76]
Han Qi, Haocheng Yin, Aris Zhu, Yilun Du, and Heng Yang. Strengthening Generative Robot Policies through Predictive World Modeling.arXiv preprint arXiv:2502.00622, 2025. 13, 16 29 DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
-
[77]
Humanoid Policy ˜ Human Policy
Ri-Zhao Qiu, Shiqi Yang, Xuxin Cheng, Chaitanya Chawla, Jialong Li, Tairan He, Ge Yan, David Yoon, Ryan Hoque, Lars Paulsen, et al. Humanoid Policy ˜ Human Policy. InProc. Conf. on Robot Learning (CoRL), 2025. 4
work page 2025
-
[78]
Evaluating Robot Policies in a World Model.arXiv preprint arXiv:2506.00613, 2025
Julian Quevedo, Percy Liang, and Sherry Yang. Evaluating Robot Policies in a World Model.arXiv preprint arXiv:2506.00613, 2025. 13, 16
-
[79]
General Agents Need World Models
Jonathan Richens, Tom Everitt, and David Abel. General Agents Need World Models. InProc. of the International Conf. on Machine learning (ICML), 2025. 2, 16
work page 2025
-
[80]
Javier Romero, Dimitrios Tzionas, and Michael J Black. Embodied Hands: Modeling and Capturing Hands and Bodies Together.arXiv preprint arXiv:2201.02610, 2022. 11
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.