pith. machine review for the scientific record. sign in

arxiv: 2511.00062 · v2 · submitted 2025-10-28 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

Recognition: 3 theorem links

· Lean Theorem

World Simulation with Video Foundation Models for Physical AI

Authors on Pith no claims yet

Pith reviewed 2026-05-12 22:56 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO
keywords world foundation modelsvideo generationphysical AIrobotics simulationflow-based modelssynthetic dataSim2Real translationembodied intelligence
0
0 comments X

The pith

Cosmos-Predict2.5 unifies text, image, and video inputs into controllable world generation for robotics simulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Cosmos-Predict2.5 as the next iteration of world foundation models that combine Text2World, Image2World, and Video2World tasks inside one flow-based system. It adds Cosmos-Reason1 to supply stronger text-based control and trains on 200 million curated clips before applying reinforcement learning refinement. The result is higher-fidelity video output that follows instructions more closely than the prior version, at both 2B and 14B scales. A companion Cosmos-Transfer2.5 model handles Sim2Real and Real2Real translation while remaining smaller. The authors position these tools as practical support for generating synthetic training data and running closed-loop policy tests in robotics and autonomous systems.

Core claim

Cosmos-Predict2.5 is a flow-based model that unifies Text2World, Image2World, and Video2World generation while integrating Cosmos-Reason1 for richer text grounding and finer control. Trained on 200M curated video clips and refined with RL post-training, it produces substantial gains in video quality and instruction alignment over Cosmos-Predict1 at 2B and 14B scales. Paired with the 3.5 times smaller Cosmos-Transfer2.5 for world translation, the family is presented as a set of open tools that enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for embodied intelligence.

What carries the argument

The flow-based architecture that unifies Text2World, Image2World, and Video2World generation, augmented by integration with Cosmos-Reason1 for text grounding and control.

If this is right

  • More reliable synthetic data can be generated for training physical AI systems.
  • Policy evaluation becomes feasible inside longer, higher-fidelity simulated episodes.
  • Closed-loop simulation supports iterative testing of robotics and autonomous driving agents.
  • A smaller control-net style model delivers robust Sim2Real and Real2Real video translation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Open release of the models and benchmarks could let independent groups build hybrid training pipelines that mix simulated and real data more aggressively.
  • If the simulated worlds hold up under long-horizon prediction, they may reduce the volume of real-world robot trials needed during development.
  • The same generation stack might later support multi-agent or multi-view scenarios once the training distribution expands.

Load-bearing premise

That gains in generated video quality and instruction following will produce simulations accurate enough for downstream robotics tasks such as policy evaluation.

What would settle it

A controlled test in which robot policies trained or evaluated inside Cosmos-Predict2.5 simulations show no measurable improvement in real-world success rate compared with policies trained inside prior simulators or with real data.

read the original abstract

We introduce [Cosmos-Predict2.5], the latest generation of the Cosmos World Foundation Models for Physical AI. Built on a flow-based architecture, [Cosmos-Predict2.5] unifies Text2World, Image2World, and Video2World generation in a single model and leverages [Cosmos-Reason1], a Physical AI vision-language model, to provide richer text grounding and finer control of world simulation. Trained on 200M curated video clips and refined with reinforcement learning-based post-training, [Cosmos-Predict2.5] achieves substantial improvements over [Cosmos-Predict1] in video quality and instruction alignment, with models released at 2B and 14B scales. These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems. We further extend the family with [Cosmos-Transfer2.5], a control-net style framework for Sim2Real and Real2Real world translation. Despite being 3.5$\times$ smaller than [Cosmos-Transfer1], it delivers higher fidelity and robust long-horizon video generation. Together, these advances establish [Cosmos-Predict2.5] and [Cosmos-Transfer2.5] as versatile tools for scaling embodied intelligence. To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and https://github.com/nvidia-cosmos/cosmos-transfer2.5. We hope these open resources lower the barrier to adoption and foster innovation in building the next generation of embodied intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces Cosmos-Predict2.5, a flow-based world foundation model for Physical AI that unifies Text2World, Image2World, and Video2World generation in a single architecture. It incorporates Cosmos-Reason1 for richer text grounding and control, trains on 200M curated video clips with reinforcement learning post-training, and releases 2B and 14B parameter models. The authors claim substantial improvements over Cosmos-Predict1 in video quality and instruction alignment, enabling more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics. They also present Cosmos-Transfer2.5, a 3.5x smaller control-net style model for Sim2Real and Real2Real translation with higher fidelity, and release source code, pretrained checkpoints, and benchmarks under an open license.

Significance. If the claimed gains in controllability and physical fidelity are substantiated, the work could meaningfully advance embodied AI by supplying scalable open-source world simulators for robotics research. The explicit release of code, checkpoints, and curated benchmarks under the NVIDIA Open Model License is a concrete strength that supports reproducibility and community adoption.

major comments (2)
  1. [Abstract] Abstract: The central claim that Cosmos-Predict2.5 'achieves substantial improvements over Cosmos-Predict1 in video quality and instruction alignment' is presented without any quantitative metrics, baselines, ablation studies, or evaluation details. This directly undermines assessment of the downstream assertion that the models enable 'more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems.'
  2. [Training and post-training description] Training and post-training description: No ablations isolate the contributions of the flow-based unification, 200M-clip curation, RL post-training, or Cosmos-Reason1 integration to any performance metric. Without such evidence, the weakest assumption—that these elements produce simulations sufficiently accurate and controllable for policy evaluation—remains untested.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that the abstract and training sections can be strengthened with more explicit quantitative support and component analysis. We will revise the manuscript accordingly while preserving the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that Cosmos-Predict2.5 'achieves substantial improvements over Cosmos-Predict1 in video quality and instruction alignment' is presented without any quantitative metrics, baselines, ablation studies, or evaluation details. This directly undermines assessment of the downstream assertion that the models enable 'more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems.'

    Authors: We agree that the abstract should include concrete quantitative metrics to support the claims of improvement. In the revised version, we will expand the abstract to report key metrics from the experimental evaluation, including specific gains in video quality (e.g., perceptual and temporal consistency scores) and instruction alignment (e.g., text-video matching accuracy) relative to Cosmos-Predict1, along with brief mention of the evaluation protocols used. These details already appear in the body of the paper and will now be summarized upfront to allow readers to better assess the downstream utility for synthetic data generation and policy evaluation. revision: yes

  2. Referee: [Training and post-training description] Training and post-training description: No ablations isolate the contributions of the flow-based unification, 200M-clip curation, RL post-training, or Cosmos-Reason1 integration to any performance metric. Without such evidence, the weakest assumption—that these elements produce simulations sufficiently accurate and controllable for policy evaluation—remains untested.

    Authors: We acknowledge that isolating the individual contributions of the flow-based unification, data curation scale, RL post-training, and Cosmos-Reason1 integration would provide stronger evidence. The current manuscript demonstrates overall gains via direct comparisons to Cosmos-Predict1, but we agree targeted ablations would be valuable. In the revision, we will add a dedicated ablation subsection (or supplementary material) that quantifies the incremental impact of each component on metrics such as video fidelity and controllability. This will directly address the concern about untested assumptions for policy evaluation use cases. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical model description and release

full rationale

The paper introduces Cosmos-Predict2.5 and Cosmos-Transfer2.5 as trained video foundation models, describing their flow-based architecture, training on 200M clips, RL post-training, integration with Cosmos-Reason1, and empirical improvements in quality/alignment. No mathematical derivation chain, predictive equations, uniqueness theorems, or fitted parameters are presented that could reduce to inputs by construction. Claims rest on training procedures and released checkpoints rather than any self-referential logic. This matches the default expectation of a non-circular empirical release paper.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard large-scale video model training assumptions rather than new theoretical derivations; no invented physical entities are introduced.

free parameters (2)
  • model scales = 2B, 14B
    2B and 14B parameter counts chosen for different compute and deployment needs
  • training dataset size = 200M
    200M curated video clips selected as the training corpus
axioms (2)
  • domain assumption Flow-based generative models can capture physical dynamics in video sufficiently well for robotics simulation
    Invoked by the choice of architecture for world simulation
  • domain assumption Curated video data plus RL post-training yields controllable physical behavior
    Basis for claiming improved instruction alignment and reliability

pith-pipeline@v0.9.0 · 5990 in / 1421 out tokens · 44490 ms · 2026-05-12T22:56:44.832776+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Built on a flow-based architecture, Cosmos-Predict2.5 unifies Text2World, Image2World, and Video2World generation... Trained on 200M curated video clips and refined with reinforcement learning-based post-training... enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics

  • IndisputableMonolith.Foundation.DimensionForcing alexander_duality_circle_linking unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We further extend the family with Cosmos-Transfer2.5, a control-net style framework for Sim2Real and Real2Real world translation

  • IndisputableMonolith.Foundation.PhiForcing phi_equation unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics

    cs.RO 2026-04 conditional novelty 8.0

    Open-H-Embodiment is the largest open multi-embodiment medical robotics dataset, used to train GR00T-H, the first open vision-language-action model that achieves end-to-end suturing completion where prior models fail.

  2. Coding Agent Is Good As World Simulator

    cs.AI 2026-05 unverdicted novelty 7.0

    A multi-agent framework generates and refines executable physics simulation code from prompts to create world models that enforce physical constraints, claiming superior accuracy and fidelity over video-based alternatives.

  3. CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models

    cs.CV 2026-05 unverdicted novelty 7.0

    CollabVR improves video reasoning performance by coupling vision-language models and video generation models in a closed-loop step-level collaboration that detects and repairs generation failures.

  4. Mask World Model: Predicting What Matters for Robust Robot Policy Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...

  5. RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.

  6. RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.

  7. MultiWorld: Scalable Multi-Agent Multi-View Video World Models

    cs.CV 2026-04 unverdicted novelty 7.0

    MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.

  8. UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

    cs.CV 2026-04 unverdicted novelty 7.0

    UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.

  9. MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models

    cs.CV 2026-04 unverdicted novelty 7.0

    MotionScape is a large-scale UAV video dataset with highly dynamic 6-DoF motions, geometric trajectories, and semantic annotations to train world models that better simulate complex 3D dynamics under large viewpoint changes.

  10. Action Images: End-to-End Policy Learning via Multiview Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.

  11. Pelican-Unified 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action

    cs.RO 2026-05 unverdicted novelty 6.0

    Pelican-Unified 1.0 trains a single VLM plus Unified Future Generator to jointly optimize understanding, reasoning, future video prediction, and action generation, reporting top-tier scores on VLM, WorldArena, and Rob...

  12. Reinforcing VLAs in Task-Agnostic World Models

    cs.AI 2026-05 unverdicted novelty 6.0

    RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.

  13. Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation

    cs.CV 2026-05 unverdicted novelty 6.0

    A video transfer pipeline augments simulated VLA data into realistic videos while preserving actions, yielding consistent performance gains on robot benchmarks such as 8% on Robotwin 2.0.

  14. Learning physically grounded traffic accident reconstruction from public accident reports

    cs.LG 2026-04 unverdicted novelty 6.0

    A multimodal learning model with a new dataset of 6,217 cases reconstructs lane-consistent pre-impact motion and collision interactions from public accident reports, outperforming baselines in accuracy and consistency.

  15. Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training

    cs.RO 2026-04 unverdicted novelty 6.0

    Hi-WM uses human interventions inside an action-conditioned world model with rollback and branching to generate dense corrective data, raising real-world success by 37.9 points on average across three manipulation tasks.

  16. UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

    cs.CV 2026-04 unverdicted novelty 6.0

    UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.

  17. Human Cognition in Machines: A Unified Perspective of World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...

  18. From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation

    cs.RO 2026-04 unverdicted novelty 6.0

    Digital Cousins is a generative real-to-sim method that creates diverse high-fidelity simulation scenes from real panoramas to improve generalization in robot learning and evaluation.

  19. ShapeGen: Robotic Data Generation for Category-Level Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    ShapeGen generates shape-diverse 3D robotic manipulation demonstrations without simulators by curating a functional shape library and applying a minimal-annotation pipeline for novel, physically plausible data.

  20. Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

    cs.CV 2026-04 unverdicted novelty 6.0

    Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.

  21. SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations

    cs.CV 2026-04 unverdicted novelty 6.0

    SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and t...

  22. Lifting Unlabeled Internet-level Data for 3D Scene Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Unlabeled web videos processed by designed data engines generate effective training data that yields strong zero-shot and finetuned performance on 3D detection, segmentation, VQA, and navigation.

  23. Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

    eess.IV 2026-03 unverdicted novelty 6.0

    Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.

  24. World Action Models are Zero-shot Policies

    cs.RO 2026-02 unverdicted novelty 6.0

    DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...

  25. Di-BiLPS: Denoising induced Bidirectional Latent-PDE-Solver under Sparse Observations

    cs.LG 2026-05 unverdicted novelty 5.0

    Di-BiLPS combines a variational autoencoder, latent diffusion, and contrastive learning to achieve state-of-the-art accuracy on PDE problems with as little as 3% observations while supporting zero-shot super-resolutio...

  26. Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models

    cs.RO 2026-05 unverdicted novelty 5.0

    Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.

  27. Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...

  28. SyncFix: Fixing 3D Reconstructions via Multi-View Synchronization

    cs.CV 2026-04 unverdicted novelty 5.0

    SyncFix improves 3D reconstructions by synchronizing multi-view latent representations in a diffusion refinement process, generalizing from pair-wise training to arbitrary view counts at inference.

  29. RLDX-1 Technical Report

    cs.RO 2026-05 unverdicted novelty 4.0

    RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.

  30. RLDX-1 Technical Report

    cs.RO 2026-05 unverdicted novelty 4.0

    RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.

  31. OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

    cs.CV 2026-04 unverdicted novelty 4.0

    OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

  32. World Model for Robot Learning: A Comprehensive Survey

    cs.RO 2026-04 unverdicted novelty 3.0

    A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · cited by 29 Pith papers · 27 internal anchors

  1. [1]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. 35

  2. [2]

    Edify image: High-quality image generation with pixel space laplacian diffusion models.arXiv preprint arXiv:2411.07126, 2024

    Yuval Atzmon, Maciej Bala, Yogesh Balaji, Tiffany Cai, Yin Cui, Jiaojiao Fan, Yunhao Ge, Siddharth Gururani, Jacob Huffman, Ronald Isaac, et al. Edify image: High-quality image generation with pixel space laplacian diffusion models.arXiv preprint arXiv:2411.07126, 2024. 8

  3. [3]

    Recammaster: Camera-controlled generative rendering from a single video

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InICCV, 2025. 31

  4. [4]

    Syncammaster: Synchronizing multi-camera video generation from diverse viewpoints

    Jianhong Bai, Menghan Xia, Xintao Wang, Ziyang Yuan, Xiao Fu, Zuozhu Liu, Haoji Hu, Pengfei Wan, and Di Zhang. Syncammaster: Synchronizing multi-camera video generation from diverse viewpoints. InICLR, 2025. 31

  5. [5]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 4, 5, 6, 7, 32

  6. [6]

    Genie 3: A new frontier for world models, 2025

    Philip J Ball, J Bauer, F Belletti, et al. Genie 3: A new frontier for world models, 2025. 35

  7. [7]

    Bansal, Z

    Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520, 2024. 36

  8. [8]

    Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

    Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Goldenberg, Aditya Grover, and Kai-Wei Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025. 36

  9. [9]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 6, 36

  10. [10]

    bloc97. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.https://www.reddit.com/r/LocalLLaMA/comments/ 14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/, 2023. Reddit post, r/LocalL- LaMA. 9

  11. [11]

    Intphys 2: Benchmarking intuitive physics understanding in complex synthetic environments, 2025

    Florian Bordes, Quentin Garrido, Justine T Kao, Adina Williams, Michael Rabbat, and Emmanuel Dupoux. Intphys 2: Benchmarking intuitive physics understanding in complex synthetic environments. arXiv preprint arXiv:2506.09849, 2025. 36

  12. [12]

    Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. InIROS, 2025. 6, 31

  13. [13]

    Planning with reasoning using vision language world model.arXiv preprint arXiv:2509.02722, 2025

    Delong Chen, Theo Moutakanni, Willy Chung, Yejin Bang, Ziwei Ji, Allen Bolourchi, and Pascale Fung. Planning with reasoning using vision language world model.arXiv preprint arXiv:2509.02722, 2025. 35

  14. [14]

    Video depth anything: Consistent depth estimation for super-long videos

    Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. InCVPR, 2025. 19 38 World Simulation with Video Foundation Models for Physical AI

  15. [15]

    On the importance of noise scheduling for diffusion models

    Ting Chen. On the importance of noise scheduling for diffusion models.arXiv preprint arXiv:2301.10972,

  16. [16]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin CM Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRSS, 2023. 22

  17. [17]

    Delta lake: Open-source storage framework that enables building lakehouses.https: //delta.io/, 2019

    Databricks. Delta lake: Open-source storage framework that enables building lakehouses.https: //delta.io/, 2019. Open-source project, Delta Lake. 6

  18. [18]

    Veo 3, 5 2025

    Google DeepMind. Veo 3, 5 2025. URLhttps://deepmind.google/technologies/veo/veo-3/. 35

  19. [19]

    Duan, H.-X

    Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025. 36

  20. [20]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024. 8, 11

  21. [21]

    LLM-based Realistic Safety-Critical Driving Video Generation

    YongjieFu, RuijianZha, PeiTian, andXuanDi. Llm-basedrealisticsafety-criticaldrivingvideogeneration. arXiv preprint arXiv:2507.01264, 2025. 36

  22. [22]

    Diffusion models and gaussian flow matching: Two sides of the same coin

    Ruiqi Gao, Emiel Hoogeboom, Jonathan Heek, Valentin De Bortoli, Kevin Patrick Murphy, and Tim Salimans. Diffusion models and gaussian flow matching: Two sides of the same coin. InThe Fourth Blogpost Track at ICLR 2025, 2025. 8

  23. [23]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025. 35

  24. [24]

    YOLOX: Exceeding YOLO Series in 2021

    Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021. 7

  25. [25]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 13

  26. [26]

    T2vphysbench: A first-principles benchmark for physical consistency in text-to-video generation.arXiv preprint arXiv:2505.00337, 2025

    Xuyang Guo, Jiayan Huo, Zhenmei Shi, Zhao Song, Jiahao Zhang, and Jiale Zhao. T2vphysbench: A first-principles benchmark for physical consistency in text-to-video generation.arXiv preprint arXiv:2505.00337, 2025. 36

  27. [27]

    World Models

    David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018. 35

  28. [28]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 35

  29. [29]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019. 35

  30. [30]

    Generalized neighborhood attention: Multi- dimensional sparse attention at the speed of light.arXiv preprint arXiv:2504.16922, 2025

    Ali Hassani, Fengzhe Zhou, Aditya Kane, Jiannan Huang, Chieh-Yun Chen, Min Shi, Steven Walton, Markus Hoehnerbach, Vijay Thakkar, Michael Isaev, et al. Generalized neighborhood attention: Multi- dimensional sparse attention at the speed of light.arXiv preprint arXiv:2504.16922, 2025. 14

  31. [31]

    Unirelight: Learning joint decomposition and synthesis for video relighting.arXiv preprint arXiv:2506.15673, 2025

    Kai He, Ruofan Liang, Jacob Munkberg, Jon Hasselgren, Nandita Vijaykumar, Alexander Keller, Sanja Fidler, Igor Gilitschenski, Zan Gojcic, and Zian Wang. Unirelight: Learning joint decomposition and synthesis for video relighting.arXiv preprint arXiv:2506.15673, 2025. 36 39 World Simulation with Video Foundation Models for Physical AI

  32. [32]

    simple diffusion: End-to-end diffusion for high resolution images

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. InICML, 2023. 8

  33. [33]

    arXiv preprint arXiv:2508.10934 (2025)

    Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025. 31

  34. [34]

    Let- 3d-ap: Longitudinal error tolerant 3d average precision for camera-only 3d detection, 2024

    Wei-Chih Hung, Vincent Casser, Henrik Kretzschmar, Jyh-Jing Hwang, and Dragomir Anguelov. Let- 3d-ap: Longitudinal error tolerant 3d average precision for camera-only 3d detection, 2024. URL https://arxiv.org/abs/2206.07705. 25

  35. [35]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

  36. [36]

    arXiv preprint arXiv:2505.12705 (2025)

    Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025. 3, 32, 36

  37. [37]

    arXiv preprint arXiv:2303.07399 (2023)

    Tao Jiang, Peng Lu, Li Zhang, Ningsheng Ma, Rui Han, Chengqi Lyu, Yining Li, and Kai Chen. Rtmpose: Real-time multi-person pose estimation based on mmpose.arXiv preprint arXiv:2303.07399, 2023. 7

  38. [38]

    Elucidating the design space of diffusion-based generative models.NeurIPS, 2022

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.NeurIPS, 2022. 8

  39. [39]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    AlexanderKhazatsky, KarlPertsch, SurajNair, AshwinBalakrishna, SudeepDasari, SiddharthKaramcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large- scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024. 6

  40. [40]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 32, 35

  41. [41]

    Kling, 2024

    KuaiShou. Kling, 2024. URLhttps://klingai.com/. 35

  42. [42]

    Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers.arXiv preprint arXiv:2203.17270, 2022

    Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers, 2022. URLhttps://arxiv.org/abs/2203.17270. 28

  43. [43]

    Won- derplay: Dynamic 3d scene generation from a single image and actions.arXiv preprint arXiv:2505.18151,

    Zizhang Li, Hong-Xing Yu, Wei Liu, Yin Yang, Charles Herrmann, Gordon Wetzstein, and Jiajun Wu. Won- derplay: Dynamic 3d scene generation from a single image and actions.arXiv preprint arXiv:2505.18151,

  44. [44]

    Torchtitan: One-stop pytorch native solution for production ready LLM pretraining

    Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, and Stratos Idreos. Torchtitan: One-stop pytorch native solution for production ready LLM pretraining. InICLR, 2025. 14

  45. [45]

    Genieenvisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635,

    Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025. 36

  46. [46]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 8 40 World Simulation with Video Foundation Models for Physical AI

  47. [47]

    Improving Video Generation with Human Feedback

    Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025. 13

  48. [48]

    Dynamicscaler: Seamless and scalable video generation for panoramic scenes

    Jinxiu Liu, Shaoheng Lin, Yinxiao Li, and Ming-Hsuan Yang. Dynamicscaler: Seamless and scalable video generation for panoramic scenes. InCVPR, 2025. 35

  49. [49]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023. 22

  50. [50]

    Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

    Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. arXiv preprint arXiv:2410.11081, 2024. 14

  51. [51]

    Latr: 3d lane detection from monocular images with transformer, 2023

    Yueru Luo, Chaoda Zheng, Xu Yan, Tang Kun, Chao Zheng, Shuguang Cui, and Zhen Li. Latr: 3d lane detection from monocular images with transformer, 2023. URLhttps://arxiv.org/abs/2308.04583. 28

  52. [52]

    Hailuo, 2024

    MiniMax. Hailuo, 2024. URLhttps://hailuoai.com/video. 35

  53. [53]

    Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025

    Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025. 36

  54. [54]

    RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. arXiv preprint arXiv:2406.02523, 2024. 35

  55. [55]

    Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025

    NVIDIA. Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025. 3, 9, 35

  56. [56]

    Cosmos-transfer1: Conditional world generation with adaptive multimodal control, 2025

    NVIDIA. Cosmos-transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492, 2025. 3, 18, 19, 28, 36

  57. [57]

    Cosmos World Foundation Model Platform for Physical AI

    NVIDIA. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,

  58. [58]

    3, 4, 8, 9, 31, 35, 36

  59. [59]

    Sora, 2024

    OpenAI. Sora, 2024. URLhttps://openai.com/sora/. 35

  60. [60]

    Training language models to follow instructions with human feedback.NeurIPS, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.NeurIPS, 2022. 13

  61. [61]

    YaRN: Efficient Context Window Extension of Large Language Models

    Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023. 9

  62. [62]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024. 35

  63. [63]

    Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. InKDD, 2020. 14

  64. [64]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 19, 22 41 World Simulation with Video Foundation Models for Physical AI

  65. [65]

    Ren, Justin Lidard, Lars Lien Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz

    Allen Z. Ren, Justin Lidard, Lars Lien Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization. InICLR,

  66. [66]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159, 2024. 22

  67. [67]

    Available: https://arxiv.org/abs/2506.09042

    Xuanchi Ren, Yifan Lu, Tianshi Cao, Ruiyuan Gao, Shengyu Huang, Amirmojtaba Sabour, Tianchang Shen, Tobias Pfaff, Jay Zhangjie Wu, Runjian Chen, et al. Cosmos-drive-dreams: Scalable synthetic driving data generation with world foundation models.arXiv preprint arXiv:2506.09042, 2025. 3, 25, 28, 36

  68. [68]

    Gen3c: 3d-informed world-consistent video generation with precise camera control

    Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera control. InCVPR, 2025. 36

  69. [69]

    Gen 3, 2024

    Runway. Gen 3, 2024. URLhttps://runwayml.com/research/introducing-gen-3-alpha. 35

  70. [70]

    very scattered

    Paul D Sampson. Fitting conic sections to “very scattered” data: An iterative refinement of the bookstein algorithm.Computer Graphics and Image Processing, 1982. ISSN 0146-664X. doi: https://doi.org/ 10.1016/0146-664X(82)90101-0. URL https://www.sciencedirect.com/science/article/pii/ 0146664X82901010. 25

  71. [71]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 13

  72. [72]

    arXiv preprint arXiv:2301.11280 , year=

    Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, et al. Text-to-4d dynamic scene generation.arXiv preprint arXiv:2301.11280, 2023. 35

  73. [73]

    Light field networks: Neural scene representations with single-evaluation rendering.NeurIPS, 2021

    Vincent Sitzmann, Semon Rezchikov, Bill Freeman, Josh Tenenbaum, and Fredo Durand. Light field networks: Neural scene representations with single-evaluation rendering.NeurIPS, 2021. 31

  74. [74]

    Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2, 2021

    Ivan Skorokhodov, Sergey Tulyakov, and Mohamed Elhoseiny. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2, 2021. 25

  75. [75]

    cuRobo: Parallelized collision-free minimum-jerk robot motion generation.arXiv preprint arXiv:2310.17274, 2023

    Balakumar Sundaralingam, Siva Kumar Sastry Hari, Adam Fishman, Caelan Garrett, Karl Van Wyk, Valts Blukis, Alexander Millane, Helen Oleynikova, Ankur Handa, Fabio Ramos, Nathan Ratliff, and Dieter Fox. cuRobo: Parallelized collision-free minimum-jerk robot motion generation.arXiv preprint arXiv:2310.17274, 2023. 21

  76. [76]

    1x technologies | safe humanoids for the home, 2025

    1X Technologies. 1x technologies | safe humanoids for the home, 2025. URLhttps://www.1x.tech/. 6

  77. [77]

    Open x-embodiment: Robotic learning datasets and rt-x models

    Quan Vuong, Sergey Levine, Homer Rich Walke, Karl Pertsch, Anikait Singh, Ria Doshi, Charles Xu, Jianlan Luo, Liam Tan, Dhruv Shah, et al. Open x-embodiment: Robotic learning datasets and rt-x models. InTowards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023,

  78. [78]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InCoRL, 2023. 6, 33 42 World Simulation with Video Foundation Models for Physical AI

  79. [79]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 9, 32, 35

  80. [80]

    A comprehensive study of decoder-only llms for text-to-image generation

    Andrew Z Wang, Songwei Ge, Tero Karras, Ming-Yu Liu, and Yogesh Balaji. A comprehensive study of decoder-only llms for text-to-image generation. InCVPR, 2025. 9

Showing first 80 references.