arxiv: 2511.00062 · v2 · submitted 2025-10-28 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

Recognition: 3 theorem links

· Lean Theorem

World Simulation with Video Foundation Models for Physical AI

NVIDIA: Arslan Ali , Junjie Bai , Maciej Bala , Yogesh Balaji , Aaron Blakeman , Tiffany Cai , Jiaxin Cao , Tianshi Cao

show 80 more authors

Elizabeth Cha Yu-Wei Chao Prithvijit Chattopadhyay Mike Chen Yongxin Chen Yu Chen Shuai Cheng Yin Cui Jenna Diamond Yifan Ding Jiaojiao Fan Linxi Fan Liang Feng Francesco Ferroni Sanja Fidler Xiao Fu Ruiyuan Gao Yunhao Ge Jinwei Gu Aryaman Gupta Siddharth Gururani Imad El Hanafi Ali Hassani Zekun Hao Jacob Huffman Joel Jang Pooya Jannaty Jan Kautz Grace Lam Xuan Li Zhaoshuo Li Maosheng Liao Chen-Hsuan Lin Tsung-Yi Lin Yen-Chen Lin Huan Ling Ming-Yu Liu Xian Liu Yifan Lu Alice Luo Qianli Ma Hanzi Mao Kaichun Mo Seungjun Nah Yashraj Narang Abhijeet Panaskar Lindsey Pavao Trung Pham Morteza Ramezanali Fitsum Reda Scott Reed Xuanchi Ren Haonan Shao Yue Shen Stella Shi Shuran Song Bartosz Stefaniak Shangkun Sun Shitao Tang Sameena Tasmeen Lyne Tchapmi Wei-Cheng Tseng Jibin Varghese Andrew Z. Wang Hao Wang Haoxiang Wang Heng Wang Ting-Chun Wang Fangyin Wei Jiashu Xu Dinghao Yang Xiaodong Yang Haotian Ye Seonghyeon Ye Xiaohui Zeng Jing Zhang Qinsheng Zhang Kaiwen Zheng Andrew Zhu Yuke Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 22:56 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO

keywords world foundation modelsvideo generationphysical AIrobotics simulationflow-based modelssynthetic dataSim2Real translationembodied intelligence

0 comments

The pith

Cosmos-Predict2.5 unifies text, image, and video inputs into controllable world generation for robotics simulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Cosmos-Predict2.5 as the next iteration of world foundation models that combine Text2World, Image2World, and Video2World tasks inside one flow-based system. It adds Cosmos-Reason1 to supply stronger text-based control and trains on 200 million curated clips before applying reinforcement learning refinement. The result is higher-fidelity video output that follows instructions more closely than the prior version, at both 2B and 14B scales. A companion Cosmos-Transfer2.5 model handles Sim2Real and Real2Real translation while remaining smaller. The authors position these tools as practical support for generating synthetic training data and running closed-loop policy tests in robotics and autonomous systems.

Core claim

Cosmos-Predict2.5 is a flow-based model that unifies Text2World, Image2World, and Video2World generation while integrating Cosmos-Reason1 for richer text grounding and finer control. Trained on 200M curated video clips and refined with RL post-training, it produces substantial gains in video quality and instruction alignment over Cosmos-Predict1 at 2B and 14B scales. Paired with the 3.5 times smaller Cosmos-Transfer2.5 for world translation, the family is presented as a set of open tools that enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for embodied intelligence.

What carries the argument

The flow-based architecture that unifies Text2World, Image2World, and Video2World generation, augmented by integration with Cosmos-Reason1 for text grounding and control.

If this is right

More reliable synthetic data can be generated for training physical AI systems.
Policy evaluation becomes feasible inside longer, higher-fidelity simulated episodes.
Closed-loop simulation supports iterative testing of robotics and autonomous driving agents.
A smaller control-net style model delivers robust Sim2Real and Real2Real video translation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Open release of the models and benchmarks could let independent groups build hybrid training pipelines that mix simulated and real data more aggressively.
If the simulated worlds hold up under long-horizon prediction, they may reduce the volume of real-world robot trials needed during development.
The same generation stack might later support multi-agent or multi-view scenarios once the training distribution expands.

Load-bearing premise

That gains in generated video quality and instruction following will produce simulations accurate enough for downstream robotics tasks such as policy evaluation.

What would settle it

A controlled test in which robot policies trained or evaluated inside Cosmos-Predict2.5 simulations show no measurable improvement in real-world success rate compared with policies trained inside prior simulators or with real data.

read the original abstract

We introduce [Cosmos-Predict2.5], the latest generation of the Cosmos World Foundation Models for Physical AI. Built on a flow-based architecture, [Cosmos-Predict2.5] unifies Text2World, Image2World, and Video2World generation in a single model and leverages [Cosmos-Reason1], a Physical AI vision-language model, to provide richer text grounding and finer control of world simulation. Trained on 200M curated video clips and refined with reinforcement learning-based post-training, [Cosmos-Predict2.5] achieves substantial improvements over [Cosmos-Predict1] in video quality and instruction alignment, with models released at 2B and 14B scales. These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems. We further extend the family with [Cosmos-Transfer2.5], a control-net style framework for Sim2Real and Real2Real world translation. Despite being 3.5$\times$ smaller than [Cosmos-Transfer1], it delivers higher fidelity and robust long-horizon video generation. Together, these advances establish [Cosmos-Predict2.5] and [Cosmos-Transfer2.5] as versatile tools for scaling embodied intelligence. To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and https://github.com/nvidia-cosmos/cosmos-transfer2.5. We hope these open resources lower the barrier to adoption and foster innovation in building the next generation of embodied intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a model release note for updated Cosmos video generators with some unification and RL tweaks, but the performance claims rest on unshown metrics.

read the letter

The main thing here is that NVIDIA has shipped Cosmos-Predict2.5 at 2B and 14B scales plus a smaller Cosmos-Transfer2.5 control-net variant, all under an open license with code and checkpoints. They unify text-to-world, image-to-world, and video-to-world in one flow-matching setup, wire in Cosmos-Reason1 for better text control, train on 200M clips, and add RL post-training. The transfer model is 3.5x smaller than its predecessor yet claims higher fidelity for long-horizon output. That open release is the concrete value—people building synthetic data pipelines or closed-loop robotics tests can actually download and run these now instead of waiting for closed APIs.

Referee Report

2 major / 0 minor

Summary. The paper introduces Cosmos-Predict2.5, a flow-based world foundation model for Physical AI that unifies Text2World, Image2World, and Video2World generation in a single architecture. It incorporates Cosmos-Reason1 for richer text grounding and control, trains on 200M curated video clips with reinforcement learning post-training, and releases 2B and 14B parameter models. The authors claim substantial improvements over Cosmos-Predict1 in video quality and instruction alignment, enabling more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics. They also present Cosmos-Transfer2.5, a 3.5x smaller control-net style model for Sim2Real and Real2Real translation with higher fidelity, and release source code, pretrained checkpoints, and benchmarks under an open license.

Significance. If the claimed gains in controllability and physical fidelity are substantiated, the work could meaningfully advance embodied AI by supplying scalable open-source world simulators for robotics research. The explicit release of code, checkpoints, and curated benchmarks under the NVIDIA Open Model License is a concrete strength that supports reproducibility and community adoption.

major comments (2)

[Abstract] Abstract: The central claim that Cosmos-Predict2.5 'achieves substantial improvements over Cosmos-Predict1 in video quality and instruction alignment' is presented without any quantitative metrics, baselines, ablation studies, or evaluation details. This directly undermines assessment of the downstream assertion that the models enable 'more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems.'
[Training and post-training description] Training and post-training description: No ablations isolate the contributions of the flow-based unification, 200M-clip curation, RL post-training, or Cosmos-Reason1 integration to any performance metric. Without such evidence, the weakest assumption—that these elements produce simulations sufficiently accurate and controllable for policy evaluation—remains untested.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that the abstract and training sections can be strengthened with more explicit quantitative support and component analysis. We will revise the manuscript accordingly while preserving the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that Cosmos-Predict2.5 'achieves substantial improvements over Cosmos-Predict1 in video quality and instruction alignment' is presented without any quantitative metrics, baselines, ablation studies, or evaluation details. This directly undermines assessment of the downstream assertion that the models enable 'more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems.'

Authors: We agree that the abstract should include concrete quantitative metrics to support the claims of improvement. In the revised version, we will expand the abstract to report key metrics from the experimental evaluation, including specific gains in video quality (e.g., perceptual and temporal consistency scores) and instruction alignment (e.g., text-video matching accuracy) relative to Cosmos-Predict1, along with brief mention of the evaluation protocols used. These details already appear in the body of the paper and will now be summarized upfront to allow readers to better assess the downstream utility for synthetic data generation and policy evaluation. revision: yes
Referee: [Training and post-training description] Training and post-training description: No ablations isolate the contributions of the flow-based unification, 200M-clip curation, RL post-training, or Cosmos-Reason1 integration to any performance metric. Without such evidence, the weakest assumption—that these elements produce simulations sufficiently accurate and controllable for policy evaluation—remains untested.

Authors: We acknowledge that isolating the individual contributions of the flow-based unification, data curation scale, RL post-training, and Cosmos-Reason1 integration would provide stronger evidence. The current manuscript demonstrates overall gains via direct comparisons to Cosmos-Predict1, but we agree targeted ablations would be valuable. In the revision, we will add a dedicated ablation subsection (or supplementary material) that quantifies the incremental impact of each component on metrics such as video fidelity and controllability. This will directly address the concern about untested assumptions for policy evaluation use cases. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical model description and release

full rationale

The paper introduces Cosmos-Predict2.5 and Cosmos-Transfer2.5 as trained video foundation models, describing their flow-based architecture, training on 200M clips, RL post-training, integration with Cosmos-Reason1, and empirical improvements in quality/alignment. No mathematical derivation chain, predictive equations, uniqueness theorems, or fitted parameters are presented that could reduce to inputs by construction. Claims rest on training procedures and released checkpoints rather than any self-referential logic. This matches the default expectation of a non-circular empirical release paper.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard large-scale video model training assumptions rather than new theoretical derivations; no invented physical entities are introduced.

free parameters (2)

model scales = 2B, 14B
2B and 14B parameter counts chosen for different compute and deployment needs
training dataset size = 200M
200M curated video clips selected as the training corpus

axioms (2)

domain assumption Flow-based generative models can capture physical dynamics in video sufficiently well for robotics simulation
Invoked by the choice of architecture for world simulation
domain assumption Curated video data plus RL post-training yields controllable physical behavior
Basis for claiming improved instruction alignment and reliability

pith-pipeline@v0.9.0 · 5990 in / 1421 out tokens · 44490 ms · 2026-05-12T22:56:44.832776+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Built on a flow-based architecture, Cosmos-Predict2.5 unifies Text2World, Image2World, and Video2World generation... Trained on 200M curated video clips and refined with reinforcement learning-based post-training... enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics
IndisputableMonolith.Foundation.DimensionForcing alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We further extend the family with Cosmos-Transfer2.5, a control-net style framework for Sim2Real and Real2Real world translation
IndisputableMonolith.Foundation.PhiForcing phi_equation unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics
cs.RO 2026-04 conditional novelty 8.0

Open-H-Embodiment is the largest open multi-embodiment medical robotics dataset, used to train GR00T-H, the first open vision-language-action model that achieves end-to-end suturing completion where prior models fail.
Coding Agent Is Good As World Simulator
cs.AI 2026-05 unverdicted novelty 7.0

A multi-agent framework generates and refines executable physics simulation code from prompts to create world models that enforce physical constraints, claiming superior accuracy and fidelity over video-based alternatives.
CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models
cs.CV 2026-05 unverdicted novelty 7.0

CollabVR improves video reasoning performance by coupling vision-language models and video generation models in a closed-loop step-level collaboration that detects and repairs generation failures.
Mask World Model: Predicting What Matters for Robust Robot Policy Learning
cs.RO 2026-04 unverdicted novelty 7.0

Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
cs.CV 2026-04 unverdicted novelty 7.0

MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 7.0

UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models
cs.CV 2026-04 unverdicted novelty 7.0

MotionScape is a large-scale UAV video dataset with highly dynamic 6-DoF motions, geometric trajectories, and semantic annotations to train world models that better simulate complex 3D dynamics under large viewpoint changes.
Action Images: End-to-End Policy Learning via Multiview Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
Pelican-Unified 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action
cs.RO 2026-05 unverdicted novelty 6.0

Pelican-Unified 1.0 trains a single VLM plus Unified Future Generator to jointly optimize understanding, reasoning, future video prediction, and action generation, reporting top-tier scores on VLM, WorldArena, and Rob...
Reinforcing VLAs in Task-Agnostic World Models
cs.AI 2026-05 unverdicted novelty 6.0

RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.
Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation
cs.CV 2026-05 unverdicted novelty 6.0

A video transfer pipeline augments simulated VLA data into realistic videos while preserving actions, yielding consistent performance gains on robot benchmarks such as 8% on Robotwin 2.0.
Learning physically grounded traffic accident reconstruction from public accident reports
cs.LG 2026-04 unverdicted novelty 6.0

A multimodal learning model with a new dataset of 6,217 cases reconstructs lane-consistent pre-impact motion and collision interactions from public accident reports, outperforming baselines in accuracy and consistency.
Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training
cs.RO 2026-04 unverdicted novelty 6.0

Hi-WM uses human interventions inside an action-conditioned world model with rollback and branching to generate dense corrective data, raising real-world success by 37.9 points on average across three manipulation tasks.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 6.0

UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.
Human Cognition in Machines: A Unified Perspective of World Models
cs.RO 2026-04 unverdicted novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation
cs.RO 2026-04 unverdicted novelty 6.0

Digital Cousins is a generative real-to-sim method that creates diverse high-fidelity simulation scenes from real panoramas to improve generalization in robot learning and evaluation.
ShapeGen: Robotic Data Generation for Category-Level Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

ShapeGen generates shape-diverse 3D robotic manipulation demonstrations without simulators by curating a functional shape library and applying a minimal-annotation pipeline for novel, physically plausible data.
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
cs.CV 2026-04 unverdicted novelty 6.0

Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
cs.CV 2026-04 unverdicted novelty 6.0

SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and t...
Lifting Unlabeled Internet-level Data for 3D Scene Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Unlabeled web videos processed by designed data engines generate effective training data that yields strong zero-shot and finetuned performance on 3D detection, segmentation, VQA, and navigation.
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
eess.IV 2026-03 unverdicted novelty 6.0

Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
World Action Models are Zero-shot Policies
cs.RO 2026-02 unverdicted novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
Di-BiLPS: Denoising induced Bidirectional Latent-PDE-Solver under Sparse Observations
cs.LG 2026-05 unverdicted novelty 5.0

Di-BiLPS combines a variational autoencoder, latent diffusion, and contrastive learning to achieve state-of-the-art accuracy on PDE problems with as little as 3% observations while supporting zero-shot super-resolutio...
Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models
cs.RO 2026-05 unverdicted novelty 5.0

Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
cs.CV 2026-04 unverdicted novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
SyncFix: Fixing 3D Reconstructions via Multi-View Synchronization
cs.CV 2026-04 unverdicted novelty 5.0

SyncFix improves 3D reconstructions by synchronizing multi-view latent representations in a diffusion refinement process, generalizing from pair-wise training to arbitrary view counts at inference.
RLDX-1 Technical Report
cs.RO 2026-05 unverdicted novelty 4.0

RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.
RLDX-1 Technical Report
cs.RO 2026-05 unverdicted novelty 4.0

RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
cs.CV 2026-04 unverdicted novelty 4.0

OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.
World Model for Robot Learning: A Comprehensive Survey
cs.RO 2026-04 unverdicted novelty 3.0

A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · cited by 29 Pith papers · 27 internal anchors

[1]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. 35

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Edify image: High-quality image generation with pixel space laplacian diffusion models.arXiv preprint arXiv:2411.07126, 2024

Yuval Atzmon, Maciej Bala, Yogesh Balaji, Tiffany Cai, Yin Cui, Jiaojiao Fan, Yunhao Ge, Siddharth Gururani, Jacob Huffman, Ronald Isaac, et al. Edify image: High-quality image generation with pixel space laplacian diffusion models.arXiv preprint arXiv:2411.07126, 2024. 8

work page arXiv 2024
[3]

Recammaster: Camera-controlled generative rendering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InICCV, 2025. 31

work page 2025
[4]

Syncammaster: Synchronizing multi-camera video generation from diverse viewpoints

Jianhong Bai, Menghan Xia, Xintao Wang, Ziyang Yuan, Xiao Fu, Zuozhu Liu, Haoji Hu, Pengfei Wan, and Di Zhang. Syncammaster: Synchronizing multi-camera video generation from diverse viewpoints. InICLR, 2025. 31

work page 2025
[5]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 4, 5, 6, 7, 32

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Genie 3: A new frontier for world models, 2025

Philip J Ball, J Bauer, F Belletti, et al. Genie 3: A new frontier for world models, 2025. 35

work page 2025
[7]

Bansal, Z

Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520, 2024. 36

work page arXiv 2024
[8]

Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Goldenberg, Aditya Grover, and Kai-Wei Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025. 36

work page arXiv 2025
[9]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 6, 36

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

bloc97. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.https://www.reddit.com/r/LocalLLaMA/comments/ 14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/, 2023. Reddit post, r/LocalL- LaMA. 9

work page 2023
[11]

Intphys 2: Benchmarking intuitive physics understanding in complex synthetic environments, 2025

Florian Bordes, Quentin Garrido, Justine T Kao, Adina Williams, Michael Rabbat, and Emmanuel Dupoux. Intphys 2: Benchmarking intuitive physics understanding in complex synthetic environments. arXiv preprint arXiv:2506.09849, 2025. 36

work page arXiv 2025
[12]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. InIROS, 2025. 6, 31

work page 2025
[13]

Planning with reasoning using vision language world model.arXiv preprint arXiv:2509.02722, 2025

Delong Chen, Theo Moutakanni, Willy Chung, Yejin Bang, Ziwei Ji, Allen Bolourchi, and Pascale Fung. Planning with reasoning using vision language world model.arXiv preprint arXiv:2509.02722, 2025. 35

work page arXiv 2025
[14]

Video depth anything: Consistent depth estimation for super-long videos

Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. InCVPR, 2025. 19 38 World Simulation with Video Foundation Models for Physical AI

work page 2025
[15]

On the importance of noise scheduling for diffusion models

Ting Chen. On the importance of noise scheduling for diffusion models.arXiv preprint arXiv:2301.10972,

work page arXiv
[16]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin CM Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRSS, 2023. 22

work page 2023
[17]

Delta lake: Open-source storage framework that enables building lakehouses.https: //delta.io/, 2019

Databricks. Delta lake: Open-source storage framework that enables building lakehouses.https: //delta.io/, 2019. Open-source project, Delta Lake. 6

work page 2019
[18]

Veo 3, 5 2025

Google DeepMind. Veo 3, 5 2025. URLhttps://deepmind.google/technologies/veo/veo-3/. 35

work page 2025
[19]

Duan, H.-X

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025. 36

work page arXiv 2025
[20]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024. 8, 11

work page 2024
[21]

LLM-based Realistic Safety-Critical Driving Video Generation

YongjieFu, RuijianZha, PeiTian, andXuanDi. Llm-basedrealisticsafety-criticaldrivingvideogeneration. arXiv preprint arXiv:2507.01264, 2025. 36

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Diffusion models and gaussian flow matching: Two sides of the same coin

Ruiqi Gao, Emiel Hoogeboom, Jonathan Heek, Valentin De Bortoli, Kevin Patrick Murphy, and Tim Salimans. Diffusion models and gaussian flow matching: Two sides of the same coin. InThe Fourth Blogpost Track at ICLR 2025, 2025. 8

work page 2025
[23]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025. 35

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

YOLOX: Exceeding YOLO Series in 2021

Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021. 7

work page internal anchor Pith review arXiv 2021
[25]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

T2vphysbench: A first-principles benchmark for physical consistency in text-to-video generation.arXiv preprint arXiv:2505.00337, 2025

Xuyang Guo, Jiayan Huo, Zhenmei Shi, Zhao Song, Jiahao Zhang, and Jiale Zhao. T2vphysbench: A first-principles benchmark for physical consistency in text-to-video generation.arXiv preprint arXiv:2505.00337, 2025. 36

work page arXiv 2025
[27]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018. 35

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 35

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019. 35

work page internal anchor Pith review Pith/arXiv arXiv 1912
[30]

Generalized neighborhood attention: Multi- dimensional sparse attention at the speed of light.arXiv preprint arXiv:2504.16922, 2025

Ali Hassani, Fengzhe Zhou, Aditya Kane, Jiannan Huang, Chieh-Yun Chen, Min Shi, Steven Walton, Markus Hoehnerbach, Vijay Thakkar, Michael Isaev, et al. Generalized neighborhood attention: Multi- dimensional sparse attention at the speed of light.arXiv preprint arXiv:2504.16922, 2025. 14

work page arXiv 2025
[31]

Unirelight: Learning joint decomposition and synthesis for video relighting.arXiv preprint arXiv:2506.15673, 2025

Kai He, Ruofan Liang, Jacob Munkberg, Jon Hasselgren, Nandita Vijaykumar, Alexander Keller, Sanja Fidler, Igor Gilitschenski, Zan Gojcic, and Zian Wang. Unirelight: Learning joint decomposition and synthesis for video relighting.arXiv preprint arXiv:2506.15673, 2025. 36 39 World Simulation with Video Foundation Models for Physical AI

work page arXiv 2025
[32]

simple diffusion: End-to-end diffusion for high resolution images

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. InICML, 2023. 8

work page 2023
[33]

arXiv preprint arXiv:2508.10934 (2025)

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025. 31

work page arXiv 2025
[34]

Let- 3d-ap: Longitudinal error tolerant 3d average precision for camera-only 3d detection, 2024

Wei-Chih Hung, Vincent Casser, Henrik Kretzschmar, Jyh-Jing Hwang, and Dragomir Anguelov. Let- 3d-ap: Longitudinal error tolerant 3d average precision for camera-only 3d detection, 2024. URL https://arxiv.org/abs/2206.07705. 25

work page arXiv 2024
[35]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

arXiv preprint arXiv:2505.12705 (2025)

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025. 3, 32, 36

work page arXiv 2025
[37]

arXiv preprint arXiv:2303.07399 (2023)

Tao Jiang, Peng Lu, Li Zhang, Ningsheng Ma, Rui Han, Chengqi Lyu, Yining Li, and Kai Chen. Rtmpose: Real-time multi-person pose estimation based on mmpose.arXiv preprint arXiv:2303.07399, 2023. 7

work page arXiv 2023
[38]

Elucidating the design space of diffusion-based generative models.NeurIPS, 2022

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.NeurIPS, 2022. 8

work page 2022
[39]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

AlexanderKhazatsky, KarlPertsch, SurajNair, AshwinBalakrishna, SudeepDasari, SiddharthKaramcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large- scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 32, 35

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Kling, 2024

KuaiShou. Kling, 2024. URLhttps://klingai.com/. 35

work page 2024
[42]

Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers.arXiv preprint arXiv:2203.17270, 2022

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers, 2022. URLhttps://arxiv.org/abs/2203.17270. 28

work page arXiv 2022
[43]

Won- derplay: Dynamic 3d scene generation from a single image and actions.arXiv preprint arXiv:2505.18151,

Zizhang Li, Hong-Xing Yu, Wei Liu, Yin Yang, Charles Herrmann, Gordon Wetzstein, and Jiajun Wu. Won- derplay: Dynamic 3d scene generation from a single image and actions.arXiv preprint arXiv:2505.18151,

work page arXiv
[44]

Torchtitan: One-stop pytorch native solution for production ready LLM pretraining

Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, and Stratos Idreos. Torchtitan: One-stop pytorch native solution for production ready LLM pretraining. InICLR, 2025. 14

work page 2025
[45]

Genieenvisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635,

Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025. 36

work page arXiv 2025
[46]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 8 40 World Simulation with Video Foundation Models for Physical AI

work page internal anchor Pith review Pith/arXiv arXiv 2022
[47]

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025. 13

work page internal anchor Pith review arXiv 2025
[48]

Dynamicscaler: Seamless and scalable video generation for panoramic scenes

Jinxiu Liu, Shaoheng Lin, Yinxiao Li, and Ming-Hsuan Yang. Dynamicscaler: Seamless and scalable video generation for panoramic scenes. InCVPR, 2025. 35

work page 2025
[49]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023. 22

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. arXiv preprint arXiv:2410.11081, 2024. 14

work page internal anchor Pith review arXiv 2024
[51]

Latr: 3d lane detection from monocular images with transformer, 2023

Yueru Luo, Chaoda Zheng, Xu Yan, Tang Kun, Chao Zheng, Shuguang Cui, and Zhen Li. Latr: 3d lane detection from monocular images with transformer, 2023. URLhttps://arxiv.org/abs/2308.04583. 28

work page arXiv 2023
[52]

Hailuo, 2024

MiniMax. Hailuo, 2024. URLhttps://hailuoai.com/video. 35

work page 2024
[53]

Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025

Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025. 36

work page arXiv 2025
[54]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. arXiv preprint arXiv:2406.02523, 2024. 35

work page internal anchor Pith review arXiv 2024
[55]

Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025

NVIDIA. Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025. 3, 9, 35

work page arXiv 2025
[56]

Cosmos-transfer1: Conditional world generation with adaptive multimodal control, 2025

NVIDIA. Cosmos-transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492, 2025. 3, 18, 19, 28, 36

work page arXiv 2025
[57]

Cosmos World Foundation Model Platform for Physical AI

NVIDIA. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,

work page internal anchor Pith review Pith/arXiv arXiv
[58]

3, 4, 8, 9, 31, 35, 36

work page
[59]

Sora, 2024

OpenAI. Sora, 2024. URLhttps://openai.com/sora/. 35

work page 2024
[60]

Training language models to follow instructions with human feedback.NeurIPS, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.NeurIPS, 2022. 13

work page 2022
[61]

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023. 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024. 35

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. InKDD, 2020. 14

work page 2020
[64]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 19, 22 41 World Simulation with Video Foundation Models for Physical AI

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

Ren, Justin Lidard, Lars Lien Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz

Allen Z. Ren, Justin Lidard, Lars Lien Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization. InICLR,

work page
[66]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159, 2024. 22

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Available: https://arxiv.org/abs/2506.09042

Xuanchi Ren, Yifan Lu, Tianshi Cao, Ruiyuan Gao, Shengyu Huang, Amirmojtaba Sabour, Tianchang Shen, Tobias Pfaff, Jay Zhangjie Wu, Runjian Chen, et al. Cosmos-drive-dreams: Scalable synthetic driving data generation with world foundation models.arXiv preprint arXiv:2506.09042, 2025. 3, 25, 28, 36

work page arXiv 2025
[68]

Gen3c: 3d-informed world-consistent video generation with precise camera control

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera control. InCVPR, 2025. 36

work page 2025
[69]

Gen 3, 2024

Runway. Gen 3, 2024. URLhttps://runwayml.com/research/introducing-gen-3-alpha. 35

work page 2024
[70]

very scattered

Paul D Sampson. Fitting conic sections to “very scattered” data: An iterative refinement of the bookstein algorithm.Computer Graphics and Image Processing, 1982. ISSN 0146-664X. doi: https://doi.org/ 10.1016/0146-664X(82)90101-0. URL https://www.sciencedirect.com/science/article/pii/ 0146664X82901010. 25

work page doi:10.1016/0146-664x(82)90101-0 1982
[71]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 13

work page internal anchor Pith review Pith/arXiv arXiv 2017
[72]

arXiv preprint arXiv:2301.11280 , year=

Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, et al. Text-to-4d dynamic scene generation.arXiv preprint arXiv:2301.11280, 2023. 35

work page arXiv 2023
[73]

Light field networks: Neural scene representations with single-evaluation rendering.NeurIPS, 2021

Vincent Sitzmann, Semon Rezchikov, Bill Freeman, Josh Tenenbaum, and Fredo Durand. Light field networks: Neural scene representations with single-evaluation rendering.NeurIPS, 2021. 31

work page 2021
[74]

Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2, 2021

Ivan Skorokhodov, Sergey Tulyakov, and Mohamed Elhoseiny. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2, 2021. 25

work page 2021
[75]

cuRobo: Parallelized collision-free minimum-jerk robot motion generation.arXiv preprint arXiv:2310.17274, 2023

Balakumar Sundaralingam, Siva Kumar Sastry Hari, Adam Fishman, Caelan Garrett, Karl Van Wyk, Valts Blukis, Alexander Millane, Helen Oleynikova, Ankur Handa, Fabio Ramos, Nathan Ratliff, and Dieter Fox. cuRobo: Parallelized collision-free minimum-jerk robot motion generation.arXiv preprint arXiv:2310.17274, 2023. 21

work page arXiv 2023
[76]

1x technologies | safe humanoids for the home, 2025

1X Technologies. 1x technologies | safe humanoids for the home, 2025. URLhttps://www.1x.tech/. 6

work page 2025
[77]

Open x-embodiment: Robotic learning datasets and rt-x models

Quan Vuong, Sergey Levine, Homer Rich Walke, Karl Pertsch, Anikait Singh, Ria Doshi, Charles Xu, Jianlan Luo, Liam Tan, Dhruv Shah, et al. Open x-embodiment: Robotic learning datasets and rt-x models. InTowards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023,

work page
[78]

Bridgedata v2: A dataset for robot learning at scale

Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InCoRL, 2023. 6, 33 42 World Simulation with Video Foundation Models for Physical AI

work page 2023
[79]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 9, 32, 35

work page internal anchor Pith review Pith/arXiv arXiv 2025
[80]

A comprehensive study of decoder-only llms for text-to-image generation

Andrew Z Wang, Songwei Ge, Tero Karras, Ming-Yu Liu, and Yogesh Balaji. A comprehensive study of decoder-only llms for text-to-image generation. InCVPR, 2025. 9

work page 2025

Showing first 80 references.