pith. machine review for the scientific record. sign in

arxiv: 2501.03575 · v3 · submitted 2025-01-07 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

Recognition: 2 theorem links

· Lean Theorem

Cosmos World Foundation Model Platform for Physical AI

Alice Luo, Anqi Li, Arsalan Mousavian, Arslan Ali, Artur Zolkowski, Bartosz Stefaniak, Chen-Hsuan Lin, Daniel Dworakowski, David Page, Despoina Paschalidou, Dieter Fox, Ed Schmerling, Erik Barker, Ethan He, Fangyin Wei, Fitsum Reda, Francesco Ferroni, Gergely Kl\'ar, Grace Lam, Hanzi Mao, Hao Wang, Haoxiang Wang, Heng Wang, Huan Ling, Jacob Huffman, Jay Zhangjie Wu, Jiahui Huang, Jiaojiao Fan, Jiashu Xu, Jibin Varghese, Jingyi Jin, Jing Zhang, Jinwei Gu, Kaichun Mo, Laura Leal-Taixe, Lindsey Pavao, Lin Yen-Chen, Lyne Tchapmi, Maciej Bala, Michele Fenzi, Ming-Yu Liu, Morteza Ramezanali, NVIDIA: Niket Agarwal, Pooya Jannaty, Prithvijit Chattopadhyay, Przemek Tredak, Qianli Ma, Qingqing Zhao, Qinsheng Zhang, Sanja Fidler, Seungjun Nah, Seung Wook Kim, Shitao Tang, Shiyi Lan, Siddharth Gururani, Songwei Ge, Sriharsha Niverty, Stella Shi, Tiffany Cai, Ting-Chun Wang, Tsung-Yi Lin, Vasanth Rao Naik Sabavat, Wei-Cheng Tseng, Wei Yang, Xian Liu, Xiaohui Zeng, Xiaowei Ren, Xinyue Wei, Yifan Ding, Yin Cui, Yogesh Balaji, Yongxin Chen, Yunhao Ge, Yuxuan Zhang, Yu Zeng, Zeeshan Patel, Zhaoshuo Li

Pith reviewed 2026-05-10 23:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO
keywords world foundation modelphysical AIfine-tuningworld modelvideo tokenizerdigital twinvideo curationopen-source platform
0
0 comments X

The pith

A platform supplies pre-trained world foundation models that developers can fine-tune for specific physical AI applications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a platform to help create world models for training physical AI systems digitally before real-world use. It positions a world foundation model as a general-purpose starting point that can be adapted through fine-tuning to fit particular downstream tasks. The platform includes a video curation pipeline, pre-trained models, post-training examples, and video tokenizers to support this customization process. Releasing the models open-weight with permissive licenses is meant to let developers address important societal problems by building digital twins of the world.

Core claim

Physical AI needs to be trained digitally first with a digital twin of the policy model and a digital twin of the world as the world model. The paper presents the platform to help developers build customized world models for their physical AI setups by positioning a world foundation model as a general-purpose world model that can be fine-tuned into customized versions for downstream applications, with components covering a video curation pipeline, pre-trained models, post-training examples, and video tokenizers.

What carries the argument

The world foundation model, positioned as a general-purpose world model that supports fine-tuning into customized models for specific applications.

If this is right

  • Developers can build customized world models for their physical AI setups by starting from the pre-trained foundation models.
  • The video curation pipeline and tokenizers reduce the effort required to prepare data for model adaptation.
  • Post-training examples show how to adapt the general models to specific tasks with limited additional work.
  • Open-weight availability with permissive licenses allows wider use in creating digital twins for physical AI.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This could shorten the development cycle for training physical AI policies by providing ready simulation bases instead of requiring full retraining.
  • It may connect to challenges in scalable simulation where accurate long-horizon world predictions determine policy safety.
  • A testable extension would be measuring how well fine-tuned models handle multi-agent interactions or rare events in physical scenarios.

Load-bearing premise

That the pre-trained models and post-training examples will transfer effectively to diverse physical AI tasks with only modest additional effort.

What would settle it

A demonstration that fine-tuned models from the platform fail to predict physical interactions accurately in new environments outside the provided examples would disprove the central positioning.

read the original abstract

Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make Cosmos open-source and our models open-weight with permissive licenses available via https://github.com/nvidia-cosmos/cosmos-predict1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces the Cosmos World Foundation Model Platform for Physical AI applications. It describes a video curation pipeline, pre-trained world foundation models, examples of post-training, and video tokenizers. The authors position the world foundation model as a general-purpose model that can be fine-tuned into customized world models for downstream tasks and release the models as open-weight with permissive licenses via GitHub.

Significance. If the pre-trained models and post-training pipeline transfer effectively as claimed, the open release could accelerate Physical AI development by providing accessible tools for building digital twins of the world. Explicit credit is given for the open-source code and open-weight models under permissive licenses, which lowers barriers for the community. However, the significance remains prospective without demonstrated performance.

major comments (1)
  1. [Abstract] Abstract: The positioning statement that the world foundation model 'can be fine-tuned into customized world models for downstream applications' is load-bearing for the paper's contribution but is unsupported by any quantitative benchmarks, ablation studies, error analysis, or transfer results on Physical AI tasks.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment point by point below and have prepared revisions to strengthen the clarity of our positioning.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The positioning statement that the world foundation model 'can be fine-tuned into customized world models for downstream applications' is load-bearing for the paper's contribution but is unsupported by any quantitative benchmarks, ablation studies, error analysis, or transfer results on Physical AI tasks.

    Authors: We agree that the abstract's positioning statement is forward-looking and would benefit from greater precision. The manuscript's primary contribution is the open platform itself, encompassing the video curation pipeline, pre-trained world foundation models, illustrative post-training examples, and video tokenizers. These elements are designed to enable developers to build and fine-tune customized world models. The post-training examples demonstrate the adaptation process in practice, but we acknowledge the absence of comprehensive quantitative benchmarks, ablations, or error analyses on specific Physical AI downstream tasks. In the revised version, we will update the abstract to state that the platform supplies the foundation and tools for such fine-tuning, with examples provided to illustrate the workflow, while clarifying that rigorous transfer performance evaluations on end-user Physical AI tasks are prospective and left to downstream applications. We will also add a dedicated limitations subsection discussing the current scope and the need for task-specific validation by users. revision: yes

Circularity Check

0 steps flagged

No significant circularity; platform announcement with no derivations

full rationale

The document is a platform announcement and positioning statement for the Cosmos World Foundation Model Platform. It describes components (video curation pipeline, pre-trained models, post-training examples, video tokenizers) and states that a general-purpose world foundation model can be fine-tuned for downstream Physical AI tasks. No mathematical derivations, equations, predictions, fitted parameters, or first-principles results are present. The central claim is definitional positioning rather than a derived result, with no self-referential reductions, self-citations as load-bearing premises, or renamings of known results. The transfer performance to tasks is left as an empirical question for users. This is self-contained with no internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work relies on standard practices in large-scale video model training and fine-tuning.

pith-pipeline@v0.9.0 · 5779 in / 984 out tokens · 38737 ms · 2026-05-10T23:34:02.306215+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

    cs.CV 2026-05 unverdicted novelty 8.0

    AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.

  2. SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

    cs.AI 2026-05 unverdicted novelty 8.0

    SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.

  3. GenAI Powered Dynamic Causal Inference with Unstructured Data

    stat.ME 2026-05 unverdicted novelty 7.0

    A GenAI-based method extracts representations from unstructured data and uses a neural network to fit marginal structural models that recover causal effects of treatment feature sequences including their positions.

  4. NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

  5. Do Joint Audio-Video Generation Models Understand Physics?

    cs.SD 2026-05 unverdicted novelty 7.0

    Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.

  6. LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation

    eess.IV 2026-05 unverdicted novelty 7.0

    LiVeAction is a lightweight asymmetric neural codec using an FFT-inspired encoder and variance-based training that outperforms generative tokenizers in rate-distortion while supporting real-time use on resource-constr...

  7. EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

    cs.CV 2026-05 unverdicted novelty 7.0

    EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.

  8. Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion

    cs.RO 2026-05 unverdicted novelty 7.0

    Action Agent pairs LLM-driven video generation with a flow-constrained diffusion transformer to produce velocity commands, raising video success to 86% and delivering 64.7% real-world navigation on a Unitree G1 humanoid.

  9. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  10. OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer

    cs.CV 2026-04 unverdicted novelty 7.0

    OmniShotCut treats shot boundary detection as structured relational prediction via a shot-query Transformer, uses fully synthetic transitions for training data, and releases OmniShotCutBench for evaluation.

  11. Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

    cs.AI 2026-04 unverdicted novelty 7.0

    Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...

  12. VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis

    cs.RO 2026-04 unverdicted novelty 7.0

    VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.

  13. RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.

  14. EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks

    cs.CV 2026-04 unverdicted novelty 7.0

    EgoTL provides a new egocentric dataset with think-aloud chains and metric labels that benchmarks VLMs on long-horizon tasks and improves their planning, reasoning, and spatial grounding after finetuning.

  15. ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.

  16. MoRight: Motion Control Done Right

    cs.CV 2026-04 unverdicted novelty 7.0

    MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...

  17. SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    SEM-ROVER generates large multiview-consistent 3D urban driving scenes via semantic-conditioned diffusion on Σ-Voxfield voxel grids with progressive outpainting and deferred rendering.

  18. A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

    cs.CV 2026-04 conditional novelty 7.0

    Delta tokens compress VFM feature differences into single tokens, enabling a lightweight generative world model that predicts diverse futures with far lower compute than existing approaches.

  19. RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

    cs.RO 2026-05 unverdicted novelty 6.0

    A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.

  20. Qwen-Image-VAE-2.0 Technical Report

    cs.CV 2026-05 unverdicted novelty 6.0

    Qwen-Image-VAE-2.0 achieves state-of-the-art high-compression image reconstruction and superior diffusability for diffusion models, with a new text-rich document benchmark.

  21. VISOR: A Vision-Language Model-based Test Oracle for Testing Robot

    cs.SE 2026-05 unverdicted novelty 6.0

    VISOR applies VLMs to automate robot test oracles for correctness and quality assessment while reporting uncertainty, with evaluation on GPT and Gemini showing trade-offs in precision and recall but poor uncertainty c...

  22. SceneFactory: GPU-Accelerated Multi-Agent Driving Simulation with Physics-Based Vehicle Dynamics

    cs.MA 2026-05 accept novelty 6.0

    SceneFactory delivers a batched GPU platform for physics-based multi-agent autonomous driving simulation that achieves 127x higher throughput than non-vectorized PhysX while supporting articulated dynamics and road-co...

  23. CASCADE: Context-Aware Relaxation for Speculative Image Decoding

    cs.CV 2026-05 unverdicted novelty 6.0

    CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...

  24. Earth-o1: A Grid-free Observation-native Atmospheric World Model

    cs.CV 2026-05 unverdicted novelty 6.0

    Earth-o1 learns continuous atmospheric dynamics from ungridded observations and matches operational IFS forecast skill in hindcasts.

  25. Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

    cs.RO 2026-04 unverdicted novelty 6.0

    X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...

  26. Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

    cs.RO 2026-04 unverdicted novelty 6.0

    X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.

  27. Tempered Sequential Monte Carlo for Trajectory and Policy Optimization with Differentiable Dynamics

    cs.LG 2026-04 unverdicted novelty 6.0

    Tempered sequential Monte Carlo samples from a Boltzmann-tilted distribution over controllers to optimize trajectories and policies under differentiable dynamics.

  28. Tempered Sequential Monte Carlo for Trajectory and Policy Optimization with Differentiable Dynamics

    cs.LG 2026-04 unverdicted novelty 6.0

    Tempered sequential Monte Carlo samples efficiently from a temperature-annealed distribution over controller parameters to solve trajectory and policy optimization under differentiable dynamics.

  29. EgoDyn-Bench: Evaluating Ego-Motion Understanding in Vision-Centric Foundation Models for Autonomous Driving

    cs.CV 2026-04 unverdicted novelty 6.0

    EgoDyn-Bench reveals a perception bottleneck in vision-centric foundation models: ego-motion logic derives from language while visual input adds negligible signal, with explicit trajectories restoring consistency.

  30. Infrastructure-Centric World Models: Bridging Temporal Depth and Spatial Breadth for Roadside Perception

    cs.CV 2026-04 unverdicted novelty 6.0

    Infrastructure-centric world models use roadside sensors' temporal depth to complement vehicle spatial breadth for better traffic simulation and prediction.

  31. MetaEarth3D: Unlocking World-scale 3D Generation with Spatially Scalable Generative Modeling

    cs.CV 2026-04 unverdicted novelty 6.0

    MetaEarth3D is the first generative foundation model for spatially consistent, unbounded 3D scene generation at planetary scale using optical Earth observation data.

  32. Active World-Model with 4D-informed Retrieval for Exploration and Awareness

    cs.CV 2026-04 unverdicted novelty 6.0

    AW4RE is a generative world model that estimates action-conditioned observations via 4D-informed evidence retrieval, geometric support, and conditional completion to enable better exploration under partial observability.

  33. Human Cognition in Machines: A Unified Perspective of World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...

  34. Lyra 2.0: Explorable Generative 3D Worlds

    cs.CV 2026-04 unverdicted novelty 6.0

    Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.

  35. Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

    cs.CV 2026-04 unverdicted novelty 6.0

    Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.

  36. WM-DAgger: Enabling Efficient Data Aggregation for Imitation Learning with World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    WM-DAgger uses world models with corrective action synthesis and consistency-guided filtering to aggregate OOD recovery data for imitation learning, reporting 93.3% success in soft bag pushing with five demonstrations.

  37. Structured State-Space Regularization for Compact and Generation-Friendly Image Tokenization

    cs.CV 2026-04 unverdicted novelty 6.0

    A new regularizer transfers frequency awareness from state-space models into image tokenizers, yielding more compact latents that improve diffusion-model generation quality with little reconstruction penalty.

  38. Latent-Compressed Variational Autoencoder for Video Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.

  39. VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

    cs.RO 2026-04 unverdicted novelty 6.0

    VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.

  40. ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks

    cs.CV 2026-04 unverdicted novelty 6.0

    ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.

  41. DiffHDR: Re-Exposing LDR Videos with Video Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    DiffHDR converts LDR videos to HDR by formulating the task as generative radiance inpainting in a video diffusion model's latent space, using Log-Gamma encoding and synthesized training data to achieve better fidelity...

  42. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

    cs.AI 2026-01 conditional novelty 6.0

    Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.

  43. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    cs.AI 2025-06 unverdicted novelty 6.0

    V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...

  44. SkyReels-V2: Infinite-length Film Generative Model

    cs.CV 2025-04 unverdicted novelty 6.0

    SkyReels-V2 produces infinite-length film videos via MLLM-based captioning, progressive pretraining, motion RL, and diffusion forcing with non-decreasing noise schedules.

  45. Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

    cs.RO 2025-04 unverdicted novelty 6.0

    Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...

  46. VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    cs.CV 2025-03 accept novelty 6.0

    VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...

  47. GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    cs.RO 2025-03 unverdicted novelty 6.0

    GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.

  48. Latte: Latent Diffusion Transformer for Video Generation

    cs.CV 2024-01 unverdicted novelty 6.0

    Latte achieves state-of-the-art video generation on FaceForensics, SkyTimelapse, UCF101, and Taichi-HD by using a latent diffusion transformer with four efficient spatial-temporal decomposition variants and best-pract...

  49. Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models

    cs.CV 2026-05 unverdicted novelty 5.0

    Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.

  50. LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore)

    cs.CV 2026-05 conditional novelty 5.0

    The PhyScore challenge creates the first benchmark requiring metrics to jointly score video quality, physical realism, condition alignment, and temporal consistency while localizing physical anomalies in 1554 videos f...

  51. What Matters in Practical Learned Image Compression

    cs.CV 2026-05 unverdicted novelty 5.0

    A practical learned image codec delivers 2.3-3x bitrate savings over AV1/VVC and 20-40% over prior learned codecs while encoding 12MP images in 230ms on iPhone.

  52. Lifting Embodied World Models for Planning and Control

    cs.CV 2026-04 unverdicted novelty 5.0

    Composing a policy that maps 2D waypoints to joint actions with a frozen world model yields a lifted world model that achieves 3.8 times lower mean joint error than direct low-level search while being more compute-eff...

  53. From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation

    cs.GR 2026-04 unverdicted novelty 5.0

    The paper surveys 3D asset generation methods and organizes them around the full production pipeline to assess which outputs meet engine-level requirements for interactive applications.

  54. Cortex 2.0: Grounding World Models in Real-World Industrial Deployment

    cs.RO 2026-04 unverdicted novelty 5.0

    Cortex 2.0 introduces world-model-based planning that generates and scores future trajectories to outperform reactive vision-language-action baselines on industrial robotic tasks including pick-and-place, sorting, and...

  55. StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement

    cs.RO 2026-04 unverdicted novelty 5.0

    StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...

  56. World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems

    cs.RO 2026-04 unverdicted novelty 5.0

    The World-Value-Action model enables implicit planning for VLA systems by performing inference over a learned latent representation of high-value future trajectories instead of direct action prediction.

  57. PAT-VCM: Plug-and-Play Auxiliary Tokens for Video Coding for Machines

    cs.CV 2026-04 unverdicted novelty 5.0

    PAT-VCM adds lightweight auxiliary tokens to a shared baseline video stream to support multiple downstream machine tasks without task-specific codecs.

  58. ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation

    cs.RO 2026-04 unverdicted novelty 5.0

    Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.

  59. Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

    cs.CV 2026-04 unverdicted novelty 5.0

    Phantom generates visually realistic and physically consistent videos by jointly modeling visual content and latent physical dynamics via an abstract physics-aware representation.

  60. Qwen-Image Technical Report

    cs.CV 2025-08 unverdicted novelty 5.0

    Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive en...

Reference graph

Works this paper leans on

253 extracted references · 200 canonical work pages · cited by 65 Pith papers · 39 internal anchors

  1. [1]

    Abbas, K

    Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S Morcos. Semdedup: Data-efficient learning at web-scale through semantic deduplication.arXiv preprint arXiv:2303.09540, 2023. 10

  2. [2]

    Nemotron-4 340b technical report

    Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, et al. Nemotron-4 340b technical report.arXiv preprint arXiv:2406.11704, 2024. 29

  3. [3]

    Pixtral 12b.arXiv preprint arXiv:2410.07073,

    Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, Diogo Costa, Baudouin De Monicault, Saurabh Garg, Theophile Gervet, et al. Pixtral 12b. arXiv preprint arXiv:2410.07073, 2024. 26

  4. [4]

    Bbc planet earth dataset, 2016

    AI Image Lab, University of Modena. Bbc planet earth dataset, 2016. URLhttps://aimagelab.ing. unimore.it/imagelab/researchActivity.asp?idActivity=19. Accessed: 2024-10-17. 7

  5. [5]

    Diffusion for world modeling: Visual details matter in atari

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. InNeurIPS, 2024. 55, 56

  6. [6]

    Latent-Shift: Latent diffusion with temporal shift for efficient text-to-video generation.arXiv preprint arXiv:2304.08477,

    Jie An, Songyang Zhang, Harry Yang, Sonal Gupta, Jia-Bin Huang, Jiebo Luo, and Xi Yin. Latent-shift: La- tent diffusion with temporal shift for efficient text-to-video generation.arXiv preprint arXiv:2304.08477,

  7. [7]

    Edify image: High-quality image generation with pixel space laplacian diffusion models.arXiv preprint arXiv:2411.07126, 2024

    Yuval Atzmon, Maciej Bala, Yogesh Balaji, Tiffany Cai, Yin Cui, Jiaojiao Fan, Yunhao Ge, Siddharth Gururani, Jacob Huffman, Ronald Isaac, et al. Edify image: High-quality image generation with pixel space laplacian diffusion models.arXiv preprint arXiv:2411.07126, 2024. 22

  8. [8]

    ediff-i: Text-to-image diffusion models with ensemble of expert denoisers

    Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv preprint arXiv:2211.01324, 2022. 23

  9. [9]

    Navigation world models, 2024.https: //arxiv.org/abs/2412.03572

    Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models.arXiv preprint arXiv:2412.03572, 2024. 55

  10. [10]

    Improving image generation with better captions.Computer Science

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2023. 57

  11. [11]

    Zero-shot robotic manipulation with pre-trained image-editing diffusion models

    Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pre-trained image-editing diffusion models. InNeurIPS Workshops, 2023. 57

  12. [12]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 56, 57

  13. [13]

    Align your latents: High-resolution video synthesis with latent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InCVPR,

  14. [14]

    11, 36, 37, 46, 49, 56, 57

  15. [15]

    Video generation models as world simulators, 2024

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators, 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators . 56, 57 61 Cosmos World Foundation Model Platform for Physical AI

  16. [16]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In NeurIPS, 2020. 27, 29

  17. [17]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In ICML, 2024. 55, 56

  18. [18]

    Lee, Deming Chen, and Tri Dao

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. InICML, 2024. 30, 31

  19. [19]

    Pyscenedetect, 2024

    Brandon Castellano. Pyscenedetect, 2024. URLhttps://www.scenedetect.com. Video Cut Detection and Analysis Tool. 7

  20. [20]

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. Maskgit: Masked generative image transformer. InCVPR, 2022. 57

  21. [21]

    pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

    David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InCVPR, 2024. 56

  22. [22]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024. 57

  23. [23]

    Training Deep Nets with Sublinear Memory Cost

    Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016. 23

  24. [24]

    On the importance of noise scheduling for diffusion models

    Ting Chen. On the importance of noise scheduling for diffusion models.arXiv preprint arXiv:2301.10972,

  25. [25]

    Panda-70m: Captioning 70m videos with multiple cross-modality teachers

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InCVPR, 2024. 7, 16

  26. [26]

    Diffusion policy: Visuomotor policy learning via action diffusion.RSS, 2023

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.RSS, 2023. 57

  27. [27]

    Emu: Enhancing image generation models using photogenic needles in a haystack

    Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiaofang Wang, Abhimanyu Dubey, et al. Emu: Enhancing image generation models using photogenic needles in a haystack.arXiv preprint arXiv:2309.15807, 2023. 22, 57

  28. [28]

    The z-loss: a shift and scale invariant classification loss belonging to the spherical family.arXiv preprint arXiv:1604.08859, 2016

    Alexandre de Brébisson and Pascal Vincent. The z-loss: a shift and scale invariant classification loss belonging to the spherical family.arXiv preprint arXiv:1604.08859, 2016. 28

  29. [29]

    Scaling vision transformers to 22 billion parameters

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. InICML, 2023. 21

  30. [30]

    Autoregressive Video Gen- eration Without Vector Quantization

    Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization.arXiv preprint arXiv:2412.14169, 2024. 56

  31. [31]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 12, 16 62 Cosmos World Foundation Model Platform for Physical AI

  32. [32]

    Retinaface: Single-stage dense face localisation in the wild.CVPR, 2020

    Jiankang Deng, Jia Guo, Yuxiang Zhou, Jinke Yu, Irene Kotsia, and Stefanos Zafeiriou. Retinaface: Single-stage dense face localisation in the wild.CVPR, 2020. 55

  33. [33]

    Superpoint: Self-supervised interest point detection and description

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description. InCVPR Workshops, 2018. 36

  34. [34]

    Diffusion world model: Fu- ture modeling beyond step-by-step rollout for offline reinforcement learning.arXiv preprint arXiv:2402.03570, 2024

    Zihan Ding, Amy Zhang, Yuandong Tian, and Qinqing Zheng. Diffusion world model: Future modeling beyond step-by-step rollout for offline reinforcement learning.arXiv preprint arXiv:2402.03570, 2024. 55, 56

  35. [35]

    An image is worth 16x16 words: transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: transformers for image recognition at scale. InICLR, 2021. 56

  36. [36]

    Learning universal policies via text-guided video generation

    Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InNeurIPS, 2024. 57

  37. [37]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 24, 27, 29, 30

  38. [38]

    Bridge data: Boosting generalization of robotic skills with cross-domain datasets

    Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Dani- ilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. InRSS, 2022. 43

  39. [39]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, 2021. 57

  40. [40]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024. 21

  41. [41]

    Two-frame motion estimation based on polynomial expansion

    Gunnar Farnebäck. Two-frame motion estimation based on polynomial expansion. InScandinavian Conference on Image Analysis, 2003. 9

  42. [42]

    Deep visual foresight for planning robot motion

    Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. InICRA, 2017. 57

  43. [43]

    FLUX.1: Image generation, 2024

    FLUX. FLUX.1: Image generation, 2024. URLhttps://huggingface.co/black-forest-labs/FLUX. 1-dev. 12, 57

  44. [44]

    Dreamsim: Learning new dimensions of human visual similarity using synthetic data

    Stephanie Fu, Netanel Yakir Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. In NeurIPS, 2023. 39

  45. [45]

    Datacomp: In search of the next generation of multimodal datasets

    Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. InNeurIPS, 2024. 10

  46. [46]

    Make-a-scene: Scene-based text-to-image generation with human priors

    Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors. InECCV, 2022. 57

  47. [47]

    A new algorithm for data compression.The C Users Journal, 1994

    Philip Gage. A new algorithm for data compression.The C Users Journal, 1994. 28

  48. [48]

    Murphy, and Tim Salimans

    Ruiqi Gao, Emiel Hoogeboom, Jonathan Heek, Valentin De Bortoli, Kevin P. Murphy, and Tim Salimans. Diffusion meets flow matching: Two sides of the same coin, 2024. URLhttps://diffusionflow. github.io/. 19 63 Cosmos World Foundation Model Platform for Physical AI

  49. [49]

    Magicdrivedit: High- resolution long video generation for autonomous driving with adaptive control

    Ruiyuan Gao, Kai Chen, Bo Xiao, Lanqing Hong, Zhenguo Li, and Qiang Xu. Magicdrivedit: High- resolution long video generation for autonomous driving with adaptive control. arXiv preprint arXiv:2411.13807, 2024. 57

  50. [50]

    Magicdrive: Street view generation with diverse 3d geometry control

    Ruiyuan Gao, Kai Chen, Enze Xie, Lanqing HONG, Zhenguo Li, Dit-Yan Yeung, and Qiang Xu. Magicdrive: Street view generation with diverse 3d geometry control. InICLR, 2024. 57

  51. [51]

    Vista: A generalizable driving world model with high fidelity and versatile controllability

    Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. In NeurIPS, 2024. 57

  52. [52]

    Image style transfer using convolutional neural networks

    Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In CVPR, 2016. 15

  53. [53]

    Long video generation with time-agnostic vqgan and time-sensitive transformer

    Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh. Long video generation with time-agnostic vqgan and time-sensitive transformer. InECCV, 2022. 57

  54. [54]

    Preserve your own correlation: A noise prior for video diffusion models

    Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, and Yogesh Balaji. Preserve your own correlation: A noise prior for video diffusion models. In ICCV, 2023. 56, 57

  55. [55]

    Visual fact checker: Enabling high-fidelity detailed caption generation

    Yunhao Ge, Xiaohui Zeng, Jacob Samuel Huffman, Tsung-Yi Lin, Ming-Yu Liu, and Yin Cui. Visual fact checker: Enabling high-fidelity detailed caption generation. InCVPR, 2024. 10

  56. [56]

    AEGIS: Online adaptive AI content safety moderation with ensemble of LLM experts.arXiv preprint arXiv:2404.05993, 2024

    Shaona Ghosh, Prasoon Varshney, Erick Galinkin, and Christopher Parisien. Aegis: Online adaptive ai content safety moderation with ensemble of llm experts.arXiv preprint arXiv:2404.05993, 2024. 53, 54

  57. [57]

    Imagebind: One embedding space to bind them all

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InCVPR, 2023. 8

  58. [58]

    Emu video: Factorizing text-to-video generation by explicit image conditioning

    Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing text-to-video generation by explicit image conditioning. InECCV, 2024. 56, 57

  59. [59]

    Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

    KristenGrauman, AndrewWestbury, LorenzoTorresani, KrisKitani, JitendraMalik, TriantafyllosAfouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InCVPR, 2024. 16

  60. [60]

    Photorealistic video generation with diffusion models

    Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models. InECCV, 2024. 21

  61. [61]

    Pre-trained text-to-image diffusion models are versatile representation learners for control

    Gunshi Gupta, Karmesh Yadav, Yarin Gal, Dhruv Batra, Zsolt Kira, Cong Lu, and Tim GJ Rudner. Pre-trained text-to-image diffusion models are versatile representation learners for control. InICLR Workshops, 2024. 57

  62. [62]

    SPACE:Speech-driven Portrait Animation with Controllable Expression

    SiddharthGururani, ArunMallya, Ting-ChunWang, RafaelValle, andMing-YuLiu. SPACE:Speech-driven Portrait Animation with Controllable Expression. InICCV, 2023. 56

  63. [63]

    World Models

    David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018. 55

  64. [64]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019. 55

  65. [65]

    Mastering atari with discrete world models

    Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. InICLR, 2021. 55, 56 64 Cosmos World Foundation Model Platform for Physical AI

  66. [66]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023. 55

  67. [67]

    Td-mpc2: Scalable, robust world models for continuous control

    Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. In ICLR, 2024. 55

  68. [68]

    Cambridge university press, 2003

    Richard Hartley and Andrew Zisserman.Multiple view geometry in computer vision. Cambridge university press, 2003. 36, 51

  69. [69]

    CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    HaoHe, YinghaoXu, YuweiGuo, GordonWetzstein, BoDai, HongshengLi, andCeyuanYang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024. 57

  70. [70]

    Learning an actionable discrete diffusion policy via large-scale actionless video pre-training

    Haoran He, Chenjia Bai, Ling Pan, Weinan Zhang, Bin Zhao, and Xuelong Li. Learning an actionable discrete diffusion policy via large-scale actionless video pre-training. InNeurIPS, 2024. 57

  71. [71]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022. 57

  72. [72]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS, 2017. 17, 41, 50

  73. [73]

    wake-sleep

    Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The "wake-sleep" algorithm for unsupervised neural networks.Science, 1995. 57

  74. [74]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

  75. [75]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020. 19, 57

  76. [76]

    Video diffusion models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. InNeurIPS, 2022. 56

  77. [77]

    Cogvideo: Large-scale pretraining for text-to-video generation via transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. InICLR, 2023. 57

  78. [78]

    Simple diffusion: End-to-end diffusion for high resolution images

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. Simple diffusion: End-to-end diffusion for high resolution images. InICML, 2023. 22

  79. [79]

    Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion.arXiv preprint arXiv:2410.19324,

    Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Sim- pler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion.arXiv preprint arXiv:2410.19324,

  80. [80]

    GAIA-1: A Generative World Model for Autonomous Driving

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023. 49, 55, 56, 57

Showing first 80 references.