super hub Canonical reference

Cosmos World Foundation Model Platform for Physical AI

Arslan Ali, Erik Barker, Maciej Bala, NVIDIA: Niket Agarwal, Tiffany Cai, Yogesh Balaji · 2025 · cs.CV · arXiv 2501.03575

Canonical reference. 79% of citing Pith papers cite this work as background.

172 Pith papers citing it

Background 79% of classified citations

open full Pith review browse 172 citing papers more from Arslan Ali arXiv PDF

abstract

Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make Cosmos open-source and our models open-weight with permissive licenses available via https://github.com/nvidia-cosmos/cosmos-predict1.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 39 method 6 baseline 3

citation-polarity summary

background 38 use method 6 baseline 3 unclear 1

claims ledger

abstract Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models,

authors

Arslan Ali Erik Barker Maciej Bala NVIDIA: Niket Agarwal Tiffany Cai Yogesh Balaji

co-cited works

representative citing papers

Imperfect World Models are Exploitable

cs.AI · 2026-05-15 · unverdicted · novelty 8.0

A formal theory proves model exploitation is essentially unavoidable on large policy sets in RL, generalizes reward hacking results, and derives a safe horizon for a relaxed version of exploitation.

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

cs.CV · 2026-05-13 · unverdicted · novelty 8.0

AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10 · accept · novelty 8.0 · 2 refs

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

Do generative video models understand physical principles?

cs.CV · 2025-01-14 · unverdicted · novelty 8.0

Physics-IQ benchmark reveals that generative video models exhibit limited physical understanding unrelated to their visual quality.

DVG-WM: Disentangled Video Generation Enables Efficient Embodied World Model for Robotic Manipulation

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

DVG-WM disentangles dynamics learning and visual synthesis in video world models using flow matching and latent degradation to achieve faster inference up to 3.97 times with improved quality on LIBERO and real-world robotic platforms.

SENSE-VAD: Sentient and Semantic Video Anomaly Detection for Autonomous Driving

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

SENSE-VAD introduces the first synthetic benchmark dataset with per-frame labels for socially complex anomalies in autonomous driving scenes and shows existing video anomaly detectors fail on them.

Intrinsic decomposition and editing of 3D Gaussian splats

cs.GR · 2026-06-30 · unverdicted · novelty 7.0

A method to decompose 3D Gaussian splats into independent albedo and shading components for consistent texture editing in radiance fields.

Targeting World Models to Compromise Robot Learning Pipelines

cs.RO · 2026-06-08 · unverdicted · novelty 7.0

World models introduce a stealthy poisoning vector into robot learning pipelines where malicious prompts or dynamics in teleoperated data activate only during synthetic trajectory generation, enabling backdoors in downstream policies.

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

VLMs formulate differentiable rewards from task-specific rules to enable test-time online LoRA optimization of VGMs, delivering 16.7-point gains on symbolic and general video reasoning benchmarks over VLM-as-solver and Best-of-N baselines.

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.

Benchmarking Single-Factor Physical Video-to-Audio Generation

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

FlatSounds benchmark shows state-of-the-art V2A models rely more on text captions than visual input for physical and semantic accuracy, with captions improving correctness but degrading temporal alignment.

MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

MiraBench defines action-conditioned reliability via three levels (physics adherence, action-following fidelity, optimism bias detection) and applies it to 12 model configurations using a 16,000-judgment human corpus, finding visual fidelity a poor proxy for action fidelity, no reliable scale benefi

Point Tracking Improves World Action Models

cs.RO · 2026-05-22 · unverdicted · novelty 7.0

JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.

Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency supervision during LoRA finetuning, with a new AeroBench benchmark showing improved AA

Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation

cs.SD · 2026-05-15 · unverdicted · novelty 7.0

BandTok tokenizes Mel-spectrograms as independent time-frequency band tokens from a single codebook and pairs it with 2D RoPE in an autoregressive model to improve music generation over residual multi-codebook tokenizers.

Probing into Camera Control of Video Models

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

A training-free method reformulates camera control as geometric displacement fields applied via differentiable latent resampling, enabling control and bias probing in video diffusion models.

CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

cs.CV · 2026-05-14 · conditional · novelty 7.0

CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight bimanual manipulation tasks.

GenAI Powered Dynamic Causal Inference with Unstructured Data

stat.ME · 2026-05-08 · unverdicted · novelty 7.0

A GenAI-based method extracts representations from unstructured data and uses a neural network to fit marginal structural models that recover causal effects of treatment feature sequences including their positions.

NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

cs.RO · 2026-05-08 · unverdicted · novelty 7.0

NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

Do Joint Audio-Video Generation Models Understand Physics?

cs.SD · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

AV-Phys Bench shows that current joint audio-video models lack robust physical commonsense, with major drops on transitions and deliberate anti-physics prompts.

LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation

eess.IV · 2026-05-07 · unverdicted · novelty 7.0

LiVeAction is a lightweight asymmetric neural codec using an FFT-inspired encoder and variance-based training that outperforms generative tokenizers in rate-distortion while supporting real-time use on resource-constrained sensors across modalities.

EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.

Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion

cs.RO · 2026-05-02 · unverdicted · novelty 7.0

Action Agent pairs LLM-driven video generation with a flow-constrained diffusion transformer to produce velocity commands, raising video success to 86% and delivering 64.7% real-world navigation on a Unitree G1 humanoid.

Being-H0.7: A Latent World-Action Model from Egocentric Videos

cs.RO · 2026-04-30 · unverdicted · novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

citing papers explorer

Showing 6 of 6 citing papers after filters.

MoRight: Motion Control Done Right cs.CV · 2026-04-08 · unverdicted · none · ref 16 · internal anchor
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.
Lyra 2.0: Explorable Generative 3D Worlds cs.CV · 2026-04-14 · unverdicted · none · ref 13 · internal anchor
Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning cs.AI · 2025-06-11 · unverdicted · none · ref 1 · internal anchor
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.
Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models cs.CV · 2026-05-07 · unverdicted · none · ref 1 · internal anchor
Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.
Qwen-Image-2.0 Technical Report cs.CV · 2026-05-11 · unverdicted · none · ref 1 · internal anchor
Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond cs.AI · 2026-04-24 · unreviewed · ref 2 · internal anchor

Cosmos World Foundation Model Platform for Physical AI

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer