pith. sign in

super hub Canonical reference

Cosmos World Foundation Model Platform for Physical AI

Canonical reference. 80% of citing Pith papers cite this work as background.

247 Pith papers citing it
Background 80% of classified citations
abstract

Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make Cosmos open-source and our models open-weight with permissive licenses available via https://github.com/nvidia-cosmos/cosmos-predict1.

hub tools

citation-role summary

background 39 method 6 baseline 3 dataset 1

citation-polarity summary

claims ledger

  • abstract Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models,

authors

co-cited works

clear filters

representative citing papers

Imperfect World Models are Exploitable

cs.AI · 2026-05-15 · unverdicted · novelty 8.0

A formal theory proves model exploitation is essentially unavoidable on large policy sets in RL, generalizes reward hacking results, and derives a safe horizon for a relaxed version of exploitation.

Stealthy World Model Manipulation via Data Poisoning

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

SWAAP is the first two-stage poisoning framework that identifies a harmful target world model via bilevel optimization and realizes it through stealth-constrained gradient matching on a limited fraction of fine-tuning transitions.

M*: A Modular, Extensible, Serving System for Multimodal Models

cs.LG · 2026-06-10 · unverdicted · novelty 7.0

M* introduces the Walk Graph abstraction to serve arbitrary compositions of multimodal model components and reports latency and throughput gains over vLLM-Omni and other baselines on text-to-image, text-to-speech, and robotic planning workloads.

Targeting World Models to Compromise Robot Learning Pipelines

cs.RO · 2026-06-08 · unverdicted · novelty 7.0

World models introduce a stealthy poisoning vector into robot learning pipelines where malicious prompts or dynamics in teleoperated data activate only during synthetic trajectory generation, enabling backdoors in downstream policies.

Benchmarking Single-Factor Physical Video-to-Audio Generation

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

FlatSounds benchmark shows state-of-the-art V2A models rely more on text captions than visual input for physical and semantic accuracy, with captions improving correctness but degrading temporal alignment.

MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

MiraBench defines action-conditioned reliability via three levels (physics adherence, action-following fidelity, optimism bias detection) and applies it to 12 model configurations using a 16,000-judgment human corpus, finding visual fidelity a poor proxy for action fidelity, no reliable scale benefi

Point Tracking Improves World Action Models

cs.RO · 2026-05-22 · unverdicted · novelty 7.0

JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.

Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency supervision during LoRA finetuning, with a new AeroBench benchmark showing improved AA

Probing into Camera Control of Video Models

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

A training-free method reformulates camera control as geometric displacement fields applied via differentiable latent resampling, enabling control and bias probing in video diffusion models.

CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

cs.CV · 2026-05-14 · conditional · novelty 7.0

CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight bimanual manipulation tasks.

citing papers explorer

Showing 16 of 16 citing papers after filters.

  • Imperfect World Models are Exploitable cs.AI · 2026-05-15 · unverdicted · none · ref 10 · internal anchor

    A formal theory proves model exploitation is essentially unavoidable on large policy sets in RL, generalizes reward hacking results, and derives a safe horizon for a relaxed version of exploitation.

  • SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning cs.AI · 2026-05-10 · accept · none · ref 56 · 2 links · internal anchor

    SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

  • MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models cs.AI · 2026-05-28 · unverdicted · none · ref 29 · internal anchor

    MiraBench defines action-conditioned reliability via three levels (physics adherence, action-following fidelity, optimism bias detection) and applies it to 12 model configurations using a 16,000-judgment human corpus, finding visual fidelity a poor proxy for action fidelity, no reliable scale benefi

  • EO-WM: A Physically Informed World Model for Probabilistic Earth Observation Forecasting cs.AI · 2026-06-25 · unverdicted · none · ref 1 · internal anchor

    EO-WM is a diffusion transformer that adds physically separated baseline-anomaly and cumulative-stress conditioning to probabilistic EO forecasting and validates it on two new weather-response benchmarks, reporting 5.63% and 7.80% relative gains on NDVI decline metrics.

  • BilliardPhys-Bench: Benchmarking Physical Reasoning and Visual Dynamics of Multimodal LLMs cs.AI · 2026-05-29 · unverdicted · none · ref 1 · internal anchor

    BilliardPhys-Bench is a new procedural benchmark that evaluates multimodal LLMs on ball-to-ball collisions, wall bounces, and final positions in simulated billiards, revealing performance drops with time and complexity plus a 'stasis bias' toward predicting no interaction.

  • Enhancing Table Reasoning with Deterministic Table-State Rewards cs.AI · 2026-01-30 · unverdicted · none · ref 11 · internal anchor

    RE-TAB uses a deterministic LCS-based table-state reward for stepwise guidance and test-time scaling, raising LLM table-reasoning accuracy by 26.7 pp on average across six backbones and three benchmarks.

  • Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning cs.AI · 2026-01-22 · conditional · none · ref 26 · internal anchor

    Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.

  • V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning cs.AI · 2025-06-11 · unverdicted · none · ref 1 · internal anchor

    V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.

  • Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning cs.AI · 2025-03-18 · conditional · none · ref 38 · internal anchor

    Cosmos-Reason1-7B and 56B models are trained with physical common sense and embodied reasoning ontologies via supervised fine-tuning and reinforcement learning to produce next-step physical actions.

  • Back to Parsimonious Latents: Learning Task-Centric World Models from Visual Foundations cs.AI · 2026-05-25 · unverdicted · none · ref 1 · internal anchor

    TC-WM converts foundation-model visual embeddings into parsimonious task-sufficient world model latents via linear projection, contrastive physical-state alignment, and embedding reconstruction, with a theoretical identification guarantee.

  • What Drives Success in Physical Planning with Joint-Embedding Predictive World Models? cs.AI · 2025-12-30 · unverdicted · none · ref 1 · internal anchor

    An empirical study of JEPA world models identifies architecture, training objective, and planning choices that yield a model outperforming DINO-WM and V-JEPA-2-AC on navigation and manipulation tasks.

  • Business World Model cs.AI · 2026-06-08 · unverdicted · none · ref 18 · internal anchor

    This paper introduces the Business World Model, a conceptual architecture that encodes business states, dynamics, and actions using semantic representations to support autonomous planning.

  • WorldString: Actionable World Representation cs.AI · 2026-05-18 · unverdicted · none · ref 1 · 2 links · internal anchor

    Proposes WorldString, a differentiable neural model for the state manifold of actionable physical objects learned directly from 3D or video data as a building block for world models.

  • Coding Agent Is Good As World Simulator cs.AI · 2026-05-14 · unverdicted · none · ref 7 · 2 links · internal anchor

    An agentic framework generates executable physics simulation code from text prompts via coordinated planning, coding, visual, and physics agents that iterate to satisfy both prompt fidelity and physical constraints.

  • Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond cs.AI · 2026-04-24 · conditional · none · ref 2 · 2 links · internal anchor

    A survey proposing a three-level capability taxonomy (L1 Predictor, L2 Simulator, L3 Evolver) for world models across physical, digital, social, and scientific domains.

  • A Tutorial on World Models and Physical AI cs.AI · 2026-06-11 · unverdicted · none · ref 2 · internal anchor

    A tutorial that unifies explicit and implicit world models through shared predictive structure for applications in physical AI such as robotics.