A formal theory proves model exploitation is essentially unavoidable on large policy sets in RL, generalizes reward hacking results, and derives a safe horizon for a relaxed version of exploitation.
super hub Canonical reference
Cosmos World Foundation Model Platform for Physical AI
Canonical reference. 80% of citing Pith papers cite this work as background.
abstract
Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make Cosmos open-source and our models open-weight with permissive licenses available via https://github.com/nvidia-cosmos/cosmos-predict1.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models,
authors
co-cited works
representative citing papers
AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
Physics-IQ benchmark reveals that generative video models exhibit limited physical understanding unrelated to their visual quality.
SENSE-VAD introduces the first synthetic benchmark dataset with per-frame labels for socially complex anomalies in autonomous driving scenes and shows existing video anomaly detectors fail on them.
A method to decompose 3D Gaussian splats into independent albedo and shading components for consistent texture editing in radiance fields.
JudgeFit produces per-VLM physical video evaluation taxonomies that improve held-out accuracy by a mean 32% relative to a single global schema across 16 models from eight families.
SWAAP is the first two-stage poisoning framework that identifies a harmful target world model via bilevel optimization and realizes it through stealth-constrained gradient matching on a limited fraction of fine-tuning transitions.
ThinkingVLA is a Mixture-of-Transformers VLA model that performs interleaved forward CoT for subgoal and image prediction followed by inverse CoT grounded on the predicted image to generate actions.
M* introduces the Walk Graph abstraction to serve arbitrary compositions of multimodal model components and reports latency and throughput gains over vLLM-Omni and other baselines on text-to-image, text-to-speech, and robotic planning workloads.
Self-distillation from a caption-conditioned video diffusion model to an image-and-prompt-conditioned executor, enhanced by RL from VLM feedback, enables task solving in world models.
World models introduce a stealthy poisoning vector into robot learning pipelines where malicious prompts or dynamics in teleoperated data activate only during synthetic trajectory generation, enabling backdoors in downstream policies.
ActProbe is an action-space detector that uses temporal consistency error and action chunk magnitude from policy outputs, mapped via LSTM-MLP, to predict failures earlier than baselines across policies and real-robot tasks.
A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.
VLMs formulate differentiable rewards from task-specific rules to enable test-time online LoRA optimization of VGMs, delivering 16.7-point gains on symbolic and general video reasoning benchmarks over VLM-as-solver and Best-of-N baselines.
SVI-Bench provides 35K hours of sports video with 9 tasks across four cognitive levels, revealing models drop from ~74% on action QA to 5% on agentic evidence integration.
YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.
FlatSounds benchmark shows state-of-the-art V2A models rely more on text captions than visual input for physical and semantic accuracy, with captions improving correctness but degrading temporal alignment.
MiraBench defines action-conditioned reliability via three levels (physics adherence, action-following fidelity, optimism bias detection) and applies it to 12 model configurations using a 16,000-judgment human corpus, finding visual fidelity a poor proxy for action fidelity, no reliable scale benefi
JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.
Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency supervision during LoRA finetuning, with a new AeroBench benchmark showing improved AA
BandTok tokenizes Mel-spectrograms as independent time-frequency band tokens from a single codebook and pairs it with 2D RoPE in an autoregressive model to improve music generation over residual multi-codebook tokenizers.
A training-free method reformulates camera control as geometric displacement fields applied via differentiable latent resampling, enabling control and bias probing in video diffusion models.
CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight bimanual manipulation tasks.
citing papers explorer
-
Imperfect World Models are Exploitable
A formal theory proves model exploitation is essentially unavoidable on large policy sets in RL, generalizes reward hacking results, and derives a safe horizon for a relaxed version of exploitation.
-
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
-
MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models
MiraBench defines action-conditioned reliability via three levels (physics adherence, action-following fidelity, optimism bias detection) and applies it to 12 model configurations using a 16,000-judgment human corpus, finding visual fidelity a poor proxy for action fidelity, no reliable scale benefi
-
EO-WM: A Physically Informed World Model for Probabilistic Earth Observation Forecasting
EO-WM is a diffusion transformer that adds physically separated baseline-anomaly and cumulative-stress conditioning to probabilistic EO forecasting and validates it on two new weather-response benchmarks, reporting 5.63% and 7.80% relative gains on NDVI decline metrics.
-
BilliardPhys-Bench: Benchmarking Physical Reasoning and Visual Dynamics of Multimodal LLMs
BilliardPhys-Bench is a new procedural benchmark that evaluates multimodal LLMs on ball-to-ball collisions, wall bounces, and final positions in simulated billiards, revealing performance drops with time and complexity plus a 'stasis bias' toward predicting no interaction.
-
Enhancing Table Reasoning with Deterministic Table-State Rewards
RE-TAB uses a deterministic LCS-based table-state reward for stepwise guidance and test-time scaling, raising LLM table-reasoning accuracy by 26.7 pp on average across six backbones and three benchmarks.
-
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.
-
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.
-
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Cosmos-Reason1-7B and 56B models are trained with physical common sense and embodied reasoning ontologies via supervised fine-tuning and reinforcement learning to produce next-step physical actions.
-
Back to Parsimonious Latents: Learning Task-Centric World Models from Visual Foundations
TC-WM converts foundation-model visual embeddings into parsimonious task-sufficient world model latents via linear projection, contrastive physical-state alignment, and embedding reconstruction, with a theoretical identification guarantee.
-
What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?
An empirical study of JEPA world models identifies architecture, training objective, and planning choices that yield a model outperforming DINO-WM and V-JEPA-2-AC on navigation and manipulation tasks.
-
Business World Model
This paper introduces the Business World Model, a conceptual architecture that encodes business states, dynamics, and actions using semantic representations to support autonomous planning.
-
WorldString: Actionable World Representation
Proposes WorldString, a differentiable neural model for the state manifold of actionable physical objects learned directly from 3D or video data as a building block for world models.
-
Coding Agent Is Good As World Simulator
An agentic framework generates executable physics simulation code from text prompts via coordinated planning, coding, visual, and physics agents that iterate to satisfy both prompt fidelity and physical constraints.
-
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
A survey proposing a three-level capability taxonomy (L1 Predictor, L2 Simulator, L3 Evolver) for world models across physical, digital, social, and scientific domains.
-
A Tutorial on World Models and Physical AI
A tutorial that unifies explicit and implicit world models through shared predictive structure for applications in physical AI such as robotics.