hub Canonical reference

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam · 2026 · cs.AI · arXiv 2601.16163

Canonical reference. 78% of citing Pith papers cite this work as background.

43 Pith papers citing it

Background 78% of classified citations

open full Pith review browse 43 citing papers arXiv PDF

abstract

Recent video generation models demonstrate remarkable ability to capture complex physical interactions and scene evolution over time. To leverage their spatiotemporal priors, robotics works have adapted video models for policy learning but introduce complexity by requiring multiple stages of post-training and new architectural components for action generation. In this work, we introduce Cosmos Policy, a simple approach for adapting a large pretrained video model (Cosmos-Predict2) into an effective robot policy through a single stage of post-training on the robot demonstration data collected on the target platform, with no architectural modifications. Cosmos Policy learns to directly generate robot actions encoded as latent frames within the video model's latent diffusion process, harnessing the model's pretrained priors and core learning algorithm to capture complex action distributions. Additionally, Cosmos Policy generates future state images and values (expected cumulative rewards), which are similarly encoded as latent frames, enabling test-time planning of action trajectories with higher likelihood of success. In our evaluations, Cosmos Policy achieves state-of-the-art performance on the LIBERO and RoboCasa simulation benchmarks (98.5% and 67.1% average success rates, respectively) and the highest average score in challenging real-world bimanual manipulation tasks, outperforming strong diffusion policies trained from scratch, video model-based policies, and state-of-the-art vision-language-action models fine-tuned on the same robot demonstrations. Furthermore, given policy rollout data, Cosmos Policy can learn from experience to refine its world model and value function and leverage model-based planning to achieve even higher success rates in challenging tasks. We release code, models, and training data at https://research.nvidia.com/labs/dir/cosmos-policy/

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 19 baseline 2 method 2

citation-polarity summary

background 18 baseline 2 use method 2 unclear 1

representative citing papers

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

cs.RO · 2026-05-08 · unverdicted · novelty 7.0

NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

cs.RO · 2026-05-07 · unverdicted · novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.

Being-H0.7: A Latent World-Action Model from Egocentric Videos

cs.RO · 2026-04-30 · unverdicted · novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.

Action Images: End-to-End Policy Learning via Multiview Video Generation

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.

JailWAM: Jailbreaking World Action Models in Robot Control

cs.RO · 2026-04-07 · unverdicted · novelty 7.0

JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.

PlayWorld: Learning Robot World Models from Autonomous Play

cs.RO · 2026-03-09 · unverdicted · novelty 7.0

PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy performance via model-based RL.

DIM-WAM: World-Action Modeling with Diverse Historical Event Memory

cs.RO · 2026-06-26 · unverdicted · novelty 6.0

DiM-WAM is a memory-augmented world-action model that integrates multi-scale historical events and global task progress to improve long-horizon robot manipulation performance.

SafeVLA-Bench: A Benchmark for the Success-Safety Gap in Vision-Language-Action Models

cs.RO · 2026-05-30 · accept · novelty 6.0

SafeVLA-Bench adds STL-based safety checks to VLA benchmarks and finds 13-56% of successful rollouts on LIBERO and RoboCasa-365 violate at least one safety clause.

VLAConf: Calibrated Task-Success Confidence for Vision-Language-Action Models

cs.RO · 2026-05-28 · unverdicted · novelty 6.0

VLAConf is a one-class discriminative method that estimates step-wise task-success confidence for VLA models via anomaly scoring on frozen representations plus step-conditioned modeling, shown to be more efficient than ensemble or probability baselines on LIBERO and real robots.

How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning

cs.RO · 2026-05-16 · unverdicted · novelty 6.0

DeMiAn re-annotates robot and egocentric videos with VLM-generated dense labels across motion, scene, pose, and reasoning aspects, then uses a learned instructor to boost policy success by 5 points on RoboCasa over task-only baselines.

OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.

HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models

cs.RO · 2026-05-11 · unverdicted · novelty 6.0

HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.

Data-Asymmetric Latent Imagination and Reranking for 3D Robotic Imitation Learning

cs.RO · 2026-05-11 · unverdicted · novelty 6.0

DALI-R boosts 3D imitation learning success rates by 6.8% on average from suboptimal trajectories via latent imagination and reranking, with under 0.7x inference cost.

Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation

cs.RO · 2026-05-08 · unverdicted · novelty 6.0

Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.

MolmoAct2: Action Reasoning Models for Real-world Deployment

cs.RO · 2026-05-04 · unverdicted · novelty 6.0 · 2 refs

MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture changes for lower latency.

MotuBrain: An Advanced World Action Model for Robot Control

cs.RO · 2026-04-30 · unverdicted · novelty 6.0

MotuBrain jointly models video and action via a three-stream Mixture-of-Transformers UniDiffuser to reach 95.8-96.1% success on RoboTwin 2.0 benchmarks, top EWMScore, and fast 11 Hz inference while adapting to new robots with 50-100 trajectories.

Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

cs.RO · 2026-04-29 · unverdicted · novelty 6.0 · 2 refs

X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.

Human Cognition in Machines: A Unified Perspective of World Models

cs.RO · 2026-04-17 · unverdicted · novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.

XRZero-G0: Pushing the Frontier of Dexterous Robotic Manipulation with Interfaces, Quality and Ratios

cs.RO · 2026-04-14 · unverdicted · novelty 6.0

XRZero-G0 enables 2000-hour robot-free datasets that, when mixed 10:1 with real-robot data, match full real-robot performance at 1/20th the cost and support zero-shot transfer.

Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

cs.RO · 2026-04-14 · unverdicted · novelty 6.0

Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.

VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

cs.RO · 2026-04-10 · unverdicted · novelty 6.0

VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.

citing papers explorer

Showing 43 of 43 citing papers.

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation cs.RO · 2026-05-12 · unverdicted · none · ref 28 · internal anchor
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models cs.RO · 2026-05-08 · unverdicted · none · ref 5 · internal anchor
NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation cs.RO · 2026-05-07 · unverdicted · none · ref 34 · internal anchor
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields cs.CV · 2026-05-07 · unverdicted · none · ref 7 · internal anchor
EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.
Being-H0.7: A Latent World-Action Model from Egocentric Videos cs.RO · 2026-04-30 · unverdicted · none · ref 13 · internal anchor
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies cs.CV · 2026-04-27 · unverdicted · none · ref 20 · internal anchor
CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.
Action Images: End-to-End Policy Learning via Multiview Video Generation cs.CV · 2026-04-07 · unverdicted · none · ref 27 · internal anchor
Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
JailWAM: Jailbreaking World Action Models in Robot Control cs.RO · 2026-04-07 · unverdicted · none · ref 8 · internal anchor
JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.
PlayWorld: Learning Robot World Models from Autonomous Play cs.RO · 2026-03-09 · unverdicted · none · ref 10 · internal anchor
PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy performance via model-based RL.
DIM-WAM: World-Action Modeling with Diverse Historical Event Memory cs.RO · 2026-06-26 · unverdicted · none · ref 28 · internal anchor
DiM-WAM is a memory-augmented world-action model that integrates multi-scale historical events and global task progress to improve long-horizon robot manipulation performance.
SafeVLA-Bench: A Benchmark for the Success-Safety Gap in Vision-Language-Action Models cs.RO · 2026-05-30 · accept · none · ref 8 · internal anchor
SafeVLA-Bench adds STL-based safety checks to VLA benchmarks and finds 13-56% of successful rollouts on LIBERO and RoboCasa-365 violate at least one safety clause.
VLAConf: Calibrated Task-Success Confidence for Vision-Language-Action Models cs.RO · 2026-05-28 · unverdicted · none · ref 37 · internal anchor
VLAConf is a one-class discriminative method that estimates step-wise task-success confidence for VLA models via anomaly scoring on frozen representations plus step-conditioned modeling, shown to be more efficient than ensemble or probability baselines on LIBERO and real robots.
How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning cs.RO · 2026-05-16 · unverdicted · none · ref 22 · internal anchor
DeMiAn re-annotates robot and egocentric videos with VLM-generated dense labels across motion, scene, pose, and reasoning aspects, then uses a learned instructor to boost policy success by 5 points on RoboCasa over task-only baselines.
OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation cs.CV · 2026-05-12 · unverdicted · none · ref 11 · internal anchor
OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 26 · internal anchor
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
Data-Asymmetric Latent Imagination and Reranking for 3D Robotic Imitation Learning cs.RO · 2026-05-11 · unverdicted · none · ref 8 · internal anchor
DALI-R boosts 3D imitation learning success rates by 6.8% on average from suboptimal trajectories via latent imagination and reranking, with under 0.7x inference cost.
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation cs.RO · 2026-05-08 · unverdicted · none · ref 88 · internal anchor
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
MolmoAct2: Action Reasoning Models for Real-world Deployment cs.RO · 2026-05-04 · unverdicted · none · ref 21 · 2 links · internal anchor
MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture changes for lower latency.
MotuBrain: An Advanced World Action Model for Robot Control cs.RO · 2026-04-30 · unverdicted · none · ref 21 · internal anchor
MotuBrain jointly models video and action via a three-stream Mixture-of-Transformers UniDiffuser to reach 95.8-96.1% success on RoboTwin 2.0 benchmarks, top EWMScore, and fast 11 Hz inference while adapting to new robots with 50-100 trajectories.
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising cs.RO · 2026-04-29 · unverdicted · none · ref 11 · 2 links · internal anchor
X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
Human Cognition in Machines: A Unified Perspective of World Models cs.RO · 2026-04-17 · unverdicted · none · ref 87 · internal anchor
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.
XRZero-G0: Pushing the Frontier of Dexterous Robotic Manipulation with Interfaces, Quality and Ratios cs.RO · 2026-04-14 · unverdicted · none · ref 11 · internal anchor
XRZero-G0 enables 2000-hour robot-free datasets that, when mixed 10:1 with real-robot data, match full real-robot performance at 1/20th the cost and support zero-shot transfer.
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models cs.RO · 2026-04-14 · unverdicted · none · ref 25 · internal anchor
Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis cs.RO · 2026-04-10 · unverdicted · none · ref 33 · internal anchor
VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
Grasp as You Dream: Imitating Functional Grasping from Generated Human Demonstrations cs.RO · 2026-04-08 · unverdicted · none · ref 33 · internal anchor
GraspDreamer synthesizes human functional grasping demonstrations with visual generative models to enable zero-shot robot grasping with improved data efficiency and generalization.
Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation? cs.RO · 2026-04-06 · unverdicted · none · ref 29 · internal anchor
Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.
Fast-WAM: Do World Action Models Need Test-time Future Imagination? cs.CV · 2026-03-17 · unverdicted · none · ref 35 · internal anchor
Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
World Action Models are Zero-shot Policies cs.RO · 2026-02-17 · unverdicted · none · ref 55 · internal anchor
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment transfer with 10-30 minutes of data.
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation cs.RO · 2026-01-11 · unverdicted · none · ref 56 · internal anchor
PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
$\tau_0$-WM: A Unified Video-Action World Model for Robotic Manipulation cs.RO · 2026-05-31 · unverdicted · none · ref 24 · internal anchor
A shared video diffusion backbone jointly predicts future latents and continuous actions while also rolling out candidate actions to predict dense task-progress scores, trained on 27,300 hours of mixed robot and human data.
World Models for Robotic Manipulation: A Survey cs.RO · 2026-05-27 · accept · none · ref 35 · internal anchor
Survey organizing world models for robotic manipulation into representation families, a functional taxonomy, and infrastructure roles across pretraining, post-training, and inference, while reviewing 34 datasets and evaluation protocols.
SANTS: A State-Adaptive Scheduler for World Action Models cs.RO · 2026-05-27 · unverdicted · none · ref 7 · internal anchor
SANTS adaptively chooses denoising depth in video-based robot action diffusion policies using a state-dependent stopping hazard and noise ratio, trained via downstream action reward to reduce latency.
PhyWorld: Physics-Faithful World Model for Video Generation cs.CV · 2026-05-19 · unverdicted · none · ref 24 · internal anchor
PhyWorld improves temporal consistency and physical plausibility in video world models via flow matching fine-tuning followed by DPO on physics preference pairs, with reported gains on VBench and a custom physical-faithfulness benchmark.
Nautilus: From One Prompt to Plug-and-Play Robot Learning cs.RO · 2026-05-12 · unverdicted · none · ref 94 · internal anchor
NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
Gated Memory Policy cs.RO · 2026-04-21 · unverdicted · none · ref 25 · internal anchor
GMP selectively activates and represents memory via a gate and lightweight cross-attention, yielding 30.1% higher success on non-Markovian robotic tasks while staying competitive on Markovian ones.
Causal World Modeling for Robot Control cs.CV · 2026-01-29 · unverdicted · none · ref 33 · internal anchor
LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.
World Action Models: The Next Frontier in Embodied AI cs.RO · 2026-05-12 · unverdicted · none · ref 19 · internal anchor
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
RLDX-1 Technical Report cs.RO · 2026-05-05 · unverdicted · none · ref 58 · 2 links · internal anchor
RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory cs.CV · 2026-04-10 · unverdicted · none · ref 20 · internal anchor
Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive distillation on a 5B model.
World Models: A Comprehensive Survey of Architectures, Methodologies, Reasoning Paradigms, and Applications cs.LG · 2026-05-28 · unverdicted · none · ref 272 · internal anchor
The paper delivers a multi-axis taxonomy for world models that maps architectures, training families, reasoning strategies, and domains from early cognitive foundations through systems such as Dreamer, MuZero, and Sora while noting evaluation gaps.
World Model for Robot Learning: A Comprehensive Survey cs.RO · 2026-04-30 · unverdicted · none · ref 29 · internal anchor
A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datasets, and benchmarks.
Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving cs.CV · 2026-05-21 · unreviewed · ref 23 · internal anchor
VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation cs.CV · 2026-01-30 · unreviewed · ref 5 · internal anchor

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer