pith. sign in

arxiv: 2606.02800 · v4 · pith:B43Y2PK5new · submitted 2026-06-01 · 💻 cs.CV · cs.AI· cs.LG· cs.MM· cs.RO

Cosmos 3: Omnimodal World Models for Physical AI

NVIDIA: Aditi , Niket Agarwal , Arslan Ali , Jon Allen , Martin Antolini , Adeline Aubame , Alisson Azzolini , Junjie Bai
show 286 more authors
Maciej Bala Yogesh Balaji Josh Bapst Aarti Basant Mukesh Beladiya Mohammad Qazim Bhat Zaid Pervaiz Bhat Dan Blick Vanni Brighella Han Cai Tiffany Cai Eric Cameracci Jiaxin Cao Yulong Cao Mark Carlson Carlos Casanova Ting-Yun Chang Yan Chang Yu-Wei Chao Prithvijit Chattopadhyay Roshan Chaudhari Chieh-Yun Chen Junyu Chen Ke Chen Qizhi Chen Wenkai Chen Xiaotong Chen Yu Chen An-Chieh Cheng Click Cheng Xiu Chia Jeana Choi Chaeyeon Chung Wenyan Cong Yin Cui Magdalena Dadela Nalin Dadhich Wenliang Dai Joyjit Daw Alperen Degirmenci Rodrigo Vieira Del Monte Robert Denomme Sameer Dharur Marco Di Lucca Ke Ding Wenhao Ding Yifan Ding Yuzhu Dong Nicole Drumheller Yilun Du Aigul Dzhumamuratova Aleksandr Efitorov Hamid Eghbalzadeh Naomi Eigbe Imad El Hanafi Hassan Eslami Benedikt Falk Jiaojiao Fan Jim Fan Amol Fasale Sergiy Fefilatyev Liang Feng Francesco Ferroni Sanja Fidler Xiao Fu Vikram Fugro Prashant Gaikwad TJ Galda Katelyn Gao Yihuai Gao Wenhang Ge Sreyan Ghosh Arushi Goel Vivek Goel Akash Gokul Rama Govindaraju Jinwei Gu Miguel Guerrero Elfie Guo Aryaman Gupta Siddharth Gururani Hugo Hadfield Song Han Ankur Handa Zekun Hao Mohammad Harrim Ali Hassani Nathan Hayes-Roth Yufan He Chris Helvig Cyrus Hogg Madison Huang Michael Huang Sophia Huang Yufan Huang Jacob Huffman DeLesley Hutchins Suneel Indupuru Boris Ivanovic Arihant Jain Joel Jang Ryan Ji Yanan Jian Dongfu Jiang Jingyi Jin Atharva Joshi Nikhilesh Joshi Pranjali Joshi Andy Ju Jaehun Jung Weiwei Kang Scott Kassekert Jan Kautz Ashna Khetan Julia Kiczka Slawek Kierat Gwanghyun Kim Kuno Kim Sunny Kim Kezhi Kong Xin Kong Zhifeng Kong Tomasz Kornuta Egor Krivov Hui Kuang Saurav Kumar Chia-Wen Kuo George Kurian Wojciech Kutak JF Lafleche Himangshu Lahkar Omar Laymoun Jayjun Lee Sanggil Lee Gabriele Leone Boyi Li Freya Li Jiajun Li Jinfeng Li Ling Li Pengcheng Li Shangru Li Tingle Li Xiaolong Li Xuan Li Zhaoshuo Li Zhiqi Li Hao Liang Maosheng Liao Chen-Hsuan Lin Tsung-Yi Lin Ming-Yu Liu Sifei Liu Zihan Liu Hai Loc Lu Xiangyu Lu Alice Luo Ruipu Luo Wenjie Luo Jiangran Lyu Martin Ding Ma Nic Ma Qianli Ma Dawid Majchrowski Louis Marcoux Miguel Martin Qing Miao Ashkan Mirzaei Shreyas Misra Kaichun Mo Durra Mohsin Hyejin Moon Pawel Morkisz Saeid Motiian Kirill Motkov Seungjun Nah Yashraj Narang Deepak Narayanan Thabang Ngazimbi Julian Ouyang Shubham Pachori David Page Yatian Pang Sehwi Park Mahesh Patekar Mostofa Patwary Marco Pavone Trung Pham Wei Ping Soha Pouya Shrimai Prabhumoye Varun Praveen Delin Qu Hesam Rabeti Morteza Ramezanali Marilyn Reeb Xuanchi Ren Kristen Rumley Wojciech Rymer Jun Saito Yeongho Seol John Shao Piyush Shekdar Tianwei Shen Humphrey Shi Min Shi Stella Shi Kevin Shih Mohammad Shoeybi Mateusz Sieniawski Shuran Song Alexander Sotelo Amir Sotoodeh Sunil Srinivasa Vignesh Srinivasakumar Bartosz Stefaniak Rahul Heinrich Steiger Shangkun Sun Jiaxiang Tang Shitao Tang Yangyang Tang Yue Tang Tolou Tavakkoli Kayley Ting Krzysztof Tomala Wei-Cheng Tseng Jibin Varghese Sergei Vasilev Thomas Volk Raju Wagwani Roger Waleffe Andrew Z. Wang Boxiang Wang Haoxiang Wang Qiao Wang Shihao Wang Shijie Wang Ting-Chun Wang Yan Wang Yu Wang Rohit Watve David Wehr Fangyin Wei Xinshuo Weng Jay Zhangjie Wu Kedi Wu Hongchi Xia Summer Xiao Tianjun Xiao Kevin Xie Daguang Xu Jiashu Xu Mengyao Xu Ruqing Xu Xingqian Xu Yao Xu Dinghao Yang Dong Yang Hans Yang Xiaodong Yang Xuning Yang Yichu Yang Yurong You Zhiding Yu Hao Yuan Simon Yuen Xiaohui Zeng Pengcuo Zeren Cindy Zha Haotian Zhang Jenny Zhang Jing Zhang Liangkai Zhang Paris Zhang Shun Zhang Xuanmeng Zhang Zhizheng Zhang Ann Zhao Yilin Zhao Yuliya Zhautouskaya Charles Zhou Fengzhe Zhou Shilin Zhu Yuke Zhu Dima Zhylko Artur Zolkowski
This is my paper

Pith reviewed 2026-06-28 15:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.MMcs.RO
keywords omnimodal world modelsPhysical AImixture-of-transformersembodied agentsmultimodal generationstate-of-the-artvideo generationaction models
0
0 comments X

The pith

A single mixture-of-transformers model jointly processes and generates language, images, video, audio, and actions for Physical AI.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Cosmos 3 as a family of models that handle five modalities together in one architecture rather than separate specialized systems. It argues this unified approach subsumes vision-language models, video generators, world simulators, and action models into a flexible framework. Evaluations across understanding and generation tasks show state-of-the-art results, with post-trained versions ranking highest in open-source text-to-image, image-to-video, and policy benchmarks. The work positions these omnimodal models as scalable backbones for embodied agents. Code, checkpoints, and datasets are released to support further Physical AI research.

Core claim

Cosmos 3 establishes a unified mixture-of-transformers architecture that jointly processes and generates sequences across language, image, video, audio, and action modalities, achieving new state-of-the-art performance on diverse tasks and serving as general-purpose backbones for embodied agents.

What carries the argument

mixture-of-transformers architecture supporting highly flexible input-output configurations across multiple modalities

If this is right

  • Vision-language models, video generators, and world simulators become interchangeable components of one system.
  • Embodied agents can use the same backbone for both perception and action planning without switching models.
  • Open release of the models and synthetic datasets enables direct replication and extension by other researchers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the no-trade-off claim holds, training pipelines for robotics could shift from assembling multiple models to fine-tuning one omnimodal base.
  • Real-world deployment would still require separate validation that simulated action sequences transfer to physical hardware.

Load-bearing premise

One shared architecture can reach top performance in every modality without substantial trade-offs in any single one.

What would settle it

A direct comparison where adding audio or action generation to the model produces a clear drop in text-to-image or image-to-video quality relative to specialized single-modality models.

Figures

Figures reproduced from arXiv: 2606.02800 by Aarti Basant, Adeline Aubame, Aigul Dzhumamuratova, Akash Gokul, Aleksandr Efitorov, Alexander Sotelo, Alice Luo, Ali Hassani, Alisson Azzolini, Alperen Degirmenci, Amir Sotoodeh, Amol Fasale, An-Chieh Cheng, Andrew Z. Wang, Andy Ju, Ankur Handa, Ann Zhao, Arihant Jain, Arslan Ali, Artur Zolkowski, Arushi Goel, Aryaman Gupta, Ashkan Mirzaei, Ashna Khetan, Atharva Joshi, Bartosz Stefaniak, Benedikt Falk, Boris Ivanovic, Boxiang Wang, Boyi Li, Carlos Casanova, Chaeyeon Chung, Charles Zhou, Chen-Hsuan Lin, Chia-Wen Kuo, Chieh-Yun Chen, Chris Helvig, Cindy Zha, Click Cheng, Cyrus Hogg, Daguang Xu, Dan Blick, David Page, David Wehr, Dawid Majchrowski, Deepak Narayanan, DeLesley Hutchins, Delin Qu, Dima Zhylko, Dinghao Yang, Dongfu Jiang, Dong Yang, Durra Mohsin, Egor Krivov, Elfie Guo, Eric Cameracci, Fangyin Wei, Fengzhe Zhou, Francesco Ferroni, Freya Li, Gabriele Leone, George Kurian, Gwanghyun Kim, Hai Loc Lu, Hamid Eghbalzadeh, Han Cai, Hans Yang, Hao Liang, Haotian Zhang, Haoxiang Wang, Hao Yuan, Hassan Eslami, Hesam Rabeti, Himangshu Lahkar, Hongchi Xia, Hugo Hadfield, Hui Kuang, Humphrey Shi, Hyejin Moon, Imad El Hanafi, Jacob Huffman, Jaehun Jung, Jan Kautz, Jayjun Lee, Jay Zhangjie Wu, Jeana Choi, Jenny Zhang, JF Lafleche, Jiajun Li, Jiangran Lyu, Jiaojiao Fan, Jiashu Xu, Jiaxiang Tang, Jiaxin Cao, Jibin Varghese, Jim Fan, Jinfeng Li, Jingyi Jin, Jing Zhang, Jinwei Gu, Joel Jang, John Shao, Jon Allen, Josh Bapst, Joyjit Daw, Julia Kiczka, Julian Ouyang, Junjie Bai, Jun Saito, Junyu Chen, Kaichun Mo, Katelyn Gao, Kayley Ting, Ke Chen, Ke Ding, Kedi Wu, Kevin Shih, Kevin Xie, Kezhi Kong, Kirill Motkov, Kristen Rumley, Krzysztof Tomala, Kuno Kim, Liang Feng, Liangkai Zhang, Ling Li, Louis Marcoux, Maciej Bala, Madison Huang, Magdalena Dadela, Mahesh Patekar, Maosheng Liao, Marco Di Lucca, Marco Pavone, Marilyn Reeb, Mark Carlson, Martin Antolini, Martin Ding Ma, Mateusz Sieniawski, Mengyao Xu, Michael Huang, Miguel Guerrero, Miguel Martin, Ming-Yu Liu, Min Shi, Mohammad Harrim, Mohammad Qazim Bhat, Mohammad Shoeybi, Morteza Ramezanali, Mostofa Patwary, Mukesh Beladiya, Nalin Dadhich, Naomi Eigbe, Nathan Hayes-Roth, Nic Ma, Nicole Drumheller, Niket Agarwal, Nikhilesh Joshi, NVIDIA: Aditi, Omar Laymoun, Paris Zhang, Pawel Morkisz, Pengcheng Li, Pengcuo Zeren, Piyush Shekdar, Pranjali Joshi, Prashant Gaikwad, Prithvijit Chattopadhyay, Qianli Ma, Qiao Wang, Qing Miao, Qizhi Chen, Rahul Heinrich Steiger, Raju Wagwani, Rama Govindaraju, Robert Denomme, Rodrigo Vieira Del Monte, Roger Waleffe, Rohit Watve, Roshan Chaudhari, Ruipu Luo, Ruqing Xu, Ryan Ji, Saeid Motiian, Sameer Dharur, Sanggil Lee, Sanja Fidler, Saurav Kumar, Scott Kassekert, Sehwi Park, Sergei Vasilev, Sergiy Fefilatyev, Seungjun Nah, Shangkun Sun, Shangru Li, Shihao Wang, Shijie Wang, ShiLin Zhu, Shitao Tang, Shreyas Misra, Shrimai Prabhumoye, Shubham Pachori, Shun Zhang, Shuran Song, Siddharth Gururani, Sifei Liu, Simon Yuen, Slawek Kierat, Soha Pouya, Song Han, Sophia Huang, Sreyan Ghosh, Stella Shi, Summer Xiao, Suneel Indupuru, Sunil Srinivasa, Sunny Kim, Thabang Ngazimbi, Thomas Volk, Tianjun Xiao, Tianwei Shen, Tiffany Cai, Ting-Chun Wang, Tingle Li, Ting-Yun Chang, TJ Galda, Tolou Tavakkoli, Tomasz Kornuta, Trung Pham, Tsung-Yi Lin, Vanni Brighella, Varun Praveen, Vignesh Srinivasakumar, Vikram Fugro, Vivek Goel, Wei-Cheng Tseng, Wei Ping, Weiwei Kang, Wenhang Ge, Wenhao Ding, Wenjie Luo, Wenkai Chen, Wenliang Dai, Wenyan Cong, Wojciech Kutak, Wojciech Rymer, Xiangyu Lu, Xiaodong Yang, Xiao Fu, Xiaohui Zeng, Xiaolong Li, Xiaotong Chen, Xingqian Xu, Xin Kong, Xinshuo Weng, Xiu Chia, Xuanchi Ren, Xuan Li, Xuanmeng Zhang, Xuning Yang, Yanan Jian, Yan Chang, Yangyang Tang, Yan Wang, Yao Xu, Yashraj Narang, Yatian Pang, Yeongho Seol, Yichu Yang, Yifan Ding, Yihuai Gao, Yilin Zhao, Yilun Du, Yin Cui, Yogesh Balaji, Yu Chen, Yue Tang, Yufan He, Yufan Huang, Yuke Zhu, Yuliya Zhautouskaya, Yulong Cao, Yurong You, Yu Wang, Yu-Wei Chao, Yuzhu Dong, Zaid Pervaiz Bhat, Zekun Hao, Zhaoshuo Li, Zhiding Yu, Zhifeng Kong, Zhiqi Li, Zhizheng Zhang, Zihan Liu.

Figure 1
Figure 1. Figure 1: Cosmos 3 serves as a general-purpose backbone for Physical AI. By jointly modeling language, image, video, audio, and action for both understanding and generation, Cosmos 3 unifies a wide range of model classes within a single network architecture, including vision-language models, image generation models, audio-visual generation models, policy or world-action models, forward dynamics models, and inverse d… view at source ↗
Figure 2
Figure 2. Figure 2: Cosmos 3 offers a strong starting point for training Physical AI agents. Cosmos 3 can be post-trained on target data for distinct applications without architectural modifications. In this paper, we demonstrate how we post-train Cosmos 3 for better synthetic data generation (Sec. 4.2.3 and Sec. 4.2.4) and better robot policy (Sec. 4.2.5). In the future, we expect Cosmos 3 to play an essential role in genera… view at source ↗
Figure 3
Figure 3. Figure 3: Unified action representation. We map heterogeneous embodiment controls into compact action vectors built from shared geometric components. Ego and effector motions are encoded as relative-pose pseudo-actions using 3D translation and 6D rotation (an over-parameterized rotation representation by Zhou Zhou et al. (2019), as the degree of freedom of rotation is 3), while grasp states directly encode the curre… view at source ↗
Figure 4
Figure 4. Figure 4: Action sequence configurations. For a video-action data sample, Cosmos 3 constructs different training modes by varying which tokens are clean and which are noisy. The diagram shows a local temporal window in which action tokens lie between adjacent video tokens: 𝑎𝑡 connects 𝑣𝑡−1 to 𝑣𝑡, and 𝑎𝑡+1 connects 𝑣𝑡 to 𝑣𝑡+1. Forward dynamics mode denoises vision tokens conditioned on clean action tokens; inverse dy… view at source ↗
Figure 5
Figure 5. Figure 5: Mixture-of-Transformers (MoT) architecture of Cosmos 3. Left: a single transformer operates on one token sequence comprising the autoregressive (AR) and diffusion (DM) subsequences: AR carries discrete text tokens and, optionally, ViT-encoded vision tokens, ending with <EOS> and a begin-of-generation token <BOG>, while DM carries continuous tokens from their respective encoders, noise-perturbed during trai… view at source ↗
Figure 6
Figure 6. Figure 6: Illustrative coordinate assignment under 3D MRoPE. Left: A packed token sequence containing language, video (two frames, 2 × 2 spatial grid each), audio, and action tokens. Each token receives a (𝑡, ℎ, 𝑤) triplet. Language tokens use 𝑡 = ℎ = 𝑤; video tokens vary on all three axes; action and audio tokens use temporal coordinates only (ℎ = 𝑤 = 0). A modality offset 𝑘 separates the text and vision temporal r… view at source ↗
Figure 7
Figure 7. Figure 7: Cosmos 3 Reasoner data composition by capability category. We summarize the curated data mixture used to train Cosmos 3 Reasoner across the pre-training and supervised fine-tuning stages. The mixture contains 22.0M pre-training samples and 2.2M supervised fine-tuning samples spanning image–text, video–text, and text-only categories, with each ring showing the relative contribution of major capability strea… view at source ↗
Figure 8
Figure 8. Figure 8: summarizes the Generator training curriculum across modalities and stages. Pre-training Mid-training Post-training Cosmos3-Nano Cosmos3-Super Text-to-Image Text-to-Video Image-to-Video Video-to-Video Text-to-(Video+Audio) Image-to-(Video+Audio) (Image+Action)-to-Video Video-to-Action Image-to-(Action+Video) Video Transfer 767M 16M 8M 348M 75M 20K 139M 19M — — 8M 58K — 4M — Cosmos3-Super-Text2Image Cosmos3-… view at source ↗
Figure 9
Figure 9. Figure 9: Action data distribution. Hours are aggregated over the four main action-data pillars in the final curated action mid-training set, which contains 8.4M episodes and 61.3K hours [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Left: Multi-resolution training and sequence packing. The three resolution tiers (256p, 480p, 720p) differ in their maximum frame budget, eligible source material, and rectified-flow noise-shift value; variable-length sequences from different tiers are packed together to fill a fixed 74,000-token context window, maximizing GPU utilization without padding. Right: Data mixture used in generator pre-training… view at source ↗
Figure 11
Figure 11. Figure 11: Overview of the Cosmos 3 infrastructure stack. The platform spans four pillars. Data Infrastructure ingests raw multimodal streams and curates them into WebDataset-format training shards. Training Infrastructure consumes those shards on NVIDIA GPU clusters with efficient parallelization, data loading, and checkpointing. The resulting checkpoints feed two parallel paths (separated by the dashed divider): S… view at source ↗
Figure 12
Figure 12. Figure 12: Overview of the Joint Data-Loader. Stream-specific data-loaders feed local per-stream buffers on each rank. At each global iteration, a rank-synchronous selector chooses the same stream 𝑘𝑖 across distributed ranks. Each rank then greedily packs samples from its selected local buffer into 𝐵 (𝜌) 𝑖 under token and sample-count budgets, using bounded look-ahead to reduce unused token capacity. volume the prim… view at source ↗
Figure 13
Figure 13. Figure 13: Look-ahead packing in the JointDataLoader. The loader greedily scans samples from the selected stream and packs those that fit within the remaining token budget into the current mini-batch (Mint). Samples that exceed the budget are temporarily set aside in a lookaside buffer (Rose), allowing later smaller samples to fill the remaining capacity. At the end of the iteration, skipped samples are returned to … view at source ↗
Figure 14
Figure 14. Figure 14: Two-way flat attention. Each pathway is implemented as a single variable-length SDPA call. (a) The Reasoner pathway uses a standard causal varlen call on the packed Reasoner tokens, producing a block-diagonal causal mask. (b) The Generator pathway packs Generator queries separately from the interleaved key/value stream [𝑅0, 𝐺0, 𝑅1, 𝐺1, . . . , 𝑅𝑛, 𝐺𝑛]. The resulting mask is block-diagonal but rectangular … view at source ↗
Figure 15
Figure 15. Figure 15: Sharded AOT compilation of the Wan2.2 tokenizer. The 45 static-shape graphs arising from {3 resolutions} × {5 aspect ratios} × {3 tokenizer call modes} are partitioned across ranks; each rank performs compilation on its assigned graph(s), writes the compiled artifact to a shared filesystem, and loads the full set of artifacts before training begins. Warm-up time drops from ∼15 min (serial) to <1 min (shar… view at source ↗
Figure 16
Figure 16. Figure 16: Cosmos 3 serving performance. (a) Cosmos3-Nano 720p T2V 1-GPU latency on H100 NVL and B200, to observe performance on different hardware backends. (b) Cosmos3-Nano 720p T2I 1-GPU latency on H100 NVL and B200, to observe performance on different hardware backends. (c) 720p T2V latency scaling on B200 from 1 to 8 GPUs for Cosmos3-Nano and Cosmos3-Super. Lower is better throughout. span reference-based error… view at source ↗
Figure 17
Figure 17. Figure 17: Example images generated by Cosmos3-Super-Text2Image. Our model generates images that are both physically plausible and photorealistic, exhibiting coherent object geometry, consistent object environment interactions, etc. All images are generated from single-shot upsampled JSON prompts using shift=3.0, guidance=4.0, and 50 diffusion steps, as also described in [PITH_FULL_IMAGE:figures/full_fig_p054_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Cosmos3-Super-Text2Image is the #1 open-weight model on crowdsourced arena rankings. Cosmos3-Super-Text2Image ranked #1 among open-weight models (#4 including proprietary models) on the Artificial Analysis Text to Image Leaderboard (Date: 2026-05-28). 6.2.2. Video Generation Evaluation We evaluate the video generation capabilities of Cosmos 3 through complementary automated and human benchmarks. Automated… view at source ↗
Figure 19
Figure 19. Figure 19: Cosmos3-Super-Image2Video is the best open-weight model on crowdsourced arena rankings. Cosmos3-Super-Image2Video ranked #1 among open-weight models (#22 including proprietary models) on the Artificial Analysis Image to Video Leaderboard (No Audio) (Date: 2026-05-28). object dynamics, contact, and hand anatomy. The HWB score is the average of the instruction-following and physics pass rates. Tab. 14 repor… view at source ↗
Figure 20
Figure 20. Figure 20: Audio-video event alignment. Selected frames from a Cosmos3-Nano generation are paired with the spectrogram of the generated audio. Colored frames denote hammer-strike moments, and their temporal markers coincide with sharp spectral transients. The gray frame shows a non-contact moment between strikes, where no comparable acoustic transient is observed. This contrast provides qualitative evidence that the… view at source ↗
Figure 21
Figure 21. Figure 21: Driving scene Video Transfer results. Cosmos3-Nano generates frames (bottom) from the corresponding 720p control video (top). The control video encodes HD map elements—lanes, road markings, poles, and traffic lights (with or without state)—which together represent complex road topologies (including overpasses), as well as actors represented as cuboids. Each cuboid is color-coded by a coarse class ontology… view at source ↗
Figure 22
Figure 22. Figure 22: Comparison for autonomous vehicle inverse dynamics. We qualitatively compare ego-vehicle trajectories estimated from input videos by different methods, with the red trajectory representing the ground truth. Cosmos3-Nano (MT-init) demonstrates the ability to estimate accurate, metric-scale ego poses. driving data achieves much better metric-scale translation estimation, whereas the general-domain baselines… view at source ↗
Figure 23
Figure 23. Figure 23: Camera forward dynamics comparison. Given complex realistic trajectories, Cosmos3-Nano (MT-init) faithfully reproduces the same camera motion in the generated video. For each motion example, the first row shows frames near the start of the sequence and the second row shows frames near the end. The downward arrow indicates temporal progression from start to finish, while the text beside it specifies the co… view at source ↗
Figure 24
Figure 24. Figure 24: Qualitative comparison for robotics forward dynamics. The generated frames closely follow the action commands, and the interactions between the robot arm and the fabric are more realistic than those from the baseline. Green boxes highlight regions with visible distortions or artifacts in the baseline outputs, and the corresponding regions in Cosmos3-Nano (MT-init), where these artifacts are absent [PITH_… view at source ↗
Figure 25
Figure 25. Figure 25: Real-world evaluation of Cosmos3-Nano-Policy-DROID. We show snapshot frames from physical-robot rollouts. These examples demonstrate successful real-world deployment of our policy for language-conditioned manipulation tasks on physical robots [PITH_FULL_IMAGE:figures/full_fig_p069_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Cosmos3-Nano-Policy-DROID held the top position on RoboArena. Cosmos3-Nano-Policy-DROID ranked #1 on the RoboArena real-world benchmark leaderboard (Date: 2026-05-30). unseen objects and tasks. The policy also tolerates failures, often retrying when necessary, and remains robust to human interventions during execution. Some qualitative results are shown in [PITH_FULL_IMAGE:figures/full_fig_p069_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Cosmos3-Nano-Policy-DROID ranked #1 on MolmoSpaces. Cosmos3-Nano-Policy-DROID was the top-ranked model on the MolmoSpaces simulation benchmark leaderboard (Date: 2026-06-20) [PITH_FULL_IMAGE:figures/full_fig_p070_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Synergy study across ego-motion and robot manipulation domains. Rows denote evaluation domains; columns denote the added co-training domain. Diagonal cells are single-domain baselines, while off-diagonal cells use a 50/50 row–column mixture with matched row-domain training exposure. Each cell reports the score and delta from the row diagonal. Green indicates positive transfer, with signs adjusted so that … view at source ↗
Figure 29
Figure 29. Figure 29: Egocentric motion as a robot-adaptation prior. (a) The matrix measures pairwise transfer between AgiBot robot manipulation and Egocentric motion. (b) The curve compares AgiBot adaptation from an Egocentric-warmed checkpoint against direct adaptation from the PT-init checkpoint. baselines or more heterogeneous data. WidowX-250 benefits strongly from Google Robot, gaining +1.39 FD PSNR and +2.29 policy PSNR… view at source ↗
Figure 30
Figure 30. Figure 30: Multiview action prompt formatting. When multiple viewpoints are available, we concatenate them into a single canvas and attach view-layout metadata in the structured JSON prompt so the model can associate each pixel region with its camera stream. B.6. Cosmos 3 Generator Negative Prompt We use the following negative prompt for the base Cosmos3-Nano and Cosmos3-Super generators. Cosmos 3 Base Generator Neg… view at source ↗
Figure 31
Figure 31. Figure 31: SDG-PhyxSim. A single frame of the wrecking_ball scene at the moment of impact (Corner camera). From left to right: RGB, center-of-mass displacement, cumulative rotation, linear velocity, and angular velocity. Overview. SDG-PhyxSim (PhysicsAI-WorldModel-Synthetic-Physical-Interaction-Scenes) is a large-scale syn￾thetic video dataset of physically simulated multi-object interaction scenes, designed to expo… view at source ↗
Figure 32
Figure 32. Figure 32 [PITH_FULL_IMAGE:figures/full_fig_p099_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: SDG-DriveSim dataset. SDG-DriveSim is built to cover long-tail, rare scenarios which are hard to capture in real world. Eight representative driving scenarios are provided from the dataset. Dataset statistics. The current SDG-DriveSim release contains 264,000 clips totaling approximately 1,467 hours of video, rendered at 4K (3840×2160) and 24 fps with per-clip durations of approximately 20 s, cor￾respondi… view at source ↗
Figure 34
Figure 34. Figure 34: SDG-SynHuman samples. From left to right: RGB, depth; exterior and interior views. 101 [PITH_FULL_IMAGE:figures/full_fig_p101_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: SDG-Warehouse dataset. Sample view of the four scenarios and the annotations (RGB clips, metric depth, instance segmentation, shaded segmentation, and Canny edge) [PITH_FULL_IMAGE:figures/full_fig_p104_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Joint embedding geometry of pre-train and SDG. PCA→UMAP projection of 20,000 pre-training cluster centroids (gray) and 200 randomly sampled clips from each SDG source. Each SDG source forms a distinct, tightly clustered region that overlaps only narrowly with the bulk pre-training distribution [PITH_FULL_IMAGE:figures/full_fig_p105_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Predicted video vs. simulator rollout on RoboLab. For the wrist and left cameras, Sim Env shows the video recorded by executing the predicted action chunk in the RoboLab simulator from the same initial state, while Pred shows the video predicted by Cosmos3-Nano-Policy-DROID jointly with that action chunk. We observe that the predicted video closely matches the simulator rollout. F. Cosmos-HumanEval Benchm… view at source ↗
read the original abstract

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces Cosmos 3, a family of omnimodal world models based on a mixture-of-transformers architecture that jointly processes and generates across language, image, video, audio, and action modalities. It claims to unify vision-language models, video generators, world simulators, and world-action models into a single framework, establishes new state-of-the-art performance on a diverse suite of understanding and generation tasks for Physical AI, and reports top rankings from external evaluations (Artificial Analysis for T2I/I2V and RoboArena for policy models). The work releases code, checkpoints, synthetic datasets, and benchmarks under the OpenMDW-1.1 license.

Significance. If the empirical SOTA claims hold under independent verification, the work would be significant as a demonstration that a single scalable architecture can serve as a general-purpose backbone for embodied agents without modality-specific trade-offs. The open release of code, models, and benchmarks is a clear strength that directly enables reproducibility and falsification of the no-trade-off assumption.

major comments (1)
  1. Abstract: the central claim that Cosmos 3 'establishes a new state-of-the-art across a diverse suite of understanding and generation tasks' is unsupported by any metrics, baselines, evaluation protocols, or error analysis in the provided text, which is load-bearing for the primary contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for stronger support of the central SOTA claim. We address the point below and indicate planned revisions.

read point-by-point responses
  1. Referee: [—] Abstract: the central claim that Cosmos 3 'establishes a new state-of-the-art across a diverse suite of understanding and generation tasks' is unsupported by any metrics, baselines, evaluation protocols, or error analysis in the provided text, which is load-bearing for the primary contribution.

    Authors: The abstract is a concise summary; the full manuscript substantiates the claim with detailed metrics, baselines, protocols, and analyses in Sections 4 (omnimodal understanding benchmarks) and 5 (generation and simulation tasks), plus the external Artificial Analysis and RoboArena rankings. We agree the abstract would be stronger with explicit quantitative anchors or section pointers and will revise it to include key results (e.g., top scores on representative tasks) while retaining brevity. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is an empirical model release describing an omnimodal architecture and its benchmark results. No equations, derivations, or first-principles claims appear in the abstract or described content. Central assertions rest on external third-party rankings and released code/checkpoints that enable independent verification rather than any self-referential fitting or self-citation chain. The work therefore contains no load-bearing steps that reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; typical large-scale multimodal models rest on numerous unfixed hyperparameters, training data curation choices, and architectural scaling decisions that cannot be enumerated here.

pith-pipeline@v0.9.1-grok · 7063 in / 1130 out tokens · 25612 ms · 2026-06-28T15:06:31.363952+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 13 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation in Streaming Video Generation and Interactive World Models

    cs.CV 2026-06 unverdicted novelty 6.0

    Causal-rCM unifies teacher-forcing and self-forcing distillation for autoregressive video diffusion, delivering a 2-step model with VBench-T2V score 84.63 and enabling interactive world models on Cosmos 3 using only s...

  2. DiffusionBench: On Holistic Evaluation of Diffusion Transformers

    cs.CV 2026-06 conditional novelty 6.0

    NanoGen unifies DiT training on ImageNet and T2I, reveals negative Pearson correlations (-0.377 to -0.580) in method rankings across metrics from 21 models, and motivates DiffusionBench for holistic evaluation.

  3. ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?

    cs.CV 2026-06 unverdicted novelty 6.0

    ImageWAM shows image editing models can replace video generation in world action models, delivering better performance with 6x lower FLOPs and 4x lower latency by using edit-derived KV caches as compact context.

  4. SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

    cs.RO 2026-06 unverdicted novelty 6.0

    SC3-Eval enforces three consistency constraints on video world models to evaluate robot manipulation policies, achieving 0.929 Pearson correlation with real-world rollouts across seven policies.

  5. SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

    cs.RO 2026-06 unverdicted novelty 6.0

    SC3-Eval enforces three consistencies on a video model to produce policy rollouts that correlate 0.929 with real-world performance across seven vision-language-action policies and reproduce observed failure modes.

  6. ActWorld: From Explorable to Interactive World Model via Action-Aware Memory

    cs.CV 2026-06 unverdicted novelty 6.0

    ActWorld extends navigation-centric world models to support mid-rollout object interactions via chunk-autoregressive generation, action-aware memory routing, and a persistent memory bank, backed by a 100K annotated in...

  7. PhysisForcing: Physics Reinforced World Simulator for Robotic Manipulation

    cs.CV 2026-06 unverdicted novelty 5.0

    PhysisForcing applies trajectory and relational alignment losses to DiT features in video models, improving physical plausibility on R-Bench, PAI-Bench, and EZS-Bench while raising closed-loop robotic success rates fr...

  8. Learning Action Priors for Cross-embodiment Robot Manipulation

    cs.RO 2026-06 unverdicted novelty 5.0

    A two-stage framework pretrains an action module with temporal motion priors from unconditioned trajectories using flow-matching, then transfers it to VLA training via decoder reuse and distillation, yielding better p...

  9. Sol Video Inference Engine: Agent-Native Full-Stack Acceleration Framework for Efficient Video Generation

    cs.CV 2026-06 unverdicted novelty 5.0

    Sol Video Inference Engine uses parallel skill agents to optimize cache, sparse attention, token pruning, quantization, and kernel fusion, delivering over 2x end-to-end acceleration with near-lossless quality on three...

  10. Physics-IQ Verified

    cs.CV 2026-06 unverdicted novelty 5.0

    Physics-IQ Verified refines 57.6% of samples and 34.8% of prompts from the original benchmark and produces moderate ranking shifts (Kendall's τ = 0.46) across six image-to-video models.

  11. PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation

    cs.RO 2026-06 unverdicted novelty 5.0

    PAIWorld adds explicit geometric cross-view mechanisms and 3D distillation to DiT world models to achieve multi-view 3D consistency in robotic manipulation benchmarks.

  12. What Spatial Memory Must Store: Occlusion as the Test for Language-Agent Memory

    cs.AI 2026-06 unverdicted novelty 5.0

    Geometry-led weighting outperforms blended memory recall for spatial queries, and a DDA-based visibility predicate correctly flags occluded targets while recall remains occlusion-blind.

  13. Critique of Agent Model

    cs.AI 2026-06 unverdicted novelty 4.0

    Distinguishes agentic (externally scaffolded) from agentive (internally structured) AI systems and proposes the Goal-Identity-Configurator architecture for endogenous autonomy.

Reference graph

Works this paper leans on

15 extracted references · 9 linked inside Pith · cited by 12 Pith papers

  1. [1]

    Revisiting feature prediction for learning visual representations from video

    77 Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. InICLR, 2025. 76 James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improv...

  2. [2]

    AgiBot World Colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

    76 Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. AgiBot World Colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. InIROS, 2025. 25, 63, 78 Junhao Cai, Zetao Cai, Jiafei Cao, Yilun Chen, Zeyu He, Lei Jiang, Hang Li, Hengjie Li, Yang Li, Yufei L...

  3. [3]

    GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

    79 Chilam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, et al. GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024. 78 Atoosa Chegini, Keivan Rezaei, Hamid Eghbalzadeh, and Soheil Feizi. RePanda: Pandas-powered tabular verification and reasoning. InACL, 2025. 75 Boyuan ...

  4. [4]

    Out of time: Automated lip sync in the wild

    25 Joon Son Chung and Andrew Zisserman. Out of time: Automated lip sync in the wild. InACCV Workshops,

  5. [5]

    Gramaccioni, Emilian Postolache, Emanuele Rodola, Danilo Comminiello, and Joshua D

    24, 78 Marco Comunita, Riccardo F. Gramaccioni, Emilian Postolache, Emanuele Rodola, Danilo Comminiello, and Joshua D. Reiss. SyncFusion: Multimodal onset-synchronized video-to-audio foley synthesis. InICASSP, 2024. 78 AgiBot World Colosseum contributors. AgiBot world colosseum. https://github.com/OpenDriveLab/AgiBot-World, 2024. 78 Jade Copet, Felix Kreu...

  6. [6]

    Prompt expansion for adaptive text-to-image generation

    52, 53 Siddhartha Datta, Alexander Ku, Deepak Ramachandran, and Peter Anderson. Prompt expansion for adaptive text-to-image generation. InACL, 2024. 74 Google DeepMind. Veo 3, 5 2025. URLhttps://deepmind.google/technologies/veo/veo-3/. 77, 79 Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas M...

  7. [7]

    VLMEvalKit: An open-source toolkit for evaluating large multi-modality models

    54 Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. VLMEvalKit: An open-source toolkit for evaluating large multi-modality models. In ACM MM, 2024. 50, 52 Andreas Dürr. The city generator.https://superhivemarket.com/products/the-city-generator/, 5

  8. [8]

    ElevenLabs Sound Effects.https://elevenlabs.io/sound-effects, 2024

    102 ElevenLabs. ElevenLabs Sound Effects.https://elevenlabs.io/sound-effects, 2024. 78 Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, and Michael Rubinstein. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation.ACM TOG, 2018. 78 Weixi Feng, Wanrong...

  9. [9]

    HMMT February 2025, 2025

    75 Harvard-MIT Mathematics Tournament. HMMT February 2025, 2025. URL https://hmmt-archive.s3.amazonaws.com/tournaments/2025/feb/comb/solutions.pdf. 106 Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi. Neighborhood attention transformer. In CVPR, 2023. 40 Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Ya...

  10. [10]

    35 Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, et al

    Accessed: 2026-05-20. 35 Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, et al. Voicebox: Text-guided multilingual universal speech generation at scale. InNeurIPS, 2023. 78 Julien Le Dem. Parquet: Columnar storage for the people. Strata + Hadoop World, New York, https:/...

  11. [11]

    Teaching CLIP to count to ten

    33 Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching CLIP to count to ten. InICCV, 2023. 52 130 Cosmos 3: Omnimodal World Models for Physical AI Brahma S. Pavse, Faraz Torabi, Josiah P. Hanna, Garrett Warnell, and Peter Stone. RIDM: Reinforced inverse dynamics modeling for learning from a single observe...

  12. [12]

    22 David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2023. 106 Xuanchi Ren, Yifan Lu, Tianshi Cao, Ruiyuan Gao, Shengyu Huang, Amirmojtaba Sabour, Tianchang Shen, Tobias Pfaff, Jay Zhang...

  13. [13]

    MM-Diffusion: Learning multi-modal diffusion models for joint audio and video generation

    66 Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, and Baining Guo. MM-Diffusion: Learning multi-modal diffusion models for joint audio and video generation. In CVPR, 2023. 79 Runway. Gen-3 Alpha, 2024. URLhttps://runwayml.com/research/introducing-gen-3-alpha. 77 Runway. Runway Gen-4.https://runwayml.com/res...

  14. [14]

    Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328, 2025

    76 Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328, 2025. 31 World Labs. Marble: A multimodal world model.https://www.worldlabs.ai/blog/marble-world-model,

  15. [15]

    76 135 Cosmos 3: Omnimodal World Models for Physical AI Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin

    World Labs blog post, accessed 2026-05-04. 76 135 Cosmos 3: Omnimodal World Models for Physical AI Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. InCVPR, 2023a. 21, 63, 108 Hongtao Wu, ...