{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:L35TTSYHKB5XDMBT6B6DTWQZFP","short_pith_number":"pith:L35TTSYH","schema_version":"1.0","canonical_sha256":"5efb39cb07507b71b033f07c39da192bec8d1652a1d07700b132fffe02c2cbe5","source":{"kind":"arxiv","id":"2503.15558","version":3},"attestation_state":"computed","paper":{"title":"Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Cosmos-Reason1 models understand the physical world and generate embodied decisions through long chain-of-thought reasoning in natural language.","cross_cats":["cs.CV","cs.LG","cs.RO"],"primary_cat":"cs.AI","authors_text":"Alice Luo, Andrew Mathau, Andrew Z. Wang, Boxin Wang, Brendan Johnson, David W. Romero, Dinghao Yang, Elena Lantz, Fangyin Wei, Francesco Ferroni, George Kurian, Hannah Brandon, Haoxiang Wang, Huayu Chen, Imad El Hanafi, Jacob Huffman, Jenna Diamond, Jiashu Xu, Jiaxin Cao, Jingxu Zhang, Jingyi Jin, Jinju Chu, Jinwei Gu, Junjie Bai, Liang Feng, Lindsey Pavao, Lyne Tchapmi, Maosheng Liao, Ming-Yu Liu, Misha Smelyanskiy, Nayeon Lee, NVIDIA: Alisson Azzolini, Prithvijit Chattopadhyay, Rama Govindaraju, Rizwan Khan, Shuran Song, Siddharth Gururani, Tsung-Yi Lin, Wei Ping, Xiangyu Lu, Xiaodong Yang, Xiaohui Zeng, Xuan Li, Yao Xu, Yen-Chen Lin, Yifan Ding, Yin Cui, Yun Ni, Zekun Hao, Zhaoshuo Li, Zhe Zhang, Zhuolin Yang","submitted_at":"2025-03-18T22:06:58Z","abstract_excerpt":"Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, with a focus on physical common sense and embodied reasoning. To represent physical common sense, we use a hierarchical ontology that captures fundamental knowledge about space, time, and physics. For em"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2503.15558","kind":"arxiv","version":3},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.AI","submitted_at":"2025-03-18T22:06:58Z","cross_cats_sorted":["cs.CV","cs.LG","cs.RO"],"title_canon_sha256":"271076f7bff7127ea066e33b2f91610943489f69dfae2f3f4b9424a2736cae51","abstract_canon_sha256":"1f0b0573feb963ffca5fb9df6524f92e43065373bc101230dfd17063462a2f83"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:47.833334Z","signature_b64":"Qx5VCOpiVZCElbBUhqy1C60AwyAZPJnYHtyboBiiHaACfdR2l1vrghGUMBRgebHPVSranCE2tUDpnx/CBcRoAw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"5efb39cb07507b71b033f07c39da192bec8d1652a1d07700b132fffe02c2cbe5","last_reissued_at":"2026-05-17T23:38:47.832540Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:47.832540Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Cosmos-Reason1 models understand the physical world and generate embodied decisions through long chain-of-thought reasoning in natural language.","cross_cats":["cs.CV","cs.LG","cs.RO"],"primary_cat":"cs.AI","authors_text":"Alice Luo, Andrew Mathau, Andrew Z. Wang, Boxin Wang, Brendan Johnson, David W. Romero, Dinghao Yang, Elena Lantz, Fangyin Wei, Francesco Ferroni, George Kurian, Hannah Brandon, Haoxiang Wang, Huayu Chen, Imad El Hanafi, Jacob Huffman, Jenna Diamond, Jiashu Xu, Jiaxin Cao, Jingxu Zhang, Jingyi Jin, Jinju Chu, Jinwei Gu, Junjie Bai, Liang Feng, Lindsey Pavao, Lyne Tchapmi, Maosheng Liao, Ming-Yu Liu, Misha Smelyanskiy, Nayeon Lee, NVIDIA: Alisson Azzolini, Prithvijit Chattopadhyay, Rama Govindaraju, Rizwan Khan, Shuran Song, Siddharth Gururani, Tsung-Yi Lin, Wei Ping, Xiangyu Lu, Xiaodong Yang, Xiaohui Zeng, Xuan Li, Yao Xu, Yen-Chen Lin, Yifan Ding, Yin Cui, Yun Ni, Zekun Hao, Zhaoshuo Li, Zhe Zhang, Zhuolin Yang","submitted_at":"2025-03-18T22:06:58Z","abstract_excerpt":"Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, with a focus on physical common sense and embodied reasoning. To represent physical common sense, we use a hierarchical ontology that captures fundamental knowledge about space, time, and physics. For em"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Cosmos-Reason1 models can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The hierarchical ontology for physical common sense and the two-dimensional ontology for embodied reasoning sufficiently capture the knowledge needed to generalize across physical tasks and embodiments.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Cosmos-Reason1-7B and 56B models are trained with physical common sense and embodied reasoning ontologies via supervised fine-tuning and reinforcement learning to produce next-step physical actions.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Cosmos-Reason1 models understand the physical world and generate embodied decisions through long chain-of-thought reasoning in natural language.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"024d77bbcd9335b6bb5423d451418a5f6f6ffee30c36b5cfd54e643077b5241b"},"source":{"id":"2503.15558","kind":"arxiv","version":3},"verdict":{"id":"a2a69306-7128-4b3f-b9f3-744aad9489b3","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T12:43:31.071002Z","strongest_claim":"Cosmos-Reason1 models can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes.","one_line_summary":"Cosmos-Reason1-7B and 56B models are trained with physical common sense and embodied reasoning ontologies via supervised fine-tuning and reinforcement learning to produce next-step physical actions.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The hierarchical ontology for physical common sense and the two-dimensional ontology for embodied reasoning sufficiently capture the knowledge needed to generalize across physical tasks and embodiments.","pith_extraction_headline":"Cosmos-Reason1 models understand the physical world and generate embodied decisions through long chain-of-thought reasoning in natural language."},"references":{"count":60,"sample":[{"doi":"","year":2024,"title":"Agibot world colosseum.https://github.com/OpenDriveLab/AgiBot-World, 2024","work_id":"b6e28cd0-4ca2-4a47-b4a5-fef58e17bdbe","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Do as i can, not as i say: Grounding language in robotic affordances","work_id":"a85fc3ea-bb91-47f3-b6a2-caa967931f52","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Flamingo: a visual language model for few-shot learning","work_id":"bcc10fec-5e16-45c2-8157-9a14a84708ac","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Covla: Comprehensive vision-language-action dataset for autonomous driving","work_id":"534db3f4-3bf9-47ee-b3b6-02f30b09e417","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","ref_index":5,"cited_arxiv_id":"2502.13923","is_internal_anchor":true}],"resolved_work":60,"snapshot_sha256":"59481525b03e77d420b38c4ff26892a4bbc90a552e2d100d01539931ade94e9f","internal_anchors":20},"formal_canon":{"evidence_count":2,"snapshot_sha256":"53f7af9ac5daa00f92b11232a6246454e2fe0b73fd940279e55de02190f80d77"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2503.15558","created_at":"2026-05-17T23:38:47.832676+00:00"},{"alias_kind":"arxiv_version","alias_value":"2503.15558v3","created_at":"2026-05-17T23:38:47.832676+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2503.15558","created_at":"2026-05-17T23:38:47.832676+00:00"},{"alias_kind":"pith_short_12","alias_value":"L35TTSYHKB5X","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"L35TTSYHKB5XDMBT","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"L35TTSYH","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":30,"internal_anchor_count":30,"sample":[{"citing_arxiv_id":"2605.21917","citing_title":"MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2505.23678","citing_title":"Grounded Reinforcement Learning for Visual Reasoning","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2602.08167","citing_title":"Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2603.17305","citing_title":"Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17077","citing_title":"How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2505.21996","citing_title":"Learning World Models for Interactive Video Generation","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2507.16815","citing_title":"ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2511.00088","citing_title":"Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail","ref_index":67,"is_internal_anchor":true},{"citing_arxiv_id":"2511.16518","citing_title":"MiMo-Embodied: X-Embodied Foundation Model Technical Report","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2507.01925","citing_title":"A Survey on Vision-Language-Action Models: An Action Tokenization Perspective","ref_index":280,"is_internal_anchor":true},{"citing_arxiv_id":"2511.23230","citing_title":"Action-guided generation of 3D functionality segmentation data","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2512.13609","citing_title":"Do-Undo Bench: Reversibility for Action Understanding in Image Generation","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2603.03944","citing_title":"SCP: Spatial Causal Prediction in Video","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2511.00062","citing_title":"World Simulation with Video Foundation Models for Physical AI","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27472","citing_title":"PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2503.09567","citing_title":"Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18486","citing_title":"Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08975","citing_title":"Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09613","citing_title":"SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09146","citing_title":"Beyond Thinking: Imagining in 360$^\\circ$ for Humanoid Visual Search","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21931","citing_title":"Seeing Fast and Slow: Learning the Flow of Time in Videos","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07774","citing_title":"RoboAgent: Chaining Basic Capabilities for Embodied Task Planning","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2604.13654","citing_title":"Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17887","citing_title":"StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18839","citing_title":"One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models","ref_index":148,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/L35TTSYHKB5XDMBT6B6DTWQZFP","json":"https://pith.science/pith/L35TTSYHKB5XDMBT6B6DTWQZFP.json","graph_json":"https://pith.science/api/pith-number/L35TTSYHKB5XDMBT6B6DTWQZFP/graph.json","events_json":"https://pith.science/api/pith-number/L35TTSYHKB5XDMBT6B6DTWQZFP/events.json","paper":"https://pith.science/paper/L35TTSYH"},"agent_actions":{"view_html":"https://pith.science/pith/L35TTSYHKB5XDMBT6B6DTWQZFP","download_json":"https://pith.science/pith/L35TTSYHKB5XDMBT6B6DTWQZFP.json","view_paper":"https://pith.science/paper/L35TTSYH","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2503.15558&json=true","fetch_graph":"https://pith.science/api/pith-number/L35TTSYHKB5XDMBT6B6DTWQZFP/graph.json","fetch_events":"https://pith.science/api/pith-number/L35TTSYHKB5XDMBT6B6DTWQZFP/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/L35TTSYHKB5XDMBT6B6DTWQZFP/action/timestamp_anchor","attest_storage":"https://pith.science/pith/L35TTSYHKB5XDMBT6B6DTWQZFP/action/storage_attestation","attest_author":"https://pith.science/pith/L35TTSYHKB5XDMBT6B6DTWQZFP/action/author_attestation","sign_citation":"https://pith.science/pith/L35TTSYHKB5XDMBT6B6DTWQZFP/action/citation_signature","submit_replication":"https://pith.science/pith/L35TTSYHKB5XDMBT6B6DTWQZFP/action/replication_record"}},"created_at":"2026-05-17T23:38:47.832676+00:00","updated_at":"2026-05-17T23:38:47.832676+00:00"}