OVO-S-Bench provides 1680 human-annotated questions on 348 videos to measure streaming spatial intelligence in MLLMs across instantaneous perception, spatiotemporal tracking, spatial simulation, and allocentric mapping.
hub Canonical reference
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Canonical reference. 77% of citing Pith papers cite this work as background.
abstract
Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, with a focus on physical common sense and embodied reasoning. To represent physical common sense, we use a hierarchical ontology that captures fundamental knowledge about space, time, and physics. For embodied reasoning, we rely on a two-dimensional ontology that generalizes across different physical embodiments. Building on these capabilities, we develop two multimodal large language models, Cosmos-Reason1-7B and Cosmos-Reason1-56B. We curate data and train our models in two stages: Physical AI supervised fine-tuning (SFT) and Physical AI reinforcement learning (RL). To evaluate our models, we build comprehensive benchmarks for physical common sense and embodied reasoning according to our ontologies. Evaluation results show that Physical AI SFT and RL bring significant improvements. To facilitate the development of Physical AI, we make our code and pre-trained models available under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-reason1.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Introduces pause-and-think-T dataset and pause-and-think-B benchmark; fine-tunes 4B VLM to 58% accuracy matching 235B model while generalizing out-of-distribution.
MAVEN pipeline generates multi-scale spatio-temporal event descriptions from videos using agentic adaptation and refinement, then produces training data that lets a fine-tuned 8B model outperform Gemini baselines on private CCTV and AccidentBench tasks.
SABER provides 44.8K multi-representation action samples from unscripted retail environments that raise a VLA model's mean success rate on ten manipulation tasks from 13.4% to 29.3%.
SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
Do-Undo Bench is a new evaluation task and dataset that forces models to simulate forward action effects and then undo them to measure genuine action understanding in image generation.
SynthFun3D generates synthetic 3D functionality segmentation data from action descriptions via object retrieval and scene arrangement, yielding consistent gains of +2.2 mAP, +6.3 mAR, and +5.7 mIoU when augmenting real data for VLM training.
SR-REAL equips spatial VLMs with dual LOR and DTR reasoning paths trained via RL, achieving better benchmark performance through mutual reinforcement and generalization without per-task tuning.
MotionWAM conditions a policy on intermediate features from a video world model to predict unified whole-body motion tokens, enabling real-time humanoid loco-manipulation that outperforms VLA baselines by over 30% on nine Unitree G1 tasks.
Dream-Tac unifies visual and tactile signals in a world action model using contact-gated fusion and attention bias, reporting 31.7% average action accuracy gains on six manipulation tasks.
Introduces CausalPhys benchmark with causal graphs and CRFT fine-tuning to improve VLMs' causal physical reasoning accuracy and interpretability.
ERVLA trains on a 978k-trajectory embodied CoT corpus using reasoning as supervision with dropout, then predicts actions without CoT at test time, reaching 86.9% on LIBERO-Plus and 53.2% on VLABench.
Introduces a new diagnostic benchmark and million-scale reasoning corpus showing that training on explicit causal traces improves next-state prediction in embodied planning, with reported gains from data scaling.
nuReasoning is a new real-world dataset and benchmark extending nuScenes/nuPlan with 20k clips and multi-type reasoning annotations to evaluate and improve reasoning in long-tail autonomous driving.
DeMiAn re-annotates robot and egocentric videos with VLM-generated dense labels across motion, scene, pose, and reasoning aspects, then uses a learned instructor to boost policy success by 5 points on RoboCasa over task-only baselines.
Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency in 360° environments.
PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
Self-supervised models learn to perceive and manipulate the flow of time in videos, supporting speed detection, large-scale slow-motion data curation, and temporally controllable video synthesis.
Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
Re²MoGen generates open-vocabulary motions via MCTS-enhanced LLM keyframe planning, pose-prior optimization with dynamic temporal matching fine-tuning, and physics-aware RL post-training, claiming SOTA performance.
R&B-EnCoRe uses self-supervised importance-weighted variational inference to distill action-predictive reasoning datasets that improve VLA performance on manipulation, navigation, and driving tasks without external verifiers.
MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.
Alpamayo-R1 introduces a VLA model with a Chain of Causation dataset and multi-stage SFT-plus-RL training that reports 12% better planning accuracy and 35% fewer close encounters versus trajectory-only baselines in driving tasks.
citing papers explorer
-
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.