citation dossier

Cosmos-reason1: From physical common sense to embodied reasoning

NVIDIA · 2025 · arXiv 2503.15558

16Pith papers citing it

17reference links

cs.CVtop field · 6 papers

UNVERDICTEDtop verdict bucket · 16 papers

This arXiv-backed work is queued for full Pith review when it crosses the high-inbound sweep. That review runs reader · skeptic · desk-editor · referee · rebuttal · circularity · lean confirmation · RS check · pith extraction.

read on arXiv PDF

why this work matters in Pith

Pith has found this work in 16 reviewed papers. Its strongest current cluster is cs.CV (6 papers). The largest review-status bucket among citing papers is UNVERDICTED (16 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

representative citing papers

SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation

cs.RO · 2026-05-10 · unverdicted · novelty 7.0

SABER provides 44.8K multi-representation action samples from unscripted retail environments that raise a VLA model's mean success rate on ten manipulation tasks from 13.4% to 29.3%.

Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search

cs.CV · 2026-05-09 · unverdicted · novelty 6.0

Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency in 360° environments.

PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations

cs.AI · 2026-04-30 · unverdicted · novelty 6.0

PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.

Seeing Fast and Slow: Learning the Flow of Time in Videos

cs.CV · 2026-04-23 · unverdicted · novelty 6.0

Self-supervised models learn to perceive and manipulate the flow of time in videos, supporting speed detection, large-scale slow-motion data curation, and temporally controllable video synthesis.

One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.

Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

cs.CV · 2026-04-20 · unverdicted · novelty 6.0 · 2 refs

OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.

Re$^2$MoGen: Open-Vocabulary Motion Generation via LLM Reasoning and Physics-Aware Refinement

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

Re²MoGen generates open-vocabulary motions via MCTS-enhanced LLM keyframe planning, pose-prior optimization with dynamic temporal matching fine-tuning, and physics-aware RL post-training, claiming SOTA performance.

Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation

cs.AI · 2026-05-09 · unverdicted · novelty 5.0

Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.

PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance

cs.RO · 2026-04-22 · unverdicted · novelty 5.0

PokeVLA is a lightweight VLA model pre-trained on 2.4M samples for spatial grounding and reasoning, then adapted via multi-view semantics and geometry alignment to achieve state-of-the-art robot manipulation performance.

StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement

cs.RO · 2026-04-20 · unverdicted · novelty 5.0

StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict action accuracy on AgiBot and 9.7-17.6% gains in real-robot tasks.

RoboAgent: Chaining Basic Capabilities for Embodied Task Planning

cs.RO · 2026-04-09 · unverdicted · novelty 5.0

RoboAgent chains basic vision-language capabilities inside a single VLM via a scheduler and trains it in three stages (behavior cloning, DAgger, RL) to improve embodied task planning.

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

cs.AI · 2025-03-12 · unverdicted · novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

cs.RO · 2026-04-22 · unverdicted · novelty 4.0

JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.

XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

cs.CV · 2026-04-20 · unverdicted · novelty 4.0

XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial reasoning and embodied performance on 18 benchmarks.

Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap

cs.RO · 2026-04-15 · unverdicted · novelty 4.0

A survey of UAV vision-and-language navigation that establishes a methodological taxonomy, reviews resources and challenges, and proposes a forward-looking research roadmap.

World Simulation with Video Foundation Models for Physical AI

cs.CV · 2025-10-28 · unverdicted · novelty 4.0

Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.

citing papers explorer

Showing 16 of 16 citing papers.

SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation cs.RO · 2026-05-10 · unverdicted · none · ref 18
SABER provides 44.8K multi-representation action samples from unscripted retail environments that raise a VLA model's mean success rate on ten manipulation tasks from 13.4% to 29.3%.
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search cs.CV · 2026-05-09 · unverdicted · none · ref 61
Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency in 360° environments.
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations cs.AI · 2026-04-30 · unverdicted · none · ref 23
PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
Seeing Fast and Slow: Learning the Flow of Time in Videos cs.CV · 2026-04-23 · unverdicted · none · ref 1
Self-supervised models learn to perceive and manipulate the flow of time in videos, supporting speed detection, large-scale slow-motion data curation, and temporally controllable video synthesis.
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models cs.LG · 2026-04-20 · unverdicted · none · ref 148
Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation cs.CV · 2026-04-20 · unverdicted · none · ref 3 · 2 links
OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
Re$^2$MoGen: Open-Vocabulary Motion Generation via LLM Reasoning and Physics-Aware Refinement cs.CV · 2026-04-20 · unverdicted · none · ref 1
Re²MoGen generates open-vocabulary motions via MCTS-enhanced LLM keyframe planning, pose-prior optimization with dynamic temporal matching fine-tuning, and physics-aware RL post-training, claiming SOTA performance.
Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation cs.AI · 2026-05-09 · unverdicted · none · ref 32
Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.
PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance cs.RO · 2026-04-22 · unverdicted · none · ref 45
PokeVLA is a lightweight VLA model pre-trained on 2.4M samples for spatial grounding and reasoning, then adapted via multi-view semantics and geometry alignment to achieve state-of-the-art robot manipulation performance.
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement cs.RO · 2026-04-20 · unverdicted · none · ref 2
StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict action accuracy on AgiBot and 9.7-17.6% gains in real-robot tasks.
RoboAgent: Chaining Basic Capabilities for Embodied Task Planning cs.RO · 2026-04-09 · unverdicted · none · ref 4
RoboAgent chains basic vision-language capabilities inside a single VLM via a scheduler and trains it in three stages (behavior cloning, DAgger, RL) to improve embodied task planning.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 27
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy cs.RO · 2026-04-22 · unverdicted · none · ref 1
JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments cs.CV · 2026-04-20 · unverdicted · none · ref 2
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial reasoning and embodied performance on 18 benchmarks.
Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap cs.RO · 2026-04-15 · unverdicted · none · ref 26
A survey of UAV vision-and-language navigation that establishes a methodological taxonomy, reviews resources and challenges, and proposes a forward-looking research roadmap.
World Simulation with Video Foundation Models for Physical AI cs.CV · 2025-10-28 · unverdicted · none · ref 55
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.

Cosmos-reason1: From physical common sense to embodied reasoning

why this work matters in Pith

fields

years

verdicts

representative citing papers

citing papers explorer