pith. sign in

arxiv: 2510.09036 · v2 · pith:65AGAO2Hnew · submitted 2025-10-10 · 💻 cs.RO

RoDyn: Taming Interactive Robot-Dynamic 2.5D World Model for Robotic Manipulation

classification 💻 cs.RO
keywords modelrobot-dynamicrodynworldlearningmanipulationmaskmodels
0
0 comments X
read the original abstract

Learned world models hold significant potential as neural simulators for robotic manipulation. However, prevalent 2D video-based models inherently lack the spatial and kinematic reasoning crucial for physical interactions. We introduce RoDyn, a novel Robot-Dynamic 2.5D World Model that formulates environmental dynamics within a highly efficient, geometry-aware latent space. Through the proposed Robot-Dynamic Tokenizer, we explicitly couple semantic visual appearances with spatial and agent-centric priors via an RGB-dominated cross-attention mechanism and dynamic mask guidance. Furthermore, by injecting these mask priors directly into sequence transitions, our Mask-guided Autoregressive architecture drives the model to focus on active robot-object interaction regions. Extensive experiments demonstrate that RoDyn establishes SOTA generation fidelity across large-scale datasets. Crucially, it translates these predictive capabilities into substantial downstream gains, accelerating model-based reinforcement learning and achieving a 42\% improvement in real-world imitation learning success rates over pure 2D baselines.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DVG-WM: Disentangled Video Generation Enables Efficient Embodied World Model for Robotic Manipulation

    cs.RO 2026-06 unverdicted novelty 7.0

    DVG-WM disentangles dynamics learning and visual synthesis in video world models using flow matching and latent degradation to achieve faster inference up to 3.97 times with improved quality on LIBERO and real-world r...

  2. GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation

    cs.CV 2026-05 unverdicted novelty 6.0

    GEM-4D improves video world models for robot manipulation by distilling 4D geometric correspondences into training and adding an inverse dynamics module, achieving SOTA geometric consistency and 81% real-world success.

  3. GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation

    cs.CV 2026-05 unverdicted novelty 5.0

    GEM-4D is a video world model that injects 4D correspondence supervision to improve geometric consistency and robot manipulation success from 61% to 81%.