RoDyn: Taming Interactive Robot-Dynamic 2.5D World Model for Robotic Manipulation

Chuanrui Zhang; Guanxing Lu; Yansong Tang; Zhengxian Wu; Ziwei Wang

arxiv: 2510.09036 · v2 · pith:65AGAO2Hnew · submitted 2025-10-10 · 💻 cs.RO

RoDyn: Taming Interactive Robot-Dynamic 2.5D World Model for Robotic Manipulation

Chuanrui Zhang , Zhengxian Wu , Guanxing Lu , Yansong Tang , Ziwei Wang This is my paper

classification 💻 cs.RO

keywords modelrobot-dynamicrodynworldlearningmanipulationmaskmodels

0 comments

read the original abstract

Learned world models hold significant potential as neural simulators for robotic manipulation. However, prevalent 2D video-based models inherently lack the spatial and kinematic reasoning crucial for physical interactions. We introduce RoDyn, a novel Robot-Dynamic 2.5D World Model that formulates environmental dynamics within a highly efficient, geometry-aware latent space. Through the proposed Robot-Dynamic Tokenizer, we explicitly couple semantic visual appearances with spatial and agent-centric priors via an RGB-dominated cross-attention mechanism and dynamic mask guidance. Furthermore, by injecting these mask priors directly into sequence transitions, our Mask-guided Autoregressive architecture drives the model to focus on active robot-object interaction regions. Extensive experiments demonstrate that RoDyn establishes SOTA generation fidelity across large-scale datasets. Crucially, it translates these predictive capabilities into substantial downstream gains, accelerating model-based reinforcement learning and achieving a 42\% improvement in real-world imitation learning success rates over pure 2D baselines.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DVG-WM: Disentangled Video Generation Enables Efficient Embodied World Model for Robotic Manipulation
cs.RO 2026-06 unverdicted novelty 7.0

DVG-WM disentangles dynamics learning and visual synthesis in video world models using flow matching and latent degradation to achieve faster inference up to 3.97 times with improved quality on LIBERO and real-world r...
GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation
cs.CV 2026-05 unverdicted novelty 6.0

GEM-4D improves video world models for robot manipulation by distilling 4D geometric correspondences into training and adding an inverse dynamics module, achieving SOTA geometric consistency and 81% real-world success.
GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation
cs.CV 2026-05 unverdicted novelty 5.0

GEM-4D is a video world model that injects 4D correspondence supervision to improve geometric consistency and robot manipulation success from 61% to 81%.