Imagine2Act: Leveraging Object-Action Motion Consistency from Imagined Goals for Robotic Manipulation

Guanghui Ren; Hao Dong; Jiadong Xu; Juan Zhu; Liang Heng; Muhe Cai; Xiaoqi Li; Yan Shen; Yiwen Wang

arxiv: 2509.17125 · v2 · pith:AEJQGOOKnew · submitted 2025-09-21 · 💻 cs.RO

Imagine2Act: Leveraging Object-Action Motion Consistency from Imagined Goals for Robotic Manipulation

Liang Heng , Jiadong Xu , Yiwen Wang , Xiaoqi Li , Muhe Cai , Yan Shen , Juan Zhu , Guanghui Ren

show 1 more author

Hao Dong

This is my paper

classification 💻 cs.RO

keywords geometricimagine2actsemanticimaginedobjectobjectstaskscapture

0 comments

read the original abstract

Relational object rearrangement (ROR) tasks (e.g., insert flower to vase) require a robot to manipulate objects with precise semantic and geometric reasoning. Existing approaches either rely on pre-collected demonstrations that struggle to capture complex geometric constraints or generate goal-state observations to capture semantic and geometric knowledge, but fail to explicitly couple object transformation with action prediction, resulting in errors due to generative noise. To address these limitations, we propose Imagine2Act, a 3D imitation-learning framework that incorporates semantic and geometric constraints of objects into policy learning to tackle high-precision manipulation tasks. We first generate imagined goal images conditioned on language instructions and reconstruct corresponding 3D point clouds to provide robust semantic and geometric priors. These imagined goal point clouds serve as additional inputs to the policy model, while an object-action consistency strategy with soft pose supervision explicitly aligns predicted end-effector motion with generated object transformation. This design enables Imagine2Act to reason about semantic and geometric relationships between objects and predict accurate actions across diverse tasks. Experiments in both simulation and the real world demonstrate that Imagine2Act outperforms previous state-of-the-art policies. More visualizations can be found at https://sites.google.com/view/imagine2act.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls
cs.CV 2026-05 unverdicted novelty 7.0

Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency sup...
World Models for Robotic Manipulation: A Survey
cs.RO 2026-05 accept novelty 5.0

Survey organizing world models for robotic manipulation into representation families, a functional taxonomy, and infrastructure roles across pretraining, post-training, and inference, while reviewing 34 datasets and e...