pith. sign in

arxiv: 2606.16917 · v3 · pith:ST27CDNCnew · submitted 2026-06-15 · 💻 cs.RO

Unified Motion-Action Modeling for Heterogeneous Robot Learning

Pith reviewed 2026-06-27 04:05 UTC · model grok-4.3

classification 💻 cs.RO
keywords robot learningvisuomotor controldynamics modelingpretrainingheterogeneous datamasked generative model3D motion trajectories
0
0 comments X

The pith

Pretrained model uses 3D motion trajectories to unify control, dynamics, and adaptation across data types

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that a single set of model parameters can be pretrained on a mixture of robot demonstrations, human videos, and simulated data to handle multiple robotic tasks at deployment. It does this by using 3D object motion trajectories as the common representation that links actions and motions. The mask pattern in the generative model decides whether the model is predicting motions or actions during training and testing. This removes the need for task labels and lets the model switch modes without retraining. A sympathetic reader would care because it suggests fewer specialized models are needed for robot learning.

Core claim

UMA treats object motion and robot actions as co-evolving variables under a masked generative objective. The mask pattern determines the supervision regime during pretraining and the inference mode at deployment. Using hindsight-relabeled motion contexts and a contrastive objective that disentangles task intent from scene geometry, UMA enables multi-task pretraining across heterogeneous data sources without requiring manually annotated task instructions.

What carries the argument

The masked generative objective on co-evolving object motion trajectories and robot actions, where the mask pattern sets both training supervision and deployment inference mode.

If this is right

  • The pretrained model supports motion-conditioned visuomotor control.
  • It supports motion-based dynamics modeling.
  • It enables task adaptation from few-shot demonstrations.
  • It outperforms state-of-the-art baselines specialized for each inference mode when pretrained on mixed data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could allow direct use of unlabeled internet videos for robot skill learning.
  • The trajectory interface might apply to other embodied AI domains like navigation or manipulation in new environments.
  • Adding more diverse simulation data could further boost performance on real robot tasks.

Load-bearing premise

That 3D object motion trajectories provide a sufficient shared interface to bridge visuomotor control and dynamics modeling across heterogeneous data sources without requiring manually annotated task instructions.

What would settle it

Testing the model on a dataset where accurate 3D object trajectories cannot be obtained from the input videos and checking if performance on control tasks drops below that of a robot-only baseline.

Figures

Figures reproduced from arXiv: 2606.16917 by Andrew Owens, Chao Feng, Kuan Fang, Meryl Zhang, Shitong Liu, Xuanchen Lu, Yunhao Cao.

Figure 1
Figure 1. Figure 1: Unified Motion-Action (UMA) Model. UMA uses object motion as a shared interface for heterogeneous robot learning. Pretraining effectively combines action-free videos, real robot data, and simulated robot data by representing task intent, observations, object motion, and robot actions as tokens under a masked generative objective. The same pretrained parameters then flexibly support visuomotor control, dyna… view at source ↗
Figure 2
Figure 2. Figure 2: Pre-Training of UMA. Left: UMA is trained with a flow matching objective to predict randomly masked object motion and robot actions, conditioned on a task latent and visual observation. Right: We encode the reference motion and initial observation into task tokens, using both flow￾matching and contrastive objectives to ensure semantic consistency of the learned task representation. 27, 28, 29, 30]. These f… view at source ↗
Figure 3
Figure 3. Figure 3: Zero-shot evaluation. Left: real-world evaluation tasks used throughout our experiments. Right: success rates for motion-conditioned visuomotor control without task-specific finetuning. Method MSE ↓ PointWorld [9] 0.054 UMA w/o Sim 0.208 UMA w/o Human 0.044 UMA (Ours) 0.042 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Few-shot adaptation. Success rates for adapting to new tasks from 25 target demonstrations under action supervision and motion supervision. Grasping Failures Execution Failures 18.33% 81.67% [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study. We evaluate the average suc￾cess rates over simulation. Model design ablation. To address Q3, we evaluate three architec￾ture variants trained on simulated robot data across three simulated tasks of 100 episodes each ( [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Masked DiT block. Each block applies adaptive layer normalization (adaLN) with two independent sets of modulation parameters, one for target (masked) tokens and one for given (unmasked) tokens, both conditioned on the diffusion timestep embedding t. The target branch learns denoising-appropriate scale and shift, while the given branch preserves clean conditioning signals with minimal distortion. layer norm… view at source ↗
Figure 8
Figure 8. Figure 8: Data pipeline. We extract 3D keypoint trajectory supervision from monocular RGB videos by estimating camera motion and depth, aligning depth to metric scale, segmenting and sampling task-relevant object points, and tracking the resulting 3D keypoints over time. temporal attention linking the same keypoint across timesteps, and context attention connecting all target tokens to observation and task tokens. T… view at source ↗
Figure 9
Figure 9. Figure 9: Multimodal task conditioning. Left: instruction following replaces the motion-derived task latent with tokens from a text encoder. Right: goal reaching uses a user-provided object description, SAM 3 segmentation, and RoMaV2-based 2D point matching to convert a goal image into a sparse start-to-end reference motion. Start End Put the cable on the table into the container Goal Image Language Instruction Lang… view at source ↗
Figure 10
Figure 10. Figure 10: Task execution under alternative inference modes. The same pretrained UMA check￾point performs instruction following (top rows) and goal reaching (bottom rows) without retraining. In instruction-following mode, a text instruction replaces the reference motion and the language encoder produces the task latent. In goal-reaching mode, a goal image is converted into a sparse two-timestep reference motion via … view at source ↗
Figure 11
Figure 11. Figure 11: Data and model scaling analysis. We report the average success rate across the three simulation tasks for six configurations shown as line charts. Both data scale and model parameter scale contribute to performance, with full model achieving the strongest result. D.2 Goal Reaching goal reaching specifies the task through a goal image og depicting the desired final configuration of the scene, together with… view at source ↗
Figure 12
Figure 12. Figure 12: Task execution. We show representative rollouts of UMA on the three real world evaluation tasks. The same pretrained model executes rigid object insertion, tool use, and deformable folding by conditioning on task motion and replanning from the current observation. sweeping, and deformable folding, matching the real world tasks used in the quantitative evaluation. These rollouts are intended to illustrate … view at source ↗
read the original abstract

We present Unified Motion-Action (UMA) Model, an approach that uses 3D object motion trajectories as a shared interface to bridge visuomotor control and dynamics modeling. UMA treats object motion and robot actions as co-evolving variables under a masked generative objective, in which the mask pattern determines both the supervision regime during pretraining and the inference mode at deployment. Using hindsight-relabeled motion contexts and a contrastive objective that disentangles task intent from scene geometry, UMA enables multi-task pretraining across heterogeneous data sources without requiring manually annotated task instructions. At deployment, the same pretrained parameters support motion-conditioned visuomotor control, motion-based dynamics modeling, and task adaptation from few-shot demonstrations. Pretrained on a mixture of robot demonstrations, human videos, and simulated data, UMA consistently outperforms state-of-the-art baselines specialized for each inference mode.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents the Unified Motion-Action (UMA) Model, which uses 3D object motion trajectories as a shared interface to bridge visuomotor control and dynamics modeling across heterogeneous sources. UMA models object motion and robot actions as co-evolving variables under a masked generative objective, where mask patterns control both pretraining supervision and deployment inference mode. Hindsight-relabeled motion contexts and a contrastive objective disentangle task intent from scene geometry, enabling multi-task pretraining on robot demonstrations, human videos, and simulated data without task annotations. The same parameters then support motion-conditioned visuomotor control, dynamics modeling, and few-shot task adaptation. The abstract claims consistent outperformance over mode-specific baselines.

Significance. If validated, the approach would offer a parameter-efficient unification of control and dynamics modeling via a geometry-based interface, potentially reducing the need for task-specific annotations and enabling cross-domain transfer. The masked generative formulation and contrastive disentanglement represent a coherent technical contribution if the 3D trajectory interface proves robust. However, the absence of any quantitative results, baselines, or ablation details in the abstract prevents assessment of whether these elements deliver measurable gains over existing multi-modal or trajectory-based robot learning methods.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'UMA consistently outperforms state-of-the-art baselines specialized for each inference mode' is asserted without any metrics, baselines, datasets, error bars, or experimental protocol. This absence makes the outperformance claim unverifiable and load-bearing for the unification thesis.
  2. [Abstract] Abstract: The manuscript relies on 3D object motion trajectories extracted from human videos as a low-noise shared interface for cross-source bridging, yet provides no verification, accuracy metrics, or ablation on extraction errors, viewpoint variation, or monocular depth ambiguity. If these trajectories contain systematic biases, the hindsight relabeling and contrastive loss cannot reliably separate intent from geometry, undermining the multi-task pretraining claim.
minor comments (1)
  1. The abstract would be strengthened by a single sentence summarizing the key quantitative result (e.g., average improvement or success rate) that supports the outperformance statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these focused comments on the abstract and the 3D trajectory interface. Both points identify places where the manuscript can be strengthened for clarity and verifiability. We address each below and commit to revisions that directly respond to the concerns without altering the core technical claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'UMA consistently outperforms state-of-the-art baselines specialized for each inference mode' is asserted without any metrics, baselines, datasets, error bars, or experimental protocol. This absence makes the outperformance claim unverifiable and load-bearing for the unification thesis.

    Authors: We agree that the abstract should not make a quantitative claim without supporting detail. The full manuscript reports these results in Section 4 (Tables 1-3), including specific metrics, baselines (e.g., RT-1, R3M, dynamics models), datasets (robot demos, human videos, simulation), and error bars across seeds. To address the referee's point, we will revise the abstract to include one or two representative numbers (e.g., success rate deltas) and name the primary baselines and data sources, while keeping the length within limits. This makes the claim verifiable from the abstract alone. revision: yes

  2. Referee: [Abstract] Abstract: The manuscript relies on 3D object motion trajectories extracted from human videos as a low-noise shared interface for cross-source bridging, yet provides no verification, accuracy metrics, or ablation on extraction errors, viewpoint variation, or monocular depth ambiguity. If these trajectories contain systematic biases, the hindsight relabeling and contrastive loss cannot reliably separate intent from geometry, undermining the multi-task pretraining claim.

    Authors: The extraction pipeline is described in Section 3.2, but we acknowledge the absence of dedicated verification. We will add a new paragraph and accompanying table in Section 4.4 (or an appendix) reporting trajectory extraction accuracy against ground-truth motion capture on a held-out human video subset, plus ablations on viewpoint variation and depth estimation noise. If biases are detected, we will quantify their effect on the contrastive loss and discuss mitigation via the hindsight relabeling. This directly tests whether the interface remains reliable for disentanglement. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The provided abstract and description define UMA via distinct components: 3D trajectories as interface, masked generative objective (mask pattern sets supervision/inference), hindsight-relabeling, and contrastive disentanglement. These are presented as modeling choices leading to empirical pretraining on mixed data and mode-specific inference, with performance evaluated against external baselines. No equations, self-definitions, fitted parameters renamed as predictions, or self-citation chains are visible that reduce claims to inputs by construction. The derivation remains self-contained against external benchmarks and data sources.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond the high-level model description.

pith-pipeline@v0.9.1-grok · 5683 in / 971 out tokens · 36845 ms · 2026-06-27T04:05:00.243379+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 1 canonical work pages

  1. [1]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...

  2. [2]

    M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  3. [3]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π0: A vision-language- action flow model for general robot control, 2024. URL https://arxiv.org...

  4. [4]

    Hafner, T

    D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020

  5. [5]

    M. Yang, Y . Du, K. Ghasemipour, J. Tompson, D. Schuurmans, and P. Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2023

  6. [6]

    Agarwal, A

    N. Agarwal, A. Ali, M. Bala, Y . Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y . Chen, Y . Cui, Y . Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

  7. [7]

    M. Xu, Z. Xu, Y . Xu, C. Chi, G. Wetzstein, M. Veloso, and S. Song. Flow as the cross-domain manipulation interface.arXiv preprint arXiv:2407.15208, 2024

  8. [8]

    H. Zhi, P. Chen, S. Zhou, Y . Dong, Q. Wu, L. Han, and M. Tan. 3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model, June 2025

  9. [9]

    Huang, Y .-W

    W. Huang, Y .-W. Chao, A. Mousavian, M.-Y . Liu, D. Fox, K. Mo, and L. Fei-Fei. Pointworld: Scaling 3d world models for in-the-wild robotic manipulation, 2026. URL https://arxiv. org/abs/2601.03782

  10. [10]

    C. Yuan, C. Wen, T. Zhang, and Y . Gao. General flow as foundation affordance for scalable robot learning.arXiv preprint arXiv:2401.11439, 2024

  11. [11]

    Andrychowicz, F

    M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. Pieter Abbeel, and W. Zaremba. Hindsight experience replay.Advances in neural information processing systems, 30, 2017. 9

  12. [12]

    Y . Cao, Z. Bhaumik, J. Jia, X. He, and K. Fang. Correspondence-oriented imitation learning: Flexible visuomotor control with 3d conditioning, 2025. URL https://arxiv.org/abs/ 2512.05953

  13. [13]

    C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y . Gao, and P. Abbeel. Any-point trajectory modeling for policy learning.arXiv preprint arXiv:2401.00025, 2023

  14. [14]

    Vecerik, C

    M. Vecerik, C. Doersch, Y . Yang, T. Davchev, Y . Aytar, G. Zhou, R. Hadsell, L. Agapito, and J. Scholz. RoboTAP: Tracking Arbitrary Points for Few-Shot Visual Imitation, Aug. 2023

  15. [15]

    J. Gu, S. Kirmani, P. Wohlhart, Y . Lu, M. G. Arenas, K. Rao, W. Yu, C. Fu, K. Gopalakrishnan, Z. Xu, P. Sundaresan, P. Xu, H. Su, K. Hausman, C. Finn, Q. H. Vuong, and T. Xiao. Rt- trajectory: Robotic task generalization via hindsight trajectory sketches.ArXiv, 2023

  16. [16]

    C. Gao, H. Zhang, Z. Xu, Z. Cai, and L. Shao. Flip: Flow-centric generative planning for general-purpose manipulation tasks.arXiv, 2024

  17. [17]

    J. Ren, P. Sundaresan, D. Sadigh, S. Choudhury, and J. Bohg. Motion tracks: A unified representation for human-robot transfer in few-shot imitation learning, 2025

  18. [18]

    Haldar and L

    S. Haldar and L. Pinto. Point Policy: Unifying Observations and Actions with Key Points for Robot Manipulation, Feb. 2025

  19. [19]

    Dharmarajan, W

    K. Dharmarajan, W. Huang, J. Wu, L. Fei-Fei, and R. Zhang. Dream2flow: Bridging video generation and open-world manipulation with 3d object flow, 2025. URL https://arxiv. org/abs/2512.24766

  20. [20]

    C.-C. Hsu, B. Wen, J. Xu, Y . Narang, X. Wang, Y . Zhu, J. Biswas, and S. Birchfield. SPOT: SE(3) Pose Trajectory Diffusion for Object-Centric Manipulation, Nov. 2024. URL http: //arxiv.org/abs/2411.00965. arXiv:2411.00965 [cs]

  21. [21]

    Y . Li, J. Wu, R. Tedrake, J. B. Tenenbaum, and A. Torralba. Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids.arXiv preprint arXiv:1810.01566, 2018

  22. [22]

    Zhang, B

    K. Zhang, B. Li, K. Hauser, and Y . Li. Particle-grid neural dynamics for learning deformable object models from rgb-d videos. InProceedings of Robotics: Science and Systems (RSS), 2025

  23. [23]

    Ghosh, H

    Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

  24. [24]

    Intelligence, K

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. V...

  25. [25]

    S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

  26. [26]

    L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch. Decision transformer: Reinforcement learning via sequence modeling.arXiv preprint arXiv:2106.01345, 2021

  27. [27]

    P. Wu, A. Majumdar, K. Stone, Y . Lin, I. Mordatch, P. Abbeel, and A. Rajeswaran. Masked trajectory models for prediction, representation, and control. InInternational Conference on Machine Learning, pages 37607–37623. PMLR, 2023. 10

  28. [28]

    F. Liu, H. Liu, A. Grover, and P. Abbeel. Masked autoencoding for scalable and generalizable decision making.Advances in Neural Information Processing Systems, 35:12608–12618, 2022

  29. [29]

    Radosavovic, B

    I. Radosavovic, B. Shi, L. Fu, K. Goldberg, T. Darrell, and J. Malik. Robot learning with sensorimotor pre-training. InConference on Robot Learning, pages 683–693. PMLR, 2023

  30. [30]

    S. Li, Y . Gao, D. Sadigh, and S. Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025

  31. [31]

    Ebert, C

    F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control.arXiv preprint arXiv:1812.00568, 2018

  32. [32]

    Hafner, T

    D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning latent dynamics for planning from pixels. InInternational conference on machine learning, pages 2555–2565. PMLR, 2019

  33. [33]

    Hafner, T

    D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

  34. [34]

    Y . Liao, P. Zhou, S. Huang, D. Yang, S. Chen, Y . Jiang, Y . Hu, J. Cai, S. Liu, J. Luo, L. Chen, S. Yan, M. Yao, and G. Ren. Genie envisioner: A unified world foundation platform for robotic manipulation, 2025. URLhttps://arxiv.org/abs/2508.05635

  35. [35]

    J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y . Fang, F. Hu, S. Huang, K. Kundalia, Y .-C. Lin, L. Magne, A. Mandlekar, A. Narayan, Y . L. Tan, G. Wang, J. Wang, Q. Wang, Y . Xu, X. Zeng, K. Zheng, R. Zheng, M.-Y . Liu, L. Zettlemoyer, D. Fox, J. Kautz, S. Reed, Y . Zhu, and L. Fan. Dreamgen: Unlocking generalization in robot learning through video world...

  36. [36]

    A. Ali, J. Bai, M. Bala, Y . Balaji, A. Blakeman, T. Cai, J. Cao, T. Cao, E. Cha, Y .-W. Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

  37. [37]

    Black, M

    K. Black, M. Nakamoto, P. Atreya, H. Walke, C. Finn, A. Kumar, and S. Levine. Zero- shot robotic manipulation with pretrained image-editing diffusion models.arXiv preprint arXiv:2310.10639, 2023

  38. [38]

    Bardes, Q

    A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. Rabbat, Y . LeCun, M. Assran, and N. Ballas. Revisiting feature prediction for learning visual representations from video.arXiv preprint arXiv:2404.08471, 2024

  39. [39]

    Assran, A

    M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

  40. [40]

    Jiang, H.-Y

    H. Jiang, H.-Y . Hsu, K. Zhang, H.-N. Yu, S. Wang, and Y . Li. Phystwin: Physics-informed reconstruction and simulation of deformable objects from videos, 2025. URL https://arxiv. org/abs/2503.17973

  41. [41]

    C. Zhu, R. Yu, S. Feng, B. Burchfiel, P. Shah, and A. Gupta. Unified world models: Cou- pling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025

  42. [42]

    J. Cen, C. Yu, H. Yuan, Y . Jiang, S. Huang, J. Guo, X. Li, Y . Song, H. Luo, F. Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

  43. [43]

    Zhang, H

    W. Zhang, H. Liu, Z. Qi, Y . Wang, X. Yu, J. Zhang, R. Dong, J. He, F. Lu, H. Wang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge. arXiv preprint arXiv:2507.04447, 2025. 11

  44. [44]

    Peebles and S

    W. Peebles and S. Xie. Scalable diffusion models with transformers.arXiv preprint arXiv:2212.09748, 2022

  45. [45]

    Dasari, O

    S. Dasari, O. Mees, S. Zhao, M. K. Srirama, and S. Levine. The ingredients for robotic diffusion transformers.arXiv preprint arXiv:2410.10088, 2024

  46. [46]

    C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

  47. [47]

    Tschannen, A

    M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y . Xia, B. Mustafa, O. Hénaff, J. Harmsen, A. Steiner, and X. Zhai. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

  48. [48]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  49. [49]

    Lipman, R

    Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow Matching for Generative Modeling, Feb. 2023

  50. [50]

    T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

  51. [51]

    Khazatsky, K

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

  52. [52]

    Y . Liu, Y . Liu, C. Jiang, K. Lyu, W. Wan, H. Shen, B. Liang, Z. Fu, H. Wang, and L. Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21013–21022, June 2022

  53. [53]

    Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations, 2026

    Ropedia. Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations, 2026. Dataset

  54. [54]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

  55. [55]

    Bloom, J

    S. Bloom, J. C. Brumberg, I. Fisk, R. J. Harrison, R. Hull, M. Ramasubramanian, K. V . Vliet, and J. Wing. Empire AI: A new model for provisioning AI and HPC for academic research in the public good. InPractice and Experience in Advanced Research Computing (PEARC ’25), page 4, Columbus, OH, USA, July 2025. ACM. doi:10.1145/3708035.3736070. URL https://doi...

  56. [56]

    Z. Li, R. Tucker, F. Cole, Q. Wang, L. Jin, V . Ye, A. Kanazawa, A. Holynski, and N. Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. In 12 Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10486–10496, 2025

  57. [57]

    Piccinelli, Y .-H

    L. Piccinelli, Y .-H. Yang, C. Sakaridis, M. Segu, S. Li, L. Van Gool, and F. Yu. Unidepth: Universal monocular metric depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10106–10116, 2024

  58. [58]

    Carion, L

    N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

  59. [59]

    Zhang, L

    B. Zhang, L. Ke, A. W. Harley, and K. Fragkiadaki. Tapip3d: Tracking any point in persistent 3d geometry.arXiv preprint arXiv:2504.14717, 2025

  60. [60]

    Calli, A

    B. Calli, A. Walsman, A. Singh, S. Srinivasa, P. Abbeel, and A. M. Dollar. Benchmarking in manipulation research: The ycb object and model set and benchmarking protocols.arXiv preprint arXiv:1502.03143, 2015

  61. [61]

    Xiang, Y

    F. Xiang, Y . Qin, K. Mo, Y . Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y . Yuan, H. Wang, L. Yi, A. X. Chang, L. J. Guibas, and H. Su. SAPIEN: A simulated part-based interactive environment. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

  62. [62]

    Downs, A

    L. Downs, A. Francis, N. Koenig, B. Kinman, R. Hickman, K. Reymann, T. B. McHugh, and V . Vanhoucke. Google scanned objects: A high-quality dataset of 3d scanned household items,

  63. [63]

    URLhttps://arxiv.org/abs/2204.11918

  64. [64]

    K. Zakka. Scanned Objects MuJoCo Models, 7 2022. URL https://github.com/ kevinzakka/mujoco_scanned_objects

  65. [65]

    objects”: [“teal cup

    J. Edstedt, D. Nordström, Y . Zhang, G. Bökman, J. Astermark, V . Larsson, A. Heyden, F. Kahl, M. Wadenbäck, and M. Felsberg. RoMa v2: Harder Better Faster Denser Feature Matching. arXiv preprint arXiv:2511.15706, 2025. 13 A Implementation Details This section provides implementation details that complement the architectural overview in the main paper. We...