Cloak: Zero-Shot Cross-Embodiment Manipulation by Masking the End-Effector from the VLA

C. Karen Liu; Guy Tevet; Michael Piseno

arxiv: 2606.22836 · v1 · pith:6GGTSSSAnew · submitted 2026-06-22 · 💻 cs.RO

Cloak: Zero-Shot Cross-Embodiment Manipulation by Masking the End-Effector from the VLA

Michael Piseno , Guy Tevet , C. Karen Liu This is my paper

Pith reviewed 2026-06-26 08:50 UTC · model grok-4.3

classification 💻 cs.RO

keywords zero-shot cross-embodimentVLAend-effector maskingrobot manipulationvision-language-actionwrist cameratransfer learning

0 comments

The pith

Masking the end-effector in wrist-camera images lets a VLA trained on one gripper control unseen robot bodies zero-shot.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that hiding the end-effector from the wrist view during training makes a Vision-Language-Action model treat visual input as independent of its own hardware. It renders the mask directly from the robot's known geometry in simulation, then augments the mask so the model learns to ignore embodiment-specific appearance. The resulting model, trained only on parallel-jaw gripper data, is applied without any new collection to a different gripper, a different arm, and a five-fingered hand. Performance on the original embodiment stays the same. This decouples collected data from the hardware that produced it.

Core claim

Cloak endows a VLA with zero-shot cross-embodiment transfer by cloaking the end-effector from its own wrist camera. The end-effector occupies a large and consistent region of the wrist view and masking it allows for embodiment-agnostic visual reasoning. Cloak renders a mask in simulation from the robot's known geometry, accurately and in real time, with no segmentation or generative models. During training the mask is augmented so the model generalizes to embodiments unseen at training time. Cloak-VLA trained on a single parallel-jaw gripper dataset transfers zero-shot to various unseen embodiments while preserving the source embodiment's performance.

What carries the argument

The Cloak mask: a real-time rendered silhouette of the end-effector generated from known robot geometry and augmented during training to force embodiment-agnostic reasoning.

If this is right

The same model controls another gripper, another arm, and a five-fingered hand without any new data or fine-tuning.
Performance on the source parallel-jaw gripper remains unchanged after the masking procedure.
Robot datasets collected on one embodiment remain usable after the hardware is replaced or retired.
No segmentation models or generative models are required to produce the masks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same masking idea could be applied to other visible robot parts such as the arm links if they appear in the wrist view.
Large multi-embodiment datasets might be assembled more easily if each source is masked before mixing.
The approach may reduce the need to retrain policies when only the end-effector changes on an existing robot.
Testing the method on camera placements other than the wrist could reveal how much the benefit depends on the end-effector dominating the image.

Load-bearing premise

Rendering an accurate mask from the robot's known geometry and augmenting it during training is sufficient to make visual reasoning ignore embodiment details for bodies never seen in the training data.

What would settle it

A clear drop in success rate when the trained model is tested on the five-fingered hand or another unseen arm, relative to its performance on the original parallel-jaw gripper, would show the transfer has failed.

Figures

Figures reproduced from arXiv: 2606.22836 by C. Karen Liu, Guy Tevet, Michael Piseno.

**Figure 2.** Figure 2: Overview. Cloak renders a geometric mask of the end-effector using the robot state and wrist camera parameters, augments it during training, and uses it to compute an attention mask for the vision encoder, cloaking the end-effector from the wrist view. The resulting image patch tokens, the source-robot state, and the language prompt drive a single VLA backbone and action head. On an unseen embodiment, tip-… view at source ↗

**Figure 3.** Figure 3: Task-averaged progression rate. The original gripper does not use TP retargeting because [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Wrist camera extrinsics estimation on a representative DROID frame. From left to right: [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Example trial setups across the four tasks (two per task). The bright yellow number [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Keyframe rollouts on the unseen Sharpa hand for the pick-and-place task, prompt [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Keyframe rollouts on the unseen UMI gripper for the fold task, prompt [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Keyframe rollouts on the unseen YAM arm and gripper for the remove task, prompt [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Ablation rollouts on the unseen Sharpa hand for the move task, prompt [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

read the original abstract

We present Cloak, a training recipe that endows a Vision-Language-Action (VLA) model with zero-shot cross-embodiment transfer by cloaking the end-effector from its own wrist camera. The end-effector occupies a large and consistent region of the wrist view and masking it allows for embodiment-agnostic visual reasoning. Cloak renders a mask in simulation from the robot's known geometry, accurately and in real time, with no segmentation or generative models. During training, we augment the mask so the model generalizes to embodiments unseen at training time. We demonstrate the recipe with Cloak-VLA, a VLA trained with Cloak on a single parallel-jaw gripper dataset. No data of new embodiments is ever collected. Cloak-VLA transfers zero-shot to various unseen embodiments, including another gripper, another arm, and a five-fingered hand, while preserving the source embodiment's performance. By decoupling the wrist view from its own embodiment, Cloak allows data to outlive the hardware it was collected on.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cloak's masking recipe is a clean, simulation-driven way to decouple wrist views from specific end-effectors, but the zero-shot claims rest on unshown results and an augmentation step whose coverage is unclear.

read the letter

The paper's core move is to render an end-effector mask from known robot geometry in simulation, then augment that mask during training so a VLA trained only on parallel-jaw data can run on unseen grippers, arms, and even a five-fingered hand. This is new as a targeted, low-overhead augmentation for VLAs rather than full domain randomization or new data collection.

It does one thing cleanly: it removes the most embodiment-specific visual cue (the gripper itself) without needing learned segmentation or generative models, and the simulation route keeps the mask accurate and real-time. That directly attacks the data-reuse problem in robotics.

The soft spot is the lack of visible numbers. The abstract states successful zero-shot transfer while preserving source performance, yet no baselines, success rates, or failure modes appear in what is provided. The augmentation schedule is the load-bearing piece, and without an ablation or characterization of how the mask variations are sampled, it is hard to judge whether the method truly generalizes or just works for the tested cases. The stress-test concern about the augmentation failing to cover drastically different geometries (shape, articulation, occlusion) is reasonable given the information; nothing in the description shows the distribution is broad enough.

This is for people building or deploying VLAs who already have one solid dataset and want to stretch it across hardware. A reader who cares about practical transfer tricks will find the recipe worth trying.

It deserves peer review because the idea is simple, the motivation is solid, and the method is reproducible from the description even if the current evidence is thin.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes Cloak, a training recipe for Vision-Language-Action (VLA) models that enables zero-shot cross-embodiment transfer by rendering and augmenting masks of the end-effector from known geometry in simulation. A model trained only on parallel-jaw gripper data is claimed to transfer without further data collection to unseen embodiments including another gripper, another arm, and a five-fingered hand, while preserving source-embodiment performance.

Significance. If the empirical results hold under rigorous validation, the work would be significant for robotics by providing a practical, simulation-only mechanism to decouple wrist-camera visual reasoning from specific hardware, allowing datasets to outlive the robots on which they were collected. The geometry-based real-time masking without segmentation or generative models is a concrete engineering strength that could be adopted in other VLA pipelines.

major comments (1)

[Abstract] Abstract: the central claim that mask augmentation during training on a single parallel-jaw dataset produces embodiment-agnostic features for a never-seen five-fingered hand is load-bearing, yet no quantitative metrics, baselines, ablation isolating the augmentation schedule, or failure-case analysis are reported; without these it is impossible to verify whether the augmentation distribution actually covers the required appearance statistics of arbitrary new end-effector geometries.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for identifying a point where the presentation of evidence for the core claim can be strengthened. We address the comment below and commit to revisions that will make the supporting metrics, ablations, and analysis explicit.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that mask augmentation during training on a single parallel-jaw dataset produces embodiment-agnostic features for a never-seen five-fingered hand is load-bearing, yet no quantitative metrics, baselines, ablation isolating the augmentation schedule, or failure-case analysis are reported; without these it is impossible to verify whether the augmentation distribution actually covers the required appearance statistics of arbitrary new end-effector geometries.

Authors: We agree that the abstract states the claim concisely without the supporting numbers and that an explicit ablation isolating the augmentation schedule together with failure-case analysis would allow readers to assess coverage of new end-effector appearance statistics. The experiments section already reports success rates on the five-fingered hand and comparisons against a no-masking baseline, but these elements are not summarized in the abstract and the augmentation ablation is not isolated as a single controlled study. We will therefore revise the abstract to include the key quantitative transfer metrics, add a dedicated ablation subsection that varies only the augmentation schedule, and include a failure-case analysis (with examples of when transfer succeeds or degrades) in the main text or supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is data augmentation with external held-out evaluation

full rationale

The paper describes a masking-based data augmentation procedure applied during training on a single parallel-jaw gripper dataset, with zero-shot transfer evaluated on unseen embodiments (different grippers, arms, five-fingered hand). No equations, fitted parameters, or predictions are defined; success is measured by empirical performance on held-out hardware rather than any reduction to training inputs by construction. No self-citations or uniqueness theorems are invoked as load-bearing steps. The derivation chain consists of a rendering step from known geometry plus augmentation, both independent of the target result.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the end-effector dominates the wrist view and that simulation masking plus augmentation suffices for generalization; no free parameters or invented entities are stated in the abstract.

free parameters (1)

mask augmentation schedule
Parameters controlling how the rendered mask is varied during training to promote generalization; values not specified in abstract.

axioms (1)

domain assumption The end-effector occupies a large and consistent region of the wrist view
Explicitly stated in the abstract as the justification for masking.

pith-pipeline@v0.9.1-grok · 5726 in / 1215 out tokens · 16178 ms · 2026-06-26T08:50:36.129759+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 12 canonical work pages · 5 internal anchors

[1]

O. X.-E. Collaboration, A. O’Neill, A. Rehman, A. Gupta, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, A. Tung, A. Bewley, A. Her- zog, A. Irpan, A. Khazatsky, A. Rai, A. Gupta, A. Wang, A. Kolobov, A. Singh, A. Garg, A. Kembhavi, A. Xie, A. Brohan, A. Raffin, A. Sharma, A. Yavary, A. Jain, A. Balakr- ishna, A. W...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

2024
[3]

C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. InProceedings of Robotics: Science and Systems (RSS), 2024

2024
[4]

Bjorck, N

NVIDIA, J. Bjorck, N. C. Fernando Casta ˜neda, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, ...

2025
[5]

M. Xu, H. Zhang, Y . Hou, Z. Xu, L. Fan, M. Veloso, and S. Song. Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation. InConference on Robot Learning, pages 437–459. PMLR, 2025

2025
[6]

L. Y . Chen, C. Xu, K. Dharmarajan, M. Z. Irshad, R. Cheng, K. Keutzer, M. Tomizuka, Q. Vuong, and K. Goldberg. Rovi-aug: Robot and viewpoint augmentation for cross- embodiment robot learning. InConference on Robot Learning (CoRL), Munich, Germany, 2024

2024
[7]

C. Yuan, S. Joshi, S. Zhu, H. Su, H. Zhao, and Y . Gao. Roboengine: Plug-and-play robot data augmentation with semantic robot segmentation and background generation. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7622–
[8]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, et al.\π {0.5}: a vision-language-action model with open-world generalization. In9th Annual Conference on Robot Learning, 2025

2025
[9]

M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X- vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Ghosh, H

Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

2024
[13]

Doshi, H

R. Doshi, H. Walke, O. Mees, S. Dasari, and S. Levine. Scaling cross-embodied learn- ing: One policy for manipulation, navigation, locomotion and aviation.arXiv preprint arXiv:2408.11812, 2024

work page arXiv 2024
[14]

L. Wang, X. Chen, J. Zhao, and K. He. Scaling proprioceptive-visual learning with heteroge- neous pre-trained transformers. InNeurips, 2024

2024
[15]

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffu- sion foundation model for bimanual manipulation. InInternational Conference on Learning Representations, volume 2025, pages 29982–30009, 2025

2025
[16]

S. Liu, B. Li, K. Ma, L. Wu, H. Tan, X. Ouyang, H. Su, and J. Zhu. Rdt2: Exploring the scaling limit of umi data towards zero-shot cross-embodiment generalization.arXiv preprint arXiv:2602.03310, 2026

work page arXiv 2026
[17]

L. Zha, A. J. Hancock, M. Zhang, T. Yin, Y . Huang, D. Shah, A. Z. Ren, and A. Majumdar. Lap: Language-action pre-training enables zero-shot cross-embodiment transfer, 2026. URL https://arxiv.org/abs/2602.10556

work page arXiv 2026
[18]

L. Y . Chen, K. Hari, K. Dharmarajan, C. Xu, Q. Vuong, and K. Goldberg. Mirage: Cross- embodiment zero-shot policy transfer with cross-painting. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

2024
[19]

S. Bahl, A. Gupta, and D. Pathak. Human-to-robot imitation in the wild. 2022

2022
[20]

Dessalene, P

E. Dessalene, P. Mantripragada, M. Maynord, and Y . Aloimonos. Embodiswap for zero-shot robot imitation learning.arXiv preprint arXiv:2510.03706, 2025

work page arXiv 2025
[21]

Lepert, J

M. Lepert, J. Fang, and J. Bohg. Phantom: Training robots without robots using only human videos. InConference on Robot Learning, pages 4545–4565. PMLR, 2025

2025
[22]

Lepert, J

M. Lepert, J. Fang, and J. Bohg. Masquerade: Learning from in-the-wild human videos using data-editing. In2026 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2026

2026
[23]

G. Li, Y . Lyu, Z. Liu, C. Hou, Y . Xu, J. Zhang, and S. Zhang. H2r: A human-to-robot data augmentation for robot pre-training from videos. InSynthetic Data for Computer Vision Work- shop@ CVPR 2025

2025
[24]

G. Ji, H. Polavaram, L. Y . Chen, S. Bajamahal, Z. Ma, S. Adebola, C. Xu, and K. Goldberg. Oxe-auge: A large-scale robot augmentation of oxe for scaling cross-embodiment policy learn- ing.arXiv preprint arXiv:2512.13100, 2025

work page arXiv 2025
[25]

P. Dan, K. Kedia, A. Chao, E. Duan, M. A. Pace, W.-C. Ma, and S. Choudhury. X-sim: Cross-embodiment learning via real-to-sim-to-real. InConference on Robot Learning, pages 816–833. PMLR, 2025. 11

2025
[26]

Lepert, R

M. Lepert, R. Doshi, and J. Bohg. Shadow: Leveraging segmentation masks for zero-shot cross-embodiment policy transfer. InConference on Robot Learning (CoRL), Munich, Ger- many, 2024

2024
[27]

Mirjalili, T

R. Mirjalili, T. J ¨ulg, F. Walter, and W. Burgard. Augmented Reality for RObots (ARRO): Pointing visuomotor policies towards visual robustness.arXiv preprint arXiv:2505.08627, 2025

work page arXiv 2025
[28]

Handa, K

A. Handa, K. Van Wyk, W. Yang, J. Liang, Y .-W. Chao, Q. Wan, S. Birchfield, N. Ratliff, and D. Fox. Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system. In2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9164–9170. IEEE, 2020

2020
[29]

Y . Qin, W. Yang, B. Huang, K. Van Wyk, H. Su, X. Wang, Y .-W. Chao, and D. Fox. Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system. InRobotics: Science and Systems, 2023

2023
[30]

H. Yuan, B. Zhou, Y . Fu, and Z. Lu. Cross-embodiment dexterous grasping with reinforce- ment learning. InInternational Conference on Learning Representations, volume 2025, pages 81413–81434, 2025

2025
[31]

Z. Wei, Y . Yao, and M. Ding. One hand to rule them all: Canonical representations for unified dexterous manipulation.arXiv preprint arXiv:2602.16712, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

Bauer, E

E. Bauer, E. Nava, and R. K. Katzschmann. Latent action diffusion for cross-embodiment manipulation. InDexterous Manipulation: Learning and Control with Diverse Modalities, 2025

2025
[33]

Zhang, L

K. Zhang, L. Xu, C. Song, J. Xu, X. Lin, Z. Jiang, and R. Xu. Dexformer: Cross-embodied dexterous manipulation via history-conditioned transformer.preprint, 2026

2026
[34]

J. Mu, S. Yang, H. Bae, F. Jia, Q. Ben, B. Li, H. Xu, and J. Pang. One-policy-fits-all: Geometry- aware action latents for cross-embodiment manipulation.arXiv preprint arXiv:2603.14522, 2026

work page arXiv 2026
[35]

K. Zakka. mink: Python inverse kinematics based on MuJoCo, jul 2025. URLhttps:// github.com/kevinzakka/mink

2025
[36]

openpi.https://github.com/Physical-Intelligence/openpi,

Physical Intelligence. openpi.https://github.com/Physical-Intelligence/openpi,
[37]

Accessed: 2026-06-04

2026
[38]

Front-Left

K. Pertsch. DROID with filled-in language annotations.https://huggingface.co/KarlP/ droid, 2024. 12 Appendix A Data Processing A.1 Camera Extrinsics Estimation The wrist-camera extrinsics shipped with DROID are noisy and not suitable for the pixel-level mask alignment needed byCloak. We therefore re-estimate the 6-DoF camera pose in the end-effector frame...

2024

[1] [1]

O. X.-E. Collaboration, A. O’Neill, A. Rehman, A. Gupta, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, A. Tung, A. Bewley, A. Her- zog, A. Irpan, A. Khazatsky, A. Rai, A. Gupta, A. Wang, A. Kolobov, A. Singh, A. Garg, A. Kembhavi, A. Xie, A. Brohan, A. Raffin, A. Sharma, A. Yavary, A. Jain, A. Balakr- ishna, A. W...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

2024

[3] [3]

C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. InProceedings of Robotics: Science and Systems (RSS), 2024

2024

[4] [4]

Bjorck, N

NVIDIA, J. Bjorck, N. C. Fernando Casta ˜neda, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, ...

2025

[5] [5]

M. Xu, H. Zhang, Y . Hou, Z. Xu, L. Fan, M. Veloso, and S. Song. Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation. InConference on Robot Learning, pages 437–459. PMLR, 2025

2025

[6] [6]

L. Y . Chen, C. Xu, K. Dharmarajan, M. Z. Irshad, R. Cheng, K. Keutzer, M. Tomizuka, Q. Vuong, and K. Goldberg. Rovi-aug: Robot and viewpoint augmentation for cross- embodiment robot learning. InConference on Robot Learning (CoRL), Munich, Germany, 2024

2024

[7] [7]

C. Yuan, S. Joshi, S. Zhu, H. Su, H. Zhao, and Y . Gao. Roboengine: Plug-and-play robot data augmentation with semantic robot segmentation and background generation. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7622–

[8] [8]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, et al.\π {0.5}: a vision-language-action model with open-world generalization. In9th Annual Conference on Robot Learning, 2025

2025

[9] [9]

M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X- vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Ghosh, H

Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

2024

[13] [13]

Doshi, H

R. Doshi, H. Walke, O. Mees, S. Dasari, and S. Levine. Scaling cross-embodied learn- ing: One policy for manipulation, navigation, locomotion and aviation.arXiv preprint arXiv:2408.11812, 2024

work page arXiv 2024

[14] [14]

L. Wang, X. Chen, J. Zhao, and K. He. Scaling proprioceptive-visual learning with heteroge- neous pre-trained transformers. InNeurips, 2024

2024

[15] [15]

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffu- sion foundation model for bimanual manipulation. InInternational Conference on Learning Representations, volume 2025, pages 29982–30009, 2025

2025

[16] [16]

S. Liu, B. Li, K. Ma, L. Wu, H. Tan, X. Ouyang, H. Su, and J. Zhu. Rdt2: Exploring the scaling limit of umi data towards zero-shot cross-embodiment generalization.arXiv preprint arXiv:2602.03310, 2026

work page arXiv 2026

[17] [17]

L. Zha, A. J. Hancock, M. Zhang, T. Yin, Y . Huang, D. Shah, A. Z. Ren, and A. Majumdar. Lap: Language-action pre-training enables zero-shot cross-embodiment transfer, 2026. URL https://arxiv.org/abs/2602.10556

work page arXiv 2026

[18] [18]

L. Y . Chen, K. Hari, K. Dharmarajan, C. Xu, Q. Vuong, and K. Goldberg. Mirage: Cross- embodiment zero-shot policy transfer with cross-painting. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

2024

[19] [19]

S. Bahl, A. Gupta, and D. Pathak. Human-to-robot imitation in the wild. 2022

2022

[20] [20]

Dessalene, P

E. Dessalene, P. Mantripragada, M. Maynord, and Y . Aloimonos. Embodiswap for zero-shot robot imitation learning.arXiv preprint arXiv:2510.03706, 2025

work page arXiv 2025

[21] [21]

Lepert, J

M. Lepert, J. Fang, and J. Bohg. Phantom: Training robots without robots using only human videos. InConference on Robot Learning, pages 4545–4565. PMLR, 2025

2025

[22] [22]

Lepert, J

M. Lepert, J. Fang, and J. Bohg. Masquerade: Learning from in-the-wild human videos using data-editing. In2026 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2026

2026

[23] [23]

G. Li, Y . Lyu, Z. Liu, C. Hou, Y . Xu, J. Zhang, and S. Zhang. H2r: A human-to-robot data augmentation for robot pre-training from videos. InSynthetic Data for Computer Vision Work- shop@ CVPR 2025

2025

[24] [24]

G. Ji, H. Polavaram, L. Y . Chen, S. Bajamahal, Z. Ma, S. Adebola, C. Xu, and K. Goldberg. Oxe-auge: A large-scale robot augmentation of oxe for scaling cross-embodiment policy learn- ing.arXiv preprint arXiv:2512.13100, 2025

work page arXiv 2025

[25] [25]

P. Dan, K. Kedia, A. Chao, E. Duan, M. A. Pace, W.-C. Ma, and S. Choudhury. X-sim: Cross-embodiment learning via real-to-sim-to-real. InConference on Robot Learning, pages 816–833. PMLR, 2025. 11

2025

[26] [26]

Lepert, R

M. Lepert, R. Doshi, and J. Bohg. Shadow: Leveraging segmentation masks for zero-shot cross-embodiment policy transfer. InConference on Robot Learning (CoRL), Munich, Ger- many, 2024

2024

[27] [27]

Mirjalili, T

R. Mirjalili, T. J ¨ulg, F. Walter, and W. Burgard. Augmented Reality for RObots (ARRO): Pointing visuomotor policies towards visual robustness.arXiv preprint arXiv:2505.08627, 2025

work page arXiv 2025

[28] [28]

Handa, K

A. Handa, K. Van Wyk, W. Yang, J. Liang, Y .-W. Chao, Q. Wan, S. Birchfield, N. Ratliff, and D. Fox. Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system. In2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9164–9170. IEEE, 2020

2020

[29] [29]

Y . Qin, W. Yang, B. Huang, K. Van Wyk, H. Su, X. Wang, Y .-W. Chao, and D. Fox. Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system. InRobotics: Science and Systems, 2023

2023

[30] [30]

H. Yuan, B. Zhou, Y . Fu, and Z. Lu. Cross-embodiment dexterous grasping with reinforce- ment learning. InInternational Conference on Learning Representations, volume 2025, pages 81413–81434, 2025

2025

[31] [31]

Z. Wei, Y . Yao, and M. Ding. One hand to rule them all: Canonical representations for unified dexterous manipulation.arXiv preprint arXiv:2602.16712, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

Bauer, E

E. Bauer, E. Nava, and R. K. Katzschmann. Latent action diffusion for cross-embodiment manipulation. InDexterous Manipulation: Learning and Control with Diverse Modalities, 2025

2025

[33] [33]

Zhang, L

K. Zhang, L. Xu, C. Song, J. Xu, X. Lin, Z. Jiang, and R. Xu. Dexformer: Cross-embodied dexterous manipulation via history-conditioned transformer.preprint, 2026

2026

[34] [34]

J. Mu, S. Yang, H. Bae, F. Jia, Q. Ben, B. Li, H. Xu, and J. Pang. One-policy-fits-all: Geometry- aware action latents for cross-embodiment manipulation.arXiv preprint arXiv:2603.14522, 2026

work page arXiv 2026

[35] [35]

K. Zakka. mink: Python inverse kinematics based on MuJoCo, jul 2025. URLhttps:// github.com/kevinzakka/mink

2025

[36] [36]

openpi.https://github.com/Physical-Intelligence/openpi,

Physical Intelligence. openpi.https://github.com/Physical-Intelligence/openpi,

[37] [37]

Accessed: 2026-06-04

2026

[38] [38]

Front-Left

K. Pertsch. DROID with filled-in language annotations.https://huggingface.co/KarlP/ droid, 2024. 12 Appendix A Data Processing A.1 Camera Extrinsics Estimation The wrist-camera extrinsics shipped with DROID are noisy and not suitable for the pixel-level mask alignment needed byCloak. We therefore re-estimate the 6-DoF camera pose in the end-effector frame...

2024