pith. sign in

arxiv: 2606.22836 · v1 · pith:6GGTSSSAnew · submitted 2026-06-22 · 💻 cs.RO

Cloak: Zero-Shot Cross-Embodiment Manipulation by Masking the End-Effector from the VLA

Pith reviewed 2026-06-26 08:50 UTC · model grok-4.3

classification 💻 cs.RO
keywords zero-shot cross-embodimentVLAend-effector maskingrobot manipulationvision-language-actionwrist cameratransfer learning
0
0 comments X

The pith

Masking the end-effector in wrist-camera images lets a VLA trained on one gripper control unseen robot bodies zero-shot.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that hiding the end-effector from the wrist view during training makes a Vision-Language-Action model treat visual input as independent of its own hardware. It renders the mask directly from the robot's known geometry in simulation, then augments the mask so the model learns to ignore embodiment-specific appearance. The resulting model, trained only on parallel-jaw gripper data, is applied without any new collection to a different gripper, a different arm, and a five-fingered hand. Performance on the original embodiment stays the same. This decouples collected data from the hardware that produced it.

Core claim

Cloak endows a VLA with zero-shot cross-embodiment transfer by cloaking the end-effector from its own wrist camera. The end-effector occupies a large and consistent region of the wrist view and masking it allows for embodiment-agnostic visual reasoning. Cloak renders a mask in simulation from the robot's known geometry, accurately and in real time, with no segmentation or generative models. During training the mask is augmented so the model generalizes to embodiments unseen at training time. Cloak-VLA trained on a single parallel-jaw gripper dataset transfers zero-shot to various unseen embodiments while preserving the source embodiment's performance.

What carries the argument

The Cloak mask: a real-time rendered silhouette of the end-effector generated from known robot geometry and augmented during training to force embodiment-agnostic reasoning.

If this is right

  • The same model controls another gripper, another arm, and a five-fingered hand without any new data or fine-tuning.
  • Performance on the source parallel-jaw gripper remains unchanged after the masking procedure.
  • Robot datasets collected on one embodiment remain usable after the hardware is replaced or retired.
  • No segmentation models or generative models are required to produce the masks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same masking idea could be applied to other visible robot parts such as the arm links if they appear in the wrist view.
  • Large multi-embodiment datasets might be assembled more easily if each source is masked before mixing.
  • The approach may reduce the need to retrain policies when only the end-effector changes on an existing robot.
  • Testing the method on camera placements other than the wrist could reveal how much the benefit depends on the end-effector dominating the image.

Load-bearing premise

Rendering an accurate mask from the robot's known geometry and augmenting it during training is sufficient to make visual reasoning ignore embodiment details for bodies never seen in the training data.

What would settle it

A clear drop in success rate when the trained model is tested on the five-fingered hand or another unseen arm, relative to its performance on the original parallel-jaw gripper, would show the transfer has failed.

Figures

Figures reproduced from arXiv: 2606.22836 by C. Karen Liu, Guy Tevet, Michael Piseno.

Figure 1
Figure 1. Figure 1: We cloak the end-effector from its own wrist camera, letting a VLA trained on a single [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview. Cloak renders a geometric mask of the end-effector using the robot state and wrist camera parameters, augments it during training, and uses it to compute an attention mask for the vision encoder, cloaking the end-effector from the wrist view. The resulting image patch tokens, the source-robot state, and the language prompt drive a single VLA backbone and action head. On an unseen embodiment, tip-… view at source ↗
Figure 3
Figure 3. Figure 3: Task-averaged progression rate. The original gripper does not use TP retargeting because [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Wrist camera extrinsics estimation on a representative DROID frame. From left to right: [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example trial setups across the four tasks (two per task). The bright yellow number [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Keyframe rollouts on the unseen Sharpa hand for the pick-and-place task, prompt [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Keyframe rollouts on the unseen UMI gripper for the fold task, prompt [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Keyframe rollouts on the unseen YAM arm and gripper for the remove task, prompt [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ablation rollouts on the unseen Sharpa hand for the move task, prompt [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
read the original abstract

We present Cloak, a training recipe that endows a Vision-Language-Action (VLA) model with zero-shot cross-embodiment transfer by cloaking the end-effector from its own wrist camera. The end-effector occupies a large and consistent region of the wrist view and masking it allows for embodiment-agnostic visual reasoning. Cloak renders a mask in simulation from the robot's known geometry, accurately and in real time, with no segmentation or generative models. During training, we augment the mask so the model generalizes to embodiments unseen at training time. We demonstrate the recipe with Cloak-VLA, a VLA trained with Cloak on a single parallel-jaw gripper dataset. No data of new embodiments is ever collected. Cloak-VLA transfers zero-shot to various unseen embodiments, including another gripper, another arm, and a five-fingered hand, while preserving the source embodiment's performance. By decoupling the wrist view from its own embodiment, Cloak allows data to outlive the hardware it was collected on.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes Cloak, a training recipe for Vision-Language-Action (VLA) models that enables zero-shot cross-embodiment transfer by rendering and augmenting masks of the end-effector from known geometry in simulation. A model trained only on parallel-jaw gripper data is claimed to transfer without further data collection to unseen embodiments including another gripper, another arm, and a five-fingered hand, while preserving source-embodiment performance.

Significance. If the empirical results hold under rigorous validation, the work would be significant for robotics by providing a practical, simulation-only mechanism to decouple wrist-camera visual reasoning from specific hardware, allowing datasets to outlive the robots on which they were collected. The geometry-based real-time masking without segmentation or generative models is a concrete engineering strength that could be adopted in other VLA pipelines.

major comments (1)
  1. [Abstract] Abstract: the central claim that mask augmentation during training on a single parallel-jaw dataset produces embodiment-agnostic features for a never-seen five-fingered hand is load-bearing, yet no quantitative metrics, baselines, ablation isolating the augmentation schedule, or failure-case analysis are reported; without these it is impossible to verify whether the augmentation distribution actually covers the required appearance statistics of arbitrary new end-effector geometries.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for identifying a point where the presentation of evidence for the core claim can be strengthened. We address the comment below and commit to revisions that will make the supporting metrics, ablations, and analysis explicit.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that mask augmentation during training on a single parallel-jaw dataset produces embodiment-agnostic features for a never-seen five-fingered hand is load-bearing, yet no quantitative metrics, baselines, ablation isolating the augmentation schedule, or failure-case analysis are reported; without these it is impossible to verify whether the augmentation distribution actually covers the required appearance statistics of arbitrary new end-effector geometries.

    Authors: We agree that the abstract states the claim concisely without the supporting numbers and that an explicit ablation isolating the augmentation schedule together with failure-case analysis would allow readers to assess coverage of new end-effector appearance statistics. The experiments section already reports success rates on the five-fingered hand and comparisons against a no-masking baseline, but these elements are not summarized in the abstract and the augmentation ablation is not isolated as a single controlled study. We will therefore revise the abstract to include the key quantitative transfer metrics, add a dedicated ablation subsection that varies only the augmentation schedule, and include a failure-case analysis (with examples of when transfer succeeds or degrades) in the main text or supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is data augmentation with external held-out evaluation

full rationale

The paper describes a masking-based data augmentation procedure applied during training on a single parallel-jaw gripper dataset, with zero-shot transfer evaluated on unseen embodiments (different grippers, arms, five-fingered hand). No equations, fitted parameters, or predictions are defined; success is measured by empirical performance on held-out hardware rather than any reduction to training inputs by construction. No self-citations or uniqueness theorems are invoked as load-bearing steps. The derivation chain consists of a rendering step from known geometry plus augmentation, both independent of the target result.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the end-effector dominates the wrist view and that simulation masking plus augmentation suffices for generalization; no free parameters or invented entities are stated in the abstract.

free parameters (1)
  • mask augmentation schedule
    Parameters controlling how the rendered mask is varied during training to promote generalization; values not specified in abstract.
axioms (1)
  • domain assumption The end-effector occupies a large and consistent region of the wrist view
    Explicitly stated in the abstract as the justification for masking.

pith-pipeline@v0.9.1-grok · 5726 in / 1215 out tokens · 16178 ms · 2026-06-26T08:50:36.129759+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 12 canonical work pages · 5 internal anchors

  1. [1]

    O. X.-E. Collaboration, A. O’Neill, A. Rehman, A. Gupta, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, A. Tung, A. Bewley, A. Her- zog, A. Irpan, A. Khazatsky, A. Rai, A. Gupta, A. Wang, A. Kolobov, A. Singh, A. Garg, A. Kembhavi, A. Xie, A. Brohan, A. Raffin, A. Sharma, A. Yavary, A. Jain, A. Balakr- ishna, A. W...

  2. [2]

    Khazatsky, K

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

  3. [3]

    C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. InProceedings of Robotics: Science and Systems (RSS), 2024

  4. [4]

    Bjorck, N

    NVIDIA, J. Bjorck, N. C. Fernando Casta ˜neda, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, ...

  5. [5]

    M. Xu, H. Zhang, Y . Hou, Z. Xu, L. Fan, M. Veloso, and S. Song. Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation. InConference on Robot Learning, pages 437–459. PMLR, 2025

  6. [6]

    L. Y . Chen, C. Xu, K. Dharmarajan, M. Z. Irshad, R. Cheng, K. Keutzer, M. Tomizuka, Q. Vuong, and K. Goldberg. Rovi-aug: Robot and viewpoint augmentation for cross- embodiment robot learning. InConference on Robot Learning (CoRL), Munich, Germany, 2024

  7. [7]

    C. Yuan, S. Joshi, S. Zhu, H. Su, H. Zhao, and Y . Gao. Roboengine: Plug-and-play robot data augmentation with semantic robot segmentation and background generation. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7622–

  8. [8]

    Black, N

    K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, et al.\π {0.5}: a vision-language-action model with open-world generalization. In9th Annual Conference on Robot Learning, 2025

  9. [9]

    M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 10

  10. [10]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  11. [11]

    X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

    J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X- vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025

  12. [12]

    Ghosh, H

    Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

  13. [13]

    Doshi, H

    R. Doshi, H. Walke, O. Mees, S. Dasari, and S. Levine. Scaling cross-embodied learn- ing: One policy for manipulation, navigation, locomotion and aviation.arXiv preprint arXiv:2408.11812, 2024

  14. [14]

    L. Wang, X. Chen, J. Zhao, and K. He. Scaling proprioceptive-visual learning with heteroge- neous pre-trained transformers. InNeurips, 2024

  15. [15]

    S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffu- sion foundation model for bimanual manipulation. InInternational Conference on Learning Representations, volume 2025, pages 29982–30009, 2025

  16. [16]

    S. Liu, B. Li, K. Ma, L. Wu, H. Tan, X. Ouyang, H. Su, and J. Zhu. Rdt2: Exploring the scaling limit of umi data towards zero-shot cross-embodiment generalization.arXiv preprint arXiv:2602.03310, 2026

  17. [17]

    L. Zha, A. J. Hancock, M. Zhang, T. Yin, Y . Huang, D. Shah, A. Z. Ren, and A. Majumdar. Lap: Language-action pre-training enables zero-shot cross-embodiment transfer, 2026. URL https://arxiv.org/abs/2602.10556

  18. [18]

    L. Y . Chen, K. Hari, K. Dharmarajan, C. Xu, Q. Vuong, and K. Goldberg. Mirage: Cross- embodiment zero-shot policy transfer with cross-painting. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

  19. [19]

    S. Bahl, A. Gupta, and D. Pathak. Human-to-robot imitation in the wild. 2022

  20. [20]

    Dessalene, P

    E. Dessalene, P. Mantripragada, M. Maynord, and Y . Aloimonos. Embodiswap for zero-shot robot imitation learning.arXiv preprint arXiv:2510.03706, 2025

  21. [21]

    Lepert, J

    M. Lepert, J. Fang, and J. Bohg. Phantom: Training robots without robots using only human videos. InConference on Robot Learning, pages 4545–4565. PMLR, 2025

  22. [22]

    Lepert, J

    M. Lepert, J. Fang, and J. Bohg. Masquerade: Learning from in-the-wild human videos using data-editing. In2026 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2026

  23. [23]

    G. Li, Y . Lyu, Z. Liu, C. Hou, Y . Xu, J. Zhang, and S. Zhang. H2r: A human-to-robot data augmentation for robot pre-training from videos. InSynthetic Data for Computer Vision Work- shop@ CVPR 2025

  24. [24]

    G. Ji, H. Polavaram, L. Y . Chen, S. Bajamahal, Z. Ma, S. Adebola, C. Xu, and K. Goldberg. Oxe-auge: A large-scale robot augmentation of oxe for scaling cross-embodiment policy learn- ing.arXiv preprint arXiv:2512.13100, 2025

  25. [25]

    P. Dan, K. Kedia, A. Chao, E. Duan, M. A. Pace, W.-C. Ma, and S. Choudhury. X-sim: Cross-embodiment learning via real-to-sim-to-real. InConference on Robot Learning, pages 816–833. PMLR, 2025. 11

  26. [26]

    Lepert, R

    M. Lepert, R. Doshi, and J. Bohg. Shadow: Leveraging segmentation masks for zero-shot cross-embodiment policy transfer. InConference on Robot Learning (CoRL), Munich, Ger- many, 2024

  27. [27]

    Mirjalili, T

    R. Mirjalili, T. J ¨ulg, F. Walter, and W. Burgard. Augmented Reality for RObots (ARRO): Pointing visuomotor policies towards visual robustness.arXiv preprint arXiv:2505.08627, 2025

  28. [28]

    Handa, K

    A. Handa, K. Van Wyk, W. Yang, J. Liang, Y .-W. Chao, Q. Wan, S. Birchfield, N. Ratliff, and D. Fox. Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system. In2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9164–9170. IEEE, 2020

  29. [29]

    Y . Qin, W. Yang, B. Huang, K. Van Wyk, H. Su, X. Wang, Y .-W. Chao, and D. Fox. Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system. InRobotics: Science and Systems, 2023

  30. [30]

    H. Yuan, B. Zhou, Y . Fu, and Z. Lu. Cross-embodiment dexterous grasping with reinforce- ment learning. InInternational Conference on Learning Representations, volume 2025, pages 81413–81434, 2025

  31. [31]

    Z. Wei, Y . Yao, and M. Ding. One hand to rule them all: Canonical representations for unified dexterous manipulation.arXiv preprint arXiv:2602.16712, 2026

  32. [32]

    Bauer, E

    E. Bauer, E. Nava, and R. K. Katzschmann. Latent action diffusion for cross-embodiment manipulation. InDexterous Manipulation: Learning and Control with Diverse Modalities, 2025

  33. [33]

    Zhang, L

    K. Zhang, L. Xu, C. Song, J. Xu, X. Lin, Z. Jiang, and R. Xu. Dexformer: Cross-embodied dexterous manipulation via history-conditioned transformer.preprint, 2026

  34. [34]

    J. Mu, S. Yang, H. Bae, F. Jia, Q. Ben, B. Li, H. Xu, and J. Pang. One-policy-fits-all: Geometry- aware action latents for cross-embodiment manipulation.arXiv preprint arXiv:2603.14522, 2026

  35. [35]

    K. Zakka. mink: Python inverse kinematics based on MuJoCo, jul 2025. URLhttps:// github.com/kevinzakka/mink

  36. [36]

    openpi.https://github.com/Physical-Intelligence/openpi,

    Physical Intelligence. openpi.https://github.com/Physical-Intelligence/openpi,

  37. [37]

    Accessed: 2026-06-04

  38. [38]

    Front-Left

    K. Pertsch. DROID with filled-in language annotations.https://huggingface.co/KarlP/ droid, 2024. 12 Appendix A Data Processing A.1 Camera Extrinsics Estimation The wrist-camera extrinsics shipped with DROID are noisy and not suitable for the pixel-level mask alignment needed byCloak. We therefore re-estimate the 6-DoF camera pose in the end-effector frame...