pith. sign in

arxiv: 2606.17256 · v1 · pith:2YMSJRINnew · submitted 2026-06-15 · 💻 cs.RO · cs.CV

Contrastive Action-Image Pre-training for Visuomotor Control

Pith reviewed 2026-06-27 03:12 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords contrastive learningvisuomotor controlegocentric videodexterous manipulationvision encodersrobot learninghand keypointspre-training
0
0 comments X

The pith

CAIP learns vision encoders by contrasting images against 3D hand keypoints extracted from human egocentric video to serve as action proxies for robots.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CAIP as a pre-training method that extracts 3D hand keypoints from 32,041 hours of egocentric human video and uses them in a contrastive objective with images. This creates a unified action-image representation intended to transfer to robot visuomotor policies. The approach requires only 88 hours of robot data and is evaluated on real dexterous manipulation with two different robot hands. CAIP is shown to outperform DINOv2, SigLIP, MVP, and R3M, with gains exceeding 30 percent on folding, pouring, and fine-grained tasks. The work positions this action-centric contrastive signal as a scalable alternative to purely image- or language-based pre-training for physical control.

Core claim

By treating 3D hand keypoints from large-scale egocentric human video as proxies for end-effector actions, CAIP applies a contrastive objective to learn a vision encoder that produces representations aligned with downstream robot action spaces, yielding more than 30 percent performance gains over prior encoders when deployed on real-world dexterous manipulation tasks with limited robot data.

What carries the argument

The contrastive action-image pre-training objective that aligns image features with 3D hand keypoint features extracted from human video.

If this is right

  • CAIP scales pre-training by substituting abundant human video for scarce robot trajectories while still supplying an action signal.
  • The resulting encoder improves policy performance on real hardware including Dexmate Vega and Sharpa Wave hands across folding, pouring, and fine-grained tasks.
  • Gains exceed 30 percent relative to DINOv2, SigLIP, MVP, and R3M under identical downstream training conditions.
  • Only 88 hours of robot data are needed once the encoder has been pre-trained on the human video corpus.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If hand-keypoint alignment proves robust across varied human activities, the same pre-training recipe could be applied to other embodied domains such as navigation or mobile manipulation.
  • The method implicitly suggests that future robot datasets could focus on high-quality action labels rather than attempting to match internet-scale image volume.
  • Extending the contrastive pairs to include temporal sequences of keypoints might further strengthen the learned dynamics for longer-horizon tasks.

Load-bearing premise

That 3D hand keypoints from human egocentric video provide a representation that aligns naturally with downstream robot action spaces and transfers effectively to visuomotor policies.

What would settle it

A controlled experiment in which the 3D hand keypoint signal is replaced by random vectors or unrelated features during pre-training, after which performance on the same robot tasks shows no improvement over baseline encoders.

Figures

Figures reproduced from arXiv: 2606.17256 by Anirudh Pai, Baifeng Shi, Boning Shao, Danfei Xu, Dantong Niu, Fabio Galasso, Jing Wang, Jitendra Malik, Konstantinos Kallidromitis, Linxi "Jim" Fan, Roei Herzig, Ruijie Zheng, Stefano Saravalle, Trevor Darrell, Yuke Zhu, Yusuke Kato, Yuvan Sharma, Zekai Wang, Zhuoyang Liu.

Figure 1
Figure 1. Figure 1: (Left) We visualize which image regions each encoder emphasizes, with saliency be￾ing computed using each encoder’s natural query mechanism (see Section A.1). SigLIP captures high-level semantics and DINOv2 captures visual structure, but neither attends to action-relevant regions. Our encoder produces manipulation-centric features focused on hands and relevant objects. (Center) Hand pose actions and paired… view at source ↗
Figure 2
Figure 2. Figure 2: CAIP architecture. A ViT encodes N image patches and a text transformer encodes L language tokens, while an action transformer encodes a T-step action chunk into a single embedding via the [CLS] token. To form a text-conditioned image embedding, we attention-pool patch tokens using text tokens as queries, then pool the result with a learnable query. The action embedding and text-conditioned image embedding… view at source ↗
Figure 3
Figure 3. Figure 3: Linear probe and zero-shot action classification on the held-out dataset. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Saliency across vision encoders on held-out egocentric manipulation frames. Columns: input, CAIP (ours), SigLIP, DINOv2. CAIP’s text-conditioned cross-attention pool (aggregated over instruction tokens) focuses on the hands and manipulated object; SigLIP’s text-agnostic learned probe scatters across background sink patches; DINOv2’s per-image PCA segments by appearance but is instruction-unaware (colors no… view at source ↗
Figure 5
Figure 5. Figure 5: CAIP Hardware Setup. ulator: the base, torso, and head joints are held static, and we drive only the two 7-DoF arms (14 joints in total). We replace the platform’s native end-effectors with Sharpa Wave hands, so each 7- DoF arm provides full SE(3) positioning of its attached dexterous hand. Arm motion is commanded as relative end-effector pose targets (Section B.3.2). B.1.2 Sharpa Wave Each arm is equipped… view at source ↗
Figure 6
Figure 6. Figure 6: Progression of Fold Shorts. Pour Almonds. Language instruction: “Pour the almonds from the filled cup to the empty cup.” This bimanual task probes control of a dynamic, granular process: the policy must regulate cup orientation and pour rate to transfer free-flowing almonds without spilling or overshooting. It is our most data-constrained task, trained on only 150 demonstrations [PITH_FULL_IMAGE:figures/f… view at source ↗
Figure 7
Figure 7. Figure 7: Progression of Pour Almonds. Pick Fruits. Language instruction: “Pick up the fruit on the left side using your left hand and place it in the basket. Then, pick up the fruit on the right side using your right hand and place it in the basket.” This bimanual task evaluates sequential pick-and-place over multiple objects, testing reliable grasp￾ing of irregularly shaped items and correct hand–object assignment… view at source ↗
Figure 8
Figure 8. Figure 8: Progression of Pick Fruits. Dispense Soap. Language instruction: “Use your left hand to pick up the soap dispenser, and then use your right hand to press the pump to dispense soap into the red bowl.” This bimanual task requires asymmetric coordination in which one hand stabilizes the dispenser while the other applies a controlled downward press, testing precise force application against a compliant mechani… view at source ↗
Figure 9
Figure 9. Figure 9: Progression of Dispense Soap. Turn On Lamp. Language instruction: “Using your left hand, carefully pull the lamp chain and release it to turn on the lamp.” This single-arm task targets fine-grained dexterity: the policy must grasp a thin, freely hanging chain, pull it through a short actuation stroke, and release, leaving little margin for imprecise contact [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Progression of Turn On Lamp. Pull Tissue. Language instruction: “Use your left hand to pick up the tissue box, and then use your right hand to pull out the tissue.” This bimanual task tests coordinated extraction of a flexible object, where one hand secures the box while the other gently pulls a single tissue free without tearing it or dislodging the box [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Progression of Pull Tissue. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: (Left) The Dark condition reduces the intensity of the standard scene lighting. (Center) The original lighting used during training. (Right) The Light condition adds an overhead bulb that casts shadows across the workspace. Distractors. For the distractor setting, we add two objects to the scene: a red book and a multi￾colored Hanoi toy tower. Both are placed well within the manipulation area so that they… view at source ↗
Figure 13
Figure 13. Figure 13: (Left) The original scene without distractors. (Right) The scene with the two distractor objects added. E Baseline Encoders We evaluate the following baseline vision encoders, all frozen during policy training. The token strategy (CLS/pooled single token vs. all patch tokens) was selected per encoder based on which option yielded the best policy performance; see [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
read the original abstract

Existing vision encoders for robotics face a fundamental bottleneck: robotic datasets lack the scale necessary for large-scale pre-training. Prior work circumvents this data scarcity by turning to internet-scale image and language data or egocentric human video. While these models show promise, neither paradigm learns from paired vision and action data, which downstream visuomotor control policies require. However, robot trajectories, the most direct source of this paired signal, are not available at pre-training scale, motivating us to extract action signals from abundant human video instead. To this end, we introduce CAIP (Contrastive Action-Image Pre-training), a vision encoder that treats human hand poses from large-scale egocentric video as a proxy for end-effector actions. By extracting 3D hand keypoints, a representation that aligns naturally with downstream robot action spaces, CAIP learns a unified action-image representation through a contrastive objective. Leveraging 32,041 hours of egocentric human video and only 88 hours of robotic manipulation data, CAIP outperforms state-of-the-art vision encoders including DINOv2, SigLIP, MVP, and R3M. Evaluated on a challenging real-world dexterous manipulation setup using Dexmate Vega and Sharpa Wave hands, CAIP yields performance gains of more than 30% on tasks involving folding, pouring, and fine-grained manipulation. Our results show that our method of contrastive action-centric pre-training yields a scalable path to achieving robust visual representations better suited for physical interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces CAIP (Contrastive Action-Image Pre-training), a vision encoder pre-trained via contrastive loss on paired images and 3D hand keypoints extracted from 32,041 hours of egocentric human video. These keypoints are positioned as a proxy for robot end-effector actions. The method is evaluated on real-world dexterous manipulation tasks (folding, pouring, fine-grained manipulation) using Dexmate Vega and Sharpa Wave hands, reporting >30% gains over DINOv2, SigLIP, MVP, and R3M while using only 88 hours of robotic data.

Significance. If the central claims hold after addressing the alignment issue, the work demonstrates a scalable route to action-aware visual representations by leveraging abundant human video rather than scarce robot trajectories. The independent human corpus avoids circularity with the robotic evaluation data, and the real-world dexterous setup on challenging platforms provides a strong testbed. This could meaningfully advance visuomotor pre-training if the proxy mechanism is shown to transfer without spurious correlations.

major comments (1)
  1. [Abstract] Abstract: The claim that '3D hand keypoints, a representation that aligns naturally with downstream robot action spaces' lacks any described retargeting, forward-kinematics projection, or morphological mapping between human hands (typically 21-27 DOF) and the specific robot platforms (Dexmate Vega, Sharpa Wave). Without an ablation isolating whether the contrastive objective learns action-relevant features versus visual correlations, the attribution of the reported >30% gains to action-image pre-training is not substantiated.
minor comments (1)
  1. [Abstract] Abstract: The quantitative performance claims would be strengthened by explicit mention of the number of tasks, exact success metrics, statistical significance, and baseline implementation details (e.g., whether encoders were frozen or fine-tuned).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract claim regarding action alignment. We address the concern point by point below and will revise the manuscript accordingly to improve clarity and substantiation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that '3D hand keypoints, a representation that aligns naturally with downstream robot action spaces' lacks any described retargeting, forward-kinematics projection, or morphological mapping between human hands (typically 21-27 DOF) and the specific robot platforms (Dexmate Vega, Sharpa Wave). Without an ablation isolating whether the contrastive objective learns action-relevant features versus visual correlations, the attribution of the reported >30% gains to action-image pre-training is not substantiated.

    Authors: We acknowledge that the abstract phrasing could be more precise. The manuscript positions 3D hand keypoints as a proxy for end-effector actions because they provide 3D Cartesian positions of hand joints extracted from egocentric video, which share a similar spatial representation with many robot end-effector pose controllers. No explicit retargeting, forward-kinematics projection, or morphological mapping between human (21-27 DOF) and robot hands is described or performed, as the contrastive objective operates on the keypoint coordinates directly as an action signal without requiring joint-space alignment. The visual encoder is trained to produce features predictive of these keypoints, which are then frozen and used with separate robot-specific action heads in the 88 hours of downstream data. We agree this proxy mechanism requires clearer justification to rule out spurious visual correlations. The existing comparisons to DINOv2, SigLIP, MVP, and R3M (none of which use action signals) provide indirect evidence, but we will add an ablation in the revision that contrasts against a purely visual contrastive baseline on the same human video corpus to isolate the contribution of the action-image pairing. This will be reflected in an updated abstract and methods section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses independent external data

full rationale

The paper's central method extracts 3D hand keypoints from large-scale egocentric human video (32,041 hours) and applies contrastive pre-training to learn action-image representations. This corpus is independent of the downstream robotic test tasks (88 hours of manipulation data on Dexmate Vega/Sharpa Wave). No equations, fitted parameters, or predictions in the provided text reduce to self-fitted quantities from the evaluation set. The alignment claim is an assumption, not a definitional reduction or self-citation chain. Self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that human hand poses serve as valid action proxies and on the effectiveness of the contrastive objective for transfer; these are not independently evidenced in the abstract.

axioms (1)
  • domain assumption 3D hand keypoints from human video align naturally with robot end-effector action spaces
    Abstract states this alignment enables unified action-image representation for downstream control.

pith-pipeline@v0.9.1-grok · 5874 in / 1181 out tokens · 46809 ms · 2026-06-27T03:12:27.082423+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 2 canonical work pages

  1. [1]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision, 2021. URLhttps://arxiv.org/abs/2103.00020

  2. [2]

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre- training, 2023. URLhttps://arxiv.org/abs/2303.15343

  3. [3]

    K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick. Masked autoencoders are scalable vision learners, 2021. URLhttps://arxiv.org/abs/2111.06377

  4. [4]

    Oquab, T

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haz- iza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. Dinov2: Learning robust visual features without sup...

  5. [5]

    H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning, 2023. URLhttps://arxiv. org/abs/2304.08485

  6. [6]

    H. Bao, L. Dong, S. Piao, and F. Wei. Beit: Bert pre-training of image transformers, 2022. URLhttps://arxiv.org/abs/2106.08254

  7. [7]

    Alayrac, J

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Milli- can, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Mon- teiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual language ...

  8. [8]

    J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023. URLhttps://arxiv.org/abs/ 2301.12597

  9. [9]

    Caron, H

    M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers, 2021. URLhttps://arxiv.org/abs/ 2104.14294

  10. [10]

    Khazatsky, K

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

  11. [11]

    Collaboration, A

    E. Collaboration, A. O’Neill, A. Rehman, A. Gupta, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, A. Tung, A. Bewley, A. Herzog, A. Ir- pan, A. Khazatsky, A. Rai, A. Gupta, A. Wang, A. Kolobov, A. Singh, A. Garg, A. Kembhavi, A. Xie, A. Brohan, A. Raffin, A. Sharma, A. Yavary, A. Jain, A. Balakrishna, A. Wahid, B....

  12. [12]

    Grauman, A

    K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, M. Martin, T. Nagarajan, I. Radosavovic, S. K. Ramakrishnan, F. Ryan, J. Sharma, M. Wray, M. Xu, E. Z. Xu, C. Zhao, S. Bansal, D. Batra, V . Cartillier, S. Crane, T. Do, M. Doulaty, A. Erapalli, C. Feichtenhofer, A. Fragomeni, Q. Fu, A. Gebreselas...

  13. [13]

    D. Shan, J. Geng, M. Shu, and D. Fouhey. Understanding human hands in contact at internet scale. 2020

  14. [14]

    Damen, H

    D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Scaling egocentric vision: The epic-kitchens dataset. InEuropean Conference on Computer Vision (ECCV), 2018

  15. [15]

    S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3m: A universal visual represen- tation for robot manipulation, 2022. URLhttps://arxiv.org/abs/2203.12601. 11

  16. [16]

    T. Xiao, I. Radosavovic, T. Darrell, and J. Malik. Masked visual pre-training for motor control,

  17. [17]

    URLhttps://arxiv.org/abs/2203.06173

  18. [18]

    Tschannen, A

    M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y . Xia, B. Mustafa, O. H´enaff, J. Harmsen, A. Steiner, and X. Zhai. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features, 2025. URLhttps://arxiv.org/abs/2502.14786

  19. [19]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps: //qwen.ai/blog?id=qwen3.5

  20. [20]

    Lipman, R

    Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling, 2023. URLhttps://arxiv.org/abs/2210.02747

  21. [21]

    ACM Trans

    J. Romero, D. Tzionas, and M. J. Black. Embodied hands: modeling and capturing hands and bodies together.ACM Transactions on Graphics, 36(6):1–17, Nov. 2017. ISSN 1557-7368. doi:10.1145/3130800.3130883. URLhttp://dx.doi.org/10.1145/3130800.3130883

  22. [22]

    Z. Tong, Y . Song, J. Wang, and L. Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training, 2022. URLhttps://arxiv.org/abs/2203. 12602

  23. [23]

    Majumdar, K

    A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V .-P. Berges, P. Abbeel, J. Malik, D. Batra, Y . Lin, O. Maksymets, A. Rajeswaran, and F. Meier. Where are we in the search for an artificial visual cortex for embodied intelligence?, 2023. URL https://arxiv.org/abs/2303.18240

  24. [24]

    Levine, C

    S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies,

  25. [25]

    URLhttps://arxiv.org/abs/1504.00702

  26. [26]

    C. Finn, X. Y . Tan, Y . Duan, T. Darrell, S. Levine, and P. Abbeel. Deep spatial autoencoders for visuomotor learning, 2016. URLhttps://arxiv.org/abs/1509.06113

  27. [27]

    Rahmatizadeh, P

    R. Rahmatizadeh, P. Abolghasemi, L. B ¨ol¨oni, and S. Levine. Vision-based multi-task ma- nipulation for inexpensive robots using end-to-end learning from demonstration, 2018. URL https://arxiv.org/abs/1707.02920

  28. [28]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. URLhttps: //arxiv.org/abs/2010.11929

  29. [29]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, K.-H. Lee, S. Levine, Y . Lu, U. Malla, D. Man- junath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsc...

  30. [30]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...

  31. [31]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A vision-language-action flow model for general robot control, 2026. URLhttps://arxiv. o...

  32. [32]

    Intelligence, K

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. V...

  33. [33]

    X. Wang, I. Alabdulmohsin, D. Salz, Z. Li, K. Rong, and X. Zhai. Scaling pre-training to one hundred billion data for vision language models, 2025. URLhttps://arxiv.org/abs/ 2502.07617

  34. [34]

    M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  35. [35]

    Bjorck, F

    NVIDIA, :, J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z...

  36. [36]

    Beyer, A

    L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Al- abdulmohsin, M. Tschannen, E. Bugliarello, T. Unterthiner, D. Keysers, S. Koppula, F. Liu, A. Grycner, A. Gritsenko, N. Houlsby, M. Kumar, K. Rong, J. Eisenschlos, R. Kabra, M. Bauer, M. Boˇsnjak, X. Chen, M. Minderer, P. V oigtlaender, I. Bica, I. Balazevic, J. Puigcer...

  37. [37]

    Z. Li, G. Chen, S. Liu, S. Wang, V . VS, Y . Ji, S. Lan, H. Zhang, Y . Zhao, S. Radhakrishnan, N. Chang, K. Sapra, A. S. Deshmukh, T. Rintamaki, M. Le, I. Karmanov, L. V oegtle, P. Fischer, D.-A. Huang, T. Roman, T. Lu, J. M. Alvarez, B. Catanzaro, J. Kautz, A. Tao, G. Liu, and Z. Yu. Eagle 2: Building post-training data strategies from scratch for fronti...

  38. [38]

    Damen, H

    D. Damen, H. Doughty, G. M. Farinella, A. Furnari, E. Kazakos, J. Ma, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100, 2020. URLhttps://arxiv.org/abs/2006.13256

  39. [40]

    Radosavovic, T

    I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell. Real-world robot learn- ing with masked visual pre-training, 2022. URLhttps://arxiv.org/abs/2210.03109

  40. [41]

    M. K. Srirama, S. Dasari, S. Bahl, and A. Gupta. Hrp: Human affordances for robotic pre- training, 2024. URLhttps://arxiv.org/abs/2407.18911

  41. [42]

    S. Ye, J. Jang, B. Jeon, S. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin, L. Liden, K. Lee, J. Gao, L. Zettlemoyer, D. Fox, and M. Seo. Latent action pretraining from videos, 2025. URLhttps://arxiv.org/abs/2410.11758. 13

  42. [43]

    Y . Chen, Y . Ge, W. Tang, Y . Li, Y . Ge, M. Ding, Y . Shan, and X. Liu. Moto: Latent mo- tion token as the bridging language for learning robot manipulation from videos, 2025. URL https://arxiv.org/abs/2412.04445

  43. [44]

    W. Dai, K. Lan, J. Zhou, B. Zhao, X. Su, J. Tong, W. Guan, and S. Yang. Conla: Contrastive latent action learning from human videos for robotic manipulation, 2026. URLhttps:// arxiv.org/abs/2602.00557

  44. [45]

    Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li. Univla: Learning to act anywhere with task-centric latent actions, 2025. URLhttps://arxiv.org/abs/2505. 06111

  45. [46]

    Zhang, J

    C. Zhang, J. Wang, Z. Gao, Y . Su, T. Dai, C. Zhou, J. Lu, and Y . Tang. Clap: Contrastive latent action pretraining for learning vision-language-action models from human videos, 2026. URL https://arxiv.org/abs/2601.04061

  46. [47]

    Jiang, Y

    G. Jiang, Y . Sun, T. Huang, H. Li, Y . Liang, and H. Xu. Robots pre-train robots: Manipulation- centric robotic representation from large-scale robot datasets, 2024. URLhttps://arxiv. org/abs/2410.22325

  47. [48]

    S.-W. Lee, X. Kang, B. Yang, and Y .-L. Kuo. Class: Contrastive learning via action sequence supervision for robot manipulation, 2025. URLhttps://arxiv.org/abs/2508.01600

  48. [49]

    T. Kim, J. Lee, M. Koo, D. Kim, K. Lee, C. Kim, Y . Seo, and J. Shin. Contrastive representation regularization for vision-language-action models, 2025. URLhttps://arxiv.org/abs/ 2510.01711

  49. [50]

    I.-C. A. Liu, K. Choromanski, S. Huang, and C. Schenck. Clamp: Contrastive learning for 3d multi-view action-conditioned robotic manipulation pretraining, 2026. URLhttps:// arxiv.org/abs/2602.00937

  50. [51]

    W. Wang, J. Li, Y . Zhu, Z. Xu, Z. Che, Y . Peng, C. Shen, D. Liu, F. Feng, and J. Tang. Visual robotic manipulation with depth-aware pretraining, 2024. URLhttps://arxiv.org/abs/ 2401.09038

  51. [52]

    Darcet, M

    T. Darcet, M. Oquab, J. Mairal, and P. Bojanowski. Vision transformers need registers, 2024. URLhttps://arxiv.org/abs/2309.16588

  52. [53]

    J. Gu, F. Xiang, X. Li, Z. Ling, X. Liu, T. Mu, Y . Tang, S. Tao, X. Wei, Y . Yao, X. Yuan, P. Xie, Z. Huang, R. Chen, and H. Su. Maniskill2: A unified benchmark for generalizable manipulation skills. InInternational Conference on Learning Representations, 2023

  53. [54]

    Caron, Y

    S. Caron, Y . De Mont-Marin, R. Budhiraja, S. H. Bang, I. Domrachev, S. Nedelchev, P. Du, A. Escande, J. Vaillant, B. Wingo, S. Patapati, D. San Jos ´e Pro, and N. G. Marticorena Vidal. Pink: Python inverse kinematics based on Pinocchio, 2026. URLhttps://github.com/ stephane-caron/pink

  54. [55]

    Carpentier, G

    J. Carpentier, G. Saurel, G. Buondonno, J. Mirabel, F. Lamiraux, O. Stasse, and N. Mansard. The Pinocchio C++ library – A fast and flexible implementation of rigid body dynamics al- gorithms and their analytical derivatives. InSII 2019 - International Symposium on System Integrations, Paris, France, Jan. 2019. URLhttps://hal.laas.fr/hal-01866228

  55. [56]

    J. A. E. Andersson, J. Gillis, G. Horn, J. B. Rawlings, and M. Diehl. CasADi – A software framework for nonlinear optimization and optimal control.Mathematical Programming Com- putation, 11(1):1–36, 2019. doi:10.1007/s12532-018-0139-4

  56. [57]

    Hoque, P

    R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. Egodex: Learning dexter- ous manipulation from large-scale egocentric video, 2025. URLhttps://arxiv.org/abs/ 2505.11709. 14

  57. [58]

    Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks, 2020. URLhttps://arxiv.org/abs/1812.07035

  58. [59]

    Pour the almonds from the filled cup to the empty cup

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware, 2023. URLhttps://arxiv.org/abs/2304.13705. 15 Appendix A Additional Experiments A.1 Saliency Visualization Details The visualizations in Figure 1 (left) and the per-encoder comparison in Figure 4 are computed us- ing each encoder’s native que...