Contrastive Action-Image Pre-training for Visuomotor Control

Anirudh Pai; Baifeng Shi; Boning Shao; Danfei Xu; Dantong Niu; Fabio Galasso; Jing Wang; Jitendra Malik; Konstantinos Kallidromitis; Linxi "Jim" Fan

arxiv: 2606.17256 · v1 · pith:2YMSJRINnew · submitted 2026-06-15 · 💻 cs.RO · cs.CV

Contrastive Action-Image Pre-training for Visuomotor Control

Yuvan Sharma , Dantong Niu , Anirudh Pai , Zekai Wang , Zhuoyang Liu , Baifeng Shi , Stefano Saravalle , Boning Shao

show 11 more authors

Ruijie Zheng Jing Wang Konstantinos Kallidromitis Yusuke Kato Fabio Galasso Yuke Zhu Danfei Xu Linxi "Jim" Fan Jitendra Malik Trevor Darrell Roei Herzig

This is my paper

Pith reviewed 2026-06-27 03:12 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords contrastive learningvisuomotor controlegocentric videodexterous manipulationvision encodersrobot learninghand keypointspre-training

0 comments

The pith

CAIP learns vision encoders by contrasting images against 3D hand keypoints extracted from human egocentric video to serve as action proxies for robots.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CAIP as a pre-training method that extracts 3D hand keypoints from 32,041 hours of egocentric human video and uses them in a contrastive objective with images. This creates a unified action-image representation intended to transfer to robot visuomotor policies. The approach requires only 88 hours of robot data and is evaluated on real dexterous manipulation with two different robot hands. CAIP is shown to outperform DINOv2, SigLIP, MVP, and R3M, with gains exceeding 30 percent on folding, pouring, and fine-grained tasks. The work positions this action-centric contrastive signal as a scalable alternative to purely image- or language-based pre-training for physical control.

Core claim

By treating 3D hand keypoints from large-scale egocentric human video as proxies for end-effector actions, CAIP applies a contrastive objective to learn a vision encoder that produces representations aligned with downstream robot action spaces, yielding more than 30 percent performance gains over prior encoders when deployed on real-world dexterous manipulation tasks with limited robot data.

What carries the argument

The contrastive action-image pre-training objective that aligns image features with 3D hand keypoint features extracted from human video.

If this is right

CAIP scales pre-training by substituting abundant human video for scarce robot trajectories while still supplying an action signal.
The resulting encoder improves policy performance on real hardware including Dexmate Vega and Sharpa Wave hands across folding, pouring, and fine-grained tasks.
Gains exceed 30 percent relative to DINOv2, SigLIP, MVP, and R3M under identical downstream training conditions.
Only 88 hours of robot data are needed once the encoder has been pre-trained on the human video corpus.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If hand-keypoint alignment proves robust across varied human activities, the same pre-training recipe could be applied to other embodied domains such as navigation or mobile manipulation.
The method implicitly suggests that future robot datasets could focus on high-quality action labels rather than attempting to match internet-scale image volume.
Extending the contrastive pairs to include temporal sequences of keypoints might further strengthen the learned dynamics for longer-horizon tasks.

Load-bearing premise

That 3D hand keypoints from human egocentric video provide a representation that aligns naturally with downstream robot action spaces and transfers effectively to visuomotor policies.

What would settle it

A controlled experiment in which the 3D hand keypoint signal is replaced by random vectors or unrelated features during pre-training, after which performance on the same robot tasks shows no improvement over baseline encoders.

Figures

Figures reproduced from arXiv: 2606.17256 by Anirudh Pai, Baifeng Shi, Boning Shao, Danfei Xu, Dantong Niu, Fabio Galasso, Jing Wang, Jitendra Malik, Konstantinos Kallidromitis, Linxi "Jim" Fan, Roei Herzig, Ruijie Zheng, Stefano Saravalle, Trevor Darrell, Yuke Zhu, Yusuke Kato, Yuvan Sharma, Zekai Wang, Zhuoyang Liu.

**Figure 1.** Figure 1: (Left) We visualize which image regions each encoder emphasizes, with saliency being computed using each encoder’s natural query mechanism (see Section A.1). SigLIP captures high-level semantics and DINOv2 captures visual structure, but neither attends to action-relevant regions. Our encoder produces manipulation-centric features focused on hands and relevant objects. (Center) Hand pose actions and paired… view at source ↗

**Figure 2.** Figure 2: CAIP architecture. A ViT encodes N image patches and a text transformer encodes L language tokens, while an action transformer encodes a T-step action chunk into a single embedding via the [CLS] token. To form a text-conditioned image embedding, we attention-pool patch tokens using text tokens as queries, then pool the result with a learnable query. The action embedding and text-conditioned image embedding… view at source ↗

**Figure 3.** Figure 3: Linear probe and zero-shot action classification on the held-out dataset. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Saliency across vision encoders on held-out egocentric manipulation frames. Columns: input, CAIP (ours), SigLIP, DINOv2. CAIP’s text-conditioned cross-attention pool (aggregated over instruction tokens) focuses on the hands and manipulated object; SigLIP’s text-agnostic learned probe scatters across background sink patches; DINOv2’s per-image PCA segments by appearance but is instruction-unaware (colors no… view at source ↗

**Figure 5.** Figure 5: CAIP Hardware Setup. ulator: the base, torso, and head joints are held static, and we drive only the two 7-DoF arms (14 joints in total). We replace the platform’s native end-effectors with Sharpa Wave hands, so each 7- DoF arm provides full SE(3) positioning of its attached dexterous hand. Arm motion is commanded as relative end-effector pose targets (Section B.3.2). B.1.2 Sharpa Wave Each arm is equipped… view at source ↗

**Figure 6.** Figure 6: Progression of Fold Shorts. Pour Almonds. Language instruction: “Pour the almonds from the filled cup to the empty cup.” This bimanual task probes control of a dynamic, granular process: the policy must regulate cup orientation and pour rate to transfer free-flowing almonds without spilling or overshooting. It is our most data-constrained task, trained on only 150 demonstrations [PITH_FULL_IMAGE:figures/f… view at source ↗

**Figure 7.** Figure 7: Progression of Pour Almonds. Pick Fruits. Language instruction: “Pick up the fruit on the left side using your left hand and place it in the basket. Then, pick up the fruit on the right side using your right hand and place it in the basket.” This bimanual task evaluates sequential pick-and-place over multiple objects, testing reliable grasping of irregularly shaped items and correct hand–object assignment… view at source ↗

**Figure 8.** Figure 8: Progression of Pick Fruits. Dispense Soap. Language instruction: “Use your left hand to pick up the soap dispenser, and then use your right hand to press the pump to dispense soap into the red bowl.” This bimanual task requires asymmetric coordination in which one hand stabilizes the dispenser while the other applies a controlled downward press, testing precise force application against a compliant mechani… view at source ↗

**Figure 9.** Figure 9: Progression of Dispense Soap. Turn On Lamp. Language instruction: “Using your left hand, carefully pull the lamp chain and release it to turn on the lamp.” This single-arm task targets fine-grained dexterity: the policy must grasp a thin, freely hanging chain, pull it through a short actuation stroke, and release, leaving little margin for imprecise contact [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

**Figure 10.** Figure 10: Progression of Turn On Lamp. Pull Tissue. Language instruction: “Use your left hand to pick up the tissue box, and then use your right hand to pull out the tissue.” This bimanual task tests coordinated extraction of a flexible object, where one hand secures the box while the other gently pulls a single tissue free without tearing it or dislodging the box [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Progression of Pull Tissue. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: (Left) The Dark condition reduces the intensity of the standard scene lighting. (Center) The original lighting used during training. (Right) The Light condition adds an overhead bulb that casts shadows across the workspace. Distractors. For the distractor setting, we add two objects to the scene: a red book and a multicolored Hanoi toy tower. Both are placed well within the manipulation area so that they… view at source ↗

**Figure 13.** Figure 13: (Left) The original scene without distractors. (Right) The scene with the two distractor objects added. E Baseline Encoders We evaluate the following baseline vision encoders, all frozen during policy training. The token strategy (CLS/pooled single token vs. all patch tokens) was selected per encoder based on which option yielded the best policy performance; see [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

read the original abstract

Existing vision encoders for robotics face a fundamental bottleneck: robotic datasets lack the scale necessary for large-scale pre-training. Prior work circumvents this data scarcity by turning to internet-scale image and language data or egocentric human video. While these models show promise, neither paradigm learns from paired vision and action data, which downstream visuomotor control policies require. However, robot trajectories, the most direct source of this paired signal, are not available at pre-training scale, motivating us to extract action signals from abundant human video instead. To this end, we introduce CAIP (Contrastive Action-Image Pre-training), a vision encoder that treats human hand poses from large-scale egocentric video as a proxy for end-effector actions. By extracting 3D hand keypoints, a representation that aligns naturally with downstream robot action spaces, CAIP learns a unified action-image representation through a contrastive objective. Leveraging 32,041 hours of egocentric human video and only 88 hours of robotic manipulation data, CAIP outperforms state-of-the-art vision encoders including DINOv2, SigLIP, MVP, and R3M. Evaluated on a challenging real-world dexterous manipulation setup using Dexmate Vega and Sharpa Wave hands, CAIP yields performance gains of more than 30% on tasks involving folding, pouring, and fine-grained manipulation. Our results show that our method of contrastive action-centric pre-training yields a scalable path to achieving robust visual representations better suited for physical interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CAIP reports real gains on dexterous robot tasks from contrastive pre-training on human hand keypoints, but the direct transfer from human poses to robot actions is the claim that needs checking.

read the letter

The main thing to know is that this paper shows measurable lifts on real dexterous manipulation by pre-training a vision encoder contrastively on 32,041 hours of egocentric human video, using 3D hand keypoints as action proxies, then fine-tuning on a small robot dataset.

What is new is the explicit contrastive tie between images and these hand-pose signals extracted from human video. Earlier approaches leaned on internet images, language, or unaligned video. Here the loss directly pulls image features toward the 3D keypoints, which the authors treat as stand-ins for end-effector motion. They evaluate on folding, pouring, and fine manipulation with Dexmate Vega and Sharpa Wave hands and claim more than 30% better success rates than DINOv2, SigLIP, MVP, and R3M.

The scale of the human data and the fact that they run on physical robots are the parts that stand out. If the numbers hold after proper controls, the method gives a practical way to bring action-like signals into pre-training without needing massive robot datasets.

The soft spot is the alignment assumption. Human hands and the tested robot platforms differ in degrees of freedom and morphology. The abstract states that 3D keypoints align naturally with robot action spaces, yet there is no description of retargeting, forward kinematics, or an ablation that isolates whether the contrastive objective learns action-relevant features or just stronger visual correlations. Without that, it is hard to credit the gains specifically to the action-image objective rather than better general features. Baseline implementation details and statistical reporting would also need to be verified.

This is for groups working on visuomotor policies and scalable pre-training for robotics. A reader who cares about using human video for robot vision would find the setup and results worth examining. It deserves a serious referee because the real-robot evaluation is present and the technical idea is straightforward to test and extend.

I would send it to peer review. The core approach is worth the time, and the alignment question can be clarified in revision.

Referee Report

1 major / 1 minor

Summary. The paper introduces CAIP (Contrastive Action-Image Pre-training), a vision encoder pre-trained via contrastive loss on paired images and 3D hand keypoints extracted from 32,041 hours of egocentric human video. These keypoints are positioned as a proxy for robot end-effector actions. The method is evaluated on real-world dexterous manipulation tasks (folding, pouring, fine-grained manipulation) using Dexmate Vega and Sharpa Wave hands, reporting >30% gains over DINOv2, SigLIP, MVP, and R3M while using only 88 hours of robotic data.

Significance. If the central claims hold after addressing the alignment issue, the work demonstrates a scalable route to action-aware visual representations by leveraging abundant human video rather than scarce robot trajectories. The independent human corpus avoids circularity with the robotic evaluation data, and the real-world dexterous setup on challenging platforms provides a strong testbed. This could meaningfully advance visuomotor pre-training if the proxy mechanism is shown to transfer without spurious correlations.

major comments (1)

[Abstract] Abstract: The claim that '3D hand keypoints, a representation that aligns naturally with downstream robot action spaces' lacks any described retargeting, forward-kinematics projection, or morphological mapping between human hands (typically 21-27 DOF) and the specific robot platforms (Dexmate Vega, Sharpa Wave). Without an ablation isolating whether the contrastive objective learns action-relevant features versus visual correlations, the attribution of the reported >30% gains to action-image pre-training is not substantiated.

minor comments (1)

[Abstract] Abstract: The quantitative performance claims would be strengthened by explicit mention of the number of tasks, exact success metrics, statistical significance, and baseline implementation details (e.g., whether encoders were frozen or fine-tuned).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract claim regarding action alignment. We address the concern point by point below and will revise the manuscript accordingly to improve clarity and substantiation.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that '3D hand keypoints, a representation that aligns naturally with downstream robot action spaces' lacks any described retargeting, forward-kinematics projection, or morphological mapping between human hands (typically 21-27 DOF) and the specific robot platforms (Dexmate Vega, Sharpa Wave). Without an ablation isolating whether the contrastive objective learns action-relevant features versus visual correlations, the attribution of the reported >30% gains to action-image pre-training is not substantiated.

Authors: We acknowledge that the abstract phrasing could be more precise. The manuscript positions 3D hand keypoints as a proxy for end-effector actions because they provide 3D Cartesian positions of hand joints extracted from egocentric video, which share a similar spatial representation with many robot end-effector pose controllers. No explicit retargeting, forward-kinematics projection, or morphological mapping between human (21-27 DOF) and robot hands is described or performed, as the contrastive objective operates on the keypoint coordinates directly as an action signal without requiring joint-space alignment. The visual encoder is trained to produce features predictive of these keypoints, which are then frozen and used with separate robot-specific action heads in the 88 hours of downstream data. We agree this proxy mechanism requires clearer justification to rule out spurious visual correlations. The existing comparisons to DINOv2, SigLIP, MVP, and R3M (none of which use action signals) provide indirect evidence, but we will add an ablation in the revision that contrasts against a purely visual contrastive baseline on the same human video corpus to isolate the contribution of the action-image pairing. This will be reflected in an updated abstract and methods section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses independent external data

full rationale

The paper's central method extracts 3D hand keypoints from large-scale egocentric human video (32,041 hours) and applies contrastive pre-training to learn action-image representations. This corpus is independent of the downstream robotic test tasks (88 hours of manipulation data on Dexmate Vega/Sharpa Wave). No equations, fitted parameters, or predictions in the provided text reduce to self-fitted quantities from the evaluation set. The alignment claim is an assumption, not a definitional reduction or self-citation chain. Self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that human hand poses serve as valid action proxies and on the effectiveness of the contrastive objective for transfer; these are not independently evidenced in the abstract.

axioms (1)

domain assumption 3D hand keypoints from human video align naturally with robot end-effector action spaces
Abstract states this alignment enables unified action-image representation for downstream control.

pith-pipeline@v0.9.1-grok · 5874 in / 1181 out tokens · 46809 ms · 2026-06-27T03:12:27.082423+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 2 canonical work pages

[1]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision, 2021. URLhttps://arxiv.org/abs/2103.00020

Pith/arXiv arXiv 2021
[2]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre- training, 2023. URLhttps://arxiv.org/abs/2303.15343

Pith/arXiv arXiv 2023
[3]

K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick. Masked autoencoders are scalable vision learners, 2021. URLhttps://arxiv.org/abs/2111.06377

Pith/arXiv arXiv 2021
[4]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haz- iza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. Dinov2: Learning robust visual features without sup...

Pith/arXiv arXiv 2024
[5]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning, 2023. URLhttps://arxiv. org/abs/2304.08485

Pith/arXiv arXiv 2023
[6]

H. Bao, L. Dong, S. Piao, and F. Wei. Beit: Bert pre-training of image transformers, 2022. URLhttps://arxiv.org/abs/2106.08254

Pith/arXiv arXiv 2022
[7]

Alayrac, J

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Milli- can, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Mon- teiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual language ...

Pith/arXiv arXiv 2022
[8]

J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023. URLhttps://arxiv.org/abs/ 2301.12597

Pith/arXiv arXiv 2023
[9]

Caron, H

M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers, 2021. URLhttps://arxiv.org/abs/ 2104.14294

Pith/arXiv arXiv 2021
[10]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

Pith/arXiv arXiv 2025
[11]

Collaboration, A

E. Collaboration, A. O’Neill, A. Rehman, A. Gupta, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, A. Tung, A. Bewley, A. Herzog, A. Ir- pan, A. Khazatsky, A. Rai, A. Gupta, A. Wang, A. Kolobov, A. Singh, A. Garg, A. Kembhavi, A. Xie, A. Brohan, A. Raffin, A. Sharma, A. Yavary, A. Jain, A. Balakrishna, A. Wahid, B....

Pith/arXiv arXiv 2025
[12]

Grauman, A

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, M. Martin, T. Nagarajan, I. Radosavovic, S. K. Ramakrishnan, F. Ryan, J. Sharma, M. Wray, M. Xu, E. Z. Xu, C. Zhao, S. Bansal, D. Batra, V . Cartillier, S. Crane, T. Do, M. Doulaty, A. Erapalli, C. Feichtenhofer, A. Fragomeni, Q. Fu, A. Gebreselas...

arXiv 2022
[13]

D. Shan, J. Geng, M. Shu, and D. Fouhey. Understanding human hands in contact at internet scale. 2020

2020
[14]

Damen, H

D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Scaling egocentric vision: The epic-kitchens dataset. InEuropean Conference on Computer Vision (ECCV), 2018

2018
[15]

S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3m: A universal visual represen- tation for robot manipulation, 2022. URLhttps://arxiv.org/abs/2203.12601. 11

Pith/arXiv arXiv 2022
[16]

T. Xiao, I. Radosavovic, T. Darrell, and J. Malik. Masked visual pre-training for motor control,
[17]

URLhttps://arxiv.org/abs/2203.06173

arXiv
[18]

Tschannen, A

M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y . Xia, B. Mustafa, O. H´enaff, J. Harmsen, A. Steiner, and X. Zhai. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features, 2025. URLhttps://arxiv.org/abs/2502.14786

Pith/arXiv arXiv 2025
[19]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps: //qwen.ai/blog?id=qwen3.5

2026
[20]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling, 2023. URLhttps://arxiv.org/abs/2210.02747

Pith/arXiv arXiv 2023
[21]

ACM Trans

J. Romero, D. Tzionas, and M. J. Black. Embodied hands: modeling and capturing hands and bodies together.ACM Transactions on Graphics, 36(6):1–17, Nov. 2017. ISSN 1557-7368. doi:10.1145/3130800.3130883. URLhttp://dx.doi.org/10.1145/3130800.3130883

work page doi:10.1145/3130800.3130883 2017
[22]

Z. Tong, Y . Song, J. Wang, and L. Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training, 2022. URLhttps://arxiv.org/abs/2203. 12602

2022
[23]

Majumdar, K

A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V .-P. Berges, P. Abbeel, J. Malik, D. Batra, Y . Lin, O. Maksymets, A. Rajeswaran, and F. Meier. Where are we in the search for an artificial visual cortex for embodied intelligence?, 2023. URL https://arxiv.org/abs/2303.18240

arXiv 2023
[24]

Levine, C

S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies,
[25]

URLhttps://arxiv.org/abs/1504.00702

Pith/arXiv arXiv
[26]

C. Finn, X. Y . Tan, Y . Duan, T. Darrell, S. Levine, and P. Abbeel. Deep spatial autoencoders for visuomotor learning, 2016. URLhttps://arxiv.org/abs/1509.06113

Pith/arXiv arXiv 2016
[27]

Rahmatizadeh, P

R. Rahmatizadeh, P. Abolghasemi, L. B ¨ol¨oni, and S. Levine. Vision-based multi-task ma- nipulation for inexpensive robots using end-to-end learning from demonstration, 2018. URL https://arxiv.org/abs/1707.02920

Pith/arXiv arXiv 2018
[28]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. URLhttps: //arxiv.org/abs/2010.11929

Pith/arXiv arXiv 2021
[29]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, K.-H. Lee, S. Levine, Y . Lu, U. Malla, D. Man- junath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsc...

Pith/arXiv arXiv 2023
[30]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...

Pith/arXiv arXiv 2023
[31]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A vision-language-action flow model for general robot control, 2026. URLhttps://arxiv. o...

Pith/arXiv arXiv 2026
[32]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. V...
[33]

X. Wang, I. Alabdulmohsin, D. Salz, Z. Li, K. Rong, and X. Zhai. Scaling pre-training to one hundred billion data for vision language models, 2025. URLhttps://arxiv.org/abs/ 2502.07617

Pith/arXiv arXiv 2025
[34]

M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024
[35]

Bjorck, F

NVIDIA, :, J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z...

Pith/arXiv arXiv 2025
[36]

Beyer, A

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Al- abdulmohsin, M. Tschannen, E. Bugliarello, T. Unterthiner, D. Keysers, S. Koppula, F. Liu, A. Grycner, A. Gritsenko, N. Houlsby, M. Kumar, K. Rong, J. Eisenschlos, R. Kabra, M. Bauer, M. Boˇsnjak, X. Chen, M. Minderer, P. V oigtlaender, I. Bica, I. Balazevic, J. Puigcer...

2024
[37]

Z. Li, G. Chen, S. Liu, S. Wang, V . VS, Y . Ji, S. Lan, H. Zhang, Y . Zhao, S. Radhakrishnan, N. Chang, K. Sapra, A. S. Deshmukh, T. Rintamaki, M. Le, I. Karmanov, L. V oegtle, P. Fischer, D.-A. Huang, T. Roman, T. Lu, J. M. Alvarez, B. Catanzaro, J. Kautz, A. Tao, G. Liu, and Z. Yu. Eagle 2: Building post-training data strategies from scratch for fronti...

arXiv 2025
[38]

Damen, H

D. Damen, H. Doughty, G. M. Farinella, A. Furnari, E. Kazakos, J. Ma, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100, 2020. URLhttps://arxiv.org/abs/2006.13256

arXiv 2020
[40]

Radosavovic, T

I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell. Real-world robot learn- ing with masked visual pre-training, 2022. URLhttps://arxiv.org/abs/2210.03109

arXiv 2022
[41]

M. K. Srirama, S. Dasari, S. Bahl, and A. Gupta. Hrp: Human affordances for robotic pre- training, 2024. URLhttps://arxiv.org/abs/2407.18911

arXiv 2024
[42]

S. Ye, J. Jang, B. Jeon, S. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin, L. Liden, K. Lee, J. Gao, L. Zettlemoyer, D. Fox, and M. Seo. Latent action pretraining from videos, 2025. URLhttps://arxiv.org/abs/2410.11758. 13

Pith/arXiv arXiv 2025
[43]

Y . Chen, Y . Ge, W. Tang, Y . Li, Y . Ge, M. Ding, Y . Shan, and X. Liu. Moto: Latent mo- tion token as the bridging language for learning robot manipulation from videos, 2025. URL https://arxiv.org/abs/2412.04445

arXiv 2025
[44]

W. Dai, K. Lan, J. Zhou, B. Zhao, X. Su, J. Tong, W. Guan, and S. Yang. Conla: Contrastive latent action learning from human videos for robotic manipulation, 2026. URLhttps:// arxiv.org/abs/2602.00557

arXiv 2026
[45]

Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li. Univla: Learning to act anywhere with task-centric latent actions, 2025. URLhttps://arxiv.org/abs/2505. 06111

2025
[46]

Zhang, J

C. Zhang, J. Wang, Z. Gao, Y . Su, T. Dai, C. Zhou, J. Lu, and Y . Tang. Clap: Contrastive latent action pretraining for learning vision-language-action models from human videos, 2026. URL https://arxiv.org/abs/2601.04061

arXiv 2026
[47]

Jiang, Y

G. Jiang, Y . Sun, T. Huang, H. Li, Y . Liang, and H. Xu. Robots pre-train robots: Manipulation- centric robotic representation from large-scale robot datasets, 2024. URLhttps://arxiv. org/abs/2410.22325

arXiv 2024
[48]

S.-W. Lee, X. Kang, B. Yang, and Y .-L. Kuo. Class: Contrastive learning via action sequence supervision for robot manipulation, 2025. URLhttps://arxiv.org/abs/2508.01600

arXiv 2025
[49]

T. Kim, J. Lee, M. Koo, D. Kim, K. Lee, C. Kim, Y . Seo, and J. Shin. Contrastive representation regularization for vision-language-action models, 2025. URLhttps://arxiv.org/abs/ 2510.01711

Pith/arXiv arXiv 2025
[50]

I.-C. A. Liu, K. Choromanski, S. Huang, and C. Schenck. Clamp: Contrastive learning for 3d multi-view action-conditioned robotic manipulation pretraining, 2026. URLhttps:// arxiv.org/abs/2602.00937

Pith/arXiv arXiv 2026
[51]

W. Wang, J. Li, Y . Zhu, Z. Xu, Z. Che, Y . Peng, C. Shen, D. Liu, F. Feng, and J. Tang. Visual robotic manipulation with depth-aware pretraining, 2024. URLhttps://arxiv.org/abs/ 2401.09038

arXiv 2024
[52]

Darcet, M

T. Darcet, M. Oquab, J. Mairal, and P. Bojanowski. Vision transformers need registers, 2024. URLhttps://arxiv.org/abs/2309.16588

Pith/arXiv arXiv 2024
[53]

J. Gu, F. Xiang, X. Li, Z. Ling, X. Liu, T. Mu, Y . Tang, S. Tao, X. Wei, Y . Yao, X. Yuan, P. Xie, Z. Huang, R. Chen, and H. Su. Maniskill2: A unified benchmark for generalizable manipulation skills. InInternational Conference on Learning Representations, 2023

2023
[54]

Caron, Y

S. Caron, Y . De Mont-Marin, R. Budhiraja, S. H. Bang, I. Domrachev, S. Nedelchev, P. Du, A. Escande, J. Vaillant, B. Wingo, S. Patapati, D. San Jos ´e Pro, and N. G. Marticorena Vidal. Pink: Python inverse kinematics based on Pinocchio, 2026. URLhttps://github.com/ stephane-caron/pink

2026
[55]

Carpentier, G

J. Carpentier, G. Saurel, G. Buondonno, J. Mirabel, F. Lamiraux, O. Stasse, and N. Mansard. The Pinocchio C++ library – A fast and flexible implementation of rigid body dynamics al- gorithms and their analytical derivatives. InSII 2019 - International Symposium on System Integrations, Paris, France, Jan. 2019. URLhttps://hal.laas.fr/hal-01866228

2019
[56]

J. A. E. Andersson, J. Gillis, G. Horn, J. B. Rawlings, and M. Diehl. CasADi – A software framework for nonlinear optimization and optimal control.Mathematical Programming Com- putation, 11(1):1–36, 2019. doi:10.1007/s12532-018-0139-4

work page doi:10.1007/s12532-018-0139-4 2019
[57]

Hoque, P

R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. Egodex: Learning dexter- ous manipulation from large-scale egocentric video, 2025. URLhttps://arxiv.org/abs/ 2505.11709. 14

Pith/arXiv arXiv 2025
[58]

Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks, 2020. URLhttps://arxiv.org/abs/1812.07035

arXiv 2020
[59]

Pour the almonds from the filled cup to the empty cup

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware, 2023. URLhttps://arxiv.org/abs/2304.13705. 15 Appendix A Additional Experiments A.1 Saliency Visualization Details The visualizations in Figure 1 (left) and the per-encoder comparison in Figure 4 are computed us- ing each encoder’s native que...

Pith/arXiv arXiv 2023

[1] [1]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision, 2021. URLhttps://arxiv.org/abs/2103.00020

Pith/arXiv arXiv 2021

[2] [2]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre- training, 2023. URLhttps://arxiv.org/abs/2303.15343

Pith/arXiv arXiv 2023

[3] [3]

K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick. Masked autoencoders are scalable vision learners, 2021. URLhttps://arxiv.org/abs/2111.06377

Pith/arXiv arXiv 2021

[4] [4]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haz- iza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. Dinov2: Learning robust visual features without sup...

Pith/arXiv arXiv 2024

[5] [5]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning, 2023. URLhttps://arxiv. org/abs/2304.08485

Pith/arXiv arXiv 2023

[6] [6]

H. Bao, L. Dong, S. Piao, and F. Wei. Beit: Bert pre-training of image transformers, 2022. URLhttps://arxiv.org/abs/2106.08254

Pith/arXiv arXiv 2022

[7] [7]

Alayrac, J

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Milli- can, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Mon- teiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual language ...

Pith/arXiv arXiv 2022

[8] [8]

J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023. URLhttps://arxiv.org/abs/ 2301.12597

Pith/arXiv arXiv 2023

[9] [9]

Caron, H

M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers, 2021. URLhttps://arxiv.org/abs/ 2104.14294

Pith/arXiv arXiv 2021

[10] [10]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

Pith/arXiv arXiv 2025

[11] [11]

Collaboration, A

E. Collaboration, A. O’Neill, A. Rehman, A. Gupta, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, A. Tung, A. Bewley, A. Herzog, A. Ir- pan, A. Khazatsky, A. Rai, A. Gupta, A. Wang, A. Kolobov, A. Singh, A. Garg, A. Kembhavi, A. Xie, A. Brohan, A. Raffin, A. Sharma, A. Yavary, A. Jain, A. Balakrishna, A. Wahid, B....

Pith/arXiv arXiv 2025

[12] [12]

Grauman, A

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, M. Martin, T. Nagarajan, I. Radosavovic, S. K. Ramakrishnan, F. Ryan, J. Sharma, M. Wray, M. Xu, E. Z. Xu, C. Zhao, S. Bansal, D. Batra, V . Cartillier, S. Crane, T. Do, M. Doulaty, A. Erapalli, C. Feichtenhofer, A. Fragomeni, Q. Fu, A. Gebreselas...

arXiv 2022

[13] [13]

D. Shan, J. Geng, M. Shu, and D. Fouhey. Understanding human hands in contact at internet scale. 2020

2020

[14] [14]

Damen, H

D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Scaling egocentric vision: The epic-kitchens dataset. InEuropean Conference on Computer Vision (ECCV), 2018

2018

[15] [15]

S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3m: A universal visual represen- tation for robot manipulation, 2022. URLhttps://arxiv.org/abs/2203.12601. 11

Pith/arXiv arXiv 2022

[16] [16]

T. Xiao, I. Radosavovic, T. Darrell, and J. Malik. Masked visual pre-training for motor control,

[17] [17]

URLhttps://arxiv.org/abs/2203.06173

arXiv

[18] [18]

Tschannen, A

M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y . Xia, B. Mustafa, O. H´enaff, J. Harmsen, A. Steiner, and X. Zhai. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features, 2025. URLhttps://arxiv.org/abs/2502.14786

Pith/arXiv arXiv 2025

[19] [19]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps: //qwen.ai/blog?id=qwen3.5

2026

[20] [20]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling, 2023. URLhttps://arxiv.org/abs/2210.02747

Pith/arXiv arXiv 2023

[21] [21]

ACM Trans

J. Romero, D. Tzionas, and M. J. Black. Embodied hands: modeling and capturing hands and bodies together.ACM Transactions on Graphics, 36(6):1–17, Nov. 2017. ISSN 1557-7368. doi:10.1145/3130800.3130883. URLhttp://dx.doi.org/10.1145/3130800.3130883

work page doi:10.1145/3130800.3130883 2017

[22] [22]

Z. Tong, Y . Song, J. Wang, and L. Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training, 2022. URLhttps://arxiv.org/abs/2203. 12602

2022

[23] [23]

Majumdar, K

A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V .-P. Berges, P. Abbeel, J. Malik, D. Batra, Y . Lin, O. Maksymets, A. Rajeswaran, and F. Meier. Where are we in the search for an artificial visual cortex for embodied intelligence?, 2023. URL https://arxiv.org/abs/2303.18240

arXiv 2023

[24] [24]

Levine, C

S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies,

[25] [25]

URLhttps://arxiv.org/abs/1504.00702

Pith/arXiv arXiv

[26] [26]

C. Finn, X. Y . Tan, Y . Duan, T. Darrell, S. Levine, and P. Abbeel. Deep spatial autoencoders for visuomotor learning, 2016. URLhttps://arxiv.org/abs/1509.06113

Pith/arXiv arXiv 2016

[27] [27]

Rahmatizadeh, P

R. Rahmatizadeh, P. Abolghasemi, L. B ¨ol¨oni, and S. Levine. Vision-based multi-task ma- nipulation for inexpensive robots using end-to-end learning from demonstration, 2018. URL https://arxiv.org/abs/1707.02920

Pith/arXiv arXiv 2018

[28] [28]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. URLhttps: //arxiv.org/abs/2010.11929

Pith/arXiv arXiv 2021

[29] [29]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, K.-H. Lee, S. Levine, Y . Lu, U. Malla, D. Man- junath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsc...

Pith/arXiv arXiv 2023

[30] [30]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...

Pith/arXiv arXiv 2023

[31] [31]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A vision-language-action flow model for general robot control, 2026. URLhttps://arxiv. o...

Pith/arXiv arXiv 2026

[32] [32]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. V...

[33] [33]

X. Wang, I. Alabdulmohsin, D. Salz, Z. Li, K. Rong, and X. Zhai. Scaling pre-training to one hundred billion data for vision language models, 2025. URLhttps://arxiv.org/abs/ 2502.07617

Pith/arXiv arXiv 2025

[34] [34]

M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024

[35] [35]

Bjorck, F

NVIDIA, :, J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z...

Pith/arXiv arXiv 2025

[36] [36]

Beyer, A

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Al- abdulmohsin, M. Tschannen, E. Bugliarello, T. Unterthiner, D. Keysers, S. Koppula, F. Liu, A. Grycner, A. Gritsenko, N. Houlsby, M. Kumar, K. Rong, J. Eisenschlos, R. Kabra, M. Bauer, M. Boˇsnjak, X. Chen, M. Minderer, P. V oigtlaender, I. Bica, I. Balazevic, J. Puigcer...

2024

[37] [37]

Z. Li, G. Chen, S. Liu, S. Wang, V . VS, Y . Ji, S. Lan, H. Zhang, Y . Zhao, S. Radhakrishnan, N. Chang, K. Sapra, A. S. Deshmukh, T. Rintamaki, M. Le, I. Karmanov, L. V oegtle, P. Fischer, D.-A. Huang, T. Roman, T. Lu, J. M. Alvarez, B. Catanzaro, J. Kautz, A. Tao, G. Liu, and Z. Yu. Eagle 2: Building post-training data strategies from scratch for fronti...

arXiv 2025

[38] [38]

Damen, H

D. Damen, H. Doughty, G. M. Farinella, A. Furnari, E. Kazakos, J. Ma, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100, 2020. URLhttps://arxiv.org/abs/2006.13256

arXiv 2020

[39] [40]

Radosavovic, T

I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell. Real-world robot learn- ing with masked visual pre-training, 2022. URLhttps://arxiv.org/abs/2210.03109

arXiv 2022

[40] [41]

M. K. Srirama, S. Dasari, S. Bahl, and A. Gupta. Hrp: Human affordances for robotic pre- training, 2024. URLhttps://arxiv.org/abs/2407.18911

arXiv 2024

[41] [42]

S. Ye, J. Jang, B. Jeon, S. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin, L. Liden, K. Lee, J. Gao, L. Zettlemoyer, D. Fox, and M. Seo. Latent action pretraining from videos, 2025. URLhttps://arxiv.org/abs/2410.11758. 13

Pith/arXiv arXiv 2025

[42] [43]

Y . Chen, Y . Ge, W. Tang, Y . Li, Y . Ge, M. Ding, Y . Shan, and X. Liu. Moto: Latent mo- tion token as the bridging language for learning robot manipulation from videos, 2025. URL https://arxiv.org/abs/2412.04445

arXiv 2025

[43] [44]

W. Dai, K. Lan, J. Zhou, B. Zhao, X. Su, J. Tong, W. Guan, and S. Yang. Conla: Contrastive latent action learning from human videos for robotic manipulation, 2026. URLhttps:// arxiv.org/abs/2602.00557

arXiv 2026

[44] [45]

Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li. Univla: Learning to act anywhere with task-centric latent actions, 2025. URLhttps://arxiv.org/abs/2505. 06111

2025

[45] [46]

Zhang, J

C. Zhang, J. Wang, Z. Gao, Y . Su, T. Dai, C. Zhou, J. Lu, and Y . Tang. Clap: Contrastive latent action pretraining for learning vision-language-action models from human videos, 2026. URL https://arxiv.org/abs/2601.04061

arXiv 2026

[46] [47]

Jiang, Y

G. Jiang, Y . Sun, T. Huang, H. Li, Y . Liang, and H. Xu. Robots pre-train robots: Manipulation- centric robotic representation from large-scale robot datasets, 2024. URLhttps://arxiv. org/abs/2410.22325

arXiv 2024

[47] [48]

S.-W. Lee, X. Kang, B. Yang, and Y .-L. Kuo. Class: Contrastive learning via action sequence supervision for robot manipulation, 2025. URLhttps://arxiv.org/abs/2508.01600

arXiv 2025

[48] [49]

T. Kim, J. Lee, M. Koo, D. Kim, K. Lee, C. Kim, Y . Seo, and J. Shin. Contrastive representation regularization for vision-language-action models, 2025. URLhttps://arxiv.org/abs/ 2510.01711

Pith/arXiv arXiv 2025

[49] [50]

I.-C. A. Liu, K. Choromanski, S. Huang, and C. Schenck. Clamp: Contrastive learning for 3d multi-view action-conditioned robotic manipulation pretraining, 2026. URLhttps:// arxiv.org/abs/2602.00937

Pith/arXiv arXiv 2026

[50] [51]

W. Wang, J. Li, Y . Zhu, Z. Xu, Z. Che, Y . Peng, C. Shen, D. Liu, F. Feng, and J. Tang. Visual robotic manipulation with depth-aware pretraining, 2024. URLhttps://arxiv.org/abs/ 2401.09038

arXiv 2024

[51] [52]

Darcet, M

T. Darcet, M. Oquab, J. Mairal, and P. Bojanowski. Vision transformers need registers, 2024. URLhttps://arxiv.org/abs/2309.16588

Pith/arXiv arXiv 2024

[52] [53]

J. Gu, F. Xiang, X. Li, Z. Ling, X. Liu, T. Mu, Y . Tang, S. Tao, X. Wei, Y . Yao, X. Yuan, P. Xie, Z. Huang, R. Chen, and H. Su. Maniskill2: A unified benchmark for generalizable manipulation skills. InInternational Conference on Learning Representations, 2023

2023

[53] [54]

Caron, Y

S. Caron, Y . De Mont-Marin, R. Budhiraja, S. H. Bang, I. Domrachev, S. Nedelchev, P. Du, A. Escande, J. Vaillant, B. Wingo, S. Patapati, D. San Jos ´e Pro, and N. G. Marticorena Vidal. Pink: Python inverse kinematics based on Pinocchio, 2026. URLhttps://github.com/ stephane-caron/pink

2026

[54] [55]

Carpentier, G

J. Carpentier, G. Saurel, G. Buondonno, J. Mirabel, F. Lamiraux, O. Stasse, and N. Mansard. The Pinocchio C++ library – A fast and flexible implementation of rigid body dynamics al- gorithms and their analytical derivatives. InSII 2019 - International Symposium on System Integrations, Paris, France, Jan. 2019. URLhttps://hal.laas.fr/hal-01866228

2019

[55] [56]

J. A. E. Andersson, J. Gillis, G. Horn, J. B. Rawlings, and M. Diehl. CasADi – A software framework for nonlinear optimization and optimal control.Mathematical Programming Com- putation, 11(1):1–36, 2019. doi:10.1007/s12532-018-0139-4

work page doi:10.1007/s12532-018-0139-4 2019

[56] [57]

Hoque, P

R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. Egodex: Learning dexter- ous manipulation from large-scale egocentric video, 2025. URLhttps://arxiv.org/abs/ 2505.11709. 14

Pith/arXiv arXiv 2025

[57] [58]

Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks, 2020. URLhttps://arxiv.org/abs/1812.07035

arXiv 2020

[58] [59]

Pour the almonds from the filled cup to the empty cup

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware, 2023. URLhttps://arxiv.org/abs/2304.13705. 15 Appendix A Additional Experiments A.1 Saliency Visualization Details The visualizations in Figure 1 (left) and the per-encoder comparison in Figure 4 are computed us- ing each encoder’s native que...

Pith/arXiv arXiv 2023