Dexterous Point Policy: Learning Point-based Dexterous Hand Policies from Human Demonstrations

Beomjun Kim; Jinwoo Shin; Sanghyeok Lee; Seong Hyeon Park; Seunghoon Sim; Seungjun Moon

arxiv: 2606.10614 · v1 · pith:DDVVL5XAnew · submitted 2026-06-09 · 💻 cs.RO · cs.CV· cs.LG

Dexterous Point Policy: Learning Point-based Dexterous Hand Policies from Human Demonstrations

Beomjun Kim , Seong Hyeon Park , Seunghoon Sim , Seungjun Moon , Sanghyeok Lee , Jinwoo Shin This is my paper

Pith reviewed 2026-06-27 13:29 UTC · model grok-4.3

classification 💻 cs.RO cs.CVcs.LG

keywords dexterous manipulationhuman video demonstrations3D keypointspolicy learningembodiment gaprobot handsautoregressive transformerreal robot evaluation

0 comments

The pith

A unified 3D keypoint representation enables training dexterous hand policies from human videos without robot demonstrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Dexterous Point Policy, which trains policies for multi-fingered robot hands by processing 3D keypoints from human demonstration videos. These keypoints serve as both the input observations and the output actions in an autoregressive transformer model. The central idea is that wrist and fingertip positions align sufficiently between humans and robots to allow direct policy transfer. This results in 75 percent success on real-world dexterous tasks such as pick-and-place and tool use, compared to just 1 percent for a leading vision-language-action model. The approach also generalizes to multi-object scenes and novel object categories without additional training data.

Core claim

By extracting 3D keypoints of objects and hands from raw human videos and training an autoregressive transformer to predict future keypoints, the method creates policies that transfer to robot hands. Human and robot behaviors align at the keypoint level for the wrist and fingertips, so no robot demonstrations are needed. On real-robot evaluations the policy reaches 75.0 percent success across pick-and-place and tool-use tasks while a state-of-the-art VLA baseline achieves only 1.0 percent, and it maintains performance in unseen multi-object and novel-category settings.

What carries the argument

Unified 3D keypoint representation used for both observations and actions in an autoregressive transformer trained on human videos.

If this is right

Direct policy learning becomes possible for dexterous tasks without collecting robot data.
The policy succeeds on both pick-and-place and tool-use tasks at 75 percent.
Generalization occurs to multi-object environments and novel object categories.
Keypoint alignment allows bypassing the embodiment gap that usually requires fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar keypoint bridging could apply to other manipulation tasks or different robot morphologies if alignment holds.
Reducing data collection costs might enable faster iteration on complex dexterous behaviors.
Low-dimensional keypoints may suffice for many manipulation skills, suggesting further compression of visual inputs is viable.

Load-bearing premise

Wrist and fingertip keypoint trajectories are similar enough between human demonstrations and robot executions for the learned policy to work without adjustment.

What would settle it

Finding a dexterous task where the policy fails because human and robot fingertip paths diverge significantly even when both are described by the same keypoint extraction process.

Figures

Figures reproduced from arXiv: 2606.10614 by Beomjun Kim, Jinwoo Shin, Sanghyeok Lee, Seong Hyeon Park, Seunghoon Sim, Seungjun Moon.

**Figure 1.** Figure 1: Dexterous Point Policy. We present a dexterous manipulation policy trained solely from human demonstration videos. Our method combines (1) a six-keypoint hand abstraction shared by human and robot, (2) internet-scale human-video pretraining and per-task fine-tuning, and (3) a fingertip contact prediction that injects force on top of the otherwise point-only representation. Together they enable a multi-fing… view at source ↗

**Figure 2.** Figure 2: Overview. (1) Point Extraction: From an egocentric frame and task description, we extract object points (via VLM segmentation and depth estimation), hand points (wrist and five fingertips from a hand tracker), and contact points (lightweight manual annotation). (2) Architecture & Training: During pretraining, an autoregressive transformer takes language, object names, object points, and the current hand po… view at source ↗

**Figure 3.** Figure 3: Real-world rollouts. Successful executions of Dexterous Point Policy across our suite of dexterous manipulation tasks, deployed on an OpenArm bimanual arm equipped with Inspire RH56F1 hands. All policies are trained from human videos alone, with no robot demonstrations. force is applied gradually rather than instantaneously. The resulting joint trajectory is executed by the robot controller at 20 Hz. 4 Exp… view at source ↗

**Figure 4.** Figure 4: Real-world task setup. We evaluate Dexterous Point Policy on four task categories: Pick and Place (top-left) with five objects placed on a 2 × 2 grid and a fixed target container; Open (topright), where the robot opens a microwave door from a randomized initial pose; Brush (bottom-left), where the robot grasps a hand brush and sweeps debris to a target location; and Spray (bottom-right), where the robot g… view at source ↗

**Figure 5.** Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison between original and scale-consistent HaWoR. We compare inference results on a single egocentric video with each model, and visualize (a) hand scale variance, and (b) noise amplitude. Indeed, the original HaWoR shows more than 10% scale inconsistency even within a single egocentric video, while also showing higher noise amplitude along the depth axis. F Scale-Consistent HaWoR With the given egoc… view at source ↗

**Figure 7.** Figure 7: Simulation setup for auxiliary residual RL. G.2 Task and Residual RL Formulation Task, simulator, and initial states. The experiment is conducted in a MuJoCo dexterous-hand simulator configured to approximate the real-robot setup. We use a spherical-object-to-bowl manipulation task, where the dexterous hand must move a spherical object from an initial table-top position into a bowl. The object is initiali… view at source ↗

**Figure 8.** Figure 8: Anchor-balanced success rate. Residual RL improves the base-policy reference under the same anchor-balanced evaluation protocol. 0k 50k 100k 150k 200k 250k 300k 350k 400k Environment steps 0.35 0.40 0.45 0.50 0.55 0.60 Mean Q value seed 1 seed 2 seed 3 seed 4 (a) Mean Q value 0k 50k 100k 150k 200k 250k 300k 350k 400k Environment steps 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 Mean absolut… view at source ↗

**Figure 9.** Figure 9: Residual RL training diagnostics. Bounded Q values and TD errors indicate stable chunk-level residual RL optimization. G.5 Interpretation and Limitations These results suggest that, in this simulation setting, the dexterous point policy can serve as a useful prior for residual RL. The base policy already provides object-centric hand trajectories and a contact prior, allowing the residual policy to focus on… view at source ↗

read the original abstract

Robotic foundation models pre-trained on human demonstration videos have shown promise, but a significant embodiment gap remains when the resulting policies are deployed on real robots. A common remedy is to fine-tune these models on robot-specific demonstrations. However, robot data collection can be prohibitively expensive and time-consuming, which is particularly acute in dexterous manipulation, e.g., teleoperating a multi-fingered hand for even a single atomic task can take days. To address this, we introduce Dexterous Point Policy, a framework that learns dexterous manipulation policies directly from human videos and requires no robot demonstrations. Our core insight is that a unified 3D keypoint representation can bridge human and robot embodiments when used for both observations and actions. Specifically, we extract 3D keypoints of task-relevant objects and human hands from raw videos, and train an autoregressive transformer over these keypoints. We observe that at the keypoint level, specifically the wrist and fingertips, human and robot behaviors closely align, enabling direct policy transfer. On a suite of real-robot tasks spanning pick-and-place and tool use, Dexterous Point Policy attains 75.0% success, whereas a state-of-the-art VLA baseline reaches only 1.0%. Furthermore, our method generalizes strongly to unseen scenarios, including multi-object environments and novel object categories.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims zero-shot transfer of dexterous policies from human videos via 3D keypoints with 75% success, but provides no validation that the keypoint alignment actually holds across embodiments.

read the letter

The main takeaway is that this work trains an autoregressive transformer on 3D keypoints extracted from human hand videos and deploys the resulting policy directly on a robot hand, reporting 75% success on pick-and-place and tool-use tasks against 1% for a VLA baseline, plus some generalization to multi-object and novel-category cases.

What stands out is the decision to use the same keypoint representation for both observations and actions, which lets them avoid any robot demonstration data. That framing is distinct from typical fine-tuning pipelines and directly targets the data bottleneck in dexterous manipulation.

The soft spot is exactly the one flagged in the stress-test note. The abstract asserts that wrist and fingertip keypoints align closely enough between human and robot to enable direct transfer, yet the supplied text contains no quantitative check—no KL divergence on trajectories, no ablation on reachable workspace, no comparison of velocity distributions. Different hand morphologies make it plausible that many predicted keypoints fall outside the robot’s joint limits or produce mismatched dynamics once inverse kinematics is applied. Without that evidence, the performance gap is difficult to attribute to the method rather than implementation details or task selection.

The experimental protocol is also described at too high a level to assess: no trial counts, variance numbers, or baseline implementation specifics appear. This is the kind of paper that would benefit from a referee who can examine the full methods and results sections.

It is aimed at researchers working on video imitation for complex manipulation. The idea is practical enough that it deserves peer review to test whether the transfer claim survives scrutiny.

Referee Report

2 major / 0 minor

Summary. The paper introduces Dexterous Point Policy, a framework that extracts 3D keypoints (wrist, fingertips, and task-relevant objects) from human demonstration videos, trains an autoregressive transformer policy over these keypoints for both observations and actions, and deploys the resulting policy zero-shot on a physical robot hand without any robot demonstrations. It claims 75% success on real-robot pick-and-place and tool-use tasks versus 1% for a state-of-the-art VLA baseline, plus strong generalization to multi-object scenes and novel object categories, based on the observation that human and robot behaviors align closely at the keypoint level.

Significance. If the empirical claims hold after proper validation, the work would be significant for dexterous manipulation: it offers a concrete route to bypass expensive robot data collection by leveraging a unified keypoint representation to close the embodiment gap between human videos and robot hardware.

major comments (2)

[Abstract] Abstract: the central zero-shot transfer claim rests on the statement that 'at the keypoint level, specifically the wrist and fingertips, human and robot behaviors closely align,' yet the supplied text provides no quantitative validation (e.g., KL divergence between human and realized robot keypoint trajectories, workspace overlap statistics, or an ablation measuring success-rate drop when the alignment assumption is violated).
[Abstract] Abstract: the headline result (75.0% vs. 1.0% success) is presented without any description of trial count, statistical significance testing, task definitions, baseline implementation details, or potential confounds such as object pose variation or lighting, rendering the magnitude of the improvement impossible to assess from the given material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that additional details on validation and experimental reporting are warranted and will revise the abstract accordingly while preserving its conciseness.

read point-by-point responses

Referee: [Abstract] Abstract: the central zero-shot transfer claim rests on the statement that 'at the keypoint level, specifically the wrist and fingertips, human and robot behaviors closely align,' yet the supplied text provides no quantitative validation (e.g., KL divergence between human and realized robot keypoint trajectories, workspace overlap statistics, or an ablation measuring success-rate drop when the alignment assumption is violated).

Authors: We acknowledge that the abstract presents the alignment as an observation without accompanying quantitative metrics. The full manuscript supports this via successful zero-shot transfer results and qualitative trajectory visualizations in the experiments. To strengthen the presentation, we will incorporate a brief quantitative validation (e.g., mean keypoint trajectory distance or an alignment ablation) into the revised abstract and reference the corresponding analysis in the main text. revision: yes
Referee: [Abstract] Abstract: the headline result (75.0% vs. 1.0% success) is presented without any description of trial count, statistical significance testing, task definitions, baseline implementation details, or potential confounds such as object pose variation or lighting, rendering the magnitude of the improvement impossible to assess from the given material.

Authors: The abstract is intentionally concise, but we agree that the headline numbers require supporting context for proper evaluation. The full manuscript details the evaluation protocol, including trial counts, task definitions, baseline setups, and controls for confounds. We will revise the abstract to include a short clause summarizing the evaluation scale (e.g., number of trials and tasks) and note that full details appear in Section 4. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method validated on external robot benchmarks

full rationale

The paper presents an empirical machine-learning pipeline that extracts 3D keypoints from human videos, trains an autoregressive transformer, and evaluates success directly on physical robot tasks (75% vs. 1% baseline). No equations, fitted parameters, or self-citations are shown that reduce the reported performance or generalization claims to inputs by construction. The keypoint-alignment statement is offered as an enabling observation rather than a derived result, and the work is self-contained against external real-robot benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that 3D keypoints extracted from human videos can be used to train policies that transfer directly because wrist and fingertip motions align across embodiments; this is a domain assumption with no independent verification supplied in the abstract.

axioms (1)

domain assumption 3D keypoints of wrists and fingertips extracted from raw human videos provide a sufficient and aligned representation for both observation and action that transfers to robot hands
Invoked in the core insight paragraph of the abstract as the basis for direct policy transfer without robot data.

pith-pipeline@v0.9.1-grok · 5797 in / 1284 out tokens · 21100 ms · 2026-06-27T13:29:03.218903+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 11 linked inside Pith

[1]

GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

Pith/arXiv arXiv 2023
[2]

LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 1

Pith/arXiv arXiv 2023
[3]

Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

Gemini Team. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

Pith/arXiv arXiv
[4]

Hierarchical text-conditional image generation with CLIP latents.arXiv preprint arXiv:2204.06125, 2022

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with CLIP latents.arXiv preprint arXiv:2204.06125, 2022. 1

Pith/arXiv arXiv 2022
[5]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022. 1

2022
[6]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Tay- lor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. OpenAI Technical Report, 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators. 1

2024
[7]

CogVideoX: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, and Jie Tang. CogVideoX: Text-to-video diffusion models with an expert transformer. InICLR, 2025. 1

2025
[8]

Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipula- tion with low-cost hardware. InRSS, 2023. 1

2023
[9]

DROID: A large-scale in-the-wild robot manipulation dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. DROID: A large-scale in-the-wild robot manipulation dataset. InRSS, 2024. 1

2024
[10]

Open X-Embodiment: Robotic learning datasets and RT-X models

Open X-Embodiment Collaboration. Open X-Embodiment: Robotic learning datasets and RT-X models. InICRA, 2024. 1

2024
[11]

Ego4D: Around the world in 3,000 hours of egocentric video

Kristen Grauman et al. Ego4D: Around the world in 3,000 hours of egocentric video. InCVPR, 2022. 1, 2, 5

2022
[12]

Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives

Kristen Grauman et al. Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives. InCVPR, 2024. 1, 2, 5

2024
[13]

something something

Raghav Goyal et al. The “something something” video database for learning and evaluating visual common sense. InICCV, 2017. 1, 2, 5

2017
[14]

Scaling egocentric vision: The EPIC-KITCHENS dataset

Dima Damen et al. Scaling egocentric vision: The EPIC-KITCHENS dataset. InECCV, 2018. 1, 2, 5

2018
[15]

R3M: A universal visual representation for robot manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3M: A universal visual representation for robot manipulation. InCoRL, 2022. 1, 3

2022
[16]

VIP: Towards universal visual reward and representation via value-implicit pre-training

Yecheng Jason Ma et al. VIP: Towards universal visual reward and representation via value-implicit pre-training. InICLR, 2023. 1, 3

2023
[17]

Language-driven representation learning for robotics

Siddharth Karamcheti et al. Language-driven representation learning for robotics. InRSS, 2023. 1, 3

2023
[18]

Where are we in the search for an artificial visual cortex for embodied intelligence? arXiv preprint arXiv:2303.18240, 2023

Arjun Majumdar et al. Where are we in the search for an artificial visual cortex for embodied intelligence? arXiv preprint arXiv:2303.18240, 2023. 1, 3

arXiv 2023
[19]

Affordances from human videos as a versatile representation for robotics

Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. InCVPR, 2023. 1, 3

2023
[20]

Track2Act: Pre- dicting point tracks from internet videos enables generalizable robot manipulation.arXiv preprint arXiv:2405.01527, 2024

Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, and Shubham Tulsiani. Track2Act: Pre- dicting point tracks from internet videos enables generalizable robot manipulation.arXiv preprint arXiv:2405.01527, 2024. 1, 3

arXiv 2024
[21]

Gen2Act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

Homanga Bharadhwaj et al. Gen2Act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024. 1, 3 11

Pith/arXiv arXiv 2024
[22]

Any-point trajectory modeling for policy learning

Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning. InRobotics: Science and Systems (RSS), 2024. 1, 3

2024
[23]

GR00T N1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Johan Bjorck et al. GR00T N1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 1, 3

Pith/arXiv arXiv 2025
[24]

π0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

Physical Intelligence. π0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025. 1, 3

Pith/arXiv arXiv 2025
[25]

Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos

Qixiu Li, Yu Deng, Yaobo Liang, Lin Luo, Lei Zhou, Chengtang Yao, Lingqi Zeng, Zhiyuan Feng, Huizhi Liang, Sicheng Xu, Yizhong Zhang, Xi Chen, Hao Chen, Lily Sun, Dong Chen, Jiaolong Yang, and Baining Guo. Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos. InIEEE International Conference on Rob...

2026
[26]

Point policy: Unifying observations and actions with key points for robot manipulation

Siddhant Haldar and Lerrel Pinto. Point policy: Unifying observations and actions with key points for robot manipulation. InConference on Robot Learning (CoRL), 2025. 2, 3, 8, 14

2025
[27]

Point bridge: 3D representations for cross domain policy learning.arXiv preprint arXiv:2601.16212, 2026

Siddhant Haldar, Lars Johannsmeier, Lerrel Pinto, Abhishek Gupta, Dieter Fox, Yashraj Narang, and Ajay Mandlekar. Point bridge: 3D representations for cross domain policy learning.arXiv preprint arXiv:2601.16212, 2026. 2, 3

arXiv 2026
[28]

kPAM: Keypoint affordances for category- level robotic manipulation

Lucas Manuelli, Wei Gao, Peter Florence, and Russ Tedrake. kPAM: Keypoint affordances for category- level robotic manipulation. InISRR, 2019. 3

2019
[29]

Tenenbaum, Alberto Rodriguez, Pulkit Agrawal, and Vincent Sitzmann

Anthony Simeonov, Yilun Du, Andrea Tagliasacchi, Joshua B. Tenenbaum, Alberto Rodriguez, Pulkit Agrawal, and Vincent Sitzmann. Neural descriptor fields: SE(3)-equivariant object representations for manipulation. InICRA, 2022. 3

2022
[30]

ReKep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation

Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. ReKep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. InCoRL, 2024. 3

2024
[31]

MOKA: Open-world robotic manipulation through mark-based visual prompting

Fangchen Liu, Kuan Fang, Pieter Abbeel, and Sergey Levine. MOKA: Open-world robotic manipulation through mark-based visual prompting. InRSS, 2024. 3

2024
[32]

RoboPoint: A vision-language model for spatial affordance prediction for robotics

Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox. RoboPoint: A vision-language model for spatial affordance prediction for robotics. InCoRL, 2024. 3

2024
[33]

P3-PO: Prescriptive point priors for visuo-spatial generalization of robot policies.arXiv preprint arXiv:2412.06784, 2024

Mara Levy, Siddhant Haldar, Lerrel Pinto, and Abhinav Shrivastava. P3-PO: Prescriptive point priors for visuo-spatial generalization of robot policies.arXiv preprint arXiv:2412.06784, 2024. 3

arXiv 2024
[34]

DexVIP: Learning dexterous grasping with human hand pose priors from video

Priyanka Mandikal and Kristen Grauman. DexVIP: Learning dexterous grasping with human hand pose priors from video. InConference on Robot Learning (CoRL), 2022. 3

2022
[35]

VideoDex: Learning dexterity from internet videos

Kenneth Shaw, Shikhar Bahl, and Deepak Pathak. VideoDex: Learning dexterity from internet videos. In CoRL, 2023. 3

2023
[36]

Phantom: Training robots without robots using only human videos.arXiv preprint arXiv:2503.00779, 2025

Matthieu Lepert et al. Phantom: Training robots without robots using only human videos.arXiv preprint arXiv:2503.00779, 2025. 3

Pith/arXiv arXiv 2025
[37]

HaWoR: World-space hand motion reconstruction from egocentric videos

Jinglei Zhang, Jiankang Deng, Chao Ma, and Rolandos Alexandros Potamias. HaWoR: World-space hand motion reconstruction from egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 5, 17

2025
[38]

Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025. 5

Pith/arXiv arXiv 2025
[39]

SAM 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, et al. SAM 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025. 5

Pith/arXiv arXiv 2025
[40]

Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth Anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647,

Pith/arXiv arXiv
[41]

Sentence-BERT: Sentence embeddings using Siamese BERT-networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. InEMNLP, 2019. 6

2019
[42]

Qi, Hao Su, Kaichun Mo, and Leonidas J

Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. PointNet: Deep learning on point sets for 3D classification and segmentation. InCVPR, 2017. 6 12

2017
[43]

Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610, 2022

Javier Romero, Dimitrios Tzionas, and Michael J Black. Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610, 2022. 17

arXiv 2022
[44]

Reconstructing hands in 3d with transformers

Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3d with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9826–9836, 2024. 17

2024
[45]

Residual off-policy rl for finetuning behavior cloning policies, 2025

Lars Ankile, Zhenyu Jiang, Rocky Duan, Guanya Shi, Pieter Abbeel, and Anusha Nagabandi. Residual off-policy rl for finetuning behavior cloning policies, 2025. 18

2025
[46]

Reinforcement learning with action chunking

Qiyang Li, Zhiyuan Zhou, and Sergey Levine. Reinforcement learning with action chunking. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 18

2025
[47]

Xinyue Chen, Che Wang, Zijian Zhou, and Keith W. Ross. Randomized ensembled double q-learning: Learning fast without a model. InInternational Conference on Learning Representations, 2021. 20 13 A Training Details Pretraining.We pretrain the autoregressive transformer on the VITRA corpus ( ∼1M egocentric episodes) for 100k optimizer steps using AdamW with ...

2021

[1] [1]

GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

Pith/arXiv arXiv 2023

[2] [2]

LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 1

Pith/arXiv arXiv 2023

[3] [3]

Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

Gemini Team. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

Pith/arXiv arXiv

[4] [4]

Hierarchical text-conditional image generation with CLIP latents.arXiv preprint arXiv:2204.06125, 2022

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with CLIP latents.arXiv preprint arXiv:2204.06125, 2022. 1

Pith/arXiv arXiv 2022

[5] [5]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022. 1

2022

[6] [6]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Tay- lor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. OpenAI Technical Report, 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators. 1

2024

[7] [7]

CogVideoX: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, and Jie Tang. CogVideoX: Text-to-video diffusion models with an expert transformer. InICLR, 2025. 1

2025

[8] [8]

Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipula- tion with low-cost hardware. InRSS, 2023. 1

2023

[9] [9]

DROID: A large-scale in-the-wild robot manipulation dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. DROID: A large-scale in-the-wild robot manipulation dataset. InRSS, 2024. 1

2024

[10] [10]

Open X-Embodiment: Robotic learning datasets and RT-X models

Open X-Embodiment Collaboration. Open X-Embodiment: Robotic learning datasets and RT-X models. InICRA, 2024. 1

2024

[11] [11]

Ego4D: Around the world in 3,000 hours of egocentric video

Kristen Grauman et al. Ego4D: Around the world in 3,000 hours of egocentric video. InCVPR, 2022. 1, 2, 5

2022

[12] [12]

Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives

Kristen Grauman et al. Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives. InCVPR, 2024. 1, 2, 5

2024

[13] [13]

something something

Raghav Goyal et al. The “something something” video database for learning and evaluating visual common sense. InICCV, 2017. 1, 2, 5

2017

[14] [14]

Scaling egocentric vision: The EPIC-KITCHENS dataset

Dima Damen et al. Scaling egocentric vision: The EPIC-KITCHENS dataset. InECCV, 2018. 1, 2, 5

2018

[15] [15]

R3M: A universal visual representation for robot manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3M: A universal visual representation for robot manipulation. InCoRL, 2022. 1, 3

2022

[16] [16]

VIP: Towards universal visual reward and representation via value-implicit pre-training

Yecheng Jason Ma et al. VIP: Towards universal visual reward and representation via value-implicit pre-training. InICLR, 2023. 1, 3

2023

[17] [17]

Language-driven representation learning for robotics

Siddharth Karamcheti et al. Language-driven representation learning for robotics. InRSS, 2023. 1, 3

2023

[18] [18]

Where are we in the search for an artificial visual cortex for embodied intelligence? arXiv preprint arXiv:2303.18240, 2023

Arjun Majumdar et al. Where are we in the search for an artificial visual cortex for embodied intelligence? arXiv preprint arXiv:2303.18240, 2023. 1, 3

arXiv 2023

[19] [19]

Affordances from human videos as a versatile representation for robotics

Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. InCVPR, 2023. 1, 3

2023

[20] [20]

Track2Act: Pre- dicting point tracks from internet videos enables generalizable robot manipulation.arXiv preprint arXiv:2405.01527, 2024

Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, and Shubham Tulsiani. Track2Act: Pre- dicting point tracks from internet videos enables generalizable robot manipulation.arXiv preprint arXiv:2405.01527, 2024. 1, 3

arXiv 2024

[21] [21]

Gen2Act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

Homanga Bharadhwaj et al. Gen2Act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024. 1, 3 11

Pith/arXiv arXiv 2024

[22] [22]

Any-point trajectory modeling for policy learning

Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning. InRobotics: Science and Systems (RSS), 2024. 1, 3

2024

[23] [23]

GR00T N1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Johan Bjorck et al. GR00T N1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 1, 3

Pith/arXiv arXiv 2025

[24] [24]

π0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

Physical Intelligence. π0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025. 1, 3

Pith/arXiv arXiv 2025

[25] [25]

Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos

Qixiu Li, Yu Deng, Yaobo Liang, Lin Luo, Lei Zhou, Chengtang Yao, Lingqi Zeng, Zhiyuan Feng, Huizhi Liang, Sicheng Xu, Yizhong Zhang, Xi Chen, Hao Chen, Lily Sun, Dong Chen, Jiaolong Yang, and Baining Guo. Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos. InIEEE International Conference on Rob...

2026

[26] [26]

Point policy: Unifying observations and actions with key points for robot manipulation

Siddhant Haldar and Lerrel Pinto. Point policy: Unifying observations and actions with key points for robot manipulation. InConference on Robot Learning (CoRL), 2025. 2, 3, 8, 14

2025

[27] [27]

Point bridge: 3D representations for cross domain policy learning.arXiv preprint arXiv:2601.16212, 2026

Siddhant Haldar, Lars Johannsmeier, Lerrel Pinto, Abhishek Gupta, Dieter Fox, Yashraj Narang, and Ajay Mandlekar. Point bridge: 3D representations for cross domain policy learning.arXiv preprint arXiv:2601.16212, 2026. 2, 3

arXiv 2026

[28] [28]

kPAM: Keypoint affordances for category- level robotic manipulation

Lucas Manuelli, Wei Gao, Peter Florence, and Russ Tedrake. kPAM: Keypoint affordances for category- level robotic manipulation. InISRR, 2019. 3

2019

[29] [29]

Tenenbaum, Alberto Rodriguez, Pulkit Agrawal, and Vincent Sitzmann

Anthony Simeonov, Yilun Du, Andrea Tagliasacchi, Joshua B. Tenenbaum, Alberto Rodriguez, Pulkit Agrawal, and Vincent Sitzmann. Neural descriptor fields: SE(3)-equivariant object representations for manipulation. InICRA, 2022. 3

2022

[30] [30]

ReKep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation

Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. ReKep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. InCoRL, 2024. 3

2024

[31] [31]

MOKA: Open-world robotic manipulation through mark-based visual prompting

Fangchen Liu, Kuan Fang, Pieter Abbeel, and Sergey Levine. MOKA: Open-world robotic manipulation through mark-based visual prompting. InRSS, 2024. 3

2024

[32] [32]

RoboPoint: A vision-language model for spatial affordance prediction for robotics

Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox. RoboPoint: A vision-language model for spatial affordance prediction for robotics. InCoRL, 2024. 3

2024

[33] [33]

P3-PO: Prescriptive point priors for visuo-spatial generalization of robot policies.arXiv preprint arXiv:2412.06784, 2024

Mara Levy, Siddhant Haldar, Lerrel Pinto, and Abhinav Shrivastava. P3-PO: Prescriptive point priors for visuo-spatial generalization of robot policies.arXiv preprint arXiv:2412.06784, 2024. 3

arXiv 2024

[34] [34]

DexVIP: Learning dexterous grasping with human hand pose priors from video

Priyanka Mandikal and Kristen Grauman. DexVIP: Learning dexterous grasping with human hand pose priors from video. InConference on Robot Learning (CoRL), 2022. 3

2022

[35] [35]

VideoDex: Learning dexterity from internet videos

Kenneth Shaw, Shikhar Bahl, and Deepak Pathak. VideoDex: Learning dexterity from internet videos. In CoRL, 2023. 3

2023

[36] [36]

Phantom: Training robots without robots using only human videos.arXiv preprint arXiv:2503.00779, 2025

Matthieu Lepert et al. Phantom: Training robots without robots using only human videos.arXiv preprint arXiv:2503.00779, 2025. 3

Pith/arXiv arXiv 2025

[37] [37]

HaWoR: World-space hand motion reconstruction from egocentric videos

Jinglei Zhang, Jiankang Deng, Chao Ma, and Rolandos Alexandros Potamias. HaWoR: World-space hand motion reconstruction from egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 5, 17

2025

[38] [38]

Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025. 5

Pith/arXiv arXiv 2025

[39] [39]

SAM 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, et al. SAM 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025. 5

Pith/arXiv arXiv 2025

[40] [40]

Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth Anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647,

Pith/arXiv arXiv

[41] [41]

Sentence-BERT: Sentence embeddings using Siamese BERT-networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. InEMNLP, 2019. 6

2019

[42] [42]

Qi, Hao Su, Kaichun Mo, and Leonidas J

Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. PointNet: Deep learning on point sets for 3D classification and segmentation. InCVPR, 2017. 6 12

2017

[43] [43]

Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610, 2022

Javier Romero, Dimitrios Tzionas, and Michael J Black. Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610, 2022. 17

arXiv 2022

[44] [44]

Reconstructing hands in 3d with transformers

Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3d with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9826–9836, 2024. 17

2024

[45] [45]

Residual off-policy rl for finetuning behavior cloning policies, 2025

Lars Ankile, Zhenyu Jiang, Rocky Duan, Guanya Shi, Pieter Abbeel, and Anusha Nagabandi. Residual off-policy rl for finetuning behavior cloning policies, 2025. 18

2025

[46] [46]

Reinforcement learning with action chunking

Qiyang Li, Zhiyuan Zhou, and Sergey Levine. Reinforcement learning with action chunking. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 18

2025

[47] [47]

Xinyue Chen, Che Wang, Zijian Zhou, and Keith W. Ross. Randomized ensembled double q-learning: Learning fast without a model. InInternational Conference on Learning Representations, 2021. 20 13 A Training Details Pretraining.We pretrain the autoregressive transformer on the VITRA corpus ( ∼1M egocentric episodes) for 100k optimizer steps using AdamW with ...

2021