Dexterous Point Policy: Learning Point-based Dexterous Hand Policies from Human Demonstrations
Pith reviewed 2026-06-27 13:29 UTC · model grok-4.3
The pith
A unified 3D keypoint representation enables training dexterous hand policies from human videos without robot demonstrations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By extracting 3D keypoints of objects and hands from raw human videos and training an autoregressive transformer to predict future keypoints, the method creates policies that transfer to robot hands. Human and robot behaviors align at the keypoint level for the wrist and fingertips, so no robot demonstrations are needed. On real-robot evaluations the policy reaches 75.0 percent success across pick-and-place and tool-use tasks while a state-of-the-art VLA baseline achieves only 1.0 percent, and it maintains performance in unseen multi-object and novel-category settings.
What carries the argument
Unified 3D keypoint representation used for both observations and actions in an autoregressive transformer trained on human videos.
If this is right
- Direct policy learning becomes possible for dexterous tasks without collecting robot data.
- The policy succeeds on both pick-and-place and tool-use tasks at 75 percent.
- Generalization occurs to multi-object environments and novel object categories.
- Keypoint alignment allows bypassing the embodiment gap that usually requires fine-tuning.
Where Pith is reading between the lines
- Similar keypoint bridging could apply to other manipulation tasks or different robot morphologies if alignment holds.
- Reducing data collection costs might enable faster iteration on complex dexterous behaviors.
- Low-dimensional keypoints may suffice for many manipulation skills, suggesting further compression of visual inputs is viable.
Load-bearing premise
Wrist and fingertip keypoint trajectories are similar enough between human demonstrations and robot executions for the learned policy to work without adjustment.
What would settle it
Finding a dexterous task where the policy fails because human and robot fingertip paths diverge significantly even when both are described by the same keypoint extraction process.
Figures
read the original abstract
Robotic foundation models pre-trained on human demonstration videos have shown promise, but a significant embodiment gap remains when the resulting policies are deployed on real robots. A common remedy is to fine-tune these models on robot-specific demonstrations. However, robot data collection can be prohibitively expensive and time-consuming, which is particularly acute in dexterous manipulation, e.g., teleoperating a multi-fingered hand for even a single atomic task can take days. To address this, we introduce Dexterous Point Policy, a framework that learns dexterous manipulation policies directly from human videos and requires no robot demonstrations. Our core insight is that a unified 3D keypoint representation can bridge human and robot embodiments when used for both observations and actions. Specifically, we extract 3D keypoints of task-relevant objects and human hands from raw videos, and train an autoregressive transformer over these keypoints. We observe that at the keypoint level, specifically the wrist and fingertips, human and robot behaviors closely align, enabling direct policy transfer. On a suite of real-robot tasks spanning pick-and-place and tool use, Dexterous Point Policy attains 75.0% success, whereas a state-of-the-art VLA baseline reaches only 1.0%. Furthermore, our method generalizes strongly to unseen scenarios, including multi-object environments and novel object categories.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Dexterous Point Policy, a framework that extracts 3D keypoints (wrist, fingertips, and task-relevant objects) from human demonstration videos, trains an autoregressive transformer policy over these keypoints for both observations and actions, and deploys the resulting policy zero-shot on a physical robot hand without any robot demonstrations. It claims 75% success on real-robot pick-and-place and tool-use tasks versus 1% for a state-of-the-art VLA baseline, plus strong generalization to multi-object scenes and novel object categories, based on the observation that human and robot behaviors align closely at the keypoint level.
Significance. If the empirical claims hold after proper validation, the work would be significant for dexterous manipulation: it offers a concrete route to bypass expensive robot data collection by leveraging a unified keypoint representation to close the embodiment gap between human videos and robot hardware.
major comments (2)
- [Abstract] Abstract: the central zero-shot transfer claim rests on the statement that 'at the keypoint level, specifically the wrist and fingertips, human and robot behaviors closely align,' yet the supplied text provides no quantitative validation (e.g., KL divergence between human and realized robot keypoint trajectories, workspace overlap statistics, or an ablation measuring success-rate drop when the alignment assumption is violated).
- [Abstract] Abstract: the headline result (75.0% vs. 1.0% success) is presented without any description of trial count, statistical significance testing, task definitions, baseline implementation details, or potential confounds such as object pose variation or lighting, rendering the magnitude of the improvement impossible to assess from the given material.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We agree that additional details on validation and experimental reporting are warranted and will revise the abstract accordingly while preserving its conciseness.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central zero-shot transfer claim rests on the statement that 'at the keypoint level, specifically the wrist and fingertips, human and robot behaviors closely align,' yet the supplied text provides no quantitative validation (e.g., KL divergence between human and realized robot keypoint trajectories, workspace overlap statistics, or an ablation measuring success-rate drop when the alignment assumption is violated).
Authors: We acknowledge that the abstract presents the alignment as an observation without accompanying quantitative metrics. The full manuscript supports this via successful zero-shot transfer results and qualitative trajectory visualizations in the experiments. To strengthen the presentation, we will incorporate a brief quantitative validation (e.g., mean keypoint trajectory distance or an alignment ablation) into the revised abstract and reference the corresponding analysis in the main text. revision: yes
-
Referee: [Abstract] Abstract: the headline result (75.0% vs. 1.0% success) is presented without any description of trial count, statistical significance testing, task definitions, baseline implementation details, or potential confounds such as object pose variation or lighting, rendering the magnitude of the improvement impossible to assess from the given material.
Authors: The abstract is intentionally concise, but we agree that the headline numbers require supporting context for proper evaluation. The full manuscript details the evaluation protocol, including trial counts, task definitions, baseline setups, and controls for confounds. We will revise the abstract to include a short clause summarizing the evaluation scale (e.g., number of trials and tasks) and note that full details appear in Section 4. revision: yes
Circularity Check
No circularity: empirical method validated on external robot benchmarks
full rationale
The paper presents an empirical machine-learning pipeline that extracts 3D keypoints from human videos, trains an autoregressive transformer, and evaluates success directly on physical robot tasks (75% vs. 1% baseline). No equations, fitted parameters, or self-citations are shown that reduce the reported performance or generalization claims to inputs by construction. The keypoint-alignment statement is offered as an enabling observation rather than a derived result, and the work is self-contained against external real-robot benchmarks with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption 3D keypoints of wrists and fingertips extracted from raw human videos provide a sufficient and aligned representation for both observation and action that transfers to robot hands
Reference graph
Works this paper leans on
-
[1]
GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1
Pith/arXiv arXiv 2023
-
[2]
LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 1
Pith/arXiv arXiv 2023
-
[3]
Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,
Gemini Team. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,
-
[4]
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with CLIP latents.arXiv preprint arXiv:2204.06125, 2022. 1
Pith/arXiv arXiv 2022
-
[5]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022. 1
2022
-
[6]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Tay- lor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. OpenAI Technical Report, 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators. 1
2024
-
[7]
CogVideoX: Text-to-video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, and Jie Tang. CogVideoX: Text-to-video diffusion models with an expert transformer. InICLR, 2025. 1
2025
-
[8]
Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn
Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipula- tion with low-cost hardware. InRSS, 2023. 1
2023
-
[9]
DROID: A large-scale in-the-wild robot manipulation dataset
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. DROID: A large-scale in-the-wild robot manipulation dataset. InRSS, 2024. 1
2024
-
[10]
Open X-Embodiment: Robotic learning datasets and RT-X models
Open X-Embodiment Collaboration. Open X-Embodiment: Robotic learning datasets and RT-X models. InICRA, 2024. 1
2024
-
[11]
Ego4D: Around the world in 3,000 hours of egocentric video
Kristen Grauman et al. Ego4D: Around the world in 3,000 hours of egocentric video. InCVPR, 2022. 1, 2, 5
2022
-
[12]
Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives
Kristen Grauman et al. Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives. InCVPR, 2024. 1, 2, 5
2024
-
[13]
something something
Raghav Goyal et al. The “something something” video database for learning and evaluating visual common sense. InICCV, 2017. 1, 2, 5
2017
-
[14]
Scaling egocentric vision: The EPIC-KITCHENS dataset
Dima Damen et al. Scaling egocentric vision: The EPIC-KITCHENS dataset. InECCV, 2018. 1, 2, 5
2018
-
[15]
R3M: A universal visual representation for robot manipulation
Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3M: A universal visual representation for robot manipulation. InCoRL, 2022. 1, 3
2022
-
[16]
VIP: Towards universal visual reward and representation via value-implicit pre-training
Yecheng Jason Ma et al. VIP: Towards universal visual reward and representation via value-implicit pre-training. InICLR, 2023. 1, 3
2023
-
[17]
Language-driven representation learning for robotics
Siddharth Karamcheti et al. Language-driven representation learning for robotics. InRSS, 2023. 1, 3
2023
-
[18]
Arjun Majumdar et al. Where are we in the search for an artificial visual cortex for embodied intelligence? arXiv preprint arXiv:2303.18240, 2023. 1, 3
arXiv 2023
-
[19]
Affordances from human videos as a versatile representation for robotics
Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. InCVPR, 2023. 1, 3
2023
-
[20]
Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, and Shubham Tulsiani. Track2Act: Pre- dicting point tracks from internet videos enables generalizable robot manipulation.arXiv preprint arXiv:2405.01527, 2024. 1, 3
arXiv 2024
-
[21]
Homanga Bharadhwaj et al. Gen2Act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024. 1, 3 11
Pith/arXiv arXiv 2024
-
[22]
Any-point trajectory modeling for policy learning
Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning. InRobotics: Science and Systems (RSS), 2024. 1, 3
2024
-
[23]
Johan Bjorck et al. GR00T N1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 1, 3
Pith/arXiv arXiv 2025
-
[24]
Physical Intelligence. π0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025. 1, 3
Pith/arXiv arXiv 2025
-
[25]
Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos
Qixiu Li, Yu Deng, Yaobo Liang, Lin Luo, Lei Zhou, Chengtang Yao, Lingqi Zeng, Zhiyuan Feng, Huizhi Liang, Sicheng Xu, Yizhong Zhang, Xi Chen, Hao Chen, Lily Sun, Dong Chen, Jiaolong Yang, and Baining Guo. Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos. InIEEE International Conference on Rob...
2026
-
[26]
Point policy: Unifying observations and actions with key points for robot manipulation
Siddhant Haldar and Lerrel Pinto. Point policy: Unifying observations and actions with key points for robot manipulation. InConference on Robot Learning (CoRL), 2025. 2, 3, 8, 14
2025
-
[27]
Siddhant Haldar, Lars Johannsmeier, Lerrel Pinto, Abhishek Gupta, Dieter Fox, Yashraj Narang, and Ajay Mandlekar. Point bridge: 3D representations for cross domain policy learning.arXiv preprint arXiv:2601.16212, 2026. 2, 3
arXiv 2026
-
[28]
kPAM: Keypoint affordances for category- level robotic manipulation
Lucas Manuelli, Wei Gao, Peter Florence, and Russ Tedrake. kPAM: Keypoint affordances for category- level robotic manipulation. InISRR, 2019. 3
2019
-
[29]
Tenenbaum, Alberto Rodriguez, Pulkit Agrawal, and Vincent Sitzmann
Anthony Simeonov, Yilun Du, Andrea Tagliasacchi, Joshua B. Tenenbaum, Alberto Rodriguez, Pulkit Agrawal, and Vincent Sitzmann. Neural descriptor fields: SE(3)-equivariant object representations for manipulation. InICRA, 2022. 3
2022
-
[30]
ReKep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation
Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. ReKep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. InCoRL, 2024. 3
2024
-
[31]
MOKA: Open-world robotic manipulation through mark-based visual prompting
Fangchen Liu, Kuan Fang, Pieter Abbeel, and Sergey Levine. MOKA: Open-world robotic manipulation through mark-based visual prompting. InRSS, 2024. 3
2024
-
[32]
RoboPoint: A vision-language model for spatial affordance prediction for robotics
Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox. RoboPoint: A vision-language model for spatial affordance prediction for robotics. InCoRL, 2024. 3
2024
-
[33]
Mara Levy, Siddhant Haldar, Lerrel Pinto, and Abhinav Shrivastava. P3-PO: Prescriptive point priors for visuo-spatial generalization of robot policies.arXiv preprint arXiv:2412.06784, 2024. 3
arXiv 2024
-
[34]
DexVIP: Learning dexterous grasping with human hand pose priors from video
Priyanka Mandikal and Kristen Grauman. DexVIP: Learning dexterous grasping with human hand pose priors from video. InConference on Robot Learning (CoRL), 2022. 3
2022
-
[35]
VideoDex: Learning dexterity from internet videos
Kenneth Shaw, Shikhar Bahl, and Deepak Pathak. VideoDex: Learning dexterity from internet videos. In CoRL, 2023. 3
2023
-
[36]
Matthieu Lepert et al. Phantom: Training robots without robots using only human videos.arXiv preprint arXiv:2503.00779, 2025. 3
Pith/arXiv arXiv 2025
-
[37]
HaWoR: World-space hand motion reconstruction from egocentric videos
Jinglei Zhang, Jiankang Deng, Chao Ma, and Rolandos Alexandros Potamias. HaWoR: World-space hand motion reconstruction from egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 5, 17
2025
-
[38]
Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025
Shuai Bai et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025. 5
Pith/arXiv arXiv 2025
-
[39]
SAM 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, et al. SAM 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025. 5
Pith/arXiv arXiv 2025
-
[40]
Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang
Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth Anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647,
-
[41]
Sentence-BERT: Sentence embeddings using Siamese BERT-networks
Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. InEMNLP, 2019. 6
2019
-
[42]
Qi, Hao Su, Kaichun Mo, and Leonidas J
Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. PointNet: Deep learning on point sets for 3D classification and segmentation. InCVPR, 2017. 6 12
2017
-
[43]
Javier Romero, Dimitrios Tzionas, and Michael J Black. Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610, 2022. 17
arXiv 2022
-
[44]
Reconstructing hands in 3d with transformers
Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3d with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9826–9836, 2024. 17
2024
-
[45]
Residual off-policy rl for finetuning behavior cloning policies, 2025
Lars Ankile, Zhenyu Jiang, Rocky Duan, Guanya Shi, Pieter Abbeel, and Anusha Nagabandi. Residual off-policy rl for finetuning behavior cloning policies, 2025. 18
2025
-
[46]
Reinforcement learning with action chunking
Qiyang Li, Zhiyuan Zhou, and Sergey Levine. Reinforcement learning with action chunking. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 18
2025
-
[47]
Xinyue Chen, Che Wang, Zijian Zhou, and Keith W. Ross. Randomized ensembled double q-learning: Learning fast without a model. InInternational Conference on Learning Representations, 2021. 20 13 A Training Details Pretraining.We pretrain the autoregressive transformer on the VITRA corpus ( ∼1M egocentric episodes) for 100k optimizer steps using AdamW with ...
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.