HandsOnWorld: Unconstrained Egocentric Video Generation with Camera-Disentangled Hand Control
Pith reviewed 2026-07-03 15:50 UTC · model grok-4.3
The pith
Hand-controlled egocentric videos can be generated from unconstrained monocular footage by disentangling camera motion with a Plucker Hand Map.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HandsOnWorld demonstrates that by annotating 3D hands on in-the-wild egocentric videos through monocular reconstruction and filtering to create the EgoVid-Pro dataset, combined with the Plucker Hand Map as a control signal, a generator can be trained that achieves higher reconstruction fidelity and control accuracy than prior methods while generalizing to out-of-distribution everyday scenes.
What carries the argument
The Plucker Hand Map, a 3D-aware control signal that extends Plucker-ray representations from camera rays to the hand surface to disentangle camera and hand motion at the representation level.
If this is right
- Generated videos show improved fidelity in reconstructing scenes and hand movements compared to previous hand-controlled generators.
- Control accuracy for hand poses is higher, allowing more precise following of input hand trajectories.
- The approach generalizes to everyday scenes outside the laboratory datasets used by prior methods.
- Training relies on a dataset of 103K clips and about 12M frames from diverse real-world egocentric videos.
Where Pith is reading between the lines
- Such a system could support more accessible creation of personalized video content using only smartphone footage.
- Extensions might include integrating other body controls or interactive editing of generated videos.
- Testing on even more diverse environments like outdoor activities could reveal further generalization limits.
Load-bearing premise
Monocular 3D hand reconstruction followed by filtering at action-semantic, image-quality, and 3D-geometric levels yields sufficiently accurate protagonist-only hand trajectories for training a generalizable generator.
What would settle it
Observe whether videos generated with hand controls from unseen everyday scenes maintain accurate hand poses and scene consistency without the artifacts seen in prior methods limited to lab data.
Figures
read the original abstract
We present HandsOnWorld, a framework for hand-controlled egocentric video generation that forgoes multi-view and marker-based motion capture, learning instead from unconstrained monocular video. Such generality is bottlenecked by the scarcity of scalable 3D hand annotations: large egocentric corpora lack finger-level labels, whereas precise hand datasets are confined to narrow, instrumented settings, limiting prior hand-controlled generators to restricted scene distributions. We instead annotate 3D hands directly on in-the-wild egocentric video through monocular reconstruction, introducing a protagonist-centered annotation pipeline that filters the reconstructions at the action-semantic, image-quality, and 3D-geometric levels to build EgoVid-Pro, a dataset of clean, protagonist-only hand trajectories spanning 103K clips and roughly 12M frames across diverse everyday scenes. To resolve the camera-hand entanglement induced by large ego-motion, we further propose the Pl\"{u}cker Hand Map, a 3D-aware control signal that extends Pl\"{u}cker-ray representations from camera rays to the hand surface, disentangling camera and hand motion at the representation level. Experiments show that \method surpasses prior hand-controlled generators in reconstruction fidelity and control accuracy, and generalizes to out-of-distribution everyday scenes beyond the laboratory datasets on which prior methods rely.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HandsOnWorld for hand-controlled egocentric video generation from unconstrained monocular video. It constructs the EgoVid-Pro dataset (103K clips, ~12M frames) via monocular 3D hand reconstruction on in-the-wild video, using a protagonist-centered pipeline with action-semantic, image-quality, and 3D-geometric filters. It proposes the Plücker Hand Map to disentangle camera and hand motion. Experiments claim superior reconstruction fidelity, control accuracy, and OOD generalization versus prior hand-controlled generators limited to lab datasets.
Significance. If the central claims hold, the work would meaningfully expand the scope of controllable egocentric video generation beyond instrumented lab settings by leveraging scalable monocular annotations and a 3D-aware control representation. The Plücker Hand Map offers a concrete mechanism for camera-hand disentanglement at the representation level, and the scale of EgoVid-Pro (if its trajectories are sufficiently clean) could support broader generalization. These elements would be strengths if accompanied by rigorous validation.
major comments (2)
- [Abstract and dataset construction section] The central claim of superior fidelity, control accuracy, and OOD generalization rests on the quality of the EgoVid-Pro training trajectories. The protagonist-centered annotation pipeline (monocular reconstruction + multi-level filtering) is described but no quantitative validation is provided, such as MPJPE, PA-MPJPE, or comparison against mocap/multi-view ground truth on held-out clips. This leaves the training signal accuracy unverified and is load-bearing for all downstream results.
- [Experiments] Experiments section: superiority and generalization claims are asserted without reference to specific quantitative metrics, baselines, error analysis, or evaluation protocol (e.g., no reported numbers for reconstruction fidelity or control accuracy). This makes it impossible to assess whether the results support the stated claims.
minor comments (1)
- [Method] Notation for the Plücker Hand Map should be introduced with an explicit equation or diagram early in the method section to clarify how it extends standard Plücker rays to the hand surface.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for stronger validation of the EgoVid-Pro dataset construction and clearer quantitative reporting in the experiments. We address each major comment below. Where the manuscript is missing explicit details, we will revise accordingly.
read point-by-point responses
-
Referee: [Abstract and dataset construction section] The central claim of superior fidelity, control accuracy, and OOD generalization rests on the quality of the EgoVid-Pro training trajectories. The protagonist-centered annotation pipeline (monocular reconstruction + multi-level filtering) is described but no quantitative validation is provided, such as MPJPE, PA-MPJPE, or comparison against mocap/multi-view ground truth on held-out clips. This leaves the training signal accuracy unverified and is load-bearing for all downstream results.
Authors: We agree that direct quantitative metrics such as MPJPE against mocap ground truth are not reported. Such ground truth is unavailable by design, as the pipeline targets unconstrained in-the-wild monocular video without instrumentation. The multi-level filters (action-semantic, image-quality, 3D-geometric) are intended to ensure trajectory cleanliness, but we acknowledge the absence of explicit validation leaves the claim under-supported. In revision we will add quantitative statistics on filter rejection rates, hand-pose consistency metrics across clips, and qualitative side-by-side comparisons of raw vs. filtered reconstructions to better substantiate the dataset quality. revision: partial
-
Referee: [Experiments] Experiments section: superiority and generalization claims are asserted without reference to specific quantitative metrics, baselines, error analysis, or evaluation protocol (e.g., no reported numbers for reconstruction fidelity or control accuracy). This makes it impossible to assess whether the results support the stated claims.
Authors: The referee is correct that the current draft does not present explicit numerical values, tables, or detailed evaluation protocols for the claimed improvements in reconstruction fidelity and control accuracy. While the abstract summarizes the outcomes, the experiments section relies on qualitative descriptions and figures without the supporting numbers. We will revise the experiments section to include concrete metrics, baseline comparisons, error analysis, and a clear evaluation protocol in the next version. revision: yes
- Direct MPJPE/PA-MPJPE validation of EgoVid-Pro against mocap or multi-view ground truth, which does not exist for the in-the-wild monocular videos used.
Circularity Check
No circularity: claims rest on new dataset and representation without self-referential derivations
full rationale
The abstract and provided text describe a pipeline for monocular 3D hand annotation, multi-level filtering to create EgoVid-Pro, and introduction of the Plücker Hand Map as a control signal. No equations, derivations, or fitted parameters are presented that reduce a claimed prediction or result back to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims concern empirical performance on a newly constructed dataset and a novel representation; these do not exhibit any of the enumerated circularity patterns. The derivation chain is self-contained against external benchmarks and does not rely on renaming known results or smuggling assumptions via prior self-work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Monocular 3D hand reconstruction tools can produce usable trajectories on in-the-wild egocentric video after multi-level filtering
invented entities (1)
-
Plücker Hand Map
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Recammaster: Camera-controlled generative rendering from a single video
Jianhong Bai, Menghan Xia, Xiao Wang, Ziyang Yuan, Xintao Fu, Zuozhuo Liu, Haoji Wang, Xiang Wen, Yu- jiu Zhang, Yansong Wang, Wenping Yang, and Zhipeng Wang. ReCamMaster: Camera-controlled generative render- ing from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. arXiv:2503.11647. 3
-
[2]
Whole-body conditioned ego- centric video prediction.arXiv preprint arXiv:2506.21552,
Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, and Jitendra Malik. Whole-body conditioned ego- centric video prediction.arXiv preprint arXiv:2506.21552,
-
[3]
HOT3D: Hand and object tracking in 3D from ego- centric multi-view videos
Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, Richard Newcombe, Robert Wang, Jakob Julian Engel, and Tomas Hodan. HOT3D: Hand and object tracking in 3D from ego- centric multi-view videos. InProceedings of the IEEE/CVF Conference on Computer Vision...
-
[4]
Align your latents: High-resolution video synthesis with la- tent diffusion models
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with la- tent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22563–22575, 2023. 3
2023
-
[5]
Genie: Gener- ative interactive environments
Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker- Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder S...
-
[6]
Narang, Karl Van Wyk, Umar Iqbal, Stan Birchfield, Jan Kautz, and Di- eter Fox
Yu-Wei Chao, Wei Yang, Yu Xiang, Pavlo Molchanov, Ankur Handa, Jonathan Tremblay, Yashraj S. Narang, Karl Van Wyk, Umar Iqbal, Stan Birchfield, Jan Kautz, and Di- eter Fox. DexYCB: A benchmark for capturing hand grasp- ing of objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. arXiv:2104.04631. 1, 4
-
[7]
Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J. Black, and Ot- mar Hilliges. ARCTIC: A dataset for dexterous bimanual hand-object manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. arXiv:2204.13662. 1, 3, 4, 7
-
[8]
3DTrajMaster: Mastering 3D trajectory for multi-entity motion in video generation
Xiao Fu, Xian Liu, Xintao Wang, Sida Peng, Menghan Xia, Xiaoyu Shi, Ziyang Yuan, Pengfei Wan, Di Zhang, and Dahua Lin. 3DTrajMaster: Mastering 3D trajectory for multi-entity motion in video generation. InProceedings of the International Conference on Learning Representations (ICLR), 2025. arXiv:2412.07759. 3
-
[9]
Vista: A generalizable driving world model with high fidelity and versatile controllability
Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2405.17398. 1, 3
-
[10]
Ego4D: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Charade, Rohit Furuta, Anca Helm, Miao Hig- gins, Howard Ipson, Suyog Jain, et al. Ego4D: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18995–19012, 2022. 1, 3, 5
2022
-
[11]
Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, et al. Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. arXiv:2311.18259. 1, 7
-
[12]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. AnimateDiff: Animate your personalized text-to- image diffusion models without specific tuning. InProceed- ings of the International Conference on Learning Represen- tations (ICLR), 2024. arXiv:2307.04725. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Twigg, Peizhao Zhang, Jeff Petkau, Tsz-Ho Yu, Chun-Jung Tai, Muzaffer Akbay, Zheng Wang, et al
Shangchen Han, Beibei Liu, Randi Cabezas, Christopher D. Twigg, Peizhao Zhang, Jeff Petkau, Tsz-Ho Yu, Chun-Jung Tai, Muzaffer Akbay, Zheng Wang, et al. MEgATrack: Monochrome egocentric articulated hand-tracking for virtual reality.ACM Transactions on Graphics (SIGGRAPH), 39(4),
-
[14]
UmeTrack: Unified multi- view end-to-end hand tracking for VR
Shangchen Han, Po-Chen Wu, Yubo Zhang, Beibei Liu, Lin- guang Zhang, Zheng Wang, Weiguang Si, Peizhao Zhang, Yujun Cai, Tomas Hodan, et al. UmeTrack: Unified multi- view end-to-end hand tracking for VR. InSIGGRAPH Asia 2022 Conference Papers, 2022. arXiv:2211.00099. 4
-
[15]
EgoSim: Egocentric World Simulator for Embodied Interaction Generation
Jinkun Hao, Mingda Jia, Ruiyan Wang, Hongrui Zhu, Ji- afei Cao, Xihui Liu, Ran Yi, Lizhuang Ma, Jiangmiao Pang, and Xudong Xu. EgoSim: Egocentric world sim- ulator for embodied interaction generation.arXiv preprint arXiv:2604.01001, 2026. 3
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[16]
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. CameraCtrl: En- abling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024. 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
Denoising Diffusion Probabilistic Models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denois- ing diffusion probabilistic models. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. arXiv:2006.11239. 4
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[19]
Yoon, Mouli Sivapu- rapu, and Jian Zhang
Ryan Hoque, Peide Huang, David J. Yoon, Mouli Sivapu- rapu, and Jian Zhang. EgoDex: Learning dexterous manip- ulation from large-scale egocentric video. InProceedings of the International Conference on Learning Representations (ICLR), 2026. 1
2026
-
[20]
GAIA-1: A Generative World Model for Autonomous Driving
Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gian- luca Corrado. GAIA-1: A generative world model for au- tonomous driving.arXiv preprint arXiv:2309.17080, 2023. arXiv:2309.17080. 1, 3 10
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021. arXiv:2106.09685. 7
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[22]
Animate Anyone: Consistent and controllable image-to-video synthesis for character animation
Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, and Liefeng Bo. Animate Anyone: Consistent and controllable image-to-video synthesis for character animation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. arXiv:2311.17117. 3
-
[23]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2506.08009. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
VBench: Comprehensive benchmark suite for video generative mod- els
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative mod- els. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni...
-
[25]
VACE: All-in-One Video Creation and Editing
Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. V ACE: All-in-one video cre- ation and editing. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), 2025. arXiv:2503.07598. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
H2O: Two hands manipulating ob- jects for first person interaction recognition
Taein Kwon, Bugra Tekin, J ¨org St ¨uckler, Abdullah Arma- gan, and Marc Pollefeys. H2O: Two hands manipulating ob- jects for first person interaction recognition. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), 2021. arXiv:2104.11181. 1
-
[28]
Modular primitives for high-performance differentiable rendering.ACM Transac- tions on Graphics, 39(6), 2020
Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. Modular primitives for high-performance differentiable rendering.ACM Transac- tions on Graphics, 39(6), 2020. 6
2020
-
[29]
Ground- ing image matching in 3D with MASt3R
Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3D with MASt3R. InProceedings of the European Conference on Computer Vision (ECCV),
-
[30]
Dayou Li, Lulin Liu, Bangya Liu, Shijie Zhou, Jiu Feng, Ziqi Lu, Minghui Zheng, Chenyu You, and Zhiwen Fan. Egocen- tric world model for photorealistic hand-object interaction synthesis.arXiv preprint arXiv:2603.13615, 2026. 1, 3
-
[31]
Zisu Li, Hengye Lyu, Jiaxin Shi, Yufeng Zeng, Mingming Fan, Hanwang Zhang, and Chen Liang. SpriteHand: Real- time versatile hand-object interaction with autoregressive video generation.arXiv preprint arXiv:2512.01960, 2025. 3
-
[32]
HOI4D: A 4D egocentric dataset for category-level human- object interaction
Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. HOI4D: A 4D egocentric dataset for category-level human- object interaction. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),
- [33]
-
[34]
TACO: Benchmarking gener- alizable bimanual tool-ACtion-object understanding
Yun Liu, Haolin Yang, Xu Si, Ling Liu, Zipeng Li, Yuxiang Zhang, Yebin Liu, and Li Yi. TACO: Benchmarking gener- alizable bimanual tool-ACtion-object understanding. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21740–21751, 2024. 1, 4
2024
-
[35]
Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi- person linear model.ACM Transactions on Graphics, 34(6): 248:1–248:16, 2015. 4, 5
2015
-
[36]
Aria Everyday Activities Dataset.arXiv preprint arXiv:2402.13349, 2024
Zhaoyang Lv, Nicholas Charron, Pierre Moulon, Alexander Gamino, Cheng Peng, Chris Sweeney, Edward Miller, and Richard Newcombe. Aria Everyday Activities Dataset.arXiv preprint arXiv:2402.13349, 2024. arXiv:2402.13349. 1
-
[37]
Karen Liu, Ziwei Liu, Jakob Engel, Renzo De Nardi, and Richard Newcombe
Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, Hyo Jin Kim, Kevin Bailey, David Soriano Fosas, C. Karen Liu, Ziwei Liu, Jakob Engel, Renzo De Nardi, and Richard Newcombe. Nymeria: A massive col- lection of multimodal egocentric daily motion in the wild. In Proceedings of t...
-
[38]
Cosmos World Foundation Model Platform for Physical AI
NVIDIA. Cosmos world foundation model platform for physical AI.arXiv preprint arXiv:2501.03575, 2025. arXiv:2501.03575. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
AssemblyHands: Towards egocentric activity understanding via 3D hand pose esti- mation
Takehiko Ohkawa, Kun He, Fadime Sener, Tomas Hodan, Luan Tran, and Cem Keskin. AssemblyHands: Towards egocentric activity understanding via 3D hand pose esti- mation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. arXiv:2304.12301. 1
-
[40]
Sora: Creating video from text
OpenAI. Sora: Creating video from text. Technical report, OpenAI, 2024. 1, 3
2024
-
[41]
Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019. 4, 5, 1
2019
-
[42]
Reconstruct- ing hands in 3D with transformers
Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstruct- ing hands in 3D with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. arXiv:2312.05251. 1, 4
-
[43]
WiLoR: End-to-end 3D hand localization and reconstruction in-the-wild
Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng, and Stefanos Zafeiriou. WiLoR: End-to-end 3D hand localization and reconstruction in-the-wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2025. arXiv:2409.12259. 1, 4
-
[44]
Javier Romero, Dimitrios Tzionas, and Michael J. Black. Embodied hands: Modeling and capturing hands and bod- ies together.ACM Transactions on Graphics, 36(6):246:1– 246:17, 2017. 4
2017
-
[45]
SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J. Liang, Alexander Sax, Hao Tang, Weiyao Wang, 11 Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Ji- awei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Doll ´ar, Georgia Gkioxari, Matt Feiszli, and Jitendra Malik. SAM 3D: 3Dfy anything in images.arXi...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Human4DiT: 360-degree human video generation with 4D diffusion transformer
Ruizhi Shao, Youxin Pang, et al. Human4DiT: 360-degree human video generation with 4D diffusion transformer. ACM Transactions on Graphics (SIGGRAPH Asia), 2024. arXiv:2405.17405. 3
-
[47]
Free-form motion con- trol: Controlling the 6D poses of camera and objects in video generation
Xincheng Shuai, Henghui Ding, Zhenyuan Qin, Hao Luo, Xingjun Ma, and Dacheng Tao. Free-form motion con- trol: Controlling the 6D poses of camera and objects in video generation. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), 2025. arXiv:2501.01425. 3, 9
-
[48]
Black, and Dim- itrios Tzionas
Omid Taheri, Nima Ghorbani, Michael J. Black, and Dim- itrios Tzionas. GRAB: A dataset of whole-body human grasping of objects. InProceedings of the European Confer- ence on Computer Vision (ECCV), 2020. arXiv:2008.11200. 1, 4
-
[49]
PlayerOne: Egocentric world simulator
Yuanpeng Tu, Hao Luo, Xi Chen, Xiang Bai, Fan Wang, and Hengshuang Zhao. PlayerOne: Egocentric world simulator. arXiv preprint arXiv:2506.09995, 2025. 1, 3
-
[50]
Diffusion models are real-time game engines
Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. In Proceedings of the International Conference on Learning Representations (ICLR), 2025. 1, 3
2025
-
[51]
Wan: Open and Advanced Large-Scale Video Generative Models
Wan Team. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 1, 3, 4, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
VGGT: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2503.11651. 1
-
[53]
Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. MoGe: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2410.19115
-
[54]
DUSt3R: Geometric 3D vision made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),
-
[55]
Xiaofeng Wang, Kang Zhao, Feng Liu, Jiayu Wang, Gu- osheng Zhao, Xiaoyi Bao, Zheng Zhu, Yingya Zhang, and Xingang Wang. EgoVid-5M: A large-scale video-action dataset for egocentric video generation.arXiv preprint arXiv:2411.08380, 2024. 1, 5, 7
-
[56]
Hand2World: Autoregressive ego- centric interaction generation via free-space hand gestures
Yuxi Wang, Wenqi Ouyang, Tianyi Wei, Yi Dong, Zhiqi Shen, and Xingang Pan. Hand2World: Autoregressive ego- centric interaction generation via free-space hand gestures. arXiv preprint arXiv:2602.09600, 2026. 1, 3, 8
-
[57]
MotionCtrl: A unified and flexible motion controller for video genera- tion
Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, and Ying-Cong Yang. MotionCtrl: A unified and flexible motion controller for video genera- tion. InACM SIGGRAPH 2024 Conference Papers, 2024. arXiv:2312.03641. 3
-
[58]
DragAnything: Motion control for anything using entity representation
Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, and Di Zhang. DragAnything: Motion control for anything using entity representation. InProceedings of the European Conference on Computer Vision (ECCV), 2024. arXiv:2403.07420. 3
-
[59]
Sun, Ashley Neall, Tong Wu, Shengqu Cai, and Gordon Wetzstein
Linxi Xie, Lisong C. Sun, Ashley Neall, Tong Wu, Shengqu Cai, and Gordon Wetzstein. Generated reality: Human- centric world simulation using interactive video gener- ation with hand and camera control.arXiv preprint arXiv:2602.18422, 2026. 1, 3, 8
-
[60]
CogVideoX: Text-to-video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer. InProceedings of the International Conference on Learning Representations (ICLR), 2025. 1, 3
2025
-
[61]
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, and Nan Duan. DragNUW A: Fine-grained control in video generation by integrating text, image, and trajectory.arXiv preprint arXiv:2308.08089, 2023. arXiv:2308.08089. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[62]
Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shecht- man, Fr ´edo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching dis- tillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. arXiv:2311.18828. 2
-
[63]
Dyn- HaMR: Recovering 4D interacting hand motion from a dy- namic camera
Zhengdi Yu, Stefanos Zafeiriou, and Tolga Birdal. Dyn- HaMR: Recovering 4D interacting hand motion from a dy- namic camera. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2412.12861. 1, 4
-
[64]
OakInk2: A dataset of bimanual hands-object manipulation in complex task completion
Xinyu Zhan, Lixin Yang, Yifei Zhao, Kangrui Mao, Han- lin Xu, Zenan Lin, Kailin Li, and Cewu Lu. OakInk2: A dataset of bimanual hands-object manipulation in complex task completion. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),
- [65]
-
[66]
Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints
Chenyangguang Zhang, Botao Ye, Boqi Chen, Alexandros Delitzas, Fangjinhua Wang, Marc Pollefeys, and Xi Wang. Controllable egocentric video generation via occlusion- aware sparse 3D hand joints. InProceedings of the Eu- ropean Conference on Computer Vision (ECCV), 2026. arXiv:2603.11755. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[67]
HaWoR: World-space hand mo- tion reconstruction from egocentric videos
Jinglei Zhang, Jiankang Deng, Chao Ma, and Rolan- dos Alexandros Potamias. HaWoR: World-space hand mo- tion reconstruction from egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 1805–1815, 2025. 1, 4, 5, 8
2025
-
[68]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, 2023. 6 12
2023
-
[69]
The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 586–595, 2018. arXiv:1801.03924. 8
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[70]
Tora: Trajectory-oriented diffusion transformer for video gener- ation
Zhenghao Zhang, Junchao Liao, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang. Tora: Trajectory-oriented diffusion transformer for video gener- ation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2407.21705. 3
-
[71]
Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongx- uan Li, and Jun Zhu. Causal forcing: Autoregressive diffu- sion distillation done right for high-quality real-time inter- active video generation.arXiv preprint arXiv:2602.02214,
work page internal anchor Pith review Pith/arXiv arXiv
-
[72]
Champ: Controllable and consistent human image anima- tion with 3D parametric guidance
Shenhao Zhu, Junming Leo Chen, Zuozhuo Dai, Qingkun Su, Yinghui Xu, Xun Cao, Yao Yao, Hao Zhu, and Siyu Zhu. Champ: Controllable and consistent human image anima- tion with 3D parametric guidance. InProceedings of the European Conference on Computer Vision (ECCV), 2024. arXiv:2403.14781. 3 13 HandsOnWorld: Unconstrained Egocentric Video Generation with Ca...
-
[73]
We setλ data = 10,λ pose = 5×10 −3, andλ shape = 3×10 −2
The full objective L=λ data Lhead +L hand +λ poseLpose +λ shapeLshape (D) is jointly minimized over all per-frame variables with Adam, holding the SMPL model and the VPoser decoder fixed. We setλ data = 10,λ pose = 5×10 −3, andλ shape = 3×10 −2. Gap-filling threshold.Before the linear-interpolation step that fills frames lacking valid detections (Sec. 4.3...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.