pith. machine review for the scientific record. sign in

arxiv: 2605.07642 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

EggHand: A Multimodal Foundation Model for Egocentric Hand Pose Forecasting

Daehee Park, Hyeondong Kim, Jaeyoung Choi, Yujin Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords egocentric hand pose forecastingmultimodal foundation modelvision-language-action3D hand poseego-motion robustnessaction predictionlanguage-conditioned forecasting
0
0 comments X

The pith

EggHand forecasts future 3D hand poses from egocentric video by combining vision-language-action decoding with viewpoint-aware encoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

EggHand is introduced to forecast future 3D hand pose sequences from egocentric video. This is important for applications that require anticipating human hand actions, such as AR/VR assistance and human-robot interaction. The framework overcomes challenges from complex intent, dexterous articulations, and drastic viewpoint shifts by using a multimodal approach. It pairs an action decoder that models hand motion dynamics with an egocentric video-text encoder that adds contextual awareness from first-person data.

Core claim

EggHand is a multimodal foundation model for egocentric hand pose forecasting that couples an action decoder from a Vision-Language-Action model with an egocentric video-text encoder. The decoder captures structured temporal dynamics of hand motion while the encoder provides viewpoint-aware contextual information from large-scale first-person video, enabling joint reasoning over motion, context, and intent without body pose or external tracking.

What carries the argument

The integration of a Vision-Language-Action action decoder for temporal hand motion dynamics and an egocentric video-text encoder for viewpoint-aware context, allowing unified multimodal reasoning for pose forecasting.

If this is right

  • It achieves state-of-the-art forecasting accuracy on the EgoExo4D dataset.
  • The predictions remain robust even under severe ego-motion causing drastic viewpoint changes.
  • The model supports controllable forecasting by accepting language-based task prompts to guide the predictions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same coupling could extend to forecasting full-body movements or object manipulations from egocentric views.
  • Language controllability suggests direct use in interactive systems where users describe intended actions in text.
  • Lower reliance on external trackers could simplify hardware needs for real-time embodied AI deployments.

Load-bearing premise

The egocentric video-text encoder alone supplies enough viewpoint-aware information to handle ego-motion challenges without using body pose or external tracking.

What would settle it

A direct comparison on videos with more extreme head movements or unseen action types where the model loses its accuracy edge over baselines that rely on body tracking would disprove the robustness and independence claims.

Figures

Figures reproduced from arXiv: 2605.07642 by Daehee Park, Hyeondong Kim, Jaeyoung Choi, Yujin Kim.

Figure 1
Figure 1. Figure 1: Overview of EggHand, the proposed framework for [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Framework of the proposed VLA architecture. The [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative 2D projections of future 3D hand-pose forecasting on EgoExo4D. Green: ground truth; red: our predictions. Left: observation window; right: forecasted future frames. Upper: piano playing, requiring fine-grained bimanual finger articulation on a structured object. Middle: COVID-19 rapid antigen test, involving tightly coupled bimanual hand-object interaction. Lower: bike repair with sparse observ… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative ablation on multimodal inputs. EggHand forecasts on a COVID-19 test kit manipulation sequence under four conditions: All Modality, Clean Vision + Dummy Text, Noisy Vision + Clean Text, and Noisy Vision + Dummy Text. Dummy text is randomized from the EgoExo4D task vocabulary; noisy vision replaces frames with Gaussian noise. Green: ground truth; red: predictions. COVID-19 test kit manipulation (… view at source ↗
read the original abstract

Forecasting future 3D hand pose sequences from egocentric video is essential for understanding human intention and enabling embodied applications such as AR/VR assistance and human-robot interaction. However, this task remains a highly challenging problem because egocentric hand motion is driven by complex human intent, exhibits highly dexterous articulations, and is observed under drastic viewpoint shifts induced by ego-motion. In this work, we introduce EggHand, a foundation-model-based framework for egocentric hand pose forecasting that unifies multimodal semantic reasoning with dynamic motion modeling. Our approach couples an action decoder from a Vision-Language-Action (VLA) model, which captures the structured temporal dynamics of hand motion, with an egocentric video-text encoder that provides viewpoint-aware contextual information learned from large-scale first-person video. Together, these components overcome the brittleness of generic visual encoders under ego-motion and enable joint reasoning over motion, context, and high-level intent-without relying on body pose or external tracking. Experiments on the EgoExo4D dataset show that EggHand sets a new state of the art in forecasting accuracy, remains robust under severe ego-motion, and further enables controllable prediction via language-based task prompts. Project page: https://jyoun9.github.io/EggHand

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces EggHand, a multimodal foundation model for forecasting future 3D hand pose sequences from egocentric video. It couples a Vision-Language-Action (VLA) action decoder for temporal motion dynamics with an egocentric video-text encoder for viewpoint-aware context, claiming to overcome generic encoder brittleness under ego-motion without body pose or external tracking. Experiments on EgoExo4D are said to establish new state-of-the-art forecasting accuracy, robustness to severe ego-motion, and controllable prediction via language task prompts.

Significance. If the experimental claims hold after verification, the work would offer a concrete demonstration of unifying VLA-style semantic reasoning with dynamic hand motion modeling in egocentric settings. This could strengthen multimodal approaches for embodied applications such as AR/VR and human-robot interaction by reducing reliance on explicit body tracking.

major comments (2)
  1. [Abstract] Abstract: the central claim that the egocentric video-text encoder supplies sufficient viewpoint-aware context to overcome generic-encoder brittleness under ego-motion (without body pose or external tracking) is load-bearing for both the SOTA forecasting result and the robustness statement. No ablation is indicated that holds the VLA decoder fixed while swapping only the visual encoder on high-ego-motion subsets of EgoExo4D; gains could instead arise from decoder architecture, pretraining scale, or dataset biases.
  2. [Experiments] Experiments section (implied by abstract claims): the abstract asserts SOTA results and robustness but supplies no quantitative numbers, error bars, ablation details, or dataset splits. The full results tables and methods must be checked to confirm whether the data actually support the stated improvements over prior methods on EgoExo4D.
minor comments (1)
  1. [Abstract] Abstract: the project page URL is given but the manuscript should explicitly state whether code, pretrained weights, or additional evaluation details are released to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We address each major comment point by point below, providing clarifications on our experimental design and indicating revisions where they will strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the egocentric video-text encoder supplies sufficient viewpoint-aware context to overcome generic-encoder brittleness under ego-motion (without body pose or external tracking) is load-bearing for both the SOTA forecasting result and the robustness statement. No ablation is indicated that holds the VLA decoder fixed while swapping only the visual encoder on high-ego-motion subsets of EgoExo4D; gains could instead arise from decoder architecture, pretraining scale, or dataset biases.

    Authors: We acknowledge that the manuscript does not present a dedicated ablation that holds the VLA decoder fixed while swapping only the visual encoder on high-ego-motion subsets. To directly support the load-bearing claim and isolate the encoder's contribution to robustness, we will add this specific ablation study in the revised manuscript. revision: yes

  2. Referee: [Experiments] Experiments section (implied by abstract claims): the abstract asserts SOTA results and robustness but supplies no quantitative numbers, error bars, ablation details, or dataset splits. The full results tables and methods must be checked to confirm whether the data actually support the stated improvements over prior methods on EgoExo4D.

    Authors: Abstracts are kept concise by design and do not typically include numerical results. The full manuscript contains detailed results tables in the Experiments section, including quantitative metrics on EgoExo4D, comparisons to prior methods, error bars, ablation studies, and dataset split information that substantiate the SOTA and robustness claims. No revision is required for this point. revision: no

Circularity Check

0 steps flagged

No circularity: claims rest on empirical SOTA results, not derivations or self-referential fits

full rationale

The paper introduces EggHand as a multimodal model combining a VLA action decoder with an egocentric video-text encoder, evaluated on EgoExo4D for hand pose forecasting. No equations, derivations, or first-principles predictions appear in the provided text. Central claims (SOTA accuracy, ego-motion robustness, language-controllable prediction) are supported by experimental comparisons to prior methods rather than any fitted parameter renamed as prediction, self-definitional loop, or load-bearing self-citation chain. The architecture description is a standard engineering unification of existing components; it does not reduce to its inputs by construction. Absence of ablations is a separate evidence-strength issue, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no information is given on free parameters, background axioms, or newly postulated entities.

pith-pipeline@v0.9.0 · 5530 in / 1159 out tokens · 32263 ms · 2026-05-11T02:08:03.092359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 2 internal anchors

  1. [1]

    Human activity anal- ysis: A review.Acm Computing Surveys, 43(3):1–43, 2011

    Jake K Aggarwal and Michael S Ryoo. Human activity anal- ysis: A review.Acm Computing Surveys, 43(3):1–43, 2011. 2

  2. [2]

    Can language beat numerical regression? language-based multimodal tra- jectory prediction

    Inhwan Bae, Junoh Lee, and Hae-Gon Jeon. Can language beat numerical regression? language-based multimodal tra- jectory prediction. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 753–766, 2024. 2

  3. [3]

    Handsonvlm: Vision-language mod- els for hand-object interaction prediction.arXiv preprint arXiv:2412.13187, 2024

    Chen Bao, Jiarui Xu, Xiaolong Wang, Abhinav Gupta, and Homanga Bharadhwaj. Handsonvlm: Vision-language mod- els for hand-object interaction prediction.arXiv preprint arXiv:2412.13187, 2024. 1, 2

  4. [4]

    Uncertainty-aware state space transformer for egocentric 3d hand trajectory forecasting

    Wentao Bao, Lele Chen, Libing Zeng, Zhong Li, Yi Xu, Junsong Yuan, and Yu Kong. Uncertainty-aware state space transformer for egocentric 3d hand trajectory forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13702–13711, 2023. 6

  5. [5]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 2, 3, 6, 7

  6. [6]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 2

  7. [7]

    Lerobot: State-of- the-art machine learning for real-world robotics in pytorch

    Remi Cadene, Simon Alibert, Alexander Soare, Quentin Gallouedec, Adil Zouitine, Steven Palma, Pepijn Kooij- mans, Michel Aractingi, Mustafa Shukor, Dana Aubakirova, Martino Russi, Francesco Capuano, Caroline Pascal, Jade Choghari, Jess Moss, and Thomas Wolf. Lerobot: State-of- the-art machine learning for real-world robotics in pytorch. https : / / github...

  8. [8]

    Lerobot: An open-source library for end-to-end robot learn- ing

    Remi Cadene, Simon Alibert, Francesco Capuano, Michel Aractingi, Adil Zouitine, Pepijn Kooijmans, Jade Choghari, Martino Russi, Caroline Pascal, Steven Palma, Mustafa Shukor, Jess Moss, Alexander Soare, Dana Aubakirova, Quentin Lhoest, Quentin Gallou ´edec, and Thomas Wolf. Lerobot: An open-source library for end-to-end robot learn- ing. InThe Fourteenth ...

  9. [9]

    Realtime multi-person 2d pose estimation using part affinity fields

    Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017. 2

  10. [10]

    Look ma, no hands! agent-environment factorization of egocentric videos.Advances in Neural Information Processing Systems, 36:21466–21486, 2023

    Matthew Chang, Aditya Prakash, and Saurabh Gupta. Look ma, no hands! agent-environment factorization of egocentric videos.Advances in Neural Information Processing Systems, 36:21466–21486, 2023. 1

  11. [11]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages...

  12. [12]

    Rescaling egocentric vision: Collection, pipeline and chal- lenges for epic-kitchens-100.International Journal of Com- puter Vision, 130(1):33–55, 2022

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and chal- lenges for epic-kitchens-100.International Journal of Com- puter Vision, 130(1):33–55, 2022. 2

  13. [13]

    Slowfast networks for video recognition

    Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6202–6211, 2019. 2

  14. [14]

    Anticipating human actions by correlating past with the future with jaccard sim- ilarity measures

    Basura Fernando and Samitha Herath. Anticipating human actions by correlating past with the future with jaccard sim- ilarity measures. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 13224–13233, 2021. 2

  15. [15]

    Multi- transmotion: Pre-trained model for human motion predic- tion

    Yang Gao, Po-Chien Luan, and Alexandre Alahi. Multi- transmotion: Pre-trained model for human motion predic- tion. In8th Annual Conference on Robot Learning, 2024. 2

  16. [16]

    Omnitraj: Pre-training on heteroge- neous data for adaptive and zero-shot human trajectory pre- diction.arXiv preprint arXiv:2507.23657, 2025

    Yang Gao, Po-Chien Luan, Kaouther Messaoud, Lan Feng, and Alexandre Alahi. Omnitraj: Pre-training on heteroge- neous data for adaptive and zero-shot human trajectory pre- diction.arXiv preprint arXiv:2507.23657, 2025. 2, 3

  17. [17]

    Future transformer for long-term action anticipation

    Dayoung Gong, Joonseok Lee, Manjin Kim, Seong Jong Ha, and Minsu Cho. Future transformer for long-term action anticipation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3052– 3061, 2022. 2

  18. [18]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022. 1, 2

  19. [19]

    Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...

  20. [20]

    Emag: Ego-motion aware and generalizable 2d hand forecasting from egocentric videos

    Masashi Hatano, Ryo Hachiuma, and Hideo Saito. Emag: Ego-motion aware and generalizable 2d hand forecasting from egocentric videos. InEuropean Conference on Com- puter Vision Workshops, 2024. 1, 2

  21. [21]

    The invisible egohand: 3d hand forecasting through egobody pose estimation.arXiv preprint arXiv:2504.08654, 2025

    Masashi Hatano, Zhifan Zhu, Hideo Saito, and Dima Damen. The invisible egohand: 3d hand forecasting through egobody pose estimation.arXiv preprint arXiv:2504.08654, 2025. 1, 2, 3, 5, 6

  22. [22]

    Gaze- guided 3d hand motion prediction for detecting intent in egocentric grasping tasks

    Yufei He, Xucong Zhang, and Arno HA Stienen. Gaze- guided 3d hand motion prediction for detecting intent in egocentric grasping tasks. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 14580– 14586. IEEE, 2025. 1

  23. [23]

    Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

    Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262,

  24. [24]

    Egocentric instruction-oriented affor- dance prediction via large multimodal model.arXiv preprint arXiv:2508.17922, 2025

    Bokai Ji, Jie Gu, Xiaokang Ma, Chu Tang, Jingmin Chen, and Guangxia Li. Egocentric instruction-oriented affor- dance prediction via large multimodal model.arXiv preprint arXiv:2508.17922, 2025. 1

  25. [25]

    H2o: Two hands manipulating objects for first person interaction recognition

    Taein Kwon, Bugra Tekin, Jan St ¨uhmer, Federica Bogo, and Marc Pollefeys. H2o: Two hands manipulating objects for first person interaction recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10138–10148, 2021. 2

  26. [26]

    Joint hand motion and interaction hotspots prediction from egocentric videos

    Shaowei Liu, Subarna Tripathi, Somdeb Majumdar, and Xi- aolong Wang. Joint hand motion and interaction hotspots prediction from egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3282–3292, 2022. 1

  27. [27]

    Novel diffusion mod- els for multimodal 3d hand trajectory prediction

    Junyi Ma, Wentao Bao, Jingyi Xu, Guanzhong Sun, Xieyuanli Chen, and Hesheng Wang. Novel diffusion mod- els for multimodal 3d hand trajectory prediction. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2408–2415. IEEE, 2025. 1

  28. [28]

    Nymeria: A mas- sive collection of multimodal egocentric daily motion in the wild

    Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, Hyo Jin Kim, et al. Nymeria: A mas- sive collection of multimodal egocentric daily motion in the wild. InEuropean Conference on Computer Vision, pages 445–465. Springer, 2024. 2

  29. [29]

    Weakly- supervised action transition learning for stochastic human motion prediction

    Wei Mao, Miaomiao Liu, and Mathieu Salzmann. Weakly- supervised action transition learning for stochastic human motion prediction. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 8151–8160, 2022. 2

  30. [30]

    On human motion prediction using recurrent neural networks

    Julieta Martinez, Michael J Black, and Javier Romero. On human motion prediction using recurrent neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2891–2900, 2017. 2, 6

  31. [31]

    A geometry loss combina- tion for 3d human pose estimation

    Ai Matsune, Shichen Hu, Guangquan Li, Sihan Wen, Xi- antan Zhu, and Zhiming Tan. A geometry loss combina- tion for 3d human pose estimation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3272–3281, 2024. 5

  32. [32]

    Can’t make an omelette without breaking some eggs: Plausible action anticipation using large video- language models

    Himangi Mittal, Nakul Agarwal, Shao-Yuan Lo, and Kwon- joon Lee. Can’t make an omelette without breaking some eggs: Plausible action anticipation using large video- language models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 18580–18590, 2024. 2

  33. [33]

    Assemblyhands: Towards ego- centric activity understanding via 3d hand pose estimation

    Takehiko Ohkawa, Kun He, Fadime Sener, Tomas Hodan, Luan Tran, and Cem Keskin. Assemblyhands: Towards ego- centric activity understanding via 3d hand pose estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12999–13008, 2023. 2

  34. [34]

    Quater- net: A quaternion-based recurrent model for human motion

    Dario Pavllo, David Grangier, and Michael Auli. Quater- net: A quaternion-based recurrent model for human motion. arXiv preprint arXiv:1805.06485, 2018. 2

  35. [35]

    Egovideo: Exploring egocentric founda- tion model and downstream adaptation.arXiv preprint arXiv:2406.18070, 2024

    Baoqi Pei, Guo Chen, Jilan Xu, Yuping He, Yicheng Liu, Kanghua Pan, Yifei Huang, Yali Wang, Tong Lu, Limin Wang, and Yu Qiao. Egovideo: Exploring egocentric foun- dation model and downstream adaptation.arXiv preprint arXiv:2406.18070, 2024. 3, 6, 7

  36. [36]

    A survey on vision-based human action recognition.Image and vision computing, 28(6):976–990,

    Ronald Poppe. A survey on vision-based human action recognition.Image and vision computing, 28(6):976–990,

  37. [37]

    Diverse 3d hand gesture prediction from body dynamics by bilateral hand disentanglement

    Xingqun Qi, Chen Liu, Muyi Sun, Lincheng Li, Changjie Fan, and Xin Yu. Diverse 3d hand gesture prediction from body dynamics by bilateral hand disentanglement. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4616–4626, 2023. 1, 3

  38. [38]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

  39. [39]

    Motion transformer with global intention localization and lo- cal movement refinement.Advances in Neural Information Processing Systems, 35:6531–6543, 2022

    Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. Motion transformer with global intention localization and lo- cal movement refinement.Advances in Neural Information Processing Systems, 35:6531–6543, 2022. 2

  40. [40]

    Two-stream con- volutional networks for action recognition in videos

    Karen Simonyan and Andrew Zisserman. Two-stream con- volutional networks for action recognition in videos. InAd- vances in Neural Information Processing Systems, 2014. 2

  41. [41]

    Weakly supervised 3d hand pose estimation via biomechanical constraints

    Adrian Spurr, Umar Iqbal, Pavlo Molchanov, Otmar Hilliges, and Jan Kautz. Weakly supervised 3d hand pose estimation via biomechanical constraints. InEuropean conference on computer vision, pages 211–228. Springer, 2020. 5

  42. [42]

    Action-guided 3d human motion prediction.Advances in Neural Information Processing Sys- tems, 34:30169–30180, 2021

    Jiangxin Sun, Zihang Lin, Xintong Han, Jian-Fang Hu, Jia Xu, and Wei-Shi Zheng. Action-guided 3d human motion prediction.Advances in Neural Information Processing Sys- tems, 34:30169–30180, 2021. 2

  43. [43]

    Deep high-resolution representation learning for human pose es- timation

    Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose es- timation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5693– 5703, 2019. 2

  44. [44]

    Compositional human pose regression

    Xiao Sun, Jiaxiang Shang, Shuang Liang, and Yichen Wei. Compositional human pose regression. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 2602–2611, 2017. 5

  45. [45]

    Enguang Wang, Wencan Pei, Yiping Gao, Chenyi Liu, Xinyu Li, and Liang Gao. Vlm-posemanip: Dexterous robotic ma- nipulation via vision-language model based instructive pose estimation for human-robot collaboration.Advanced Engi- neering Informatics, 72:104508, 2026. 2

  46. [46]

    Memory-and-anticipation transformer for online action understanding

    Jiahao Wang, Guo Chen, Yifei Huang, Limin Wang, and Tong Lu. Memory-and-anticipation transformer for online action understanding. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 13824– 13835, 2023. 2

  47. [47]

    arXiv preprint arXiv:2501.08329 , year=

    Yufei Ye, Yao Feng, Omid Taheri, Haiwen Feng, Shubham Tulsiani, and Michael J Black. Predicting 4d hand trajectory from monocular videos.arXiv preprint arXiv:2501.08329,

  48. [48]

    Instruct large language models to drive like humans.arXiv preprint arXiv:2406.07296, 2024

    Ruijun Zhang, Xianda Guo, Wenzhao Zheng, Chenming Zhang, Kurt Keutzer, and Long Chen. Instruct large language models to drive like humans.arXiv preprint arXiv:2406.07296, 2024. 2

  49. [49]

    Mo- tiongpt: Finetuned llms are general-purpose motion genera- tors

    Yaqi Zhang, Di Huang, Bin Liu, Shixiang Tang, Yan Lu, Lu Chen, Lei Bai, Qi Chu, Nenghai Yu, and Wanli Ouyang. Mo- tiongpt: Finetuned llms are general-purpose motion genera- tors. InProceedings of the AAAI Conference on Artificial Intelligence, pages 7368–7376, 2024. 3

  50. [50]

    Megohand: Multimodal egocentric hand-object interaction motion generation.arXiv preprint arXiv:2505.16602, 2025

    Bohan Zhou, Yi Zhan, Zhongbin Zhang, and Zongqing Lu. Megohand: Multimodal egocentric hand-object interaction motion generation.arXiv preprint arXiv:2505.16602, 2025. 1

  51. [51]

    Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2430– 2449, 2023

    Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, Hai Ci, Jinlu Zhang, Jiaxin Shi, Feng Gao, Qi Tian, and Yizhou Wang. Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2430– 2449, 2023. 2

  52. [52]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 2, 3