arxiv: 2605.07642 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

EggHand: A Multimodal Foundation Model for Egocentric Hand Pose Forecasting

Daehee Park, Hyeondong Kim, Jaeyoung Choi, Yujin Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords egocentric hand pose forecastingmultimodal foundation modelvision-language-action3D hand poseego-motion robustnessaction predictionlanguage-conditioned forecasting

0 comments

The pith

EggHand forecasts future 3D hand poses from egocentric video by combining vision-language-action decoding with viewpoint-aware encoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

EggHand is introduced to forecast future 3D hand pose sequences from egocentric video. This is important for applications that require anticipating human hand actions, such as AR/VR assistance and human-robot interaction. The framework overcomes challenges from complex intent, dexterous articulations, and drastic viewpoint shifts by using a multimodal approach. It pairs an action decoder that models hand motion dynamics with an egocentric video-text encoder that adds contextual awareness from first-person data.

Core claim

EggHand is a multimodal foundation model for egocentric hand pose forecasting that couples an action decoder from a Vision-Language-Action model with an egocentric video-text encoder. The decoder captures structured temporal dynamics of hand motion while the encoder provides viewpoint-aware contextual information from large-scale first-person video, enabling joint reasoning over motion, context, and intent without body pose or external tracking.

What carries the argument

The integration of a Vision-Language-Action action decoder for temporal hand motion dynamics and an egocentric video-text encoder for viewpoint-aware context, allowing unified multimodal reasoning for pose forecasting.

If this is right

It achieves state-of-the-art forecasting accuracy on the EgoExo4D dataset.
The predictions remain robust even under severe ego-motion causing drastic viewpoint changes.
The model supports controllable forecasting by accepting language-based task prompts to guide the predictions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same coupling could extend to forecasting full-body movements or object manipulations from egocentric views.
Language controllability suggests direct use in interactive systems where users describe intended actions in text.
Lower reliance on external trackers could simplify hardware needs for real-time embodied AI deployments.

Load-bearing premise

The egocentric video-text encoder alone supplies enough viewpoint-aware information to handle ego-motion challenges without using body pose or external tracking.

What would settle it

A direct comparison on videos with more extreme head movements or unseen action types where the model loses its accuracy edge over baselines that rely on body tracking would disprove the robustness and independence claims.

Figures

Figures reproduced from arXiv: 2605.07642 by Daehee Park, Hyeondong Kim, Jaeyoung Choi, Yujin Kim.

**Figure 2.** Figure 2: Framework of the proposed VLA architecture. The [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative 2D projections of future 3D hand-pose forecasting on EgoExo4D. Green: ground truth; red: our predictions. Left: observation window; right: forecasted future frames. Upper: piano playing, requiring fine-grained bimanual finger articulation on a structured object. Middle: COVID-19 rapid antigen test, involving tightly coupled bimanual hand-object interaction. Lower: bike repair with sparse observ… view at source ↗

**Figure 4.** Figure 4: Qualitative ablation on multimodal inputs. EggHand forecasts on a COVID-19 test kit manipulation sequence under four conditions: All Modality, Clean Vision + Dummy Text, Noisy Vision + Clean Text, and Noisy Vision + Dummy Text. Dummy text is randomized from the EgoExo4D task vocabulary; noisy vision replaces frames with Gaussian noise. Green: ground truth; red: predictions. COVID-19 test kit manipulation (… view at source ↗

read the original abstract

Forecasting future 3D hand pose sequences from egocentric video is essential for understanding human intention and enabling embodied applications such as AR/VR assistance and human-robot interaction. However, this task remains a highly challenging problem because egocentric hand motion is driven by complex human intent, exhibits highly dexterous articulations, and is observed under drastic viewpoint shifts induced by ego-motion. In this work, we introduce EggHand, a foundation-model-based framework for egocentric hand pose forecasting that unifies multimodal semantic reasoning with dynamic motion modeling. Our approach couples an action decoder from a Vision-Language-Action (VLA) model, which captures the structured temporal dynamics of hand motion, with an egocentric video-text encoder that provides viewpoint-aware contextual information learned from large-scale first-person video. Together, these components overcome the brittleness of generic visual encoders under ego-motion and enable joint reasoning over motion, context, and high-level intent-without relying on body pose or external tracking. Experiments on the EgoExo4D dataset show that EggHand sets a new state of the art in forecasting accuracy, remains robust under severe ego-motion, and further enables controllable prediction via language-based task prompts. Project page: https://jyoun9.github.io/EggHand

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EggHand pairs a VLA action decoder with an egocentric video-text encoder for hand-pose forecasting and claims robustness plus language control, but the abstract gives no numbers and the key ablation for the encoder's role is missing.

read the letter

The core of this paper is a new multimodal setup that takes an action decoder from a VLA model and hooks it to a video-text encoder trained on large-scale egocentric video. The goal is to forecast 3D hand poses from first-person footage while handling big viewpoint shifts from ego-motion, without body pose or external trackers, and with the option to steer outputs using language task prompts. They test on EgoExo4D and say it beats prior work while staying stable under heavy motion.

Referee Report

2 major / 1 minor

Summary. The paper introduces EggHand, a multimodal foundation model for forecasting future 3D hand pose sequences from egocentric video. It couples a Vision-Language-Action (VLA) action decoder for temporal motion dynamics with an egocentric video-text encoder for viewpoint-aware context, claiming to overcome generic encoder brittleness under ego-motion without body pose or external tracking. Experiments on EgoExo4D are said to establish new state-of-the-art forecasting accuracy, robustness to severe ego-motion, and controllable prediction via language task prompts.

Significance. If the experimental claims hold after verification, the work would offer a concrete demonstration of unifying VLA-style semantic reasoning with dynamic hand motion modeling in egocentric settings. This could strengthen multimodal approaches for embodied applications such as AR/VR and human-robot interaction by reducing reliance on explicit body tracking.

major comments (2)

[Abstract] Abstract: the central claim that the egocentric video-text encoder supplies sufficient viewpoint-aware context to overcome generic-encoder brittleness under ego-motion (without body pose or external tracking) is load-bearing for both the SOTA forecasting result and the robustness statement. No ablation is indicated that holds the VLA decoder fixed while swapping only the visual encoder on high-ego-motion subsets of EgoExo4D; gains could instead arise from decoder architecture, pretraining scale, or dataset biases.
[Experiments] Experiments section (implied by abstract claims): the abstract asserts SOTA results and robustness but supplies no quantitative numbers, error bars, ablation details, or dataset splits. The full results tables and methods must be checked to confirm whether the data actually support the stated improvements over prior methods on EgoExo4D.

minor comments (1)

[Abstract] Abstract: the project page URL is given but the manuscript should explicitly state whether code, pretrained weights, or additional evaluation details are released to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We address each major comment point by point below, providing clarifications on our experimental design and indicating revisions where they will strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the egocentric video-text encoder supplies sufficient viewpoint-aware context to overcome generic-encoder brittleness under ego-motion (without body pose or external tracking) is load-bearing for both the SOTA forecasting result and the robustness statement. No ablation is indicated that holds the VLA decoder fixed while swapping only the visual encoder on high-ego-motion subsets of EgoExo4D; gains could instead arise from decoder architecture, pretraining scale, or dataset biases.

Authors: We acknowledge that the manuscript does not present a dedicated ablation that holds the VLA decoder fixed while swapping only the visual encoder on high-ego-motion subsets. To directly support the load-bearing claim and isolate the encoder's contribution to robustness, we will add this specific ablation study in the revised manuscript. revision: yes
Referee: [Experiments] Experiments section (implied by abstract claims): the abstract asserts SOTA results and robustness but supplies no quantitative numbers, error bars, ablation details, or dataset splits. The full results tables and methods must be checked to confirm whether the data actually support the stated improvements over prior methods on EgoExo4D.

Authors: Abstracts are kept concise by design and do not typically include numerical results. The full manuscript contains detailed results tables in the Experiments section, including quantitative metrics on EgoExo4D, comparisons to prior methods, error bars, ablation studies, and dataset split information that substantiate the SOTA and robustness claims. No revision is required for this point. revision: no

Circularity Check

0 steps flagged

No circularity: claims rest on empirical SOTA results, not derivations or self-referential fits

full rationale

The paper introduces EggHand as a multimodal model combining a VLA action decoder with an egocentric video-text encoder, evaluated on EgoExo4D for hand pose forecasting. No equations, derivations, or first-principles predictions appear in the provided text. Central claims (SOTA accuracy, ego-motion robustness, language-controllable prediction) are supported by experimental comparisons to prior methods rather than any fitted parameter renamed as prediction, self-definitional loop, or load-bearing self-citation chain. The architecture description is a standard engineering unification of existing components; it does not reduce to its inputs by construction. Absence of ablations is a separate evidence-strength issue, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no information is given on free parameters, background axioms, or newly postulated entities.

pith-pipeline@v0.9.0 · 5530 in / 1159 out tokens · 32263 ms · 2026-05-11T02:08:03.092359+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
couples an action decoder from a Vision-Language-Action (VLA) model... with an egocentric video–text encoder... geometry-aware training objective... Ltotal = λabs Labs + λrel Lrel + λpair Lpair
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear
remains robust under severe ego-motion... without relying on body pose or external tracking

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 2 internal anchors

[1]

Human activity anal- ysis: A review.Acm Computing Surveys, 43(3):1–43, 2011

Jake K Aggarwal and Michael S Ryoo. Human activity anal- ysis: A review.Acm Computing Surveys, 43(3):1–43, 2011. 2

work page 2011
[2]

Can language beat numerical regression? language-based multimodal tra- jectory prediction

Inhwan Bae, Junoh Lee, and Hae-Gon Jeon. Can language beat numerical regression? language-based multimodal tra- jectory prediction. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 753–766, 2024. 2

work page 2024
[3]

Handsonvlm: Vision-language mod- els for hand-object interaction prediction.arXiv preprint arXiv:2412.13187, 2024

Chen Bao, Jiarui Xu, Xiaolong Wang, Abhinav Gupta, and Homanga Bharadhwaj. Handsonvlm: Vision-language mod- els for hand-object interaction prediction.arXiv preprint arXiv:2412.13187, 2024. 1, 2

work page arXiv 2024
[4]

Uncertainty-aware state space transformer for egocentric 3d hand trajectory forecasting

Wentao Bao, Lele Chen, Libing Zeng, Zhong Li, Yi Xu, Junsong Yuan, and Yu Kong. Uncertainty-aware state space transformer for egocentric 3d hand trajectory forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13702–13711, 2023. 6

work page 2023
[5]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 2, 3, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Lerobot: State-of- the-art machine learning for real-world robotics in pytorch

Remi Cadene, Simon Alibert, Alexander Soare, Quentin Gallouedec, Adil Zouitine, Steven Palma, Pepijn Kooij- mans, Michel Aractingi, Mustafa Shukor, Dana Aubakirova, Martino Russi, Francesco Capuano, Caroline Pascal, Jade Choghari, Jess Moss, and Thomas Wolf. Lerobot: State-of- the-art machine learning for real-world robotics in pytorch. https : / / github...

work page
[8]

Lerobot: An open-source library for end-to-end robot learn- ing

Remi Cadene, Simon Alibert, Francesco Capuano, Michel Aractingi, Adil Zouitine, Pepijn Kooijmans, Jade Choghari, Martino Russi, Caroline Pascal, Steven Palma, Mustafa Shukor, Jess Moss, Alexander Soare, Dana Aubakirova, Quentin Lhoest, Quentin Gallou ´edec, and Thomas Wolf. Lerobot: An open-source library for end-to-end robot learn- ing. InThe Fourteenth ...

work page 2026
[9]

Realtime multi-person 2d pose estimation using part affinity fields

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017. 2

work page 2017
[10]

Look ma, no hands! agent-environment factorization of egocentric videos.Advances in Neural Information Processing Systems, 36:21466–21486, 2023

Matthew Chang, Aditya Prakash, and Saurabh Gupta. Look ma, no hands! agent-environment factorization of egocentric videos.Advances in Neural Information Processing Systems, 36:21466–21486, 2023. 1

work page 2023
[11]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages...

work page 2024
[12]

Rescaling egocentric vision: Collection, pipeline and chal- lenges for epic-kitchens-100.International Journal of Com- puter Vision, 130(1):33–55, 2022

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and chal- lenges for epic-kitchens-100.International Journal of Com- puter Vision, 130(1):33–55, 2022. 2

work page 2022
[13]

Slowfast networks for video recognition

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6202–6211, 2019. 2

work page 2019
[14]

Anticipating human actions by correlating past with the future with jaccard sim- ilarity measures

Basura Fernando and Samitha Herath. Anticipating human actions by correlating past with the future with jaccard sim- ilarity measures. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 13224–13233, 2021. 2

work page 2021
[15]

Multi- transmotion: Pre-trained model for human motion predic- tion

Yang Gao, Po-Chien Luan, and Alexandre Alahi. Multi- transmotion: Pre-trained model for human motion predic- tion. In8th Annual Conference on Robot Learning, 2024. 2

work page 2024
[16]

Omnitraj: Pre-training on heteroge- neous data for adaptive and zero-shot human trajectory pre- diction.arXiv preprint arXiv:2507.23657, 2025

Yang Gao, Po-Chien Luan, Kaouther Messaoud, Lan Feng, and Alexandre Alahi. Omnitraj: Pre-training on heteroge- neous data for adaptive and zero-shot human trajectory pre- diction.arXiv preprint arXiv:2507.23657, 2025. 2, 3

work page arXiv 2025
[17]

Future transformer for long-term action anticipation

Dayoung Gong, Joonseok Lee, Manjin Kim, Seong Jong Ha, and Minsu Cho. Future transformer for long-term action anticipation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3052– 3061, 2022. 2

work page 2022
[18]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022. 1, 2

work page 2022
[19]

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...

work page 2024
[20]

Emag: Ego-motion aware and generalizable 2d hand forecasting from egocentric videos

Masashi Hatano, Ryo Hachiuma, and Hideo Saito. Emag: Ego-motion aware and generalizable 2d hand forecasting from egocentric videos. InEuropean Conference on Com- puter Vision Workshops, 2024. 1, 2

work page 2024
[21]

The invisible egohand: 3d hand forecasting through egobody pose estimation.arXiv preprint arXiv:2504.08654, 2025

Masashi Hatano, Zhifan Zhu, Hideo Saito, and Dima Damen. The invisible egohand: 3d hand forecasting through egobody pose estimation.arXiv preprint arXiv:2504.08654, 2025. 1, 2, 3, 5, 6

work page arXiv 2025
[22]

Gaze- guided 3d hand motion prediction for detecting intent in egocentric grasping tasks

Yufei He, Xucong Zhang, and Arno HA Stienen. Gaze- guided 3d hand motion prediction for detecting intent in egocentric grasping tasks. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 14580– 14586. IEEE, 2025. 1

work page 2025
[23]

Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262,

work page arXiv
[24]

Egocentric instruction-oriented affor- dance prediction via large multimodal model.arXiv preprint arXiv:2508.17922, 2025

Bokai Ji, Jie Gu, Xiaokang Ma, Chu Tang, Jingmin Chen, and Guangxia Li. Egocentric instruction-oriented affor- dance prediction via large multimodal model.arXiv preprint arXiv:2508.17922, 2025. 1

work page arXiv 2025
[25]

H2o: Two hands manipulating objects for first person interaction recognition

Taein Kwon, Bugra Tekin, Jan St ¨uhmer, Federica Bogo, and Marc Pollefeys. H2o: Two hands manipulating objects for first person interaction recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10138–10148, 2021. 2

work page 2021
[26]

Joint hand motion and interaction hotspots prediction from egocentric videos

Shaowei Liu, Subarna Tripathi, Somdeb Majumdar, and Xi- aolong Wang. Joint hand motion and interaction hotspots prediction from egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3282–3292, 2022. 1

work page 2022
[27]

Novel diffusion mod- els for multimodal 3d hand trajectory prediction

Junyi Ma, Wentao Bao, Jingyi Xu, Guanzhong Sun, Xieyuanli Chen, and Hesheng Wang. Novel diffusion mod- els for multimodal 3d hand trajectory prediction. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2408–2415. IEEE, 2025. 1

work page 2025
[28]

Nymeria: A mas- sive collection of multimodal egocentric daily motion in the wild

Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, Hyo Jin Kim, et al. Nymeria: A mas- sive collection of multimodal egocentric daily motion in the wild. InEuropean Conference on Computer Vision, pages 445–465. Springer, 2024. 2

work page 2024
[29]

Weakly- supervised action transition learning for stochastic human motion prediction

Wei Mao, Miaomiao Liu, and Mathieu Salzmann. Weakly- supervised action transition learning for stochastic human motion prediction. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 8151–8160, 2022. 2

work page 2022
[30]

On human motion prediction using recurrent neural networks

Julieta Martinez, Michael J Black, and Javier Romero. On human motion prediction using recurrent neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2891–2900, 2017. 2, 6

work page 2017
[31]

A geometry loss combina- tion for 3d human pose estimation

Ai Matsune, Shichen Hu, Guangquan Li, Sihan Wen, Xi- antan Zhu, and Zhiming Tan. A geometry loss combina- tion for 3d human pose estimation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3272–3281, 2024. 5

work page 2024
[32]

Can’t make an omelette without breaking some eggs: Plausible action anticipation using large video- language models

Himangi Mittal, Nakul Agarwal, Shao-Yuan Lo, and Kwon- joon Lee. Can’t make an omelette without breaking some eggs: Plausible action anticipation using large video- language models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 18580–18590, 2024. 2

work page 2024
[33]

Assemblyhands: Towards ego- centric activity understanding via 3d hand pose estimation

Takehiko Ohkawa, Kun He, Fadime Sener, Tomas Hodan, Luan Tran, and Cem Keskin. Assemblyhands: Towards ego- centric activity understanding via 3d hand pose estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12999–13008, 2023. 2

work page 2023
[34]

Quater- net: A quaternion-based recurrent model for human motion

Dario Pavllo, David Grangier, and Michael Auli. Quater- net: A quaternion-based recurrent model for human motion. arXiv preprint arXiv:1805.06485, 2018. 2

work page arXiv 2018
[35]

Egovideo: Exploring egocentric founda- tion model and downstream adaptation.arXiv preprint arXiv:2406.18070, 2024

Baoqi Pei, Guo Chen, Jilan Xu, Yuping He, Yicheng Liu, Kanghua Pan, Yifei Huang, Yali Wang, Tong Lu, Limin Wang, and Yu Qiao. Egovideo: Exploring egocentric foun- dation model and downstream adaptation.arXiv preprint arXiv:2406.18070, 2024. 3, 6, 7

work page arXiv 2024
[36]

A survey on vision-based human action recognition.Image and vision computing, 28(6):976–990,

Ronald Poppe. A survey on vision-based human action recognition.Image and vision computing, 28(6):976–990,

work page
[37]

Diverse 3d hand gesture prediction from body dynamics by bilateral hand disentanglement

Xingqun Qi, Chen Liu, Muyi Sun, Lincheng Li, Changjie Fan, and Xin Yu. Diverse 3d hand gesture prediction from body dynamics by bilateral hand disentanglement. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4616–4626, 2023. 1, 3

work page 2023
[38]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

work page 2021
[39]

Motion transformer with global intention localization and lo- cal movement refinement.Advances in Neural Information Processing Systems, 35:6531–6543, 2022

Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. Motion transformer with global intention localization and lo- cal movement refinement.Advances in Neural Information Processing Systems, 35:6531–6543, 2022. 2

work page 2022
[40]

Two-stream con- volutional networks for action recognition in videos

Karen Simonyan and Andrew Zisserman. Two-stream con- volutional networks for action recognition in videos. InAd- vances in Neural Information Processing Systems, 2014. 2

work page 2014
[41]

Weakly supervised 3d hand pose estimation via biomechanical constraints

Adrian Spurr, Umar Iqbal, Pavlo Molchanov, Otmar Hilliges, and Jan Kautz. Weakly supervised 3d hand pose estimation via biomechanical constraints. InEuropean conference on computer vision, pages 211–228. Springer, 2020. 5

work page 2020
[42]

Action-guided 3d human motion prediction.Advances in Neural Information Processing Sys- tems, 34:30169–30180, 2021

Jiangxin Sun, Zihang Lin, Xintong Han, Jian-Fang Hu, Jia Xu, and Wei-Shi Zheng. Action-guided 3d human motion prediction.Advances in Neural Information Processing Sys- tems, 34:30169–30180, 2021. 2

work page 2021
[43]

Deep high-resolution representation learning for human pose es- timation

Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose es- timation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5693– 5703, 2019. 2

work page 2019
[44]

Compositional human pose regression

Xiao Sun, Jiaxiang Shang, Shuang Liang, and Yichen Wei. Compositional human pose regression. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 2602–2611, 2017. 5

work page 2017
[45]

Enguang Wang, Wencan Pei, Yiping Gao, Chenyi Liu, Xinyu Li, and Liang Gao. Vlm-posemanip: Dexterous robotic ma- nipulation via vision-language model based instructive pose estimation for human-robot collaboration.Advanced Engi- neering Informatics, 72:104508, 2026. 2

work page 2026
[46]

Memory-and-anticipation transformer for online action understanding

Jiahao Wang, Guo Chen, Yifei Huang, Limin Wang, and Tong Lu. Memory-and-anticipation transformer for online action understanding. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 13824– 13835, 2023. 2

work page 2023
[47]

arXiv preprint arXiv:2501.08329 , year=

Yufei Ye, Yao Feng, Omid Taheri, Haiwen Feng, Shubham Tulsiani, and Michael J Black. Predicting 4d hand trajectory from monocular videos.arXiv preprint arXiv:2501.08329,

work page arXiv
[48]

Instruct large language models to drive like humans.arXiv preprint arXiv:2406.07296, 2024

Ruijun Zhang, Xianda Guo, Wenzhao Zheng, Chenming Zhang, Kurt Keutzer, and Long Chen. Instruct large language models to drive like humans.arXiv preprint arXiv:2406.07296, 2024. 2

work page arXiv 2024
[49]

Mo- tiongpt: Finetuned llms are general-purpose motion genera- tors

Yaqi Zhang, Di Huang, Bin Liu, Shixiang Tang, Yan Lu, Lu Chen, Lei Bai, Qi Chu, Nenghai Yu, and Wanli Ouyang. Mo- tiongpt: Finetuned llms are general-purpose motion genera- tors. InProceedings of the AAAI Conference on Artificial Intelligence, pages 7368–7376, 2024. 3

work page 2024
[50]

Megohand: Multimodal egocentric hand-object interaction motion generation.arXiv preprint arXiv:2505.16602, 2025

Bohan Zhou, Yi Zhan, Zhongbin Zhang, and Zongqing Lu. Megohand: Multimodal egocentric hand-object interaction motion generation.arXiv preprint arXiv:2505.16602, 2025. 1

work page arXiv 2025
[51]

Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2430– 2449, 2023

Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, Hai Ci, Jinlu Zhang, Jiaxin Shi, Feng Gao, Qi Tian, and Yizhou Wang. Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2430– 2449, 2023. 2

work page 2023
[52]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 2, 3

work page 2023