Recognition: 2 theorem links
· Lean TheoremEggHand: A Multimodal Foundation Model for Egocentric Hand Pose Forecasting
Pith reviewed 2026-05-11 02:08 UTC · model grok-4.3
The pith
EggHand forecasts future 3D hand poses from egocentric video by combining vision-language-action decoding with viewpoint-aware encoding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EggHand is a multimodal foundation model for egocentric hand pose forecasting that couples an action decoder from a Vision-Language-Action model with an egocentric video-text encoder. The decoder captures structured temporal dynamics of hand motion while the encoder provides viewpoint-aware contextual information from large-scale first-person video, enabling joint reasoning over motion, context, and intent without body pose or external tracking.
What carries the argument
The integration of a Vision-Language-Action action decoder for temporal hand motion dynamics and an egocentric video-text encoder for viewpoint-aware context, allowing unified multimodal reasoning for pose forecasting.
If this is right
- It achieves state-of-the-art forecasting accuracy on the EgoExo4D dataset.
- The predictions remain robust even under severe ego-motion causing drastic viewpoint changes.
- The model supports controllable forecasting by accepting language-based task prompts to guide the predictions.
Where Pith is reading between the lines
- The same coupling could extend to forecasting full-body movements or object manipulations from egocentric views.
- Language controllability suggests direct use in interactive systems where users describe intended actions in text.
- Lower reliance on external trackers could simplify hardware needs for real-time embodied AI deployments.
Load-bearing premise
The egocentric video-text encoder alone supplies enough viewpoint-aware information to handle ego-motion challenges without using body pose or external tracking.
What would settle it
A direct comparison on videos with more extreme head movements or unseen action types where the model loses its accuracy edge over baselines that rely on body tracking would disprove the robustness and independence claims.
Figures
read the original abstract
Forecasting future 3D hand pose sequences from egocentric video is essential for understanding human intention and enabling embodied applications such as AR/VR assistance and human-robot interaction. However, this task remains a highly challenging problem because egocentric hand motion is driven by complex human intent, exhibits highly dexterous articulations, and is observed under drastic viewpoint shifts induced by ego-motion. In this work, we introduce EggHand, a foundation-model-based framework for egocentric hand pose forecasting that unifies multimodal semantic reasoning with dynamic motion modeling. Our approach couples an action decoder from a Vision-Language-Action (VLA) model, which captures the structured temporal dynamics of hand motion, with an egocentric video-text encoder that provides viewpoint-aware contextual information learned from large-scale first-person video. Together, these components overcome the brittleness of generic visual encoders under ego-motion and enable joint reasoning over motion, context, and high-level intent-without relying on body pose or external tracking. Experiments on the EgoExo4D dataset show that EggHand sets a new state of the art in forecasting accuracy, remains robust under severe ego-motion, and further enables controllable prediction via language-based task prompts. Project page: https://jyoun9.github.io/EggHand
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EggHand, a multimodal foundation model for forecasting future 3D hand pose sequences from egocentric video. It couples a Vision-Language-Action (VLA) action decoder for temporal motion dynamics with an egocentric video-text encoder for viewpoint-aware context, claiming to overcome generic encoder brittleness under ego-motion without body pose or external tracking. Experiments on EgoExo4D are said to establish new state-of-the-art forecasting accuracy, robustness to severe ego-motion, and controllable prediction via language task prompts.
Significance. If the experimental claims hold after verification, the work would offer a concrete demonstration of unifying VLA-style semantic reasoning with dynamic hand motion modeling in egocentric settings. This could strengthen multimodal approaches for embodied applications such as AR/VR and human-robot interaction by reducing reliance on explicit body tracking.
major comments (2)
- [Abstract] Abstract: the central claim that the egocentric video-text encoder supplies sufficient viewpoint-aware context to overcome generic-encoder brittleness under ego-motion (without body pose or external tracking) is load-bearing for both the SOTA forecasting result and the robustness statement. No ablation is indicated that holds the VLA decoder fixed while swapping only the visual encoder on high-ego-motion subsets of EgoExo4D; gains could instead arise from decoder architecture, pretraining scale, or dataset biases.
- [Experiments] Experiments section (implied by abstract claims): the abstract asserts SOTA results and robustness but supplies no quantitative numbers, error bars, ablation details, or dataset splits. The full results tables and methods must be checked to confirm whether the data actually support the stated improvements over prior methods on EgoExo4D.
minor comments (1)
- [Abstract] Abstract: the project page URL is given but the manuscript should explicitly state whether code, pretrained weights, or additional evaluation details are released to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review of our manuscript. We address each major comment point by point below, providing clarifications on our experimental design and indicating revisions where they will strengthen the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the egocentric video-text encoder supplies sufficient viewpoint-aware context to overcome generic-encoder brittleness under ego-motion (without body pose or external tracking) is load-bearing for both the SOTA forecasting result and the robustness statement. No ablation is indicated that holds the VLA decoder fixed while swapping only the visual encoder on high-ego-motion subsets of EgoExo4D; gains could instead arise from decoder architecture, pretraining scale, or dataset biases.
Authors: We acknowledge that the manuscript does not present a dedicated ablation that holds the VLA decoder fixed while swapping only the visual encoder on high-ego-motion subsets. To directly support the load-bearing claim and isolate the encoder's contribution to robustness, we will add this specific ablation study in the revised manuscript. revision: yes
-
Referee: [Experiments] Experiments section (implied by abstract claims): the abstract asserts SOTA results and robustness but supplies no quantitative numbers, error bars, ablation details, or dataset splits. The full results tables and methods must be checked to confirm whether the data actually support the stated improvements over prior methods on EgoExo4D.
Authors: Abstracts are kept concise by design and do not typically include numerical results. The full manuscript contains detailed results tables in the Experiments section, including quantitative metrics on EgoExo4D, comparisons to prior methods, error bars, ablation studies, and dataset split information that substantiate the SOTA and robustness claims. No revision is required for this point. revision: no
Circularity Check
No circularity: claims rest on empirical SOTA results, not derivations or self-referential fits
full rationale
The paper introduces EggHand as a multimodal model combining a VLA action decoder with an egocentric video-text encoder, evaluated on EgoExo4D for hand pose forecasting. No equations, derivations, or first-principles predictions appear in the provided text. Central claims (SOTA accuracy, ego-motion robustness, language-controllable prediction) are supported by experimental comparisons to prior methods rather than any fitted parameter renamed as prediction, self-definitional loop, or load-bearing self-citation chain. The architecture description is a standard engineering unification of existing components; it does not reduce to its inputs by construction. Absence of ablations is a separate evidence-strength issue, not circularity.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearcouples an action decoder from a Vision-Language-Action (VLA) model... with an egocentric video–text encoder... geometry-aware training objective... Ltotal = λabs Labs + λrel Lrel + λpair Lpair
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclearremains robust under severe ego-motion... without relying on body pose or external tracking
Reference graph
Works this paper leans on
-
[1]
Human activity anal- ysis: A review.Acm Computing Surveys, 43(3):1–43, 2011
Jake K Aggarwal and Michael S Ryoo. Human activity anal- ysis: A review.Acm Computing Surveys, 43(3):1–43, 2011. 2
work page 2011
-
[2]
Can language beat numerical regression? language-based multimodal tra- jectory prediction
Inhwan Bae, Junoh Lee, and Hae-Gon Jeon. Can language beat numerical regression? language-based multimodal tra- jectory prediction. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 753–766, 2024. 2
work page 2024
-
[3]
Chen Bao, Jiarui Xu, Xiaolong Wang, Abhinav Gupta, and Homanga Bharadhwaj. Handsonvlm: Vision-language mod- els for hand-object interaction prediction.arXiv preprint arXiv:2412.13187, 2024. 1, 2
-
[4]
Uncertainty-aware state space transformer for egocentric 3d hand trajectory forecasting
Wentao Bao, Lele Chen, Libing Zeng, Zhong Li, Yi Xu, Junsong Yuan, and Yu Kong. Uncertainty-aware state space transformer for egocentric 3d hand trajectory forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13702–13711, 2023. 6
work page 2023
-
[5]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 2, 3, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
Lerobot: State-of- the-art machine learning for real-world robotics in pytorch
Remi Cadene, Simon Alibert, Alexander Soare, Quentin Gallouedec, Adil Zouitine, Steven Palma, Pepijn Kooij- mans, Michel Aractingi, Mustafa Shukor, Dana Aubakirova, Martino Russi, Francesco Capuano, Caroline Pascal, Jade Choghari, Jess Moss, and Thomas Wolf. Lerobot: State-of- the-art machine learning for real-world robotics in pytorch. https : / / github...
-
[8]
Lerobot: An open-source library for end-to-end robot learn- ing
Remi Cadene, Simon Alibert, Francesco Capuano, Michel Aractingi, Adil Zouitine, Pepijn Kooijmans, Jade Choghari, Martino Russi, Caroline Pascal, Steven Palma, Mustafa Shukor, Jess Moss, Alexander Soare, Dana Aubakirova, Quentin Lhoest, Quentin Gallou ´edec, and Thomas Wolf. Lerobot: An open-source library for end-to-end robot learn- ing. InThe Fourteenth ...
work page 2026
-
[9]
Realtime multi-person 2d pose estimation using part affinity fields
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017. 2
work page 2017
-
[10]
Matthew Chang, Aditya Prakash, and Saurabh Gupta. Look ma, no hands! agent-environment factorization of egocentric videos.Advances in Neural Information Processing Systems, 36:21466–21486, 2023. 1
work page 2023
-
[11]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages...
work page 2024
-
[12]
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and chal- lenges for epic-kitchens-100.International Journal of Com- puter Vision, 130(1):33–55, 2022. 2
work page 2022
-
[13]
Slowfast networks for video recognition
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6202–6211, 2019. 2
work page 2019
-
[14]
Anticipating human actions by correlating past with the future with jaccard sim- ilarity measures
Basura Fernando and Samitha Herath. Anticipating human actions by correlating past with the future with jaccard sim- ilarity measures. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 13224–13233, 2021. 2
work page 2021
-
[15]
Multi- transmotion: Pre-trained model for human motion predic- tion
Yang Gao, Po-Chien Luan, and Alexandre Alahi. Multi- transmotion: Pre-trained model for human motion predic- tion. In8th Annual Conference on Robot Learning, 2024. 2
work page 2024
-
[16]
Yang Gao, Po-Chien Luan, Kaouther Messaoud, Lan Feng, and Alexandre Alahi. Omnitraj: Pre-training on heteroge- neous data for adaptive and zero-shot human trajectory pre- diction.arXiv preprint arXiv:2507.23657, 2025. 2, 3
-
[17]
Future transformer for long-term action anticipation
Dayoung Gong, Joonseok Lee, Manjin Kim, Seong Jong Ha, and Minsu Cho. Future transformer for long-term action anticipation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3052– 3061, 2022. 2
work page 2022
-
[18]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022. 1, 2
work page 2022
-
[19]
Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...
work page 2024
-
[20]
Emag: Ego-motion aware and generalizable 2d hand forecasting from egocentric videos
Masashi Hatano, Ryo Hachiuma, and Hideo Saito. Emag: Ego-motion aware and generalizable 2d hand forecasting from egocentric videos. InEuropean Conference on Com- puter Vision Workshops, 2024. 1, 2
work page 2024
-
[21]
Masashi Hatano, Zhifan Zhu, Hideo Saito, and Dima Damen. The invisible egohand: 3d hand forecasting through egobody pose estimation.arXiv preprint arXiv:2504.08654, 2025. 1, 2, 3, 5, 6
-
[22]
Gaze- guided 3d hand motion prediction for detecting intent in egocentric grasping tasks
Yufei He, Xucong Zhang, and Arno HA Stienen. Gaze- guided 3d hand motion prediction for detecting intent in egocentric grasping tasks. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 14580– 14586. IEEE, 2025. 1
work page 2025
-
[23]
Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024
Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262,
-
[24]
Bokai Ji, Jie Gu, Xiaokang Ma, Chu Tang, Jingmin Chen, and Guangxia Li. Egocentric instruction-oriented affor- dance prediction via large multimodal model.arXiv preprint arXiv:2508.17922, 2025. 1
-
[25]
H2o: Two hands manipulating objects for first person interaction recognition
Taein Kwon, Bugra Tekin, Jan St ¨uhmer, Federica Bogo, and Marc Pollefeys. H2o: Two hands manipulating objects for first person interaction recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10138–10148, 2021. 2
work page 2021
-
[26]
Joint hand motion and interaction hotspots prediction from egocentric videos
Shaowei Liu, Subarna Tripathi, Somdeb Majumdar, and Xi- aolong Wang. Joint hand motion and interaction hotspots prediction from egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3282–3292, 2022. 1
work page 2022
-
[27]
Novel diffusion mod- els for multimodal 3d hand trajectory prediction
Junyi Ma, Wentao Bao, Jingyi Xu, Guanzhong Sun, Xieyuanli Chen, and Hesheng Wang. Novel diffusion mod- els for multimodal 3d hand trajectory prediction. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2408–2415. IEEE, 2025. 1
work page 2025
-
[28]
Nymeria: A mas- sive collection of multimodal egocentric daily motion in the wild
Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, Hyo Jin Kim, et al. Nymeria: A mas- sive collection of multimodal egocentric daily motion in the wild. InEuropean Conference on Computer Vision, pages 445–465. Springer, 2024. 2
work page 2024
-
[29]
Weakly- supervised action transition learning for stochastic human motion prediction
Wei Mao, Miaomiao Liu, and Mathieu Salzmann. Weakly- supervised action transition learning for stochastic human motion prediction. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 8151–8160, 2022. 2
work page 2022
-
[30]
On human motion prediction using recurrent neural networks
Julieta Martinez, Michael J Black, and Javier Romero. On human motion prediction using recurrent neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2891–2900, 2017. 2, 6
work page 2017
-
[31]
A geometry loss combina- tion for 3d human pose estimation
Ai Matsune, Shichen Hu, Guangquan Li, Sihan Wen, Xi- antan Zhu, and Zhiming Tan. A geometry loss combina- tion for 3d human pose estimation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3272–3281, 2024. 5
work page 2024
-
[32]
Himangi Mittal, Nakul Agarwal, Shao-Yuan Lo, and Kwon- joon Lee. Can’t make an omelette without breaking some eggs: Plausible action anticipation using large video- language models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 18580–18590, 2024. 2
work page 2024
-
[33]
Assemblyhands: Towards ego- centric activity understanding via 3d hand pose estimation
Takehiko Ohkawa, Kun He, Fadime Sener, Tomas Hodan, Luan Tran, and Cem Keskin. Assemblyhands: Towards ego- centric activity understanding via 3d hand pose estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12999–13008, 2023. 2
work page 2023
-
[34]
Quater- net: A quaternion-based recurrent model for human motion
Dario Pavllo, David Grangier, and Michael Auli. Quater- net: A quaternion-based recurrent model for human motion. arXiv preprint arXiv:1805.06485, 2018. 2
-
[35]
Baoqi Pei, Guo Chen, Jilan Xu, Yuping He, Yicheng Liu, Kanghua Pan, Yifei Huang, Yali Wang, Tong Lu, Limin Wang, and Yu Qiao. Egovideo: Exploring egocentric foun- dation model and downstream adaptation.arXiv preprint arXiv:2406.18070, 2024. 3, 6, 7
-
[36]
A survey on vision-based human action recognition.Image and vision computing, 28(6):976–990,
Ronald Poppe. A survey on vision-based human action recognition.Image and vision computing, 28(6):976–990,
-
[37]
Diverse 3d hand gesture prediction from body dynamics by bilateral hand disentanglement
Xingqun Qi, Chen Liu, Muyi Sun, Lincheng Li, Changjie Fan, and Xin Yu. Diverse 3d hand gesture prediction from body dynamics by bilateral hand disentanglement. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4616–4626, 2023. 1, 3
work page 2023
-
[38]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3
work page 2021
-
[39]
Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. Motion transformer with global intention localization and lo- cal movement refinement.Advances in Neural Information Processing Systems, 35:6531–6543, 2022. 2
work page 2022
-
[40]
Two-stream con- volutional networks for action recognition in videos
Karen Simonyan and Andrew Zisserman. Two-stream con- volutional networks for action recognition in videos. InAd- vances in Neural Information Processing Systems, 2014. 2
work page 2014
-
[41]
Weakly supervised 3d hand pose estimation via biomechanical constraints
Adrian Spurr, Umar Iqbal, Pavlo Molchanov, Otmar Hilliges, and Jan Kautz. Weakly supervised 3d hand pose estimation via biomechanical constraints. InEuropean conference on computer vision, pages 211–228. Springer, 2020. 5
work page 2020
-
[42]
Jiangxin Sun, Zihang Lin, Xintong Han, Jian-Fang Hu, Jia Xu, and Wei-Shi Zheng. Action-guided 3d human motion prediction.Advances in Neural Information Processing Sys- tems, 34:30169–30180, 2021. 2
work page 2021
-
[43]
Deep high-resolution representation learning for human pose es- timation
Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose es- timation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5693– 5703, 2019. 2
work page 2019
-
[44]
Compositional human pose regression
Xiao Sun, Jiaxiang Shang, Shuang Liang, and Yichen Wei. Compositional human pose regression. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 2602–2611, 2017. 5
work page 2017
-
[45]
Enguang Wang, Wencan Pei, Yiping Gao, Chenyi Liu, Xinyu Li, and Liang Gao. Vlm-posemanip: Dexterous robotic ma- nipulation via vision-language model based instructive pose estimation for human-robot collaboration.Advanced Engi- neering Informatics, 72:104508, 2026. 2
work page 2026
-
[46]
Memory-and-anticipation transformer for online action understanding
Jiahao Wang, Guo Chen, Yifei Huang, Limin Wang, and Tong Lu. Memory-and-anticipation transformer for online action understanding. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 13824– 13835, 2023. 2
work page 2023
-
[47]
arXiv preprint arXiv:2501.08329 , year=
Yufei Ye, Yao Feng, Omid Taheri, Haiwen Feng, Shubham Tulsiani, and Michael J Black. Predicting 4d hand trajectory from monocular videos.arXiv preprint arXiv:2501.08329,
-
[48]
Instruct large language models to drive like humans.arXiv preprint arXiv:2406.07296, 2024
Ruijun Zhang, Xianda Guo, Wenzhao Zheng, Chenming Zhang, Kurt Keutzer, and Long Chen. Instruct large language models to drive like humans.arXiv preprint arXiv:2406.07296, 2024. 2
-
[49]
Mo- tiongpt: Finetuned llms are general-purpose motion genera- tors
Yaqi Zhang, Di Huang, Bin Liu, Shixiang Tang, Yan Lu, Lu Chen, Lei Bai, Qi Chu, Nenghai Yu, and Wanli Ouyang. Mo- tiongpt: Finetuned llms are general-purpose motion genera- tors. InProceedings of the AAAI Conference on Artificial Intelligence, pages 7368–7376, 2024. 3
work page 2024
-
[50]
Bohan Zhou, Yi Zhan, Zhongbin Zhang, and Zongqing Lu. Megohand: Multimodal egocentric hand-object interaction motion generation.arXiv preprint arXiv:2505.16602, 2025. 1
-
[51]
Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, Hai Ci, Jinlu Zhang, Jiaxin Shi, Feng Gao, Qi Tian, and Yizhou Wang. Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2430– 2449, 2023. 2
work page 2023
-
[52]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 2, 3
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.