Recognition: unknown
EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation
Pith reviewed 2026-05-10 03:35 UTC · model grok-4.3
The pith
Decoupling semantic reasoning into discrete motion primitives before diffusion synthesis resolves gradient conflicts in egocentric motion generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EgoMotion is a hierarchical generative framework that decouples cognitive reasoning from kinematic modeling. In the first stage, a vision-language model projects egocentric visual observations and language instructions into a space of discrete motion primitives, forcing acquisition of goal-consistent representations. In the second stage, these primitives condition a diffusion-based generator that performs iterative denoising in continuous latent space to synthesize physically plausible and temporally coherent 3D motion trajectories.
What carries the argument
The two-stage hierarchy that maps multimodal inputs to discrete motion primitives via a vision-language model, then feeds those as conditioning to a diffusion motion generator.
If this is right
- State-of-the-art performance on egocentric vision-language motion generation tasks.
- Motion sequences that are semantically grounded in both language instructions and first-person visual observations.
- Kinematically superior trajectories compared to existing joint-optimization approaches.
- Avoidance of gradient conflicts that systematically degrade multimodal grounding and motion fidelity.
Where Pith is reading between the lines
- The discrete primitive representation could support easier adaptation to new environments or motion styles without full retraining.
- Similar staged decoupling might address optimization conflicts in other multimodal generation domains such as audio-conditioned video or text-to-3D synthesis.
- Deployment in real robotic systems could test whether the learned primitives transfer effectively from simulation to physical hardware.
- Varying the granularity of the discrete motion primitives offers a testable way to measure trade-offs between semantic fidelity and motion detail.
Load-bearing premise
Mapping multimodal inputs to discrete motion primitives reliably bridges the semantic gap and avoids information loss or new conflicts when conditioning the diffusion generator.
What would settle it
An end-to-end single-stage model that jointly optimizes the vision-language and diffusion components achieving equal or higher scores on semantic grounding and kinematic quality metrics than the two-stage EgoMotion on the same evaluation benchmarks.
Figures
read the original abstract
Faithfully modeling human behavior in dynamic environments is a foundational challenge for embodied intelligence. While conditional motion synthesis has achieved significant advances, egocentric motion generation remains largely underexplored due to the inherent complexity of first-person perception. In this work, we investigate Egocentric Vision-Language (Ego-VL) motion generation. This task requires synthesizing 3D human motion conditioned jointly on first-person visual observations and natural language instructions. We identify a critical \textit{reasoning-generation entanglement} challenge: the simultaneous optimization of semantic reasoning and kinematic modeling introduces gradient conflicts. These conflicts systematically degrade the fidelity of multimodal grounding and motion quality. To address this challenge, we propose a hierarchical generative framework \textbf{EgoMotion}. Inspired by the biological decoupling of cognitive reasoning and motor control, EgoMotion operates in two stages. In the Cognitive Reasoning stage, A vision-language model (VLM) projects multimodal inputs into a structured space of discrete motion primitives. This forces the VLM to acquire goal-consistent representations, effectively bridging the semantic gap between high-level perceptual understanding and low-level action execution. In the Motion Generation stage, these learned representations serve as expressive conditioning signals for a diffusion-based motion generator. By performing iterative denoising within a continuous latent space, the generator synthesizes physically plausible and temporally coherent trajectories. Extensive evaluations demonstrate that EgoMotion achieves state-of-the-art performance, and produces motion sequences that are both semantically grounded and kinematically superior to existing approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces EgoMotion, a two-stage hierarchical framework for egocentric vision-language motion generation. The Cognitive Reasoning stage employs a vision-language model to project first-person visual observations and natural language instructions onto a space of discrete motion primitives, aiming to resolve semantic-kinematic grounding issues. The Motion Generation stage then uses these primitives as conditioning signals for a diffusion model to synthesize 3D human motion trajectories. The central claim is that this biologically-inspired decoupling avoids gradient conflicts inherent in joint optimization, yielding state-of-the-art performance with semantically grounded and kinematically superior outputs compared to prior approaches.
Significance. If the empirical claims hold, the work could meaningfully advance embodied AI by demonstrating a practical hierarchical solution to multimodal entanglement problems in motion synthesis. The explicit separation of high-level reasoning from low-level generation offers a template that might generalize to other vision-language-action tasks, particularly in dynamic first-person settings where continuous end-to-end models often struggle with fidelity.
major comments (2)
- [Abstract] Abstract: The assertion that EgoMotion 'achieves state-of-the-art performance' and resolves gradient conflicts is presented without any quantitative metrics, ablation studies, error analysis, or comparison tables. This absence is load-bearing for the central claim, as the superiority over existing approaches cannot be evaluated from the provided description alone.
- [Abstract] Cognitive Reasoning stage description: The claim that mapping to discrete motion primitives 'forces the VLM to acquire goal-consistent representations' and bridges the semantic gap without new conflicts or information loss is not supported by evidence that the discretization preserves fine-grained egocentric kinematic details (e.g., subtle velocity or viewpoint-specific pose variations). This step is load-bearing for the subsequent diffusion stage's claimed kinematic superiority.
minor comments (1)
- [Abstract] The phrase 'reasoning-generation entanglement' is introduced as a novel challenge without a formal definition, mathematical formulation, or citation to related work on gradient conflicts in multimodal generative models.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications drawn from the full manuscript and indicate the revisions made to strengthen the abstract.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion that EgoMotion 'achieves state-of-the-art performance' and resolves gradient conflicts is presented without any quantitative metrics, ablation studies, error analysis, or comparison tables. This absence is load-bearing for the central claim, as the superiority over existing approaches cannot be evaluated from the provided description alone.
Authors: The abstract is a concise summary of the work. The full manuscript contains the requested quantitative support, including comparison tables (Section 4.1), ablation studies on the hierarchical separation and gradient conflict mitigation (Section 4.3), and error analyses (Section 5). To address the referee's concern and render the abstract more self-contained, we have revised it to include key performance metrics and a brief reference to these supporting experiments. revision: yes
-
Referee: [Abstract] Cognitive Reasoning stage description: The claim that mapping to discrete motion primitives 'forces the VLM to acquire goal-consistent representations' and bridges the semantic gap without new conflicts or information loss is not supported by evidence that the discretization preserves fine-grained egocentric kinematic details (e.g., subtle velocity or viewpoint-specific pose variations). This step is load-bearing for the subsequent diffusion stage's claimed kinematic superiority.
Authors: We agree that explicit evidence for preservation of fine-grained kinematic details under discretization is important. The manuscript provides this through ablations in Section 4.3 that compare discrete primitives against continuous alternatives, reporting low reconstruction errors on velocity and egocentric pose variations along with improved downstream motion fidelity. We have revised the abstract's description of the Cognitive Reasoning stage to note this supporting evidence and the resulting kinematic advantages. revision: yes
Circularity Check
No circularity: standard hierarchical proposal with external evaluation
full rationale
The paper describes a two-stage architecture (Cognitive Reasoning via VLM to discrete primitives, followed by diffusion-based Motion Generation) to mitigate claimed gradient conflicts in egocentric VL motion synthesis. No equations, derivations, fitted parameters, or self-citations are invoked as load-bearing steps in the provided text. The SOTA performance claim rests on external evaluations rather than any reduction of outputs to inputs by construction, self-definition, or imported uniqueness theorems. This matches the default non-circular case for descriptive ML frameworks without mathematical self-reference.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption A vision-language model can be trained to project multimodal egocentric inputs into a structured space of discrete motion primitives that preserve goal-consistent semantics.
- domain assumption Iterative denoising in a continuous latent space conditioned on those primitives will produce physically plausible and temporally coherent 3D trajectories.
Reference graph
Works this paper leans on
-
[1]
Unpaired motion style transfer from video to animation.ACM Transactions on Graphics, 39(4):64–1, 2020
Kfir Aberman, Yijia Weng, Dani Lischinski, Daniel Cohen-Or, and Baoquan Chen. Unpaired motion style transfer from video to animation.ACM Transactions on Graphics, 39(4):64–1, 2020
2020
-
[2]
Circle: Capture in rich contextual environments
Joao Pedro Araujo, Jiaman Li, Karthik Vetrivel, Rishi Agarwal, Jiajun Wu, Deepak Gopinath, Alexander William Clegg, and Karen Liu. Circle: Capture in rich contextual environments. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21211–21221, 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11
2023
-
[3]
Generating human motion in 3d scenes from text descriptions
Zhi Cen, Huaijin Pi, Sida Peng, Zehong Shen, Minghui Yang, Shuai Zhu, Hujun Bao, and Xiaowei Zhou. Generating human motion in 3d scenes from text descriptions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1855– 1866, 2024
2024
-
[4]
Context-aware human motion prediction
Enric Corona, Albert Pumarola, Guillem Alenya, and Francesc Moreno-Noguer. Context-aware human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6992–7001, 2020
2020
-
[5]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologie, pages 4171–4186, 2019
2019
-
[6]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa De- hghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[7]
Duetgen: Music driven two-person dance generation via hierarchi- cal masked modeling
Anindita Ghosh, Bing Zhou, Rishabh Dabral, Jian Wang, Vladislav Golyanik, Christian Theobalt, Philipp Slusallek, and Chuan Guo. Duetgen: Music driven two-person dance generation via hierarchi- cal masked modeling. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–11, 2025
2025
-
[8]
Egolifter: Open-world 3d segmentation for egocentric perception
Qiao Gu, Zhaoyang Lv, Duncan Frost, Simon Green, Julian Straub, and Chris Sweeney. Egolifter: Open-world 3d segmentation for egocentric perception. InProceedings of the European Conference on Computer Vision, pages 382–400. Springer, 2024
2024
-
[9]
Momask: Generative masked modeling of 3d human motions
Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked modeling of 3d human motions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910, 2024
1900
-
[10]
Generating diverse and natural 3d human motions from text
Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5152–5161, 2022
2022
-
[11]
Action2motion: Conditioned generation of 3d human motions
Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. InProceedings of the ACM International Conference on Multimedia, pages 2021–2029, 2020
2021
-
[12]
Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas, and Michael J. Black. Resolving 3d human pose ambiguities with 3d scene constraints. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2282–2292, 2019
2019
-
[13]
Nemf: Neural motion fields for kinematic animation
Chengan He, Jun Saito, James Zachary, Holly Rushmeier, and Yi Zhou. Nemf: Neural motion fields for kinematic animation. Advances in Neural Information Processing Systems, 35:4244–4256, 2022
2022
-
[14]
Ruibing Hou, Mingshuang Luo, Hongyu Pan, Hong Chang, and Shiguang Shan. Motionverse: A unified multimodal framework for motion comprehension, generation and editing.arXiv preprint arXiv:2509.23635, 2025
-
[15]
Motiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems, 36:20067–20079, 2023
Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems, 36:20067–20079, 2023
2023
-
[16]
Autonomous character-scene interaction synthesis from text instruction
Nan Jiang, Zimo He, Zi Wang, Hongjie Li, Yixin Chen, Siyuan Huang, and Yixin Zhu. Autonomous character-scene interaction synthesis from text instruction. InProceedings of the ACM SIG- GRAPH Conference on Computer Graphics and Interactive Techniques Asia, pages 1–11, 2024
2024
-
[17]
Scaling up dynamic human-scene interaction modeling
Nan Jiang, Zhiyuan Zhang, Hongjie Li, Xiaoxuan Ma, Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, and Siyuan Huang. Scaling up dynamic human-scene interaction modeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1737–1747, 2024
2024
-
[18]
Mas: Multi-view ancestral sampling for 3d motion generation using 2d diffusion
Roy Kapon, Guy Tevet, Daniel Cohen-Or, and Amit H Bermano. Mas: Multi-view ancestral sampling for 3d motion generation using 2d diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1965–1974, 2024
1965
-
[19]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[20]
Nifty: Neural object interaction fields for guided human motion synthesis
Nilesh Kulkarni, Davis Rempe, Kyle Genova, Abhijit Kundu, Justin Johnson, David Fouhey, and Leonidas Guibas. Nifty: Neural object interaction fields for guided human motion synthesis. 2023. arXiv preprint
2023
-
[21]
Autoregressive image generation using residual quanti- zation
Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook- Shin Han. Autoregressive image generation using residual quanti- zation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11523–11532, 2022
2022
-
[22]
Locomotion-action-manipulation: Syn- thesizing human-scene interactions in complex 3d environments
Jiye Lee and Hanbyul Joo. Locomotion-action-manipulation: Syn- thesizing human-scene interactions in complex 3d environments. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9663–9674, 2023
2023
-
[23]
Karen Liu
Jiaman Li, Alexander Clegg, Roozbeh Mottaghi, Jiajun Wu, Xavier Puig, and C. Karen Liu. Controllable human-object interaction synthesis. InProceedings of the European Conference on Computer Vision, pages 54–72. Springer, 2024
2024
-
[24]
Ego-body pose estimation via ego-head pose estimation
Jiaman Li, Karen Liu, and Jiajun Wu. Ego-body pose estimation via ego-head pose estimation. InProceedings of the International Conference on Machine Learning, pages 17142–17151, 2023
2023
-
[25]
Intergen: Diffusion-based multi-human motion generation under complex interactions.International Journal of Computer Vision, 132(9):3463–3483, 2024
Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, and Lan Xu. Intergen: Diffusion-based multi-human motion generation under complex interactions.International Journal of Computer Vision, 132(9):3463–3483, 2024
2024
-
[26]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[27]
Investigating pose representations and motion contexts modeling for 3d motion prediction.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):681– 697, 2022
Zhenguang Liu, Shuang Wu, Shuyuan Jin, Shouling Ji, Qi Liu, Shijian Lu, and Li Cheng. Investigating pose representations and motion contexts modeling for 3d motion prediction.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):681– 697, 2022
2022
-
[28]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[29]
Humantomato: Text-aligned whole-body motion generation,
Shunlin Lu, Ling-Hao Chen, Ailing Zeng, Jing Lin, Ruimao Zhang, Lei Zhang, and Heung-Yeung Shum. Humantomato: Text-aligned whole-body motion generation.arXiv preprint arXiv:2310.12978, 2023
-
[30]
Karen Liu, Ziwei Liu, Jakob Engel, Renzo De Nardi, and Richard New- combe
Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, Hyo Jin Kim, Kevin Bailey, David Soriano Fosas, C. Karen Liu, Ziwei Liu, Jakob Engel, Renzo De Nardi, and Richard New- combe. Nymeria: A massive collection of multimodal egocentric daily motion in the wild. InProceedings of th...
2024
-
[31]
Ian Mason, Sebastian Starke, and Taku Komura. Real-time style modelling of human locomotion via feature-wise transformations and local motion phases.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 5(1):1–18, 2022
2022
-
[32]
Lookout: Real-world humanoid egocentric navigation
Boxiao Pan, Adam W Harley, Francis Engelmann, C Karen Liu, and Leonidas J Guibas. Lookout: Real-world humanoid egocentric navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24977–24988, 2025
2025
-
[33]
Aria digital twin: A new benchmark dataset for egocentric 3d machine perception
Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Peters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard Newcombe, and Yuheng Carl Ren. Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20133– 20143, 2023
2023
-
[34]
Uniegomotion: A unified model for egocentric motion reconstruction, forecasting, and gen- eration
Chaitanya Patel, Hiroki Nakamura, Yuta Kyuragi, Kazuki Kozuka, Juan Carlos Niebles, and Ehsan Adeli. Uniegomotion: A unified model for egocentric motion reconstruction, forecasting, and gen- eration. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10318–10329, 2025
2025
-
[35]
Action- conditioned 3d human motion synthesis with transformer vae
Mathis Petrovich, Michael J Black, and G ¨ul Varol. Action- conditioned 3d human motion synthesis with transformer vae. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10985–10995, 2021
2021
-
[36]
Mmm: Generative masked motion model
Ekkasit Pinyoanuntapong, Pu Wang, Minwoo Lee, and Chen Chen. Mmm: Generative masked motion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1546–1555, 2024
2024
-
[37]
Monkey see, monkey do: Harnessing self-attention in motion diffusion for zero-shot motion transfer
Sigal Raab, Inbar Gat, Nathan Sala, Guy Tevet, Rotem Shalev- Arkushin, Ohad Fried, Amit Haim Bermano, and Daniel Cohen- Or. Monkey see, monkey do: Harnessing self-attention in motion diffusion for zero-shot motion transfer. InProceedings of the ACM SIGGRAPH Conference on Computer Graphics and Interactive Techniques, pages 1–13, 2024
2024
-
[38]
Single motion diffusion.arXiv preprint arXiv:2302.05905, 2023
Sigal Raab, Inbal Leibovitch, Guy Tevet, Moab Arar, Amit H Bermano, and Daniel Cohen-Or. Single motion diffusion.arXiv preprint arXiv:2302.05905, 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12
-
[39]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning, pages 8748–8763. PmLR, 2021
2021
-
[40]
arXiv preprint arXiv:2303.01418 (2023) 3
Yonatan Shafir, Guy Tevet, Roy Kapon, and Amit H Bermano. Human motion diffusion as a generative prior.arXiv preprint arXiv:2303.01418, 2023
-
[41]
Bailando: 3d dance generation by actor-critic gpt with choreographic memory
Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Change Loy, and Ziwei Liu. Bailando: 3d dance generation by actor-critic gpt with choreographic memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11050–11059, 2022
2022
-
[42]
PaliGemma 2: A Family of Versatile VLMs for Transfer
Andreas Steiner, Andre Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, et al. Paligemma 2: A family of versatile vlms for transfer.arXiv preprint arXiv:2412.03555, 2024
work page internal anchor Pith review arXiv 2024
-
[43]
Role-aware interaction gen- eration from textual description
Mikihiro Tanaka and Kent Fujiwara. Role-aware interaction gen- eration from textual description. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15999– 16009, 2023
2023
-
[44]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Leonard Hussenot, Thomas Mesnard, Bobak Shahriari, et al. Gemma 2: Improv- ing open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024
work page internal anchor Pith review arXiv 2024
-
[45]
Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022
work page internal anchor Pith review arXiv 2022
-
[46]
Re- covering 3d human mesh from monocular images: A survey
Yating Tian, Hongwen Zhang, Yebin Liu, and Limin Wang. Re- covering 3d human mesh from monocular images: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15406–15425, 2023
2023
-
[47]
Epic fields: Marrying 3d geometry and video understanding.Advances in Neural Information Processing Systems, 36:26485–26500, 2023
Vadim Tschernezki, Ahmad Darkhalil, Zhifan Zhu, David Fouhey, Iro Laina, Diane Larlus, Dima Damen, and Andrea Vedaldi. Epic fields: Marrying 3d geometry and video understanding.Advances in Neural Information Processing Systems, 36:26485–26500, 2023
2023
-
[48]
Edge: Editable dance generation from music
Jonathan Tseng, Rodrigo Castellon, and Karen Liu. Edge: Editable dance generation from music. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 448– 458, 2023
2023
-
[49]
Neural discrete representation learning.Advances in Neural Information Processing Systems, 30, 2017
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in Neural Information Processing Systems, 30, 2017
2017
-
[50]
Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017
2017
-
[51]
Ego- centric whole-body motion capture with fisheyevit and diffusion- based motion refinement
Jian Wang, Zhe Cao, Diogo Luvizon, Lingjie Liu, Kripasindhu Sarkar, Danhang Tang, Thabo Beeler, and Christian Theobalt. Ego- centric whole-body motion capture with fisheyevit and diffusion- based motion refinement. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 777–787, 2024
2024
-
[52]
Ego4o: Egocentric human motion capture and understanding from multi-modal input
Jian Wang, Rishabh Dabral, Diogo Luvizon, Zhe Cao, Lingjie Liu, Thabo Beeler, and Christian Theobalt. Ego4o: Egocentric human motion capture and understanding from multi-modal input. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22668–22679, 2025
2025
-
[53]
Synthesizing long-term 3d human motion and interaction in 3d scenes
Jiashun Wang, Huazhe Xu, Jingwei Xu, Sifei Liu, and Xiaolong Wang. Synthesizing long-term 3d human motion and interaction in 3d scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9401–9411, 2021
2021
-
[54]
Move as you say, interact as you can: Language-guided human motion generation with scene affordance
Zan Wang, Yixin Chen, Baoxiong Jia, Puhao Li, Jinlu Zhang, Jingze Zhang, Tengyu Liu, Yixin Zhu, Wei Liang, and Siyuan Huang. Move as you say, interact as you can: Language-guided human motion generation with scene affordance. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 433–444, 2024
2024
-
[55]
Humanise: Language-conditioned human motion generation in 3d scenes.Advances in Neural Information Processing Systems, 35:14959–14971, 2022
Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, Wei Liang, and Siyuan Huang. Humanise: Language-conditioned human motion generation in 3d scenes.Advances in Neural Information Processing Systems, 35:14959–14971, 2022
2022
-
[56]
Egotwin: Dreaming body and view in first person.arXiv preprint arXiv:2508.13013, 2025
Jingqiao Xiu, Fangzhou Hong, Yicong Li, Mengze Li, Wentao Wang, Sirui Han, Liang Pan, and Ziwei Liu. Egotwin: Dreaming body and view in first person.arXiv preprint arXiv:2508.13013, 2025
-
[57]
Regennet: Towards human action-reaction synthesis
Liang Xu, Yizhou Zhou, Yichao Yan, Xin Jin, Wenhan Zhu, Fengyun Rao, Xiaokang Yang, and Wenjun Zeng. Regennet: Towards human action-reaction synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1759–1769, 2024
2024
-
[58]
3d human pose, shape and texture from low-resolution images and videos.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4490–4504, 2021
Xiangyu Xu, Hao Chen, Francesc Moreno-Noguer, Laszlo A Jeni, and Fernando De la Torre. 3d human pose, shape and texture from low-resolution images and videos.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4490–4504, 2021
2021
-
[59]
Human motion video generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
Haiwei Xue, Xiangyang Luo, Zhanghao Hu, Xin Zhang, Xunzhi Xiang, Yuqin Dai, Jianzhuang Liu, Zhensong Zhang, Minglei Li, Jian Yang, et al. Human motion video generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
2025
-
[60]
Estimating body and hand motion in an ego-sensed world
Brent Yi, Vickie Ye, Maya Zheng, Yunqi Li, Lea M ¨uller, Geor- gios Pavlakos, Yi Ma, Jitendra Malik, and Angjoo Kanazawa. Estimating body and hand motion in an ego-sensed world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7072–7084, 2025
2025
-
[61]
Black, Xue Bin Peng, and Davis Rempe
Hongwei Yi, Justus Thies, Michael J. Black, Xue Bin Peng, and Davis Rempe. Generating human interaction motions in scenes with text control. InProceedings of the European Conference on Computer Vision, pages 246–263. Springer, 2024
2024
-
[62]
Hero: Human reaction generation from videos
Chengjun Yu, Wei Zhai, Yuhang Yang, Yang Cao, and Zheng-Jun Zha. Hero: Human reaction generation from videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10262–10274, 2025
2025
-
[63]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023
2023
-
[64]
Generating human motion from textual descriptions with discrete representa- tions
Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Yong Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen, and Ying Shan. Generating human motion from textual descriptions with discrete representa- tions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14730–14740, 2023
2023
-
[65]
Egoreact: Egocentric video-driven 3d human reaction generation
Libo Zhang, Zekun Li, Tianyu Li, Zeyu Cao, Rui Xu, Xiaoxiao Long, Wenjia Wang, Jingbo Wang, Yuan Liu, Wenping Wang, et al. Egoreact: Egocentric video-driven 3d human reaction generation. arXiv preprint arXiv:2512.22808, 2025
-
[66]
Motiondiffuse: Text-driven human motion generation with diffusion model.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(6):4115–4128, 2024
Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondiffuse: Text-driven human motion generation with diffusion model.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(6):4115–4128, 2024
2024
-
[67]
Black, and Siyu Tang
Siwei Zhang, Yan Zhang, Qianli Ma, Michael J. Black, and Siyu Tang. Place: Proximity learning of articulation and contact in 3d environments. InInternational Conference on 3D Vision, 2020
2020
-
[68]
Couch: Towards controllable human-chair interactions
Xiaohan Zhang, Bharat Lal Bhatnagar, Sebastian Starke, Vladimir Guzov, and Gerard Pons-Moll. Couch: Towards controllable human-chair interactions. InProceedings of the European Conference on Computer Vision, pages 518–535. Springer, 2022
2022
-
[69]
Motiongpt: Fine- tuned llms are general-purpose motion generators
Yaqi Zhang, Di Huang, Bin Liu, Shixiang Tang, Yan Lu, Lu Chen, Lei Bai, Qi Chu, Nenghai Yu, and Wanli Ouyang. Motiongpt: Fine- tuned llms are general-purpose motion generators. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 7368–7376, 2024
2024
-
[70]
Karen Liu, and Leonidas J
Yang Zheng, Yanchao Yang, Kaichun Mo, Jiaman Li, Tao Yu, Yebin Liu, C. Karen Liu, and Leonidas J. Guibas. Gimo: Gaze-informed human motion prediction in context. InProceedings of the European Conference on Computer Vision, pages 676–694. Springer, 2022
2022
-
[71]
Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2430–2449, 2023
Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, Hai Ci, Jinlu Zhang, Jiaxin Shi, Feng Gao, Qi Tian, and Yizhou Wang. Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2430–2449, 2023
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.