EgoPriMo: Egocentric Motion Generation for Interactive Humanoid Control

Cong Huang; Haoyang Ge; Kai Chen; Kun Li; Peng Ren; Yukun Shi

arxiv: 2606.08495 · v1 · pith:IFCV5ODKnew · submitted 2026-06-07 · 💻 cs.RO · cs.CV

EgoPriMo: Egocentric Motion Generation for Interactive Humanoid Control

Haoyang Ge , Peng Ren , Yukun Shi , Cong Huang , Kun Li , Kai Chen This is my paper

Pith reviewed 2026-06-27 18:35 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords egocentric motionhumanoid controlmotion generationdiffusion transformerSMPLtask conditioningegocentric visionrobot motion prior

0 comments

The pith

A single checkpoint reconstructs, generates, and forecasts full-body SMPL motions from egocentric video and text for humanoid control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that egocentric human demonstrations can train a unified prior supporting multiple motion tasks without separate models for each. This matters because humanoid robots need whole-body behaviors that adapt to visual scenes, tasks, and high-level user intent rather than rigid trajectory tracking alone. The framework treats language as a control signal instead of a full motion script. If the approach holds, scalable video data becomes directly usable for interactive robot motion without per-task retraining.

Core claim

EgoPriMo is a unified framework that learns motion priors from egocentric human demonstrations. Given egocentric observations and a text prompt, it reconstructs, generates, and forecasts SMPL-based full-body motion. At its core is a Triple-stream DiT that jointly models body dynamics, egocentric visual context, and text, with task-conditioning masks routing different tasks and missing-modality data through the same checkpoint. Experiments on Nymeria and EgoExo4D show one checkpoint improves egocentric motion generation over UniEgoMotion while supporting reconstruction and forecasting, and the generated SMPL motions execute on a Unitree humanoid controller.

What carries the argument

Triple-stream DiT jointly modeling body dynamics, egocentric visual context, and text, with task-conditioning masks that route tasks and missing data through one checkpoint.

If this is right

One checkpoint improves egocentric motion generation over UniEgoMotion on the tested datasets.
The same checkpoint supports reconstruction and forecasting tasks without retraining.
Generated SMPL motions transfer to execution on a physical Unitree humanoid controller.
Language serves as a high-level rather than exhaustive control signal for motion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could allow robots to switch between motion tasks on the fly in response to changing user text prompts without reloading models.
Retargeting the SMPL outputs to different robot morphologies might extend the prior beyond the Unitree platform tested.
Adding more input modalities such as depth or audio could be tested by extending the existing mask mechanism rather than redesigning the architecture.

Load-bearing premise

Task-conditioning masks can route different tasks and missing-modality data through the same model checkpoint without significant degradation in performance for each individual task.

What would settle it

Train separate specialized models for reconstruction, generation, and forecasting on Nymeria and EgoExo4D, then compare each to the single EgoPriMo checkpoint; if any specialized model outperforms the unified one by a clear margin on its target task, the unified-checkpoint claim is falsified.

Figures

Figures reproduced from arXiv: 2606.08495 by Cong Huang, Haoyang Ge, Kai Chen, Kun Li, Peng Ren, Yukun Shi.

**Figure 2.** Figure 2: Overview of EgoPriMo. Egocentric observations, auxiliary pose cues, and text prompts [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative egocentric motion generation results. Each example pairs egocentric obser [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Humanoid robot control with EgoPriMo. The model generates SMPL-based full-body [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Humanoid robots require whole-body motions that adapt to scene context, task requirements, and user intent. Motion tracking reproduces specified trajectories, and humanoid vision-language-action systems provide semantic interfaces, but neither offers a scalable and interactive prior for broad full-body behavior. We introduce EgoPriMo (Egocentric Motion Prior for Humanoid Robots), a unified framework that learns such priors from egocentric human demonstrations. Given egocentric observations and a text prompt, EgoPriMo reconstructs, generates, and forecasts SMPL-based full-body motion. Language is used as a high-level control signal rather than a complete motion specification. At the core of EgoPriMo is a Triple-stream DiT that jointly models body dynamics, egocentric visual context, and text; task-conditioning masks route different tasks and missing-modality data through the same checkpoint. Experiments on Nymeria and EgoExo4D show that one checkpoint improves egocentric motion generation over UniEgoMotion while supporting reconstruction and forecasting; the generated SMPL motions can also be executed by a Unitree humanoid controller. These results indicate a practical path from scalable egocentric observations to generalizable and interactive humanoid motion priors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EgoPriMo adds a Triple-stream DiT with task masks for unified egocentric motion tasks but the abstract gives no numbers or ablations to confirm the single-checkpoint claim holds without interference.

read the letter

The main new piece is the Triple-stream DiT that processes body dynamics, egocentric visuals, and text in parallel, with task-conditioning masks to route reconstruction, generation, forecasting, and missing-modality cases through one checkpoint. It reports better generation results than UniEgoMotion on Nymeria and EgoExo4D and says the SMPL outputs run on a Unitree controller.

This setup is a concrete step toward scalable motion priors from first-person data for humanoid control. The hardware transfer mention is a practical touch if the details hold.

The soft spot is exactly the one the stress-test note flags. The abstract asserts the masks let one model handle multiple tasks without degradation, yet it supplies no per-task scores, no mask ablation, and no head-to-head with separately trained models on the same splits. Without those, the unified advantage stays untested. The datasets are public, which is good, but the lack of quantitative breakdowns or error analysis makes it hard to judge how solid the improvements actually are.

The citation pattern is standard and relevant. No obvious circularity.

This is for robotics and vision researchers who need egocentric motion models that can switch tasks. A reader working on humanoid priors or multi-task motion would get value from the architecture description.

It deserves a serious referee because the problem matters and the approach is specific enough to review in detail, even though the current evidence is light.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces EgoPriMo, a unified framework for learning egocentric motion priors from human demonstrations. It uses a Triple-stream DiT jointly modeling body dynamics, egocentric visuals, and text, with task-conditioning masks to route reconstruction, generation, forecasting, and missing-modality inputs through a single checkpoint. Experiments on Nymeria and EgoExo4D claim that this checkpoint outperforms UniEgoMotion on generation while supporting the other tasks, with generated SMPL motions executable on a Unitree humanoid controller.

Significance. If the multi-task masking mechanism holds without degradation, the work would provide a scalable, interactive prior for humanoid whole-body control from egocentric data, moving beyond separate tracking or VLA systems. The use of public datasets (Nymeria, EgoExo4D) and hardware transfer to Unitree are concrete strengths that support reproducibility and practical relevance.

major comments (3)

[Abstract and Experiments] Abstract and Experiments section: The central claim that 'one checkpoint improves egocentric motion generation over UniEgoMotion while supporting reconstruction and forecasting' rests on task-conditioning masks enabling joint training without measurable interference. No per-task quantitative breakdown, no ablation removing the masks, and no comparison of the unified model versus separately trained task-specific models on identical data splits are reported; this leaves the 'one checkpoint' advantage unsubstantiated.
[Method] Method section (Triple-stream DiT description): The task-conditioning masks are presented as routing different tasks and missing modalities, but the manuscript provides no analysis of how mask design affects information flow across streams or any quantitative measure of specialization loss when multiple tasks share parameters.
[Hardware transfer] Hardware transfer paragraph: The claim that generated SMPL motions 'can also be executed by a Unitree humanoid controller' is load-bearing for the interactive humanoid application, yet no mapping details, success rates, or comparison to baseline controllers are supplied to confirm executability beyond qualitative assertion.

minor comments (2)

[Method] Notation for the three streams (body, visual, text) should be defined explicitly with consistent symbols when first introduced.
[Experiments] Dataset splits and exact metrics (e.g., MPJPE, FID) used for the UniEgoMotion comparison should be stated clearly in the Experiments section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, clarifying the current evidence and committing to revisions where appropriate.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: The central claim that 'one checkpoint improves egocentric motion generation over UniEgoMotion while supporting reconstruction and forecasting' rests on task-conditioning masks enabling joint training without measurable interference. No per-task quantitative breakdown, no ablation removing the masks, and no comparison of the unified model versus separately trained task-specific models on identical data splits are reported; this leaves the 'one checkpoint' advantage unsubstantiated.

Authors: We acknowledge that the manuscript does not include per-task quantitative breakdowns, ablations removing the task-conditioning masks, or direct comparisons against separately trained task-specific models on the same data splits. The reported results demonstrate that the unified checkpoint outperforms UniEgoMotion on generation while supporting the other tasks, but these additional analyses would more rigorously substantiate the lack of interference. We will add the requested per-task metrics, mask ablations, and task-specific model comparisons in the revised manuscript. revision: yes
Referee: [Method] Method section (Triple-stream DiT description): The task-conditioning masks are presented as routing different tasks and missing modalities, but the manuscript provides no analysis of how mask design affects information flow across streams or any quantitative measure of specialization loss when multiple tasks share parameters.

Authors: The manuscript presents the task-conditioning masks primarily through their empirical utility in enabling multi-task and missing-modality operation within the Triple-stream DiT. No explicit analysis of information flow or specialization loss is provided. We will expand the method section with additional discussion of mask design effects and any available quantitative measures of task specialization in the revision. revision: yes
Referee: [Hardware transfer] Hardware transfer paragraph: The claim that generated SMPL motions 'can also be executed by a Unitree humanoid controller' is load-bearing for the interactive humanoid application, yet no mapping details, success rates, or comparison to baseline controllers are supplied to confirm executability beyond qualitative assertion.

Authors: The hardware transfer is included as a qualitative demonstration that SMPL motions from EgoPriMo are executable on the Unitree platform. We agree that mapping details, success rates, and baseline controller comparisons are needed to strengthen the claim. We will incorporate these quantitative elements and implementation specifics in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical evaluation

full rationale

The paper introduces a Triple-stream DiT model with task-conditioning masks and asserts performance via experiments on public datasets Nymeria and EgoExo4D, including comparison to UniEgoMotion and execution on a Unitree controller. No derivation chain, equations, or first-principles results are presented that reduce to inputs by construction. No self-citations, fitted parameters renamed as predictions, or self-definitional steps appear in the abstract or described structure. The multi-task routing via masks is a modeling choice whose effectiveness is claimed through external benchmarks rather than tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of the new model architecture trained on egocentric datasets, with assumptions about modality handling via masks and transfer of SMPL motions to hardware.

free parameters (1)

model parameters
The DiT model weights and conditioning parameters are fitted to the Nymeria and EgoExo4D training data.

axioms (1)

domain assumption SMPL model accurately represents human body motions for transfer to humanoid robots
The framework outputs SMPL-based motions assumed to be executable by Unitree controller.

invented entities (1)

Triple-stream DiT no independent evidence
purpose: Jointly models body dynamics, egocentric visual context, and text using task-conditioning masks
New architecture introduced for unified task handling

pith-pipeline@v0.9.1-grok · 5743 in / 1445 out tokens · 28111 ms · 2026-06-27T18:35:58.547150+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 5 linked inside Pith

[1]

Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn. Humanplus: Humanoid shadowing and imitation from humans.arXiv preprint arXiv:2406.10454, 2024

arXiv 2024
[2]

T. He, Z. Luo, X. He, W. Xiao, C. Zhang, W. Zhang, K. Kitani, C. Liu, and G. Shi. Om- nih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning. InConference on Robot Learning, 2024. arXiv:2406.08858

arXiv 2024
[3]

Cheng, Y

X. Cheng, Y . Ji, J. Chen, R. Yang, G. Yang, and X. Wang. Expressive whole-body control for humanoid robots.arXiv preprint arXiv:2402.16796, 2024

arXiv 2024
[4]

T. He, W. Xiao, T. Lin, Z. Luo, Z. Xu, Z. Jiang, J. Kautz, C. Liu, G. Shi, X. Wang, L. Fan, and Y . Zhu. Hover: Versatile neural whole-body controller for humanoid robots.arXiv preprint arXiv:2410.21229, 2024

arXiv 2024
[5]

T. He, J. Gao, W. Xiao, Y . Zhang, Z. Wang, J. Wang, Z. Luo, G. He, N. Sobanbab, C. Pan, Z. Yi, G. Qu, K. Kitani, J. Hodgins, L. Fan, Y . Zhu, C. Liu, and G. Shi. Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills. InRobotics: Science and Systems, 2025. arXiv:2502.01143

arXiv 2025
[6]

S. Yin, Y . Ze, H.-X. Yu, C. K. Liu, and J. Wu. Visualmimic: Visual humanoid loco- manipulation via motion tracking and generation.arXiv preprint arXiv:2509.20322, 2025

arXiv 2025
[7]

Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Casta ˜neda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, X. Da, R. Ding, C. Hogg, L. Song, E. Lim, E. Jeong, T. He, H. Xue, W. Xiao, Z. Wang, S. Yuen, J. Kautz, Y . Chang, U. Iqbal, L. Fan, and Y . Zhu. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025. 9

Pith/arXiv arXiv 2025
[8]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024
[9]

R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, H. Yin, S. Liu, S. Han, Y . Lu, and X. Wang. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

Pith/arXiv arXiv 2025
[10]

J. Yu, Y . Shentu, D. Wu, P. Abbeel, K. Goldberg, and P. Wu. Egomi: Learning active vi- sion and whole-body manipulation from egocentric human demonstrations.arXiv preprint arXiv:2511.00153, 2025

arXiv 2025
[11]

Grauman, A

K. Grauman, A. Westbury, L. Torresani, K. M. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, et al. Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives.International Journal of Computer Vision, 2024

2024
[12]

L. Ma, Y . Ye, F. Hong, V . Guzov, Y . Jiang, R. Postyeni, L. Pesqueira, A. Gamino, V . Baiyya, H. J. Kim, K. Bailey, D. Soriano Fosas, C. K. Liu, Z. Liu, J. Engel, R. De Nardi, and R. New- combe. Nymeria: A massive collection of multimodal egocentric daily motion in the wild. European Conference on Computer Vision, 2024. arXiv:2406.09905

arXiv 2024
[13]

J. Li, C. K. Liu, and J. Wu. Ego-body pose estimation via ego-head pose estimation.arXiv preprint arXiv:2212.04636, 2023

arXiv 2023
[14]

Patel, H

C. Patel, H. Nakamura, Y . Kyuragi, K. Kozuka, J. C. Niebles, and E. Adeli. Uniegomotion: A unified model for egocentric motion reconstruction, forecasting, and generation. InIEEE/CVF International Conference on Computer Vision, 2025. arXiv:2508.01126

arXiv 2025
[15]

K. K. Somasundaram, J. Dong, H. Tang, J. Straub, M. Yan, M. Goesele, J. J. Engel, R. De Nardi, and R. A. Newcombe. Project aria: A new tool for egocentric multi-modal ai research.arXiv preprint arXiv:2308.13561, 2023

Pith/arXiv arXiv 2023
[16]

Zhang, Q

S. Zhang, Q. Ma, Y . Zhang, Z. Qian, T. Kwon, M. Pollefeys, F. Bogo, and S. Tang. Egobody: Human body shape and motion of interacting people from head-mounted devices. InEuropean Conference on Computer Vision, 2022

2022
[17]

C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng. Generating diverse and natu- ral 3d human motions from text. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2022
[18]

Petrovich, M

M. Petrovich, M. J. Black, and G. Varol. TEMOS: Generating diverse human motions from textual descriptions. InEuropean Conference on Computer Vision, 2022

2022
[19]

Tevet, S

G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-Or, and A. H. Bermano. Human motion diffusion model. InInternational Conference on Learning Representations, 2023

2023
[20]

Zhang, Z

M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu. Motiondiffuse: Text-driven human motion generation with diffusion model.arXiv preprint arXiv:2208.15001, 2022

arXiv 2022
[21]

Zhang, Y

J. Zhang, Y . Zhang, X. Cun, Y . Zhang, H. Zhao, H. Lu, X. Shen, and Y . Shan. Generating hu- man motion from textual descriptions with discrete representations. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2023
[22]

X. Chen, B. Jiang, W. Liu, Z. Huang, B. Fu, T. Chen, and G. Yu. Executing your commands via motion diffusion in latent space. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2023
[23]

Petrovich, M

M. Petrovich, M. J. Black, and G. Varol. Tmr: Text-to-motion retrieval using contrastive 3d human motion synthesis. InIEEE/CVF International Conference on Computer Vision, 2023. 10

2023
[24]

Jiang, X

B. Jiang, X. Chen, W. Liu, J. Yu, G. Yu, and T. Chen. Motiongpt: Human motion as a foreign language.arXiv preprint arXiv:2306.14795, 2023

arXiv 2023
[25]

Sharma, B

P. Sharma, B. Sundaralingam, V . Blukis, C. Paxton, T. Hermans, A. Torralba, J. Andreas, and D. Fox. Correcting robot plans with natural language feedback. InRobotics: Science and Systems, 2022. arXiv:2204.05186

arXiv 2022
[26]

L. X. Shi, Z. Hu, T. Z. Zhao, A. Sharma, K. Pertsch, J. Luo, S. Levine, and C. Finn. Yell at your robot: Improving on-the-fly from language corrections.arXiv preprint arXiv:2403.12910, 2024

arXiv 2024
[27]

Z. Yang, M. Jun, J. Tien, S. Russell, A. Dragan, and E. Bıyık. Trajectory improvement and reward learning from comparative language feedback. InConference on Robot Learning, 2024. arXiv:2410.06401

arXiv 2024
[28]

Welte, Y

E. Welte, Y . Shi, R. Wolf, M. Gilles, and R. Rayyes. Flowcorrect: Efficient interactive cor- rection of generative flow policies for robotic manipulation.arXiv preprint arXiv:2602.22056, 2026

arXiv 2026
[29]

Loper, N

M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. SMPL: A skinned multi- person linear model.ACM Transactions on Graphics, 34(6):248:1–248:16, 2015

2015
[30]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InInternational Conference on Learning Representations, 2023

2023
[31]

J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864, 2021. 11

Pith/arXiv arXiv 2021

[1] [1]

Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn. Humanplus: Humanoid shadowing and imitation from humans.arXiv preprint arXiv:2406.10454, 2024

arXiv 2024

[2] [2]

T. He, Z. Luo, X. He, W. Xiao, C. Zhang, W. Zhang, K. Kitani, C. Liu, and G. Shi. Om- nih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning. InConference on Robot Learning, 2024. arXiv:2406.08858

arXiv 2024

[3] [3]

Cheng, Y

X. Cheng, Y . Ji, J. Chen, R. Yang, G. Yang, and X. Wang. Expressive whole-body control for humanoid robots.arXiv preprint arXiv:2402.16796, 2024

arXiv 2024

[4] [4]

T. He, W. Xiao, T. Lin, Z. Luo, Z. Xu, Z. Jiang, J. Kautz, C. Liu, G. Shi, X. Wang, L. Fan, and Y . Zhu. Hover: Versatile neural whole-body controller for humanoid robots.arXiv preprint arXiv:2410.21229, 2024

arXiv 2024

[5] [5]

T. He, J. Gao, W. Xiao, Y . Zhang, Z. Wang, J. Wang, Z. Luo, G. He, N. Sobanbab, C. Pan, Z. Yi, G. Qu, K. Kitani, J. Hodgins, L. Fan, Y . Zhu, C. Liu, and G. Shi. Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills. InRobotics: Science and Systems, 2025. arXiv:2502.01143

arXiv 2025

[6] [6]

S. Yin, Y . Ze, H.-X. Yu, C. K. Liu, and J. Wu. Visualmimic: Visual humanoid loco- manipulation via motion tracking and generation.arXiv preprint arXiv:2509.20322, 2025

arXiv 2025

[7] [7]

Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Casta ˜neda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, X. Da, R. Ding, C. Hogg, L. Song, E. Lim, E. Jeong, T. He, H. Xue, W. Xiao, Z. Wang, S. Yuen, J. Kautz, Y . Chang, U. Iqbal, L. Fan, and Y . Zhu. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025. 9

Pith/arXiv arXiv 2025

[8] [8]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024

[9] [9]

R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, H. Yin, S. Liu, S. Han, Y . Lu, and X. Wang. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

Pith/arXiv arXiv 2025

[10] [10]

J. Yu, Y . Shentu, D. Wu, P. Abbeel, K. Goldberg, and P. Wu. Egomi: Learning active vi- sion and whole-body manipulation from egocentric human demonstrations.arXiv preprint arXiv:2511.00153, 2025

arXiv 2025

[11] [11]

Grauman, A

K. Grauman, A. Westbury, L. Torresani, K. M. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, et al. Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives.International Journal of Computer Vision, 2024

2024

[12] [12]

L. Ma, Y . Ye, F. Hong, V . Guzov, Y . Jiang, R. Postyeni, L. Pesqueira, A. Gamino, V . Baiyya, H. J. Kim, K. Bailey, D. Soriano Fosas, C. K. Liu, Z. Liu, J. Engel, R. De Nardi, and R. New- combe. Nymeria: A massive collection of multimodal egocentric daily motion in the wild. European Conference on Computer Vision, 2024. arXiv:2406.09905

arXiv 2024

[13] [13]

J. Li, C. K. Liu, and J. Wu. Ego-body pose estimation via ego-head pose estimation.arXiv preprint arXiv:2212.04636, 2023

arXiv 2023

[14] [14]

Patel, H

C. Patel, H. Nakamura, Y . Kyuragi, K. Kozuka, J. C. Niebles, and E. Adeli. Uniegomotion: A unified model for egocentric motion reconstruction, forecasting, and generation. InIEEE/CVF International Conference on Computer Vision, 2025. arXiv:2508.01126

arXiv 2025

[15] [15]

K. K. Somasundaram, J. Dong, H. Tang, J. Straub, M. Yan, M. Goesele, J. J. Engel, R. De Nardi, and R. A. Newcombe. Project aria: A new tool for egocentric multi-modal ai research.arXiv preprint arXiv:2308.13561, 2023

Pith/arXiv arXiv 2023

[16] [16]

Zhang, Q

S. Zhang, Q. Ma, Y . Zhang, Z. Qian, T. Kwon, M. Pollefeys, F. Bogo, and S. Tang. Egobody: Human body shape and motion of interacting people from head-mounted devices. InEuropean Conference on Computer Vision, 2022

2022

[17] [17]

C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng. Generating diverse and natu- ral 3d human motions from text. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2022

[18] [18]

Petrovich, M

M. Petrovich, M. J. Black, and G. Varol. TEMOS: Generating diverse human motions from textual descriptions. InEuropean Conference on Computer Vision, 2022

2022

[19] [19]

Tevet, S

G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-Or, and A. H. Bermano. Human motion diffusion model. InInternational Conference on Learning Representations, 2023

2023

[20] [20]

Zhang, Z

M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu. Motiondiffuse: Text-driven human motion generation with diffusion model.arXiv preprint arXiv:2208.15001, 2022

arXiv 2022

[21] [21]

Zhang, Y

J. Zhang, Y . Zhang, X. Cun, Y . Zhang, H. Zhao, H. Lu, X. Shen, and Y . Shan. Generating hu- man motion from textual descriptions with discrete representations. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2023

[22] [22]

X. Chen, B. Jiang, W. Liu, Z. Huang, B. Fu, T. Chen, and G. Yu. Executing your commands via motion diffusion in latent space. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2023

[23] [23]

Petrovich, M

M. Petrovich, M. J. Black, and G. Varol. Tmr: Text-to-motion retrieval using contrastive 3d human motion synthesis. InIEEE/CVF International Conference on Computer Vision, 2023. 10

2023

[24] [24]

Jiang, X

B. Jiang, X. Chen, W. Liu, J. Yu, G. Yu, and T. Chen. Motiongpt: Human motion as a foreign language.arXiv preprint arXiv:2306.14795, 2023

arXiv 2023

[25] [25]

Sharma, B

P. Sharma, B. Sundaralingam, V . Blukis, C. Paxton, T. Hermans, A. Torralba, J. Andreas, and D. Fox. Correcting robot plans with natural language feedback. InRobotics: Science and Systems, 2022. arXiv:2204.05186

arXiv 2022

[26] [26]

L. X. Shi, Z. Hu, T. Z. Zhao, A. Sharma, K. Pertsch, J. Luo, S. Levine, and C. Finn. Yell at your robot: Improving on-the-fly from language corrections.arXiv preprint arXiv:2403.12910, 2024

arXiv 2024

[27] [27]

Z. Yang, M. Jun, J. Tien, S. Russell, A. Dragan, and E. Bıyık. Trajectory improvement and reward learning from comparative language feedback. InConference on Robot Learning, 2024. arXiv:2410.06401

arXiv 2024

[28] [28]

Welte, Y

E. Welte, Y . Shi, R. Wolf, M. Gilles, and R. Rayyes. Flowcorrect: Efficient interactive cor- rection of generative flow policies for robotic manipulation.arXiv preprint arXiv:2602.22056, 2026

arXiv 2026

[29] [29]

Loper, N

M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. SMPL: A skinned multi- person linear model.ACM Transactions on Graphics, 34(6):248:1–248:16, 2015

2015

[30] [30]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InInternational Conference on Learning Representations, 2023

2023

[31] [31]

J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864, 2021. 11

Pith/arXiv arXiv 2021