XmoPipe: A Pipeline for Large-Scale In-the-Wild Human Motion Dataset Construction

Alexandre Meyer; Emmanuel Dellandr\'ea; Mathieu Lefort; Nathan Salazar

arxiv: 2606.20731 · v1 · pith:FDA7N7SSnew · submitted 2026-06-17 · 💻 cs.CV · cs.AI

XmoPipe: A Pipeline for Large-Scale In-the-Wild Human Motion Dataset Construction

Nathan Salazar , Emmanuel Dellandr\'ea , Mathieu Lefort , Alexandre Meyer This is my paper

Pith reviewed 2026-06-26 21:39 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords human motiondataset constructionmonocular motion capturein-the-wild videosmotion reconstructionmotion generationvideo retrievaltextual descriptions

0 comments

The pith

A pipeline extracts 3D motions from online videos to create large-scale human motion datasets usable for training models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces XmoPipe, a pipeline that starts with keywords to retrieve videos, extracts 3D body and facial motion, and adds textual descriptions. This approach aims to overcome the scale and diversity limits of marker-based motion capture by using monocular methods on unconstrained videos. A sympathetic reader would care because it could enable training of more robust motion reconstruction and generation models. The authors show that models trained on data from this pipeline perform comparably to those trained on traditional datasets and generalize well across different datasets.

Core claim

XmoPipe is a scalable pipeline for constructing in-the-wild human motion datasets. From a few keywords, it retrieves videos, extracts 3D body and facial motion using monocular capture, and generates high-level textual descriptions. The pipeline is flexible for targeted collection of motions, interactions, or expressive behaviors. Its quality is demonstrated by training motion reconstruction and motion generation models that achieve performance comparable to models trained on traditional motion capture datasets with strong cross-dataset generalization.

What carries the argument

XmoPipe, the pipeline that retrieves videos from keywords, extracts 3D motion, and generates descriptions to build datasets.

If this is right

Models for motion reconstruction achieve performance comparable to those trained on marker-based data.
Models for motion generation show similar comparability.
Strong cross-dataset generalization is observed in the trained models.
The pipeline supports targeted collection of various motion types including multi-person interactions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such datasets could support training on motions from diverse real-world contexts not feasible in controlled capture settings.
Combining the extracted motions with the generated textual descriptions may enable new multimodal motion understanding tasks.
Extending the pipeline to include more video sources or refined extraction methods could further increase dataset scale and accuracy.

Load-bearing premise

The accuracy and consistency of 3D motion extracted monocularly from unconstrained online videos is sufficient for training models that perform as well as those using marker-based capture data.

What would settle it

Training the same motion reconstruction and generation models on the pipeline's data and finding substantially worse performance on standard evaluation metrics compared to models trained on traditional datasets would falsify the quality claim.

read the original abstract

Large-scale human motion datasets are essential for training robust motion models for analysis, synthesis, and understanding. While marker-based motion capture provides precise data, it is costly and limited in scale and diversity. Recent advances in monocular motion capture and video-language understanding open the way to extract plausible motion from unconstrained online videos. We present a scalable pipeline for constructing in-the-wild human motion datasets. From a few keywords, the system retrieves videos, extracts 3D body and facial motion, and generates high-level textual descriptions. The pipeline is flexible, enabling targeted collection of various motions, multi-person interactions, or expressive behaviors. We demonstrate its quality by training motion reconstruction and motion generation models, showing performance comparable to models trained on traditional motion capture datasets and strong cross-dataset generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical keyword-driven pipeline for pulling and processing web videos into 3D motion data at scale, but the claim of mocap-comparable model performance rests on unshown accuracy checks for the monocular outputs.

read the letter

The main thing here is an end-to-end system that takes a few keywords, retrieves videos, runs monocular 3D body and face extraction, and adds text descriptions. It is set up to target multi-person scenes or expressive actions, which is the part that actually expands what you can collect beyond studio limits.

The pipeline itself is a straightforward assembly of current tools, and the authors show downstream models for reconstruction and generation that reach similar numbers to mocap-trained baselines plus decent cross-dataset results. That demonstration is the part worth looking at if you need larger training sets.

The soft spot is exactly the one in the stress-test note. The abstract says the extracted motions are plausible and the models perform comparably, but there are no reported per-joint errors, temporal consistency measures, or direct comparisons against ground-truth mocap on the same videos. Without those, it is hard to tell whether the good downstream numbers come from solid data or from the models compensating for biases like depth ambiguity or foot skating. The full paper would need to close that gap.

This is for people who build motion models and want more variety than lab capture provides. A reader could pull useful implementation details from the pipeline description even if they end up adding their own validation.

I would send it to peer review because the practical problem is real and the approach is concrete, though the validation section will need work.

Referee Report

2 major / 1 minor

Summary. The paper introduces XmoPipe, a scalable pipeline for constructing large-scale in-the-wild human motion datasets from unconstrained online videos. Starting from keywords, it retrieves videos, extracts 3D body and facial motion via monocular methods, and generates high-level textual descriptions. The pipeline supports targeted collection of motions, interactions, and expressive behaviors. Quality is demonstrated by training motion reconstruction and generation models that achieve performance comparable to those trained on traditional marker-based mocap datasets, along with strong cross-dataset generalization.

Significance. If the extracted motions prove sufficiently accurate and consistent, the pipeline could enable substantially larger and more diverse motion datasets than current mocap collections, advancing analysis, synthesis, and understanding tasks. The flexibility for keyword-driven, multi-person, and expressive data collection is a practical strength. Credit is due for framing an end-to-end empirical construction method whose quality is assessed via downstream model performance rather than isolated metrics.

major comments (2)

[Abstract] Abstract: The central claim that 'models trained on the constructed dataset show performance comparable to models trained on traditional motion capture datasets' is presented without any reported metrics, baselines, dataset sizes, error analysis, or validation protocol. This is load-bearing for the quality demonstration and leaves the monocular extraction accuracy unverified.
[Demonstration paragraph] Demonstration paragraph: No per-joint error, temporal smoothness, scale consistency, or foot-skating metrics are provided against any ground-truth mocap on the collected videos. Without these, systematic biases from depth ambiguity or expression drift cannot be ruled out as confounding the 'comparable performance' result.

minor comments (1)

[Abstract] The abstract would be clearer if it quantified the scale of the constructed dataset (e.g., total hours or number of sequences) and named the specific monocular methods used for 3D extraction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We address each major comment below, clarifying our evaluation methodology and indicating where revisions will strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'models trained on the constructed dataset show performance comparable to models trained on traditional motion capture datasets' is presented without any reported metrics, baselines, dataset sizes, error analysis, or validation protocol. This is load-bearing for the quality demonstration and leaves the monocular extraction accuracy unverified.

Authors: We agree that the abstract would benefit from including key quantitative indicators to support the claim. In the revised manuscript we will augment the abstract with specific performance numbers (e.g., MPJPE or FID values on reconstruction/generation tasks), dataset sizes, and a concise statement of the validation protocol. The full set of baselines, error analyses, and cross-dataset results already appear in Section 4; the abstract revision will make these results more immediately visible. revision: yes
Referee: [Demonstration paragraph] Demonstration paragraph: No per-joint error, temporal smoothness, scale consistency, or foot-skating metrics are provided against any ground-truth mocap on the collected videos. Without these, systematic biases from depth ambiguity or expression drift cannot be ruled out as confounding the 'comparable performance' result.

Authors: We acknowledge that direct per-joint, temporal, scale, or foot-contact metrics against synchronized marker-based mocap are absent. Because the source videos are unconstrained online footage, no such ground-truth mocap exists for the collected sequences; therefore these metrics cannot be computed. Our quality argument instead rests on downstream task performance and cross-dataset generalization, which provide an indirect but task-relevant measure of motion utility. We will add an explicit paragraph in the revised manuscript explaining this design choice and the inherent limitations of direct GT evaluation for in-the-wild data. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical pipeline with independent downstream evaluation

full rationale

The paper describes a data-construction pipeline whose central claim is an empirical demonstration: models trained on the resulting dataset achieve performance comparable to those trained on marker-based mocap data. No equations, fitted parameters, self-citations, or ansatzes are invoked to derive this result; the quality check is performed by separate training and evaluation steps whose inputs (the extracted motions) are not redefined in terms of the outputs. The absence of any load-bearing mathematical reduction or self-referential justification keeps the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are named or implied.

pith-pipeline@v0.9.1-grok · 5665 in / 1128 out tokens · 22880 ms · 2026-06-26T21:39:54.533298+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 1 linked inside Pith

[1]

World-grounded human motion recov- ery via gravity-view coordinates

Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xi- aowei Zhou. World-grounded human motion recov- ery via gravity-view coordinates. InSIGGRAPH Asia Conference Proceedings, 2024

2024
[2]

Tram: Global trajectory and motion of 3d humans from in-the-wild videos, 2024

Yufu Wang, Ziyun Wang, Lingjie Liu, and Kostas Daniilidis. Tram: Global trajectory and motion of 3d humans from in-the-wild videos, 2024

2024
[3]

Multi-hmr: Multi-person whole-body human mesh recovery in a single shot, 2024

Fabien Baradel, Matthieu Armando, Salma Galaaoui, Romain Br ´egier, Philippe Weinzaepfel, Gr ´egory Ro- gez, and Thomas Lucas. Multi-hmr: Multi-person whole-body human mesh recovery in a single shot, 2024

2024
[4]

Gpt-4o system card, 2024

OpenAI Team et al. Gpt-4o system card, 2024

2024
[5]

Video-llama: An instruction-tuned audio-visual language model for video understanding, 2023

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding, 2023

2023
[6]

Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

Pith/arXiv arXiv 2023
[7]

Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cris- tian Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in nat- ural environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, 2014

2014
[8]

Troje, Gerard Pons-Moll, and Michael J

Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. AMASS: Archive of motion capture as surface shapes. InInternational Conference on Computer Vision, pages 5442–5451, October 2019

2019
[9]

Humanml3d: A large 3d human motion dataset with natural language descriptions

Chuan Guo et al. Humanml3d: A large 3d human motion dataset with natural language descriptions. In CVPR, 2022

2022
[10]

Intergen: Diffusion-based multi-human motion generation under complex interactions.In- ternational Journal of Computer Vision, pages 1–21, 2024

Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, and Lan Xu. Intergen: Diffusion-based multi-human motion generation under complex interactions.In- ternational Journal of Computer Vision, pages 1–21, 2024

2024
[11]

Motion-x: A large-scale 3d expressive whole-body hu- man motion dataset.Advances in Neural Information Processing Systems, 2023

Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large-scale 3d expressive whole-body hu- man motion dataset.Advances in Neural Information Processing Systems, 2023

2023
[12]

Go to zero: Towards zero-shot motion generation with million-scale data, 2025

Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, and Jingbo Wang. Go to zero: Towards zero-shot motion generation with million-scale data, 2025

2025
[13]

Georgios Pavlakos, Vasileios Choutas, Nima Ghor- bani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body cap- ture: 3D hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pat- tern Recognition (CVPR), pages 10975–10985, 2019

2019
[14]

Black, David W

Angjoo Kanazawa, Michael J. Black, David W. Ja- cobs, and Jitendra Malik. End-to-end recovery of hu- man shape and pose. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), 2018

2018
[15]

Muhammed Kocabas, Nikos Athanasiou, and Michael J. Black. Vibe: Video inference for human body pose and shape estimation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

2020
[16]

Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J. Black. Wham: Reconstructing world-grounded hu- mans with accurate 3d motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2024

2024
[17]

Videobert: A joint model for video and language representation learning

Chen Sun, Austin Myers, Carl V ondrick, Kevin Mur- phy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019

2019
[18]

Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems (NeurIPS), 2022

Jean-Baptiste Alayrac, Niru Couairon, Petar Bevandic, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems (NeurIPS), 2022

2022
[19]

Video-llama: An instruction-tuned audio-visual language model for video understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing (EMNLP) Demonstrations Track, 2023

2023
[20]

Black, and G ¨ul Varol

Mathis Petrovich, Michael J. Black, and G ¨ul Varol. Action-conditioned 3d human motion synthesis with transformer vae, 2021

2021
[21]

T2m-gpt: Generating human motion from textual descriptions with discrete representations, 2023

Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. T2m-gpt: Generating human motion from textual descriptions with discrete representations, 2023

2023
[22]

Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H. Bermano. Human mo- tion diffusion model, 2022

2022
[23]

Executing your commands via motion diffusion in latent space, 2023

Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, Jingyi Yu, and Gang Yu. Executing your commands via motion diffusion in latent space, 2023

2023
[24]

Towards robust and controllable text-to- motion via masked autoregressive diffusion

Zongye Zhang, Bohan Kong, Qingjie Liu, and Yun- hong Wang. Towards robust and controllable text-to- motion via masked autoregressive diffusion. InPro- ceedings of the 33rd ACM International Conference on Multimedia, MM ’25, page 9326–9335. ACM, Oc- tober 2025

2025
[25]

Youtube data api v3, 2026

Google Developers. Youtube data api v3, 2026. On- line; accessed January 20, 2026

2026
[26]

PySceneDetect: Video cut detection and analysis tool.https://www

Brandon Castellano. PySceneDetect: Video cut detection and analysis tool.https://www. scenedetect.com/, 2025. Version 0.6.6

2025
[27]

Ultralytics yolo11, 2024

Glenn Jocher and Jing Qiu. Ultralytics yolo11, 2024

2024
[28]

Two-frame motion estimation based on polynomial expansion

Gunnar Farneb ¨ack. Two-frame motion estimation based on polynomial expansion. In Josef Bigun and Tomas Gustavsson, editors,Image Analysis, pages 363–370, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg

2003
[29]

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 34(6):248:1– 248:16, October 2015

2015
[30]

ViTPose: Simple vision transformer baselines for human pose estimation

Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. ViTPose: Simple vision transformer baselines for human pose estimation. InAdvances in Neural In- formation Processing Systems, 2022

2022
[31]

You only look once: Unified, real-time object detection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In2016 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 779–788, 2016

2016
[32]

Mediapipe: A framework for building perception pipelines, 2019

Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, Wan-Teh Chang, Wei Hua, Manfred Georg, and Matthias Grundmann. Mediapipe: A framework for building perception pipelines, 2019

2019
[33]

Filntisis, Radek Danecek, Victoria F

George Retsinas, Panagiotis P. Filntisis, Radek Danecek, Victoria F. Abrevaya, Anastasios Roussos, Timo Bolkart, and Petros Maragos. 3d facial expres- sions through analysis-by-neural-synthesis, 2024

2024
[34]

Black, and Timo Bolkart

Radek Danecek, Michael J. Black, and Timo Bolkart. EMOCA: Emotion driven monocular face capture and animation. InConference on Computer Vision and Pattern Recognition (CVPR), pages 20311–20322, 2022

2022
[35]

Qwen3-vl technical report, 2025

Qwen team. Qwen3-vl technical report, 2025

2025
[36]

Visual prompting in llms for enhancing emotion recognition, 2024

Qixuan Zhang, Zhifeng Wang, Dylan Zhang, Wenjia Niu, Sabrina Caldwell, Tom Gedeon, Yang Liu, and Zhenyue Qin. Visual prompting in llms for enhancing emotion recognition, 2024

2024
[37]

Fast segment anything, 2023

Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. Fast segment anything, 2023

2023
[38]

An enhanced context-based emotion detection model using roberta

Rohan Kamath, Arpan Ghoshal, Sivaraman Eswaran, and Prasad Honnavalli. An enhanced context-based emotion detection model using roberta. In2022 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT), pages 1–6, 2022

2022
[39]

Mac- donald & Evans, London, 1950

Rudolf von Laban.The Mastery of Movement. Mac- donald & Evans, London, 1950

1950
[40]

The KIT motion-language dataset.Big Data, 4(4):236–252, dec 2016

Matthias Plappert, Christian Mandery, and Tamim As- four. The KIT motion-language dataset.Big Data, 4(4):236–252, dec 2016

2016
[41]

Beat: A large-scale semantic and emo- tional multi-modal dataset for conversational gestures synthesis, 2022

Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Beat: A large-scale semantic and emo- tional multi-modal dataset for conversational gestures synthesis, 2022

2022
[42]

Scaling large motion models with million-level human mo- tions, 2025

Ye Wang, Sipeng Zheng, Bin Cao, Qianshan Wei, Weishuai Zeng, Qin Jin, and Zongqing Lu. Scaling large motion models with million-level human mo- tions, 2025

2025
[43]

Action2motion: Conditioned generation of 3d human motions

Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. InProceedings of the 28th ACM International Conference on Multimedia, MM ’20, page 2021–2029. ACM, October 2020

2021
[44]

Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, and Michael J. Black. Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling, 2024

2024
[45]

Wilor: End-to-end 3d hand localization and reconstruction in-the-wild, 2025

Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng, and Stefanos Zafeiriou. Wilor: End-to-end 3d hand localization and reconstruction in-the-wild, 2025. A. Supplementary material Yolov11n-pose confidence threshold to filter the frames :0.65 Vitpose-h model used to get the body keypoints for GVHMR:ViTPose huge coco 256x192 Vitpose-h confidence thre...

2025

[1] [1]

World-grounded human motion recov- ery via gravity-view coordinates

Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xi- aowei Zhou. World-grounded human motion recov- ery via gravity-view coordinates. InSIGGRAPH Asia Conference Proceedings, 2024

2024

[2] [2]

Tram: Global trajectory and motion of 3d humans from in-the-wild videos, 2024

Yufu Wang, Ziyun Wang, Lingjie Liu, and Kostas Daniilidis. Tram: Global trajectory and motion of 3d humans from in-the-wild videos, 2024

2024

[3] [3]

Multi-hmr: Multi-person whole-body human mesh recovery in a single shot, 2024

Fabien Baradel, Matthieu Armando, Salma Galaaoui, Romain Br ´egier, Philippe Weinzaepfel, Gr ´egory Ro- gez, and Thomas Lucas. Multi-hmr: Multi-person whole-body human mesh recovery in a single shot, 2024

2024

[4] [4]

Gpt-4o system card, 2024

OpenAI Team et al. Gpt-4o system card, 2024

2024

[5] [5]

Video-llama: An instruction-tuned audio-visual language model for video understanding, 2023

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding, 2023

2023

[6] [6]

Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

Pith/arXiv arXiv 2023

[7] [7]

Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cris- tian Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in nat- ural environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, 2014

2014

[8] [8]

Troje, Gerard Pons-Moll, and Michael J

Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. AMASS: Archive of motion capture as surface shapes. InInternational Conference on Computer Vision, pages 5442–5451, October 2019

2019

[9] [9]

Humanml3d: A large 3d human motion dataset with natural language descriptions

Chuan Guo et al. Humanml3d: A large 3d human motion dataset with natural language descriptions. In CVPR, 2022

2022

[10] [10]

Intergen: Diffusion-based multi-human motion generation under complex interactions.In- ternational Journal of Computer Vision, pages 1–21, 2024

Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, and Lan Xu. Intergen: Diffusion-based multi-human motion generation under complex interactions.In- ternational Journal of Computer Vision, pages 1–21, 2024

2024

[11] [11]

Motion-x: A large-scale 3d expressive whole-body hu- man motion dataset.Advances in Neural Information Processing Systems, 2023

Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large-scale 3d expressive whole-body hu- man motion dataset.Advances in Neural Information Processing Systems, 2023

2023

[12] [12]

Go to zero: Towards zero-shot motion generation with million-scale data, 2025

Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, and Jingbo Wang. Go to zero: Towards zero-shot motion generation with million-scale data, 2025

2025

[13] [13]

Georgios Pavlakos, Vasileios Choutas, Nima Ghor- bani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body cap- ture: 3D hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pat- tern Recognition (CVPR), pages 10975–10985, 2019

2019

[14] [14]

Black, David W

Angjoo Kanazawa, Michael J. Black, David W. Ja- cobs, and Jitendra Malik. End-to-end recovery of hu- man shape and pose. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), 2018

2018

[15] [15]

Muhammed Kocabas, Nikos Athanasiou, and Michael J. Black. Vibe: Video inference for human body pose and shape estimation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

2020

[16] [16]

Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J. Black. Wham: Reconstructing world-grounded hu- mans with accurate 3d motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2024

2024

[17] [17]

Videobert: A joint model for video and language representation learning

Chen Sun, Austin Myers, Carl V ondrick, Kevin Mur- phy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019

2019

[18] [18]

Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems (NeurIPS), 2022

Jean-Baptiste Alayrac, Niru Couairon, Petar Bevandic, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems (NeurIPS), 2022

2022

[19] [19]

Video-llama: An instruction-tuned audio-visual language model for video understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing (EMNLP) Demonstrations Track, 2023

2023

[20] [20]

Black, and G ¨ul Varol

Mathis Petrovich, Michael J. Black, and G ¨ul Varol. Action-conditioned 3d human motion synthesis with transformer vae, 2021

2021

[21] [21]

T2m-gpt: Generating human motion from textual descriptions with discrete representations, 2023

Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. T2m-gpt: Generating human motion from textual descriptions with discrete representations, 2023

2023

[22] [22]

Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H. Bermano. Human mo- tion diffusion model, 2022

2022

[23] [23]

Executing your commands via motion diffusion in latent space, 2023

Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, Jingyi Yu, and Gang Yu. Executing your commands via motion diffusion in latent space, 2023

2023

[24] [24]

Towards robust and controllable text-to- motion via masked autoregressive diffusion

Zongye Zhang, Bohan Kong, Qingjie Liu, and Yun- hong Wang. Towards robust and controllable text-to- motion via masked autoregressive diffusion. InPro- ceedings of the 33rd ACM International Conference on Multimedia, MM ’25, page 9326–9335. ACM, Oc- tober 2025

2025

[25] [25]

Youtube data api v3, 2026

Google Developers. Youtube data api v3, 2026. On- line; accessed January 20, 2026

2026

[26] [26]

PySceneDetect: Video cut detection and analysis tool.https://www

Brandon Castellano. PySceneDetect: Video cut detection and analysis tool.https://www. scenedetect.com/, 2025. Version 0.6.6

2025

[27] [27]

Ultralytics yolo11, 2024

Glenn Jocher and Jing Qiu. Ultralytics yolo11, 2024

2024

[28] [28]

Two-frame motion estimation based on polynomial expansion

Gunnar Farneb ¨ack. Two-frame motion estimation based on polynomial expansion. In Josef Bigun and Tomas Gustavsson, editors,Image Analysis, pages 363–370, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg

2003

[29] [29]

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 34(6):248:1– 248:16, October 2015

2015

[30] [30]

ViTPose: Simple vision transformer baselines for human pose estimation

Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. ViTPose: Simple vision transformer baselines for human pose estimation. InAdvances in Neural In- formation Processing Systems, 2022

2022

[31] [31]

You only look once: Unified, real-time object detection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In2016 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 779–788, 2016

2016

[32] [32]

Mediapipe: A framework for building perception pipelines, 2019

Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, Wan-Teh Chang, Wei Hua, Manfred Georg, and Matthias Grundmann. Mediapipe: A framework for building perception pipelines, 2019

2019

[33] [33]

Filntisis, Radek Danecek, Victoria F

George Retsinas, Panagiotis P. Filntisis, Radek Danecek, Victoria F. Abrevaya, Anastasios Roussos, Timo Bolkart, and Petros Maragos. 3d facial expres- sions through analysis-by-neural-synthesis, 2024

2024

[34] [34]

Black, and Timo Bolkart

Radek Danecek, Michael J. Black, and Timo Bolkart. EMOCA: Emotion driven monocular face capture and animation. InConference on Computer Vision and Pattern Recognition (CVPR), pages 20311–20322, 2022

2022

[35] [35]

Qwen3-vl technical report, 2025

Qwen team. Qwen3-vl technical report, 2025

2025

[36] [36]

Visual prompting in llms for enhancing emotion recognition, 2024

Qixuan Zhang, Zhifeng Wang, Dylan Zhang, Wenjia Niu, Sabrina Caldwell, Tom Gedeon, Yang Liu, and Zhenyue Qin. Visual prompting in llms for enhancing emotion recognition, 2024

2024

[37] [37]

Fast segment anything, 2023

Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. Fast segment anything, 2023

2023

[38] [38]

An enhanced context-based emotion detection model using roberta

Rohan Kamath, Arpan Ghoshal, Sivaraman Eswaran, and Prasad Honnavalli. An enhanced context-based emotion detection model using roberta. In2022 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT), pages 1–6, 2022

2022

[39] [39]

Mac- donald & Evans, London, 1950

Rudolf von Laban.The Mastery of Movement. Mac- donald & Evans, London, 1950

1950

[40] [40]

The KIT motion-language dataset.Big Data, 4(4):236–252, dec 2016

Matthias Plappert, Christian Mandery, and Tamim As- four. The KIT motion-language dataset.Big Data, 4(4):236–252, dec 2016

2016

[41] [41]

Beat: A large-scale semantic and emo- tional multi-modal dataset for conversational gestures synthesis, 2022

Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Beat: A large-scale semantic and emo- tional multi-modal dataset for conversational gestures synthesis, 2022

2022

[42] [42]

Scaling large motion models with million-level human mo- tions, 2025

Ye Wang, Sipeng Zheng, Bin Cao, Qianshan Wei, Weishuai Zeng, Qin Jin, and Zongqing Lu. Scaling large motion models with million-level human mo- tions, 2025

2025

[43] [43]

Action2motion: Conditioned generation of 3d human motions

Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. InProceedings of the 28th ACM International Conference on Multimedia, MM ’20, page 2021–2029. ACM, October 2020

2021

[44] [44]

Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, and Michael J. Black. Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling, 2024

2024

[45] [45]

Wilor: End-to-end 3d hand localization and reconstruction in-the-wild, 2025

Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng, and Stefanos Zafeiriou. Wilor: End-to-end 3d hand localization and reconstruction in-the-wild, 2025. A. Supplementary material Yolov11n-pose confidence threshold to filter the frames :0.65 Vitpose-h model used to get the body keypoints for GVHMR:ViTPose huge coco 256x192 Vitpose-h confidence thre...

2025