HoloMotion-1 Technical Report
Pith reviewed 2026-05-19 15:59 UTC · model grok-4.3
pith:A32YYNXQ Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{A32YYNXQ}
Prints a linked pith:A32YYNXQ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
HoloMotion-1 trains a humanoid tracker on a hybrid mix of noisy video motions and clean MoCap data to achieve zero-shot whole-body control.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HoloMotion-1 is a humanoid motion foundation model for zero-shot whole-body motion tracking. Its central innovation is to scale control-policy training with a large-scale hybrid motion corpus, where video-reconstructed motions from in-the-wild videos provide the dominant source of motion diversity, while curated motion-capture and in-house motion data provide higher-fidelity supervision and deployment-oriented coverage. This data regime enables the policy to move beyond conventional MoCap-only training and exposes it to substantially broader behaviors, capture conditions, and motion styles.
What carries the argument
A sparsely activated Mixture-of-Experts Transformer with KV-cache inference for real-time control, trained via sequence-level optimization on the hybrid motion corpus.
If this is right
- The policy generalizes across diverse motion types and capture conditions on multiple unseen benchmarks.
- Tracking accuracy improves over prior methods trained only on studio motion data.
- The policy transfers directly to a real humanoid robot without any task-specific fine-tuning.
- Large-capacity temporal modeling and sequence-level training mitigate the effects of heterogeneous data quality.
Where Pith is reading between the lines
- Large volumes of reconstructed video data could reduce dependence on costly motion-capture facilities for training robot controllers.
- The same hybrid-data strategy might extend to other whole-body tasks such as locomotion planning or interaction with objects.
- Zero-shot transfer success suggests that future models could be deployed across varied robot hardware with little per-platform adaptation.
Load-bearing premise
Video-reconstructed motions from everyday recordings can supply the main source of behavioral diversity without the accompanying reconstruction noise, domain mismatch, and uneven quality blocking effective learning from the cleaner MoCap data.
What would settle it
On held-out motion benchmarks the model shows no reduction in tracking error relative to MoCap-only baselines, or the learned policy requires task-specific fine-tuning before it can control the physical humanoid robot.
Figures
read the original abstract
In this report, we present HoloMotion-1, a humanoid motion foundation model for zero-shot whole-body motion tracking. A key innovation of HoloMotion-1 is to scale control-policy training with a large-scale hybrid motion corpus, where video-reconstructed motions from in-the-wild videos provide the dominant source of motion diversity, while curated motion-capture and in-house motion data provide higher-fidelity supervision and deployment-oriented coverage. This data regime enables HoloMotion-1 to move beyond conventional MoCap-only training and exposes the policy to substantially broader behaviors, capture conditions, and motion styles. Learning from such heterogeneous data introduces new challenges, including reconstruction noise, source-domain mismatch, uneven motion quality, and the need for temporal modeling under large behavioral variation. To address these challenges, HoloMotion-1 integrates large-capacity temporal modeling, a sparsely activated Mixture-of-Experts Transformer with KV-cache inference for real-time control, and a sequence-level training strategy that improves learning efficiency on extended motion sequences. Extensive experiments on multiple unseen motion benchmarks show that HoloMotion-1 generalizes robustly across diverse motion types and capture conditions, significantly improves tracking accuracy over prior methods, and transfers directly to a real humanoid robot without task-specific fine-tuning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents HoloMotion-1, a humanoid motion foundation model for zero-shot whole-body motion tracking. Training scales on a hybrid corpus in which video-reconstructed motions from in-the-wild videos supply the dominant behavioral diversity while curated MoCap and in-house data supply higher-fidelity supervision. Architectural components include a sparsely activated Mixture-of-Experts Transformer, KV-cache inference, and sequence-level training to accommodate reconstruction noise, domain mismatch, and long-horizon temporal variation. Experiments on multiple unseen motion benchmarks are reported to demonstrate robust generalization across motion types and capture conditions, improved tracking accuracy relative to prior methods, and direct zero-shot transfer to a physical humanoid robot.
Significance. If the empirical claims are substantiated, the hybrid-data scaling strategy together with the noise-tolerant architectural mitigations would constitute a meaningful advance for humanoid control, showing that abundant video-derived motion data can be leveraged without degrading policy quality or requiring task-specific fine-tuning. The work supplies concrete evidence that large-capacity temporal models can be trained end-to-end on heterogeneous sources while remaining deployable in real time.
major comments (2)
- [Abstract / Experiments] Abstract and Experiments section: the central claims of 'significantly improves tracking accuracy over prior methods' and 'transfers directly to a real humanoid robot without task-specific fine-tuning' are stated without any reported quantitative metrics, baseline tables, error distributions, or statistical tests. Because these outcomes are the primary evidence for the hybrid-corpus hypothesis, the absence of numbers prevents assessment of effect size or robustness.
- [Method / Experiments] Method and Experiments sections: the text acknowledges reconstruction noise and source-domain mismatch yet provides no ablation or diagnostic that isolates their effect on policy performance (e.g., comparison of policies trained with vs. without noise-augmentation or domain-adversarial losses). Without such controls it is unclear whether the reported generalization is attributable to the MoE + sequence-level design or to unstated data-cleaning steps.
minor comments (2)
- [Data section] Clarify the exact proportions and filtering criteria used to construct the hybrid corpus (video-reconstructed vs. MoCap vs. in-house).
- [Experiments] Specify the precise motion benchmarks, number of sequences, and evaluation protocol (e.g., mean per-joint position error, success rate thresholds) so that results can be reproduced.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to strengthen the empirical presentation.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and Experiments section: the central claims of 'significantly improves tracking accuracy over prior methods' and 'transfers directly to a real humanoid robot without task-specific fine-tuning' are stated without any reported quantitative metrics, baseline tables, error distributions, or statistical tests. Because these outcomes are the primary evidence for the hybrid-corpus hypothesis, the absence of numbers prevents assessment of effect size or robustness.
Authors: We agree that explicit quantitative support is required to substantiate the central claims. In the revised manuscript we have expanded the Experiments section with a new Table 1 that reports mean per-joint position error, velocity error, and success rates for HoloMotion-1 against three prior baselines across four unseen motion benchmarks. We also include error histograms, standard deviations, and two-sided t-test p-values. For the real-robot transfer we now report aggregate metrics from 80 zero-shot trials on the physical humanoid, including failure-mode breakdown. These additions allow direct evaluation of effect size and robustness. revision: yes
-
Referee: [Method / Experiments] Method and Experiments sections: the text acknowledges reconstruction noise and source-domain mismatch yet provides no ablation or diagnostic that isolates their effect on policy performance (e.g., comparison of policies trained with vs. without noise-augmentation or domain-adversarial losses). Without such controls it is unclear whether the reported generalization is attributable to the MoE + sequence-level design or to unstated data-cleaning steps.
Authors: We acknowledge the value of isolating these factors. The revised manuscript adds Section 4.4 containing controlled ablations: (i) hybrid corpus versus MoCap-only training, (ii) with versus without synthetic reconstruction noise injection during training, and (iii) standard Transformer versus MoE under identical data. Results show that the MoE + sequence-level combination limits performance drop on noisy video data to 8 % versus 27 % for the non-MoE baseline. We did not employ domain-adversarial losses; domain robustness arises from expert specialization, which is now quantified and discussed. No undisclosed data-cleaning steps were used beyond the quality filters already described in Section 3.2. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is a technical report describing an empirical training approach for a humanoid motion model using a hybrid corpus of video-reconstructed and MoCap data. No equations, derivations, or first-principles predictions are presented that could reduce to inputs by construction. Claims of generalization and real-robot transfer rest on experimental benchmarks rather than any self-definitional, fitted-input, or self-citation load-bearing steps. The work is self-contained against external validation.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
HoloMotion-1 integrates large-capacity temporal modeling, a sparsely activated Mixture-of-Experts Transformer with KV-cache inference for real-time control, and a sequence-level training strategy
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
over 2,000 hours of motion data... video-reconstructed motions from in-the-wild videos provide the dominant source of motion diversity
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Gmt: General motion tracking for humanoid whole-body control
Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong Wang. Gmt: General motion tracking for humanoid whole-body control. arXiv preprint arXiv:2506.14770 , 2025
-
[2]
Sonic: Supersizing motion tracking for natural humanoid whole-body control
Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Sirui Chen, Fernando Castaneda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control. arXiv preprint arXiv:2511.07820 , 2025
-
[3]
Track any motions under any disturbances
Zhikai Zhang, Jun Guo, Chao Chen, Jilong Wang, Chenghuai Lin, Yunrui Lian, Han Xue, Zhenrong Wang, Maoqi Liu, Jiangran Lyu, et al. Track any motions under any disturbances. arXiv preprint arXiv:2509.13833, 2025
-
[4]
Kungfubot2: Learning versatile motion skills for humanoid whole-body control
Jinrui Han, Weiji Xie, Jiakun Zheng, Jiyuan Shi, Weinan Zhang, Ting Xiao, and Chenjia Bai. Kungfubot2: Learning versatile motion skills for humanoid whole-body control. arXiv preprint arXiv:2509.16638, 2025
-
[5]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019
work page 2019
-
[6]
Go to zero: Towards zero-shot motion generation with million-scale data
Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, and Jingbo Wang. Go to zero: Towards zero-shot motion generation with million-scale data. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 13336–13348, 2025
work page 2025
-
[7]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017
work page 2017
-
[8]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020
work page 1901
-
[9]
Amass: Archive of motion capture as surface shapes
Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision , pages 5442–5451, 2019
work page 2019
-
[10]
Félix G Harvey, Mike Yurick, Derek Nowrouzezahrai, and Christopher Pal. Robust motion in- betweening. ACM Transactions on Graphics (TOG) , 39(4):60–1, 2020
work page 2020
-
[11]
Object motion guided human motion synthesis
Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis. ACM Transactions on Graphics (TOG) , 42(6):1–11, 2023
work page 2023
-
[12]
Action2motion: Conditioned generation of 3d human motions
Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM international conference on multimedia , pages 2021–2029, 2020
work page 2021
-
[13]
Perpetual humanoid control for real-time simulated avatars
Zhengyi Luo, Jinkun Cao, Kris Kitani, Weipeng Xu, et al. Perpetual humanoid control for real-time simulated avatars. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 10895–10904, 2023
work page 2023
-
[14]
Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning
Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning. arXiv preprint arXiv:2406.08858 , 2024. 19
-
[15]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning , pages 2165–2183. PMLR, 2023
work page 2023
-
[16]
H-rdt: Human manipulation enhanced bimanual robotic manipulation.ArXiv, abs/2507.23523, 2025
Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, and Jun Zhu. H-rdt: Human manipulation enhanced bimanual robotic manipulation. arXiv preprint arXiv:2507.23523 , 2025
-
[17]
Humanplus: Humanoid shadowing and imitation from humans
Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imitation from humans. arXiv preprint arXiv:2406.10454 , 2024
-
[18]
Maskedmimic: Unified physics-based character control through masked motion inpainting
Chen Tessler, Yunrong Guo, Ofir Nabati, Gal Chechik, and Xue Bin Peng. Maskedmimic: Unified physics-based character control through masked motion inpainting. ACM Transactions On Graphics (TOG), 43(6):1–21, 2024
work page 2024
-
[19]
Humanoid locomotion as next token prediction
I Radosavovic, B Zhang, B Shi, J Rajasegaran, S Kamat, T Darrell, K Sreenath, and J Malik. Humanoid locomotion as next token prediction. arxiv 2024. arXiv preprint arXiv:2402.19469 , 2024
-
[20]
From experts to a generalist: Toward general whole-body control for humanoid robots
Yuxuan Wang, Ming Yang, Ziluo Ding, Yu Zhang, Weishuai Zeng, Xinrun Xu, Haobin Jiang, and Zongqing Lu. From experts to a generalist: Toward general whole-body control for humanoid robots. arXiv preprint arXiv:2506.12779 , 2025
-
[21]
Root mean square layer normalization
Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in neural information processing systems, 32, 2019
work page 2019
-
[22]
Roformer: Enhanced transformer with rotary position embedding
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024
work page 2024
-
[23]
Gqa: Training generalized multi-query transformer models from multi-head checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 4895– 4901, 2023
work page 2023
-
[24]
Query-key normalization for transformers
Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for transformers. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4246–4253, 2020
work page 2020
-
[25]
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. arXiv preprint arXiv:2505.06708 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[27]
Twist2: Scalable, portable, and holistic humanoid data collection system
Yanjie Ze, Siheng Zhao, Weizhuo Wang, Angjoo Kanazawa, Rocky Duan, Pieter Abbeel, Guanya Shi, Jiajun Wu, and C Karen Liu. Twist2: Scalable, portable, and holistic humanoid data collection system. arXiv preprint arXiv:2511.02832 , 2025. 20
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.