pith. sign in

arxiv: 2606.10340 · v1 · pith:CA6ELMSOnew · submitted 2026-06-09 · 💻 cs.RO

OMG: Omni-Modal Motion Generation for Generalist Humanoid Control

Pith reviewed 2026-06-27 13:12 UTC · model grok-4.3

classification 💻 cs.RO
keywords humanoid controlmotion generationdiffusion modelsmulti-modal conditioningwhole-body controlomni-modaldata curationfoundation models
0
0 comments X

The pith

A diffusion model conditions on language, audio and reference motions to drive generalist whole-body humanoid control from curated data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that general-purpose humanoid control needs a scalable multi-modal reasoning module placed on top of a reactive motion tracker. It introduces OMG, which uses a data curation pipeline together with a diffusion generator that accepts language, audio and human motion inputs. Experiments are presented to show state-of-the-art tracking performance, clear scaling with model size, and rapid adaptation when new modalities or data distributions appear.

Core claim

OMG consists of a meticulous data curation, filtering and labeling pipeline plus a diffusion-based motion generation backbone that conditions on language, audio and human reference motions. The architecture places this generator as a reasoning brain above a reactive cerebellum. Experiments demonstrate that the resulting controller achieves state-of-the-art whole-body performance, exhibits model scaling, and adapts efficiently to new distributions and modalities.

What carries the argument

Diffusion-based motion generation backbone conditioned on language, audio and human reference motions, supported by a data curation pipeline.

If this is right

  • The controller reaches state-of-the-art performance on whole-body motion tasks.
  • Performance improves with larger model size according to scaling laws.
  • The same model adapts quickly to new data distributions and input modalities.
  • The approach constitutes a step toward foundation models for humanoid robots.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hierarchical separation of reasoning and tracking layers could be tested on non-humanoid robots that require multi-modal commands.
  • The data curation steps may transfer to other motion-generation domains where high-quality paired examples are scarce.
  • If scaling continues, the model could eventually accept longer-horizon language instructions without additional reward engineering.

Load-bearing premise

A scalable multi-modal reasoning module placed on a reactive motion tracker is enough to reach general-purpose humanoid control.

What would settle it

A controlled test in which the model shows no performance gain when scaled in size or fails to adapt when a new input modality is added after fine-tuning.

Figures

Figures reproduced from arXiv: 2606.10340 by Dongming Qiao, Guanqi He, Hang Zhao, Kun-Ying Lee, Shaoting Zhu, Siqiao Huang, Yitang Li, Zhenyu Wang.

Figure 1
Figure 1. Figure 1: Overview. OMG decomposes humanoid whole-body control into a scalable motion gen￾eration brain and a reactive motion tracking cerebellum. Built on OMG-Data, a curation of 1000+ hours omni-modal humanoid motion data, OMG-DiT maps language, audio, human reference, and their compositions into robot-executable future motions, which are deployed on a Unitree G1 in real time, paired with a pretrained motion track… view at source ↗
Figure 2
Figure 2. Figure 2: Dataset Statistics of OMG-Data. We curate a large-scale omni-modal humanoid motion corpus by aggregating heterogeneous datasets and unifying them into the Unitree G1 motion space. Left: processed data statistics across conditioning modalities and source datasets. Right: represen￾tative conditioning modalities, including language, audio, and human reference motions. Interactive and Multi-Modal Motion Genera… view at source ↗
Figure 3
Figure 3. Figure 3: OMG-DiT learns a shared diffusion backbone while enabling conditioning with modality￾specific encoders. History motion and language are injected as global context tokens via cross￾attention, whereas frame-aligned signals (i.e., audio and human reference motions) are injected through FiLM [45] adapters. New modalities are attached non-invasively through zero-initialized adapters, and multiple conditions can… view at source ↗
Figure 4
Figure 4. Figure 4: Real-World Omni-Modal Control. OMG generates diverse Unitree G1 motions across various conditioning modalities in real time, executable in the real world. 5.1 Experiment Setup Evaluation Protocol. We evaluate OMG in two regimes: pretrained omni-modal motion gen￾eration and downstream finetuning. For pretraining, we consider language-, audio-, and human￾reference-conditioned motion generation, where the mod… view at source ↗
Figure 5
Figure 5. Figure 5: Scaling OMG-DiT. Scaling Behavior. Finally, we ask whether motion generation is a scalable objective: do larger diffu￾sion backbones yield better humanoid motion qual￾ity, given the same data and evaluation protocol? To answer this, we pretrain three OMG-DiT variants with increasing numbers of parameters. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Canonicalization. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative visualization. Unitree G1 execution sequences produced by OMG under text, audio, human-reference, and composed text-audio conditions. Frames are uniformly sampled within each sequence, and embedded prompts are preserved. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Interactive Control: Composition in the Temporal Horizon. E.3 Interactive Control with Temporal Composition We showcase our model’s capability for real-time interactive control. We feed the model time￾varying commands from different modalities over the temporal horizon. As shown in [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Human-reference CFG sweep. The translucent reference overlay shows the target mo￾tion, and the opaque robot shows the generated motion [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Text-audio CFG sweep. Columns show snapshots over time and rows vary the text guidance scale while keeping the audio condition fixed. Larger text guidance improves adherence to the language instruction in the composed audio-language setting. (a) Dataset. (b) Egocentric RGB input. (c) Third-person layout [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Perceptive Locomotion setup. The dataset contains 300 Kimodo-generated demonstra￾tions per target color. The policy observes only low-resolution egocentric RGB and a discrete color command. The goal is to follow the command and locomote to the corresponding color. target-hold segment so that the reference motion stops after reaching the target. To train the diffu￾sion model, we sample 2 s windows using th… view at source ↗
Figure 12
Figure 12. Figure 12: Third-person timelapse of a successful rollout by the pretrained checkpoint. The robot enters the commanded blue target. Evaluation and Results. At test time, we use online replanning from a canonical standing G1 state. Each replan samples a 2 s motion window, executes the first 0.5 s, and then replans from the updated state; the rollout budget is 210 source frames. We evaluate 30 held-out validation roll… view at source ↗
read the original abstract

Humanoid whole-body control has made significant progress in recent years, yet existing approaches remain limited to few-skill policies with heavy reward engineering, or motion trackers that are difficult to extend to new input modalities. We argue that the key to general-purpose humanoid control is to build a scalable brain, a module capable of reasoning with diverse conditioning modalities, atop a reactive motion tracking cerebellum, mirroring the hierarchical structure of biological motor systems. Two challenges arise in realizing this vision: acquiring a vast amount of high-quality data to achieve general purpose control, and equipping the generator with the capability to condition on compositional, extensible multi-modal inputs. We present OMG, which addresses these challenges with a meticulous data curation, filtering and labeling pipeline, as well as a diffusion-based motion generation backbone that conditions on language, audio, and human reference motions. Extensive experiments validate OMG as an omni-modal whole-body controller exhibiting state-of-the-art performance, model scaling behavior and efficient adaptation to new distributions and modalities, marking a concrete step toward foundation models for humanoid robots.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces OMG, a diffusion-based omni-modal motion generator for humanoid whole-body control. It builds a scalable multi-modal reasoning module (brain) atop a reactive motion tracker (cerebellum), enabled by a data curation, filtering, and labeling pipeline that supports conditioning on language, audio, and human reference motions. The authors claim that extensive experiments demonstrate state-of-the-art performance, model scaling behavior, and efficient adaptation to new distributions and modalities, advancing toward foundation models for humanoid robots.

Significance. If the experimental results hold, the work would constitute a meaningful step toward generalist humanoid controllers by addressing the extensibility limitations of few-skill policies and non-extendable trackers through hierarchical multi-modal design and curated data pipelines.

major comments (1)
  1. [Abstract] Abstract: The central claims of state-of-the-art performance, model scaling behavior, and efficient adaptation to new modalities rest entirely on unspecified experiments; no quantitative metrics, baseline comparisons, dataset sizes, error bars, or ablation results are provided to support these assertions, which are load-bearing for the paper's contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and the opportunity to clarify the presentation of our results. The primary concern raised is addressed point-by-point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of state-of-the-art performance, model scaling behavior, and efficient adaptation to new modalities rest entirely on unspecified experiments; no quantitative metrics, baseline comparisons, dataset sizes, error bars, or ablation results are provided to support these assertions, which are load-bearing for the paper's contribution.

    Authors: We agree that the abstract, as currently written, is a high-level summary that does not enumerate specific quantitative results. The full manuscript contains these details in the Experiments section, including direct comparisons against baselines, dataset statistics from the curation pipeline, statistical error bars across runs, and ablations isolating the contributions of the multi-modal conditioning and data pipeline. To make the abstract self-contained and better support the load-bearing claims, we will revise it to include concise references to key quantitative outcomes (e.g., performance deltas and data scale) while preserving its brevity. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical system (OMG) consisting of a data curation pipeline and a diffusion-based multi-modal motion generator, with performance claims supported by experiments on SOTA results, scaling, and adaptation. No derivation chain, equations, or first-principles reductions appear in the provided abstract or description. Claims are framed as outcomes of training and evaluation rather than predictions forced by fitted parameters or self-citations. The hierarchical brain-cerebellum motif is presented as a design choice mirroring biology, not as a mathematically derived necessity. No load-bearing steps reduce to self-definition, renaming, or imported uniqueness theorems. The work is self-contained as an engineering contribution validated externally via benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, training details, or explicit assumptions, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5732 in / 1143 out tokens · 46340 ms · 2026-06-27T13:12:21.669551+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

95 extracted references · 2 canonical work pages

  1. [1]

    M. Chen, K. Wang, B. Zhang, X. Ma, Z. Yang, Y . Ren, Q. Huang, Z. Zhu, Y . Wang, and Z. Su. Holomotion-1 technical report, 2026. URLhttps://arxiv.org/abs/2605.15336

  2. [2]

    Zhuang, S

    Z. Zhuang, S. Yao, and H. Zhao. Humanoid parkour learning.arXiv preprint arXiv:2406.10759, 2024

  3. [3]

    Radosavovic, S

    I. Radosavovic, S. Kamat, T. Darrell, and J. Malik. Learning humanoid locomotion over chal- lenging terrain.arXiv preprint arXiv:2410.03654, 2024

  4. [4]

    Zhang, Y

    Y . Zhang, Y . Yuan, P. Gurunath, I. Gupta, S. Omidshafiei, A.-a. Agha-mohammadi, M. Vazquez-Chanlatte, L. Pedersen, T. He, and G. Shi. Falcon: Learning force-adaptive hu- manoid loco-manipulation.arXiv preprint arXiv:2505.06776, 2025

  5. [5]

    Y . Li, Y . Zhang, W. Xiao, C. Pan, H. Weng, G. He, T. He, and G. Shi. Hold my beer: Learning gentle humanoid locomotion and end-effector stabilization control.arXiv preprint arXiv:2505.24198, 2025

  6. [6]

    Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Castaneda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

  7. [7]

    Zhang, J

    Z. Zhang, J. Guo, C. Chen, J. Wang, C. Lin, Y . Lian, H. Xue, Z. Wang, M. Liu, J. Lyu, et al. Track any motions under any disturbances.arXiv preprint arXiv:2509.13833, 2025

  8. [8]

    Q. Liao, T. E. Truong, X. Huang, Y . Gao, G. Tevet, K. Sreenath, and C. K. Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

  9. [9]

    Serifi, R

    A. Serifi, R. Grandia, E. Knoop, M. Gross, and M. Bächer. Robot motion diffusion model: Motion generation for robotic characters. InSIGGRAPH asia 2024 conference papers, pages 1–9, 2024

  10. [10]

    Tevet, S

    G. Tevet, S. Raab, S. Cohan, D. Reda, Z. Luo, X. B. Peng, A. Bermano, and M. Van de Panne. Closd: Closing the loop between simulation and diffusion for multi-task character control. InInternational Conference on Learning Representations, volume 2025, pages 46506–46520, 2025

  11. [11]

    M. Xu, Y . Shi, K. Yin, and X. B. Peng. Parc: Physics-based augmentation with reinforcement learning for character controllers. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–11, 2025

  12. [12]

    Zhang, K

    Z. Zhang, K. Wen, M. Xu, J. He, C. Li, T. Miki, C. Schwarke, C. Zhang, X. B. Peng, and M. Hutter. Learning whole-body humanoid locomotion via motion generation and motion tracking.arXiv preprint arXiv:2604.17335, 2026

  13. [13]

    W. Xie, J. Zheng, J. Han, J. Shi, W. Zhang, C. Bai, and X. Li. Textop: Real-time interactive text-driven humanoid robot motion generation and control.arXiv preprint arXiv:2602.07439, 2026

  14. [14]

    Schuhmann, R

    C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35: 25278–25294, 2022. 9

  15. [15]

    Penedo, H

    G. Penedo, H. Kydlí ˇcek, A. Lozhkov, M. Mitchell, C. Raffel, L. V on Werra, T. Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

  16. [16]

    T. He, Z. Luo, W. Xiao, C. Zhang, K. Kitani, C. Liu, and G. Shi. Learning human-to-humanoid real-time whole-body teleoperation.arXiv preprint arXiv:2403.04436, 2024

  17. [17]

    T. He, Z. Luo, X. He, W. Xiao, C. Zhang, W. Zhang, K. Kitani, C. Liu, and G. Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning.arXiv preprint arXiv:2406.08858, 2024

  18. [18]

    Y . Ze, Z. Chen, J. P. Araújo, Z. ang Cao, X. B. Peng, J. Wu, and C. K. Liu. Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025

  19. [19]

    Tevet, S

    G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-Or, and A. H. Bermano. Human motion diffusion model.arXiv preprint arXiv:2209.14916, 2022

  20. [20]

    Zhang, Z

    M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu. Motiondiffuse: Text-driven human motion generation with diffusion model.arXiv preprint arXiv:2208.15001, 2022

  21. [21]

    L. Yang, Z. Zhang, Y . Song, S. Hong, R. Xu, Y . Zhao, W. Zhang, B. Cui, and M.-H. Yang. Diffusion models: A comprehensive survey of methods and applications.ACM computing surveys, 56(4):1–39, 2023

  22. [22]

    Peebles and S

    W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  23. [23]

    H. Liu, Z. Zhu, G. Becherini, Y . Peng, M. Su, Y . Zhou, X. Zhe, N. Iwamoto, B. Zheng, and M. J. Black. Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling, 2024. URLhttps://arxiv.org/abs/2401.00374

  24. [24]

    Zhang, Z

    J. Zhang, Z. Kang, L. Liu, J. Chang, Q. Tian, F. Gao, and Y . Wang. Opendance: Multimodal controllable 3d dance generation with large-scale internet data, 2025. URLhttps://arxiv. org/abs/2506.07565

  25. [25]

    J. Li, J. Cao, H. Zhang, D. Rempe, J. Kautz, U. Iqbal, and Y . Yuan. Genmo: A generalist model for human motion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

  26. [26]

    Rempe, M

    D. Rempe, M. Petrovich, Y . Yuan, H. Zhang, X. B. Peng, Y . Jiang, T. Wang, U. Iqbal, D. Minor, M. de Ruyter, J. Li, C. Tessler, E. Lim, E. Jeong, S. Wu, E. Hassani, M. Huang, J.-B. Yu, C. Chung, L. Song, O. Dionne, J. Kautz, S. Yuen, and S. Fidler. Kimodo: Scaling controllable human motion generation.arXiv:2603.15546, 2026

  27. [27]

    K. Fan, S. Lu, M. Dai, R. Yu, L. Xiao, Z. Dou, J. Dong, L. Ma, and J. Wang. Go to zero: Towards zero-shot motion generation with million-scale data, 2025. URLhttps://arxiv. org/abs/2507.07095

  28. [28]

    Mahmood, N

    N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019

  29. [29]

    F. G. Harvey, M. Yurick, D. Nowrouzezahrai, and C. Pal. Robust motion in-betweening. arXiv:2102.04942, 39(4), 2020

  30. [30]

    J. Li, J. Wu, and C. K. Liu. Object motion guided human motion synthesis.ACM Trans. Graph., 42(6), 2023. 10

  31. [31]

    Mason, S

    I. Mason, S. Starke, and T. Komura. Real-time style modelling of human locomotion via feature-wise transformations and local motion phases.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 5(1):1–18, 2022

  32. [32]

    C. Guo, I. Hwang, J. Wang, and B. Zhou. Snapmogen: Human motion generation from ex- pressive texts, 2025. URLhttps://arxiv.org/abs/2507.09122

  33. [33]

    R. Li, J. Zhao, Y . Zhang, M. Su, Z. Ren, H. Zhang, Y . Tang, and X. Li. Finedance: A fine-grained choreography dataset for 3d full body dance generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10234–10243, 2023

  34. [34]

    R. Li, S. Yang, D. A. Ross, and A. Kanazawa. Learn to dance with aist++: Music conditioned 3d dance generation, 2021

  35. [35]

    Zhang, J

    Y . Zhang, J. Lin, A. Zeng, G. Wu, S. Lu, Y . Fu, Y . Cai, R. Zhang, H. Wang, and L. Zhang. Motion-x++: A large-scale multimodal 3d whole-body human motion dataset.arXiv preprint arXiv:2501.05098, 2025

  36. [36]

    J. Lin, A. Zeng, S. Lu, Y . Cai, R. Zhang, H. Wang, and L. Zhang. Motion-x: A large-scale 3d expressive whole-body human motion dataset.Advances in Neural Information Processing Systems, 2023

  37. [37]

    B. Kim, H. I. Jeong, J. Sung, Y . Cheng, J. Lee, J. Y . Chang, S.-I. Choi, Y . Choi, S. Shin, J. Kim, and H. J. Chang. Personabooth: Personalized text-to-motion generation.arXiv preprint arXiv:2503.07390, 2025

  38. [38]

    K. Chen, Z. Tan, J. Lei, S.-H. Zhang, Y .-C. Guo, W. Zhang, and S.-M. Hu. Choreomas- ter: choreography-oriented music-driven dance synthesis.ACM Trans. Graph., 40(4), July

  39. [39]

    doi:10.1145/3450626.3459932

    ISSN 0730-0301. doi:10.1145/3450626.3459932. URLhttps://doi.org/10.1145/ 3450626.3459932

  40. [40]

    Burkanova, P

    B. Burkanova, P. J. Yazdian, C. Zhang, T. Evans, P. Tuttösí, and A. Lim. Salsa as a nonverbal embodied language – the compas3d dataset and benchmarks, 2025. URLhttps://arxiv. org/abs/2507.19684

  41. [41]

    N. Le, T. Pham, T. Do, E. Tjiputra, Q. D. Tran, and A. Nguyen. Music-driven group choreog- raphy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2023

  42. [42]

    Y . Ze, J. P. Araújo, J. Wu, and C. K. Liu. Gmr: General motion retargeting, 2025. URL https://github.com/YanjieZe/GMR. GitHub repository

  43. [43]

    J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu. Retargeting matters: General motion retargeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025

  44. [44]

    B. Seed. Seed1. 8 model card: Towards generalized real-world agency.arXiv preprint arXiv:2603.20633, 2026

  45. [45]

    Todorov, T

    E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–

  46. [46]

    Todorov, T

    IEEE, 2012. doi:10.1109/IROS.2012.6386109

  47. [47]

    Perez, F

    E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  48. [48]

    Y . Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, D. Schuurmans, and P. Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023. 11

  49. [49]

    Y . Du, S. Yang, P. Florence, F. Xia, A. Wahid, P. Sermanet, T. Yu, P. Abbeel, J. B. Tenen- baum, L. Kaelbling, et al. Video language planning. InInternational Conference on Learning Representations, volume 2024, pages 31138–31155, 2024

  50. [50]

    Li and K

    T. Li and K. He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025

  51. [51]

    Y . Lu, S. Lu, Q. Sun, H. Zhao, Z. Jiang, X. Wang, T. Li, Z. Geng, and K. He. One-step latent-free image generation with pixel mean flows.arXiv preprint arXiv:2601.22158, 2026

  52. [52]

    Raffel, N

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

  53. [53]

    Zhang, Y

    J. Zhang, Y . Zhang, X. Cun, Y . Zhang, H. Zhao, H. Lu, X. Shen, and Y . Shan. Generating human motion from textual descriptions with discrete representations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14730–14740, 2023

  54. [54]

    Tseng, R

    J. Tseng, R. Castellon, and K. Liu. Edge: Editable dance generation from music. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 448–458, 2023

  55. [55]

    Jiang, P

    J. Jiang, P. Streli, H. Qiu, A. Fender, L. Laich, P. Snape, and C. Holz. Avatarposer: Articulated full-body pose tracking from sparse motion sensing. InEuropean conference on computer vision, pages 443–460. Springer, 2022

  56. [56]

    X. Yi, Y . Zhou, M. Habermann, S. Shimada, V . Golyanik, C. Theobalt, and F. Xu. Physical inertial poser (pip): Physics-aware real-time human motion tracking from sparse inertial sen- sors. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13167–13178, 2022

  57. [57]

    Cho, S.-H

    H. Cho, S.-H. Kim, J. Kang, and D. Koo. Safeflow: Real-time text-driven humanoid whole- body control via physics-guided rectified flow and selective safety gating.arXiv preprint arXiv:2603.23983, 2026

  58. [58]

    R. Nai, B. Zheng, J. Zhao, H. Zhu, S. Dai, Z. Chen, Y . Hu, Y . Hu, T. Zhang, C. Wen, et al. Hu- manoid manipulation interface: Humanoid whole-body manipulation from robot-free demon- strations.arXiv preprint arXiv:2602.06643, 2026

  59. [59]

    Y . Wen, Q. Shuai, D. Kang, J. Li, C. Wen, Y . Qian, N. Jiao, C. Chen, W. Chen, Y . Wang, et al. Hy-motion 1.0: Scaling flow matching models for text-to-motion generation.arXiv preprint arXiv:2512.23464, 2025

  60. [60]

    R. Li, Y . Zhang, Y . Zhang, H. Zhang, J. Guo, Y . Zhang, Y . Liu, and X. Li. Lodge: A coarse to fine diffusion network for long dance generation guided by the characteristic dance primitives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1524–1534, 2024

  61. [61]

    Siyao, W

    L. Siyao, W. Yu, T. Gu, C. Lin, Q. Wang, C. Qian, C. C. Loy, and Z. Liu. Bailando: 3d dance generation by actor-critic gpt with choreographic memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11050–11059, 2022

  62. [62]

    Q. Zhao, K. Yang, X. Wang, S. Zhao, Y . Lu, X. Zhang, Q. Shen, X.-X. Long, and X. Cao. Make tracking easy: Neural motion retargeting for humanoid whole-body control.arXiv preprint arXiv:2603.22201, 2026

  63. [63]

    Z. Luo, J. Cao, A. W. Winkler, K. Kitani, and W. Xu. Perpetual humanoid control for real-time simulated avatars. InInternational Conference on Computer Vision (ICCV), 2023. 12

  64. [64]

    L. Yang, X. Huang, Z. Wu, A. Kanazawa, P. Abbeel, C. Sferrazza, C. K. Liu, R. Duan, and G. Shi. Omniretarget: Interaction-preserving data generation for humanoid whole-body loco- manipulation and scene interaction.arXiv preprint arXiv:2509.26633, 2025

  65. [65]

    W. Zeng, S. Lu, K. Yin, X. Niu, M. Dai, J. Wang, and J. Pang. Behavior foundation model for humanoid robots.arXiv preprint arXiv:2509.13780, 2025

  66. [66]

    M. Yuan, T. Yu, W. Ge, X. Yao, D. Li, H. Wang, J. Chen, B. Li, W. Zhang, W. Zeng, et al. A sur- vey of behavior foundation model: Next-generation whole-body control system of humanoid robots.IEEE transactions on pattern analysis and machine intelligence, 2025

  67. [67]

    Tirinzoni, A

    A. Tirinzoni, A. Touati, J. Farebrother, M. Guzek, A. Kanervisto, Y . Xu, A. Lazaric, and M. Pirotta. Zero-shot whole-body humanoid control via behavioral foundation models.arXiv preprint arXiv:2504.11054, 2025

  68. [68]

    Y . Li, Z. Luo, T. Zhang, C. Dai, A. Kanervisto, A. Tirinzoni, H. Weng, K. Kitani, M. Guzek, A. Touati, et al. Bfm-zero: A promptable behavioral foundation model for humanoid control using unsupervised reinforcement learning.arXiv preprint arXiv:2511.04131, 2025

  69. [69]

    Z. Tao, Z. Su, P. Liu, J. Sun, W. Que, J. Ma, J. Yu, J. Cao, P. Sun, H. Liang, et al. Hera- cles: Bridging precise tracking and generative synthesis for general humanoid control.arXiv preprint arXiv:2603.27756, 2026

  70. [70]

    Loper, N

    M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. SMPL: A skinned multi- person linear model.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 34(6):248:1– 248:16, Oct. 2015

  71. [71]

    Pavlakos, V

    G. Pavlakos, V . Choutas, N. Ghorbani, T. Bolkart, A. A. A. Osman, D. Tzionas, and M. J. Black. Expressive body capture: 3d hands, face, and body from a single image. InProceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019

  72. [72]

    Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5745–5753, 2019

  73. [73]

    J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  74. [74]

    J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

  75. [75]

    F. Liu, S. Zhang, X. Wang, Y . Wei, H. Qiu, Y . Zhao, Y . Zhang, Q. Ye, and F. Wan. Timestep embedding tells: It’s time to cache for video diffusion model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7353–7363, 2025

  76. [76]

    motion atomic action

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 13 A Extended Related Work Behavior Foundation Models for Humanoid Robots.Going beyond isolated skills, recent works have started to explore systems that capture broad, reusable behavioral knowledge for humanoid robots, often referred to asBehavior Fo...

  77. [77]

    [Action Requirements]

    action [Segmentation Criteria] You should decide whether to split mainly based on: - whether the motion style changes - whether a finer motion style / type can be judged to have changed - whether the main action changes If the motion remains continuous and no obvious change occurs, do not over-segment. [Action Requirements]

  78. [78]

    action must be written in English

  79. [79]

    action should summarize the concrete motion performed in this segment

  80. [80]

    action should describe the motion itself as much as possible, and should not write overly abstract content

Showing first 80 references.