OMG: Omni-Modal Motion Generation for Generalist Humanoid Control

Dongming Qiao; Guanqi He; Hang Zhao; Kun-Ying Lee; Shaoting Zhu; Siqiao Huang; Yitang Li; Zhenyu Wang

arxiv: 2606.10340 · v1 · pith:CA6ELMSOnew · submitted 2026-06-09 · 💻 cs.RO

OMG: Omni-Modal Motion Generation for Generalist Humanoid Control

Siqiao Huang , Kun-Ying Lee , Dongming Qiao , Guanqi He , Zhenyu Wang , Yitang Li , Shaoting Zhu , Hang Zhao This is my paper

Pith reviewed 2026-06-27 13:12 UTC · model grok-4.3

classification 💻 cs.RO

keywords humanoid controlmotion generationdiffusion modelsmulti-modal conditioningwhole-body controlomni-modaldata curationfoundation models

0 comments

The pith

A diffusion model conditions on language, audio and reference motions to drive generalist whole-body humanoid control from curated data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that general-purpose humanoid control needs a scalable multi-modal reasoning module placed on top of a reactive motion tracker. It introduces OMG, which uses a data curation pipeline together with a diffusion generator that accepts language, audio and human motion inputs. Experiments are presented to show state-of-the-art tracking performance, clear scaling with model size, and rapid adaptation when new modalities or data distributions appear.

Core claim

OMG consists of a meticulous data curation, filtering and labeling pipeline plus a diffusion-based motion generation backbone that conditions on language, audio and human reference motions. The architecture places this generator as a reasoning brain above a reactive cerebellum. Experiments demonstrate that the resulting controller achieves state-of-the-art whole-body performance, exhibits model scaling, and adapts efficiently to new distributions and modalities.

What carries the argument

Diffusion-based motion generation backbone conditioned on language, audio and human reference motions, supported by a data curation pipeline.

If this is right

The controller reaches state-of-the-art performance on whole-body motion tasks.
Performance improves with larger model size according to scaling laws.
The same model adapts quickly to new data distributions and input modalities.
The approach constitutes a step toward foundation models for humanoid robots.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hierarchical separation of reasoning and tracking layers could be tested on non-humanoid robots that require multi-modal commands.
The data curation steps may transfer to other motion-generation domains where high-quality paired examples are scarce.
If scaling continues, the model could eventually accept longer-horizon language instructions without additional reward engineering.

Load-bearing premise

A scalable multi-modal reasoning module placed on a reactive motion tracker is enough to reach general-purpose humanoid control.

What would settle it

A controlled test in which the model shows no performance gain when scaled in size or fails to adapt when a new input modality is added after fine-tuning.

Figures

Figures reproduced from arXiv: 2606.10340 by Dongming Qiao, Guanqi He, Hang Zhao, Kun-Ying Lee, Shaoting Zhu, Siqiao Huang, Yitang Li, Zhenyu Wang.

**Figure 1.** Figure 1: Overview. OMG decomposes humanoid whole-body control into a scalable motion generation brain and a reactive motion tracking cerebellum. Built on OMG-Data, a curation of 1000+ hours omni-modal humanoid motion data, OMG-DiT maps language, audio, human reference, and their compositions into robot-executable future motions, which are deployed on a Unitree G1 in real time, paired with a pretrained motion track… view at source ↗

**Figure 2.** Figure 2: Dataset Statistics of OMG-Data. We curate a large-scale omni-modal humanoid motion corpus by aggregating heterogeneous datasets and unifying them into the Unitree G1 motion space. Left: processed data statistics across conditioning modalities and source datasets. Right: representative conditioning modalities, including language, audio, and human reference motions. Interactive and Multi-Modal Motion Genera… view at source ↗

**Figure 3.** Figure 3: OMG-DiT learns a shared diffusion backbone while enabling conditioning with modalityspecific encoders. History motion and language are injected as global context tokens via crossattention, whereas frame-aligned signals (i.e., audio and human reference motions) are injected through FiLM [45] adapters. New modalities are attached non-invasively through zero-initialized adapters, and multiple conditions can… view at source ↗

**Figure 4.** Figure 4: Real-World Omni-Modal Control. OMG generates diverse Unitree G1 motions across various conditioning modalities in real time, executable in the real world. 5.1 Experiment Setup Evaluation Protocol. We evaluate OMG in two regimes: pretrained omni-modal motion generation and downstream finetuning. For pretraining, we consider language-, audio-, and humanreference-conditioned motion generation, where the mod… view at source ↗

**Figure 5.** Figure 5: Scaling OMG-DiT. Scaling Behavior. Finally, we ask whether motion generation is a scalable objective: do larger diffusion backbones yield better humanoid motion quality, given the same data and evaluation protocol? To answer this, we pretrain three OMG-DiT variants with increasing numbers of parameters. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Canonicalization. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative visualization. Unitree G1 execution sequences produced by OMG under text, audio, human-reference, and composed text-audio conditions. Frames are uniformly sampled within each sequence, and embedded prompts are preserved. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

**Figure 8.** Figure 8: Interactive Control: Composition in the Temporal Horizon. E.3 Interactive Control with Temporal Composition We showcase our model’s capability for real-time interactive control. We feed the model timevarying commands from different modalities over the temporal horizon. As shown in [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

**Figure 9.** Figure 9: Human-reference CFG sweep. The translucent reference overlay shows the target motion, and the opaque robot shows the generated motion [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: Text-audio CFG sweep. Columns show snapshots over time and rows vary the text guidance scale while keeping the audio condition fixed. Larger text guidance improves adherence to the language instruction in the composed audio-language setting. (a) Dataset. (b) Egocentric RGB input. (c) Third-person layout [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

**Figure 11.** Figure 11: Perceptive Locomotion setup. The dataset contains 300 Kimodo-generated demonstrations per target color. The policy observes only low-resolution egocentric RGB and a discrete color command. The goal is to follow the command and locomote to the corresponding color. target-hold segment so that the reference motion stops after reaching the target. To train the diffusion model, we sample 2 s windows using th… view at source ↗

**Figure 12.** Figure 12: Third-person timelapse of a successful rollout by the pretrained checkpoint. The robot enters the commanded blue target. Evaluation and Results. At test time, we use online replanning from a canonical standing G1 state. Each replan samples a 2 s motion window, executes the first 0.5 s, and then replans from the updated state; the rollout budget is 210 source frames. We evaluate 30 held-out validation roll… view at source ↗

read the original abstract

Humanoid whole-body control has made significant progress in recent years, yet existing approaches remain limited to few-skill policies with heavy reward engineering, or motion trackers that are difficult to extend to new input modalities. We argue that the key to general-purpose humanoid control is to build a scalable brain, a module capable of reasoning with diverse conditioning modalities, atop a reactive motion tracking cerebellum, mirroring the hierarchical structure of biological motor systems. Two challenges arise in realizing this vision: acquiring a vast amount of high-quality data to achieve general purpose control, and equipping the generator with the capability to condition on compositional, extensible multi-modal inputs. We present OMG, which addresses these challenges with a meticulous data curation, filtering and labeling pipeline, as well as a diffusion-based motion generation backbone that conditions on language, audio, and human reference motions. Extensive experiments validate OMG as an omni-modal whole-body controller exhibiting state-of-the-art performance, model scaling behavior and efficient adaptation to new distributions and modalities, marking a concrete step toward foundation models for humanoid robots.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OMG adds a diffusion generator conditioned on language, audio and motion references for humanoid whole-body control, backed by a data pipeline, with claims of SOTA performance and scaling.

read the letter

The core of this paper is a diffusion-based motion generator that takes language, audio, and human reference motions as inputs to produce whole-body humanoid trajectories. It sits on top of a reactive tracker and is trained with a curated dataset that the authors filtered and labeled specifically for this use. The claim is that this setup overcomes the narrow scope of current few-skill policies and the rigidity of pure trackers.

What is actually new is the joint conditioning on those three modalities inside one diffusion backbone for humanoids, together with the practical data pipeline that makes the training data large enough. The experiments reportedly show scaling with model size and reasonably fast adaptation when new distributions or modalities are added. If those results hold with proper baselines and ablations, the work gives a concrete example of how to move toward more general controllers.

The soft spots are in the strength of the evidence. The abstract and summary lean on “extensive experiments” and “state-of-the-art performance,” but the details of the baselines, metrics, and statistical significance are not visible at the level needed to judge how large the gains really are. The hierarchical brain-cerebellum framing is a standard motif in motor control, so the advance rests on whether the diffusion-plus-multi-modal implementation delivers measurable improvements over existing trackers and policies. Reproducibility of the data curation steps will also matter for anyone trying to build on it.

This is for people working on humanoid locomotion and multi-modal motion models. A reader who needs ideas for conditioning mechanisms or data handling in robotics would get direct value. The paper has enough of a system and reported outcomes to deserve peer review rather than a desk reject.

Referee Report

1 major / 0 minor

Summary. The paper introduces OMG, a diffusion-based omni-modal motion generator for humanoid whole-body control. It builds a scalable multi-modal reasoning module (brain) atop a reactive motion tracker (cerebellum), enabled by a data curation, filtering, and labeling pipeline that supports conditioning on language, audio, and human reference motions. The authors claim that extensive experiments demonstrate state-of-the-art performance, model scaling behavior, and efficient adaptation to new distributions and modalities, advancing toward foundation models for humanoid robots.

Significance. If the experimental results hold, the work would constitute a meaningful step toward generalist humanoid controllers by addressing the extensibility limitations of few-skill policies and non-extendable trackers through hierarchical multi-modal design and curated data pipelines.

major comments (1)

[Abstract] Abstract: The central claims of state-of-the-art performance, model scaling behavior, and efficient adaptation to new modalities rest entirely on unspecified experiments; no quantitative metrics, baseline comparisons, dataset sizes, error bars, or ablation results are provided to support these assertions, which are load-bearing for the paper's contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and the opportunity to clarify the presentation of our results. The primary concern raised is addressed point-by-point below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of state-of-the-art performance, model scaling behavior, and efficient adaptation to new modalities rest entirely on unspecified experiments; no quantitative metrics, baseline comparisons, dataset sizes, error bars, or ablation results are provided to support these assertions, which are load-bearing for the paper's contribution.

Authors: We agree that the abstract, as currently written, is a high-level summary that does not enumerate specific quantitative results. The full manuscript contains these details in the Experiments section, including direct comparisons against baselines, dataset statistics from the curation pipeline, statistical error bars across runs, and ablations isolating the contributions of the multi-modal conditioning and data pipeline. To make the abstract self-contained and better support the load-bearing claims, we will revise it to include concise references to key quantitative outcomes (e.g., performance deltas and data scale) while preserving its brevity. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical system (OMG) consisting of a data curation pipeline and a diffusion-based multi-modal motion generator, with performance claims supported by experiments on SOTA results, scaling, and adaptation. No derivation chain, equations, or first-principles reductions appear in the provided abstract or description. Claims are framed as outcomes of training and evaluation rather than predictions forced by fitted parameters or self-citations. The hierarchical brain-cerebellum motif is presented as a design choice mirroring biology, not as a mathematically derived necessity. No load-bearing steps reduce to self-definition, renaming, or imported uniqueness theorems. The work is self-contained as an engineering contribution validated externally via benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, training details, or explicit assumptions, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5732 in / 1143 out tokens · 46340 ms · 2026-06-27T13:12:21.669551+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

95 extracted references · 2 canonical work pages

[1]

M. Chen, K. Wang, B. Zhang, X. Ma, Z. Yang, Y . Ren, Q. Huang, Z. Zhu, Y . Wang, and Z. Su. Holomotion-1 technical report, 2026. URLhttps://arxiv.org/abs/2605.15336

Pith/arXiv arXiv 2026
[2]

Zhuang, S

Z. Zhuang, S. Yao, and H. Zhao. Humanoid parkour learning.arXiv preprint arXiv:2406.10759, 2024

arXiv 2024
[3]

Radosavovic, S

I. Radosavovic, S. Kamat, T. Darrell, and J. Malik. Learning humanoid locomotion over chal- lenging terrain.arXiv preprint arXiv:2410.03654, 2024

arXiv 2024
[4]

Zhang, Y

Y . Zhang, Y . Yuan, P. Gurunath, I. Gupta, S. Omidshafiei, A.-a. Agha-mohammadi, M. Vazquez-Chanlatte, L. Pedersen, T. He, and G. Shi. Falcon: Learning force-adaptive hu- manoid loco-manipulation.arXiv preprint arXiv:2505.06776, 2025

arXiv 2025
[5]

Y . Li, Y . Zhang, W. Xiao, C. Pan, H. Weng, G. He, T. He, and G. Shi. Hold my beer: Learning gentle humanoid locomotion and end-effector stabilization control.arXiv preprint arXiv:2505.24198, 2025

arXiv 2025
[6]

Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Castaneda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Pith/arXiv arXiv 2025
[7]

Zhang, J

Z. Zhang, J. Guo, C. Chen, J. Wang, C. Lin, Y . Lian, H. Xue, Z. Wang, M. Liu, J. Lyu, et al. Track any motions under any disturbances.arXiv preprint arXiv:2509.13833, 2025

arXiv 2025
[8]

Q. Liao, T. E. Truong, X. Huang, Y . Gao, G. Tevet, K. Sreenath, and C. K. Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

Pith/arXiv arXiv 2025
[9]

Serifi, R

A. Serifi, R. Grandia, E. Knoop, M. Gross, and M. Bächer. Robot motion diffusion model: Motion generation for robotic characters. InSIGGRAPH asia 2024 conference papers, pages 1–9, 2024

2024
[10]

Tevet, S

G. Tevet, S. Raab, S. Cohan, D. Reda, Z. Luo, X. B. Peng, A. Bermano, and M. Van de Panne. Closd: Closing the loop between simulation and diffusion for multi-task character control. InInternational Conference on Learning Representations, volume 2025, pages 46506–46520, 2025

2025
[11]

M. Xu, Y . Shi, K. Yin, and X. B. Peng. Parc: Physics-based augmentation with reinforcement learning for character controllers. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–11, 2025

2025
[12]

Zhang, K

Z. Zhang, K. Wen, M. Xu, J. He, C. Li, T. Miki, C. Schwarke, C. Zhang, X. B. Peng, and M. Hutter. Learning whole-body humanoid locomotion via motion generation and motion tracking.arXiv preprint arXiv:2604.17335, 2026

Pith/arXiv arXiv 2026
[13]

W. Xie, J. Zheng, J. Han, J. Shi, W. Zhang, C. Bai, and X. Li. Textop: Real-time interactive text-driven humanoid robot motion generation and control.arXiv preprint arXiv:2602.07439, 2026

arXiv 2026
[14]

Schuhmann, R

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35: 25278–25294, 2022. 9

2022
[15]

Penedo, H

G. Penedo, H. Kydlí ˇcek, A. Lozhkov, M. Mitchell, C. Raffel, L. V on Werra, T. Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

2024
[16]

T. He, Z. Luo, W. Xiao, C. Zhang, K. Kitani, C. Liu, and G. Shi. Learning human-to-humanoid real-time whole-body teleoperation.arXiv preprint arXiv:2403.04436, 2024

arXiv 2024
[17]

T. He, Z. Luo, X. He, W. Xiao, C. Zhang, W. Zhang, K. Kitani, C. Liu, and G. Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning.arXiv preprint arXiv:2406.08858, 2024

arXiv 2024
[18]

Y . Ze, Z. Chen, J. P. Araújo, Z. ang Cao, X. B. Peng, J. Wu, and C. K. Liu. Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025

arXiv 2025
[19]

Tevet, S

G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-Or, and A. H. Bermano. Human motion diffusion model.arXiv preprint arXiv:2209.14916, 2022

Pith/arXiv arXiv 2022
[20]

Zhang, Z

M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu. Motiondiffuse: Text-driven human motion generation with diffusion model.arXiv preprint arXiv:2208.15001, 2022

arXiv 2022
[21]

L. Yang, Z. Zhang, Y . Song, S. Hong, R. Xu, Y . Zhao, W. Zhang, B. Cui, and M.-H. Yang. Diffusion models: A comprehensive survey of methods and applications.ACM computing surveys, 56(4):1–39, 2023

2023
[22]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[23]

H. Liu, Z. Zhu, G. Becherini, Y . Peng, M. Su, Y . Zhou, X. Zhe, N. Iwamoto, B. Zheng, and M. J. Black. Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling, 2024. URLhttps://arxiv.org/abs/2401.00374

arXiv 2024
[24]

Zhang, Z

J. Zhang, Z. Kang, L. Liu, J. Chang, Q. Tian, F. Gao, and Y . Wang. Opendance: Multimodal controllable 3d dance generation with large-scale internet data, 2025. URLhttps://arxiv. org/abs/2506.07565

arXiv 2025
[25]

J. Li, J. Cao, H. Zhang, D. Rempe, J. Kautz, U. Iqbal, and Y . Yuan. Genmo: A generalist model for human motion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025
[26]

Rempe, M

D. Rempe, M. Petrovich, Y . Yuan, H. Zhang, X. B. Peng, Y . Jiang, T. Wang, U. Iqbal, D. Minor, M. de Ruyter, J. Li, C. Tessler, E. Lim, E. Jeong, S. Wu, E. Hassani, M. Huang, J.-B. Yu, C. Chung, L. Song, O. Dionne, J. Kautz, S. Yuen, and S. Fidler. Kimodo: Scaling controllable human motion generation.arXiv:2603.15546, 2026

arXiv 2026
[27]

K. Fan, S. Lu, M. Dai, R. Yu, L. Xiao, Z. Dou, J. Dong, L. Ma, and J. Wang. Go to zero: Towards zero-shot motion generation with million-scale data, 2025. URLhttps://arxiv. org/abs/2507.07095

arXiv 2025
[28]

Mahmood, N

N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019

2019
[29]

F. G. Harvey, M. Yurick, D. Nowrouzezahrai, and C. Pal. Robust motion in-betweening. arXiv:2102.04942, 39(4), 2020

arXiv 2020
[30]

J. Li, J. Wu, and C. K. Liu. Object motion guided human motion synthesis.ACM Trans. Graph., 42(6), 2023. 10

2023
[31]

Mason, S

I. Mason, S. Starke, and T. Komura. Real-time style modelling of human locomotion via feature-wise transformations and local motion phases.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 5(1):1–18, 2022

2022
[32]

C. Guo, I. Hwang, J. Wang, and B. Zhou. Snapmogen: Human motion generation from ex- pressive texts, 2025. URLhttps://arxiv.org/abs/2507.09122

arXiv 2025
[33]

R. Li, J. Zhao, Y . Zhang, M. Su, Z. Ren, H. Zhang, Y . Tang, and X. Li. Finedance: A fine-grained choreography dataset for 3d full body dance generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10234–10243, 2023

2023
[34]

R. Li, S. Yang, D. A. Ross, and A. Kanazawa. Learn to dance with aist++: Music conditioned 3d dance generation, 2021

2021
[35]

Zhang, J

Y . Zhang, J. Lin, A. Zeng, G. Wu, S. Lu, Y . Fu, Y . Cai, R. Zhang, H. Wang, and L. Zhang. Motion-x++: A large-scale multimodal 3d whole-body human motion dataset.arXiv preprint arXiv:2501.05098, 2025

arXiv 2025
[36]

J. Lin, A. Zeng, S. Lu, Y . Cai, R. Zhang, H. Wang, and L. Zhang. Motion-x: A large-scale 3d expressive whole-body human motion dataset.Advances in Neural Information Processing Systems, 2023

2023
[37]

B. Kim, H. I. Jeong, J. Sung, Y . Cheng, J. Lee, J. Y . Chang, S.-I. Choi, Y . Choi, S. Shin, J. Kim, and H. J. Chang. Personabooth: Personalized text-to-motion generation.arXiv preprint arXiv:2503.07390, 2025

arXiv 2025
[38]

K. Chen, Z. Tan, J. Lei, S.-H. Zhang, Y .-C. Guo, W. Zhang, and S.-M. Hu. Choreomas- ter: choreography-oriented music-driven dance synthesis.ACM Trans. Graph., 40(4), July
[39]

doi:10.1145/3450626.3459932

ISSN 0730-0301. doi:10.1145/3450626.3459932. URLhttps://doi.org/10.1145/ 3450626.3459932

work page doi:10.1145/3450626.3459932
[40]

Burkanova, P

B. Burkanova, P. J. Yazdian, C. Zhang, T. Evans, P. Tuttösí, and A. Lim. Salsa as a nonverbal embodied language – the compas3d dataset and benchmarks, 2025. URLhttps://arxiv. org/abs/2507.19684

Pith/arXiv arXiv 2025
[41]

N. Le, T. Pham, T. Do, E. Tjiputra, Q. D. Tran, and A. Nguyen. Music-driven group choreog- raphy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2023

2023
[42]

Y . Ze, J. P. Araújo, J. Wu, and C. K. Liu. Gmr: General motion retargeting, 2025. URL https://github.com/YanjieZe/GMR. GitHub repository

2025
[43]

J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu. Retargeting matters: General motion retargeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025

arXiv 2025
[44]

B. Seed. Seed1. 8 model card: Towards generalized real-world agency.arXiv preprint arXiv:2603.20633, 2026

Pith/arXiv arXiv 2026
[45]

Todorov, T

E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–

2012
[46]

Todorov, T

IEEE, 2012. doi:10.1109/IROS.2012.6386109

work page doi:10.1109/iros.2012.6386109 2012
[47]

Perez, F

E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018
[48]

Y . Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, D. Schuurmans, and P. Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023. 11

2023
[49]

Y . Du, S. Yang, P. Florence, F. Xia, A. Wahid, P. Sermanet, T. Yu, P. Abbeel, J. B. Tenen- baum, L. Kaelbling, et al. Video language planning. InInternational Conference on Learning Representations, volume 2024, pages 31138–31155, 2024

2024
[50]

Li and K

T. Li and K. He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025

Pith/arXiv arXiv 2025
[51]

Y . Lu, S. Lu, Q. Sun, H. Zhao, Z. Jiang, X. Wang, T. Li, Z. Geng, and K. He. One-step latent-free image generation with pixel mean flows.arXiv preprint arXiv:2601.22158, 2026

Pith/arXiv arXiv 2026
[52]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

2020
[53]

Zhang, Y

J. Zhang, Y . Zhang, X. Cun, Y . Zhang, H. Zhao, H. Lu, X. Shen, and Y . Shan. Generating human motion from textual descriptions with discrete representations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14730–14740, 2023

2023
[54]

Tseng, R

J. Tseng, R. Castellon, and K. Liu. Edge: Editable dance generation from music. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 448–458, 2023

2023
[55]

Jiang, P

J. Jiang, P. Streli, H. Qiu, A. Fender, L. Laich, P. Snape, and C. Holz. Avatarposer: Articulated full-body pose tracking from sparse motion sensing. InEuropean conference on computer vision, pages 443–460. Springer, 2022

2022
[56]

X. Yi, Y . Zhou, M. Habermann, S. Shimada, V . Golyanik, C. Theobalt, and F. Xu. Physical inertial poser (pip): Physics-aware real-time human motion tracking from sparse inertial sen- sors. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13167–13178, 2022

2022
[57]

Cho, S.-H

H. Cho, S.-H. Kim, J. Kang, and D. Koo. Safeflow: Real-time text-driven humanoid whole- body control via physics-guided rectified flow and selective safety gating.arXiv preprint arXiv:2603.23983, 2026

arXiv 2026
[58]

R. Nai, B. Zheng, J. Zhao, H. Zhu, S. Dai, Z. Chen, Y . Hu, Y . Hu, T. Zhang, C. Wen, et al. Hu- manoid manipulation interface: Humanoid whole-body manipulation from robot-free demon- strations.arXiv preprint arXiv:2602.06643, 2026

arXiv 2026
[59]

Y . Wen, Q. Shuai, D. Kang, J. Li, C. Wen, Y . Qian, N. Jiao, C. Chen, W. Chen, Y . Wang, et al. Hy-motion 1.0: Scaling flow matching models for text-to-motion generation.arXiv preprint arXiv:2512.23464, 2025

arXiv 2025
[60]

R. Li, Y . Zhang, Y . Zhang, H. Zhang, J. Guo, Y . Zhang, Y . Liu, and X. Li. Lodge: A coarse to fine diffusion network for long dance generation guided by the characteristic dance primitives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1524–1534, 2024

2024
[61]

Siyao, W

L. Siyao, W. Yu, T. Gu, C. Lin, Q. Wang, C. Qian, C. C. Loy, and Z. Liu. Bailando: 3d dance generation by actor-critic gpt with choreographic memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11050–11059, 2022

2022
[62]

Q. Zhao, K. Yang, X. Wang, S. Zhao, Y . Lu, X. Zhang, Q. Shen, X.-X. Long, and X. Cao. Make tracking easy: Neural motion retargeting for humanoid whole-body control.arXiv preprint arXiv:2603.22201, 2026

Pith/arXiv arXiv 2026
[63]

Z. Luo, J. Cao, A. W. Winkler, K. Kitani, and W. Xu. Perpetual humanoid control for real-time simulated avatars. InInternational Conference on Computer Vision (ICCV), 2023. 12

2023
[64]

L. Yang, X. Huang, Z. Wu, A. Kanazawa, P. Abbeel, C. Sferrazza, C. K. Liu, R. Duan, and G. Shi. Omniretarget: Interaction-preserving data generation for humanoid whole-body loco- manipulation and scene interaction.arXiv preprint arXiv:2509.26633, 2025

Pith/arXiv arXiv 2025
[65]

W. Zeng, S. Lu, K. Yin, X. Niu, M. Dai, J. Wang, and J. Pang. Behavior foundation model for humanoid robots.arXiv preprint arXiv:2509.13780, 2025

arXiv 2025
[66]

M. Yuan, T. Yu, W. Ge, X. Yao, D. Li, H. Wang, J. Chen, B. Li, W. Zhang, W. Zeng, et al. A sur- vey of behavior foundation model: Next-generation whole-body control system of humanoid robots.IEEE transactions on pattern analysis and machine intelligence, 2025

2025
[67]

Tirinzoni, A

A. Tirinzoni, A. Touati, J. Farebrother, M. Guzek, A. Kanervisto, Y . Xu, A. Lazaric, and M. Pirotta. Zero-shot whole-body humanoid control via behavioral foundation models.arXiv preprint arXiv:2504.11054, 2025

arXiv 2025
[68]

Y . Li, Z. Luo, T. Zhang, C. Dai, A. Kanervisto, A. Tirinzoni, H. Weng, K. Kitani, M. Guzek, A. Touati, et al. Bfm-zero: A promptable behavioral foundation model for humanoid control using unsupervised reinforcement learning.arXiv preprint arXiv:2511.04131, 2025

arXiv 2025
[69]

Z. Tao, Z. Su, P. Liu, J. Sun, W. Que, J. Ma, J. Yu, J. Cao, P. Sun, H. Liang, et al. Hera- cles: Bridging precise tracking and generative synthesis for general humanoid control.arXiv preprint arXiv:2603.27756, 2026

arXiv 2026
[70]

Loper, N

M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. SMPL: A skinned multi- person linear model.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 34(6):248:1– 248:16, Oct. 2015

2015
[71]

Pavlakos, V

G. Pavlakos, V . Choutas, N. Ghorbani, T. Bolkart, A. A. A. Osman, D. Tzionas, and M. J. Black. Expressive body capture: 3d hands, face, and body from a single image. InProceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019

2019
[72]

Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5745–5753, 2019

2019
[73]

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[74]

J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

Pith/arXiv arXiv 2010
[75]

F. Liu, S. Zhang, X. Wang, Y . Wei, H. Qiu, Y . Zhao, Y . Zhang, Q. Ye, and F. Wan. Timestep embedding tells: It’s time to cache for video diffusion model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7353–7363, 2025

2025
[76]

motion atomic action

I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 13 A Extended Related Work Behavior Foundation Models for Humanoid Robots.Going beyond isolated skills, recent works have started to explore systems that capture broad, reusable behavioral knowledge for humanoid robots, often referred to asBehavior Fo...

Pith/arXiv arXiv 2017
[77]

[Action Requirements]

action [Segmentation Criteria] You should decide whether to split mainly based on: - whether the motion style changes - whether a finer motion style / type can be judged to have changed - whether the main action changes If the motion remains continuous and no obvious change occurs, do not over-segment. [Action Requirements]
[78]

action must be written in English
[79]

action should summarize the concrete motion performed in this segment
[80]

action should describe the motion itself as much as possible, and should not write overly abstract content

Showing first 80 references.

[1] [1]

M. Chen, K. Wang, B. Zhang, X. Ma, Z. Yang, Y . Ren, Q. Huang, Z. Zhu, Y . Wang, and Z. Su. Holomotion-1 technical report, 2026. URLhttps://arxiv.org/abs/2605.15336

Pith/arXiv arXiv 2026

[2] [2]

Zhuang, S

Z. Zhuang, S. Yao, and H. Zhao. Humanoid parkour learning.arXiv preprint arXiv:2406.10759, 2024

arXiv 2024

[3] [3]

Radosavovic, S

I. Radosavovic, S. Kamat, T. Darrell, and J. Malik. Learning humanoid locomotion over chal- lenging terrain.arXiv preprint arXiv:2410.03654, 2024

arXiv 2024

[4] [4]

Zhang, Y

Y . Zhang, Y . Yuan, P. Gurunath, I. Gupta, S. Omidshafiei, A.-a. Agha-mohammadi, M. Vazquez-Chanlatte, L. Pedersen, T. He, and G. Shi. Falcon: Learning force-adaptive hu- manoid loco-manipulation.arXiv preprint arXiv:2505.06776, 2025

arXiv 2025

[5] [5]

Y . Li, Y . Zhang, W. Xiao, C. Pan, H. Weng, G. He, T. He, and G. Shi. Hold my beer: Learning gentle humanoid locomotion and end-effector stabilization control.arXiv preprint arXiv:2505.24198, 2025

arXiv 2025

[6] [6]

Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Castaneda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Pith/arXiv arXiv 2025

[7] [7]

Zhang, J

Z. Zhang, J. Guo, C. Chen, J. Wang, C. Lin, Y . Lian, H. Xue, Z. Wang, M. Liu, J. Lyu, et al. Track any motions under any disturbances.arXiv preprint arXiv:2509.13833, 2025

arXiv 2025

[8] [8]

Q. Liao, T. E. Truong, X. Huang, Y . Gao, G. Tevet, K. Sreenath, and C. K. Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

Pith/arXiv arXiv 2025

[9] [9]

Serifi, R

A. Serifi, R. Grandia, E. Knoop, M. Gross, and M. Bächer. Robot motion diffusion model: Motion generation for robotic characters. InSIGGRAPH asia 2024 conference papers, pages 1–9, 2024

2024

[10] [10]

Tevet, S

G. Tevet, S. Raab, S. Cohan, D. Reda, Z. Luo, X. B. Peng, A. Bermano, and M. Van de Panne. Closd: Closing the loop between simulation and diffusion for multi-task character control. InInternational Conference on Learning Representations, volume 2025, pages 46506–46520, 2025

2025

[11] [11]

M. Xu, Y . Shi, K. Yin, and X. B. Peng. Parc: Physics-based augmentation with reinforcement learning for character controllers. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–11, 2025

2025

[12] [12]

Zhang, K

Z. Zhang, K. Wen, M. Xu, J. He, C. Li, T. Miki, C. Schwarke, C. Zhang, X. B. Peng, and M. Hutter. Learning whole-body humanoid locomotion via motion generation and motion tracking.arXiv preprint arXiv:2604.17335, 2026

Pith/arXiv arXiv 2026

[13] [13]

W. Xie, J. Zheng, J. Han, J. Shi, W. Zhang, C. Bai, and X. Li. Textop: Real-time interactive text-driven humanoid robot motion generation and control.arXiv preprint arXiv:2602.07439, 2026

arXiv 2026

[14] [14]

Schuhmann, R

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35: 25278–25294, 2022. 9

2022

[15] [15]

Penedo, H

G. Penedo, H. Kydlí ˇcek, A. Lozhkov, M. Mitchell, C. Raffel, L. V on Werra, T. Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

2024

[16] [16]

T. He, Z. Luo, W. Xiao, C. Zhang, K. Kitani, C. Liu, and G. Shi. Learning human-to-humanoid real-time whole-body teleoperation.arXiv preprint arXiv:2403.04436, 2024

arXiv 2024

[17] [17]

T. He, Z. Luo, X. He, W. Xiao, C. Zhang, W. Zhang, K. Kitani, C. Liu, and G. Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning.arXiv preprint arXiv:2406.08858, 2024

arXiv 2024

[18] [18]

Y . Ze, Z. Chen, J. P. Araújo, Z. ang Cao, X. B. Peng, J. Wu, and C. K. Liu. Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025

arXiv 2025

[19] [19]

Tevet, S

G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-Or, and A. H. Bermano. Human motion diffusion model.arXiv preprint arXiv:2209.14916, 2022

Pith/arXiv arXiv 2022

[20] [20]

Zhang, Z

M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu. Motiondiffuse: Text-driven human motion generation with diffusion model.arXiv preprint arXiv:2208.15001, 2022

arXiv 2022

[21] [21]

L. Yang, Z. Zhang, Y . Song, S. Hong, R. Xu, Y . Zhao, W. Zhang, B. Cui, and M.-H. Yang. Diffusion models: A comprehensive survey of methods and applications.ACM computing surveys, 56(4):1–39, 2023

2023

[22] [22]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023

[23] [23]

H. Liu, Z. Zhu, G. Becherini, Y . Peng, M. Su, Y . Zhou, X. Zhe, N. Iwamoto, B. Zheng, and M. J. Black. Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling, 2024. URLhttps://arxiv.org/abs/2401.00374

arXiv 2024

[24] [24]

Zhang, Z

J. Zhang, Z. Kang, L. Liu, J. Chang, Q. Tian, F. Gao, and Y . Wang. Opendance: Multimodal controllable 3d dance generation with large-scale internet data, 2025. URLhttps://arxiv. org/abs/2506.07565

arXiv 2025

[25] [25]

J. Li, J. Cao, H. Zhang, D. Rempe, J. Kautz, U. Iqbal, and Y . Yuan. Genmo: A generalist model for human motion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025

[26] [26]

Rempe, M

D. Rempe, M. Petrovich, Y . Yuan, H. Zhang, X. B. Peng, Y . Jiang, T. Wang, U. Iqbal, D. Minor, M. de Ruyter, J. Li, C. Tessler, E. Lim, E. Jeong, S. Wu, E. Hassani, M. Huang, J.-B. Yu, C. Chung, L. Song, O. Dionne, J. Kautz, S. Yuen, and S. Fidler. Kimodo: Scaling controllable human motion generation.arXiv:2603.15546, 2026

arXiv 2026

[27] [27]

K. Fan, S. Lu, M. Dai, R. Yu, L. Xiao, Z. Dou, J. Dong, L. Ma, and J. Wang. Go to zero: Towards zero-shot motion generation with million-scale data, 2025. URLhttps://arxiv. org/abs/2507.07095

arXiv 2025

[28] [28]

Mahmood, N

N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019

2019

[29] [29]

F. G. Harvey, M. Yurick, D. Nowrouzezahrai, and C. Pal. Robust motion in-betweening. arXiv:2102.04942, 39(4), 2020

arXiv 2020

[30] [30]

J. Li, J. Wu, and C. K. Liu. Object motion guided human motion synthesis.ACM Trans. Graph., 42(6), 2023. 10

2023

[31] [31]

Mason, S

I. Mason, S. Starke, and T. Komura. Real-time style modelling of human locomotion via feature-wise transformations and local motion phases.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 5(1):1–18, 2022

2022

[32] [32]

C. Guo, I. Hwang, J. Wang, and B. Zhou. Snapmogen: Human motion generation from ex- pressive texts, 2025. URLhttps://arxiv.org/abs/2507.09122

arXiv 2025

[33] [33]

R. Li, J. Zhao, Y . Zhang, M. Su, Z. Ren, H. Zhang, Y . Tang, and X. Li. Finedance: A fine-grained choreography dataset for 3d full body dance generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10234–10243, 2023

2023

[34] [34]

R. Li, S. Yang, D. A. Ross, and A. Kanazawa. Learn to dance with aist++: Music conditioned 3d dance generation, 2021

2021

[35] [35]

Zhang, J

Y . Zhang, J. Lin, A. Zeng, G. Wu, S. Lu, Y . Fu, Y . Cai, R. Zhang, H. Wang, and L. Zhang. Motion-x++: A large-scale multimodal 3d whole-body human motion dataset.arXiv preprint arXiv:2501.05098, 2025

arXiv 2025

[36] [36]

J. Lin, A. Zeng, S. Lu, Y . Cai, R. Zhang, H. Wang, and L. Zhang. Motion-x: A large-scale 3d expressive whole-body human motion dataset.Advances in Neural Information Processing Systems, 2023

2023

[37] [37]

B. Kim, H. I. Jeong, J. Sung, Y . Cheng, J. Lee, J. Y . Chang, S.-I. Choi, Y . Choi, S. Shin, J. Kim, and H. J. Chang. Personabooth: Personalized text-to-motion generation.arXiv preprint arXiv:2503.07390, 2025

arXiv 2025

[38] [38]

K. Chen, Z. Tan, J. Lei, S.-H. Zhang, Y .-C. Guo, W. Zhang, and S.-M. Hu. Choreomas- ter: choreography-oriented music-driven dance synthesis.ACM Trans. Graph., 40(4), July

[39] [39]

doi:10.1145/3450626.3459932

ISSN 0730-0301. doi:10.1145/3450626.3459932. URLhttps://doi.org/10.1145/ 3450626.3459932

work page doi:10.1145/3450626.3459932

[40] [40]

Burkanova, P

B. Burkanova, P. J. Yazdian, C. Zhang, T. Evans, P. Tuttösí, and A. Lim. Salsa as a nonverbal embodied language – the compas3d dataset and benchmarks, 2025. URLhttps://arxiv. org/abs/2507.19684

Pith/arXiv arXiv 2025

[41] [41]

N. Le, T. Pham, T. Do, E. Tjiputra, Q. D. Tran, and A. Nguyen. Music-driven group choreog- raphy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2023

2023

[42] [42]

Y . Ze, J. P. Araújo, J. Wu, and C. K. Liu. Gmr: General motion retargeting, 2025. URL https://github.com/YanjieZe/GMR. GitHub repository

2025

[43] [43]

J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu. Retargeting matters: General motion retargeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025

arXiv 2025

[44] [44]

B. Seed. Seed1. 8 model card: Towards generalized real-world agency.arXiv preprint arXiv:2603.20633, 2026

Pith/arXiv arXiv 2026

[45] [45]

Todorov, T

E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–

2012

[46] [46]

Todorov, T

IEEE, 2012. doi:10.1109/IROS.2012.6386109

work page doi:10.1109/iros.2012.6386109 2012

[47] [47]

Perez, F

E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018

[48] [48]

Y . Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, D. Schuurmans, and P. Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023. 11

2023

[49] [49]

Y . Du, S. Yang, P. Florence, F. Xia, A. Wahid, P. Sermanet, T. Yu, P. Abbeel, J. B. Tenen- baum, L. Kaelbling, et al. Video language planning. InInternational Conference on Learning Representations, volume 2024, pages 31138–31155, 2024

2024

[50] [50]

Li and K

T. Li and K. He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025

Pith/arXiv arXiv 2025

[51] [51]

Y . Lu, S. Lu, Q. Sun, H. Zhao, Z. Jiang, X. Wang, T. Li, Z. Geng, and K. He. One-step latent-free image generation with pixel mean flows.arXiv preprint arXiv:2601.22158, 2026

Pith/arXiv arXiv 2026

[52] [52]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

2020

[53] [53]

Zhang, Y

J. Zhang, Y . Zhang, X. Cun, Y . Zhang, H. Zhao, H. Lu, X. Shen, and Y . Shan. Generating human motion from textual descriptions with discrete representations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14730–14740, 2023

2023

[54] [54]

Tseng, R

J. Tseng, R. Castellon, and K. Liu. Edge: Editable dance generation from music. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 448–458, 2023

2023

[55] [55]

Jiang, P

J. Jiang, P. Streli, H. Qiu, A. Fender, L. Laich, P. Snape, and C. Holz. Avatarposer: Articulated full-body pose tracking from sparse motion sensing. InEuropean conference on computer vision, pages 443–460. Springer, 2022

2022

[56] [56]

X. Yi, Y . Zhou, M. Habermann, S. Shimada, V . Golyanik, C. Theobalt, and F. Xu. Physical inertial poser (pip): Physics-aware real-time human motion tracking from sparse inertial sen- sors. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13167–13178, 2022

2022

[57] [57]

Cho, S.-H

H. Cho, S.-H. Kim, J. Kang, and D. Koo. Safeflow: Real-time text-driven humanoid whole- body control via physics-guided rectified flow and selective safety gating.arXiv preprint arXiv:2603.23983, 2026

arXiv 2026

[58] [58]

R. Nai, B. Zheng, J. Zhao, H. Zhu, S. Dai, Z. Chen, Y . Hu, Y . Hu, T. Zhang, C. Wen, et al. Hu- manoid manipulation interface: Humanoid whole-body manipulation from robot-free demon- strations.arXiv preprint arXiv:2602.06643, 2026

arXiv 2026

[59] [59]

Y . Wen, Q. Shuai, D. Kang, J. Li, C. Wen, Y . Qian, N. Jiao, C. Chen, W. Chen, Y . Wang, et al. Hy-motion 1.0: Scaling flow matching models for text-to-motion generation.arXiv preprint arXiv:2512.23464, 2025

arXiv 2025

[60] [60]

R. Li, Y . Zhang, Y . Zhang, H. Zhang, J. Guo, Y . Zhang, Y . Liu, and X. Li. Lodge: A coarse to fine diffusion network for long dance generation guided by the characteristic dance primitives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1524–1534, 2024

2024

[61] [61]

Siyao, W

L. Siyao, W. Yu, T. Gu, C. Lin, Q. Wang, C. Qian, C. C. Loy, and Z. Liu. Bailando: 3d dance generation by actor-critic gpt with choreographic memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11050–11059, 2022

2022

[62] [62]

Q. Zhao, K. Yang, X. Wang, S. Zhao, Y . Lu, X. Zhang, Q. Shen, X.-X. Long, and X. Cao. Make tracking easy: Neural motion retargeting for humanoid whole-body control.arXiv preprint arXiv:2603.22201, 2026

Pith/arXiv arXiv 2026

[63] [63]

Z. Luo, J. Cao, A. W. Winkler, K. Kitani, and W. Xu. Perpetual humanoid control for real-time simulated avatars. InInternational Conference on Computer Vision (ICCV), 2023. 12

2023

[64] [64]

L. Yang, X. Huang, Z. Wu, A. Kanazawa, P. Abbeel, C. Sferrazza, C. K. Liu, R. Duan, and G. Shi. Omniretarget: Interaction-preserving data generation for humanoid whole-body loco- manipulation and scene interaction.arXiv preprint arXiv:2509.26633, 2025

Pith/arXiv arXiv 2025

[65] [65]

W. Zeng, S. Lu, K. Yin, X. Niu, M. Dai, J. Wang, and J. Pang. Behavior foundation model for humanoid robots.arXiv preprint arXiv:2509.13780, 2025

arXiv 2025

[66] [66]

M. Yuan, T. Yu, W. Ge, X. Yao, D. Li, H. Wang, J. Chen, B. Li, W. Zhang, W. Zeng, et al. A sur- vey of behavior foundation model: Next-generation whole-body control system of humanoid robots.IEEE transactions on pattern analysis and machine intelligence, 2025

2025

[67] [67]

Tirinzoni, A

A. Tirinzoni, A. Touati, J. Farebrother, M. Guzek, A. Kanervisto, Y . Xu, A. Lazaric, and M. Pirotta. Zero-shot whole-body humanoid control via behavioral foundation models.arXiv preprint arXiv:2504.11054, 2025

arXiv 2025

[68] [68]

Y . Li, Z. Luo, T. Zhang, C. Dai, A. Kanervisto, A. Tirinzoni, H. Weng, K. Kitani, M. Guzek, A. Touati, et al. Bfm-zero: A promptable behavioral foundation model for humanoid control using unsupervised reinforcement learning.arXiv preprint arXiv:2511.04131, 2025

arXiv 2025

[69] [69]

Z. Tao, Z. Su, P. Liu, J. Sun, W. Que, J. Ma, J. Yu, J. Cao, P. Sun, H. Liang, et al. Hera- cles: Bridging precise tracking and generative synthesis for general humanoid control.arXiv preprint arXiv:2603.27756, 2026

arXiv 2026

[70] [70]

Loper, N

M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. SMPL: A skinned multi- person linear model.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 34(6):248:1– 248:16, Oct. 2015

2015

[71] [71]

Pavlakos, V

G. Pavlakos, V . Choutas, N. Ghorbani, T. Bolkart, A. A. A. Osman, D. Tzionas, and M. J. Black. Expressive body capture: 3d hands, face, and body from a single image. InProceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019

2019

[72] [72]

Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5745–5753, 2019

2019

[73] [73]

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024

[74] [74]

J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

Pith/arXiv arXiv 2010

[75] [75]

F. Liu, S. Zhang, X. Wang, Y . Wei, H. Qiu, Y . Zhao, Y . Zhang, Q. Ye, and F. Wan. Timestep embedding tells: It’s time to cache for video diffusion model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7353–7363, 2025

2025

[76] [76]

motion atomic action

I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 13 A Extended Related Work Behavior Foundation Models for Humanoid Robots.Going beyond isolated skills, recent works have started to explore systems that capture broad, reusable behavioral knowledge for humanoid robots, often referred to asBehavior Fo...

Pith/arXiv arXiv 2017

[77] [77]

[Action Requirements]

action [Segmentation Criteria] You should decide whether to split mainly based on: - whether the motion style changes - whether a finer motion style / type can be judged to have changed - whether the main action changes If the motion remains continuous and no obvious change occurs, do not over-segment. [Action Requirements]

[78] [78]

action must be written in English

[79] [79]

action should summarize the concrete motion performed in this segment

[80] [80]

action should describe the motion itself as much as possible, and should not write overly abstract content