pith. sign in

arxiv: 2605.13729 · v1 · pith:BMFR3ZQInew · submitted 2026-05-13 · 💻 cs.CV · cs.AI

Coordinating Multiple Conditions for Trajectory-Controlled Human Motion Generation

Pith reviewed 2026-05-14 19:53 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords human motion generationtrajectory controldiffusion modeltext-to-motionmotion inpaintingmultimodal conditionsselective inpainting
0
0 comments X p. Extension
pith:BMFR3ZQI Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{BMFR3ZQI}

Prints a linked pith:BMFR3ZQI badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

CMC coordinates text and trajectory conditions via two-stage diffusion to generate accurate human motions

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces CMC, a decoupled two-stage framework for trajectory-controlled human motion generation. The first stage uses a diffusion model to generate simplified controlled joint representations guided by trajectories. The second stage employs a text-conditioned inpainting model to complete full-body motions from these partial observations. To prevent overfitting, it incorporates Selective Inpainting Mechanism that alternates between generation and inpainting tasks. This addresses conflicts between conditions and representation inconsistencies in existing methods, leading to better control accuracy and motion quality on benchmarks.

Core claim

By separating trajectory control into a simplified joint generation stage and using the output as partial observations for text-guided full motion inpainting, CMC resolves condition conflicts and representation inconsistencies, achieving state-of-the-art results in both trajectory following and motion realism.

What carries the argument

The divide-and-conquer cascade consisting of trajectory-guided diffusion for controlled joints and text-conditioned diffusion inpainting, with Selective Inpainting Mechanism (SIM) for training stability.

Load-bearing premise

The simplified controlled-joint representation supplies sufficient partial observations for generating consistent full-body motions without artifacts.

What would settle it

Observing frequent motion artifacts or trajectory deviations in the generated full-body motions when the first-stage output is used as input would disprove the effectiveness of the decoupling strategy.

Figures

Figures reproduced from arXiv: 2605.13729 by Changxing Ding, Deli Cai, Haoyang Ma.

Figure 1
Figure 1. Figure 1: Comparison of frameworks between our approach and two mainstream [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: We propose to Coordinate Multiple Conditions (CMC) for trajectory-controlled human motion generation. We visualize examples using different [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparisons of the control error for the redundant and simplified [PITH_FULL_IMAGE:figures/full_fig_p002_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of our CMC. It consists of two stages: Trajectory Control and Motion Completion. In the Trajectory Control stage, we utilize textual descriptions and spatial trajectories of the controlled joints to predict the trajectories of both the pelvis and the controlled joints within a simplified representation space. Subsequently, the Motion Completion stage takes these trajectories as partial observation… view at source ↗
Figure 6
Figure 6. Figure 6: The workflow of SIM to train the diffusion inpainting model. SIM [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Pseudo-code of our SIM implementation in Python. [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparisons between our method, Omnicontrol [ [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Two visual comparisons (a) and (b) to qualitatively prove the existence [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparisons on FID scores with and without SIM. Each figure plots [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Average control error across denoising steps. Darker-colored and [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Statistical mean and standard deviation of the control error across [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Errors across all denoising steps.The solid lines denote the mean [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative visualizations of motions conditioned on text only. [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative visualizations of motions conditioned on both text and [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗
read the original abstract

Trajectory-controlled human motion generation aims to synthesize realistic human motions conditioned on both textual descriptions and spatial trajectories. However, existing methods suffer from two critical limitations: first, the conflict between text and trajectory conditions disrupts the denoising process, resulting in compromised motion quality or inaccurate trajectory following; second, the use of redundant motion representations introduces inconsistencies between motion components, leading to instability during trajectory control. To address these challenges, we propose CMC, a decoupled framework that effectively coordinates text and trajectory conditions through a divide-and-conquer strategy. CMC follows a divide-and-conquer paradigm, comprising two cascaded stages: Trajectory Control and Motion Completion. In the first stage, a diffusion model generates a simplified representation of the controlled joints under trajectory guidance, based on the given trajectories, ensuring accurate and stable trajectory following. In the second stage, a text-conditioned diffusion inpainting model generates full-body motions using the simplified representation from the first stage as partial observations. To mitigate overfitting caused by limited inpainting training data, we further introduce the Selective Inpainting Mechanism (SIM), which alternates between text-to-motion generation and motion inpainting tasks during training. Experiments on HumanML3D and KIT datasets demonstrate that CMC achieves state-of-the-art performance in control accuracy and motion quality, demonstrating its effectiveness in coordinating multimodal conditions and representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes CMC, a two-stage decoupled diffusion framework for trajectory-controlled human motion generation. Stage 1 trains a diffusion model to produce a simplified representation of controlled joints conditioned on input trajectories. Stage 2 uses a text-conditioned inpainting diffusion model that treats the Stage-1 output as partial observations to synthesize full-body motion, with the Selective Inpainting Mechanism (SIM) alternating between text-to-motion and inpainting tasks to reduce overfitting. Experiments on HumanML3D and KIT are reported to show state-of-the-art control accuracy and motion quality.

Significance. If the two-stage decomposition and SIM prove robust, the work would offer a practical way to resolve conflicts between textual and spatial conditions in motion synthesis while avoiding inconsistencies from redundant representations. The divide-and-conquer design and selective training strategy could generalize to other multimodal conditional generation problems.

major comments (3)
  1. [Experiments] Experiments section: The SOTA claims on control accuracy and motion quality rest on the assumption that the first-stage simplified controlled-joint representation supplies sufficient partial observations for artifact-free inpainting, yet no ablation replaces Stage-1 outputs with ground-truth partial observations or compares against an end-to-end joint model. This directly tests the core divide-and-conquer premise but is absent.
  2. [Method] Method description of SIM: The mechanism is introduced to mitigate overfitting from limited inpainting data, but the paper provides no quantitative comparison of training dynamics or final metrics with and without SIM, leaving the contribution to stability unverified.
  3. [Experiments] Results tables (HumanML3D and KIT): No error bars, standard deviations across runs, or statistical significance tests are reported for the metrics against baselines, weakening the strength of the SOTA conclusion.
minor comments (1)
  1. [Abstract] The abstract and introduction could more explicitly state the dimensionality or joint subset used in the simplified representation to clarify what information is retained versus discarded.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment point by point below, indicating planned revisions where appropriate to strengthen the validation of our divide-and-conquer approach.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The SOTA claims on control accuracy and motion quality rest on the assumption that the first-stage simplified controlled-joint representation supplies sufficient partial observations for artifact-free inpainting, yet no ablation replaces Stage-1 outputs with ground-truth partial observations or compares against an end-to-end joint model. This directly tests the core divide-and-conquer premise but is absent.

    Authors: We agree that directly testing the core premise with ground-truth partial observations from Stage 1 and a comparison to an end-to-end joint model would provide stronger evidence. We will add these ablations in the revised manuscript, reporting control accuracy and motion quality metrics for both settings to quantify the benefit of the decoupled stages. revision: yes

  2. Referee: [Method] Method description of SIM: The mechanism is introduced to mitigate overfitting from limited inpainting data, but the paper provides no quantitative comparison of training dynamics or final metrics with and without SIM, leaving the contribution to stability unverified.

    Authors: We acknowledge the need for explicit verification of SIM's contribution. In the revision we will include training loss curves and final performance metrics on HumanML3D and KIT comparing the full model against the variant trained without SIM, thereby quantifying its effect on stability and final results. revision: yes

  3. Referee: [Experiments] Results tables (HumanML3D and KIT): No error bars, standard deviations across runs, or statistical significance tests are reported for the metrics against baselines, weakening the strength of the SOTA conclusion.

    Authors: We will update the results tables to report standard deviations computed over multiple independent runs with different random seeds. Where feasible we will also add statistical significance tests (e.g., paired t-tests) against the strongest baselines to better substantiate the SOTA claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces an explicit new two-stage architecture (first-stage trajectory-guided diffusion on simplified controlled-joint representation, second-stage text-conditioned inpainting) plus the Selective Inpainting Mechanism, with performance measured directly against external benchmarks on HumanML3D and KIT. No equations, predictions, or central claims reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the divide-and-conquer strategy and empirical results remain independent of the inputs they are evaluated on.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework relies on the standard assumption that diffusion models can separately model partial joint trajectories and full-body completions; no new physical entities or ungrounded mathematical axioms are introduced beyond established generative modeling practices.

axioms (1)
  • domain assumption Diffusion models can generate realistic human motions when conditioned on text or partial observations
    Invoked implicitly as the basis for both stages; standard in prior motion diffusion literature.

pith-pipeline@v0.9.0 · 5528 in / 1161 out tokens · 46895 ms · 2026-05-14T19:53:12.336054+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages

  1. [1]

    Crowdmogen: Event-driven collective human motion generation.Int

    Yukang Cao, Xinying Guo, Mingyuan Zhang, Haozhe Xie, Chenyang Gu, and Ziwei Liu. Crowdmogen: Event-driven collective human motion generation.Int. J. Comput. Vis., 134(1):29, 2026

  2. [2]

    Executing your commands via motion diffusion in latent space

    Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InIEEE Conf. Comput. Vis. Pattern Recog., 2023

  3. [3]

    Hop: Heterogeneous topology-based multimodal entanglement for co-speech gesture generation

    Hongye Cheng, Tianyu Wang, Guangsi Shi, Zexing Zhao, and Yanwei Fu. Hop: Heterogeneous topology-based multimodal entanglement for co-speech gesture generation. InIEEE Conf. Comput. Vis. Pattern Recog., 2025

  4. [4]

    Interaction transformer for human reaction generation.IEEE Trans

    Baptiste Chopin, Hao Tang, Naima Otberdout, Mohamed Daoudi, and Nicu Sebe. Interaction transformer for human reaction generation.IEEE Trans. Multimedia, 25:8842–8854, 2023

  5. [5]

    Mofusion: A framework for denoising-diffusion- based motion synthesis

    Rishabh Dabral, Muhammad Hamza Mughal, Vladislav Golyanik, and Christian Theobalt. Mofusion: A framework for denoising-diffusion- based motion synthesis. InIEEE Conf. Comput. Vis. Pattern Recog., 2023

  6. [6]

    Motionlcm: Real-time controllable motion generation via latent consistency model

    Wenxun Dai, Ling-Hao Chen, Jingbo Wang, Jinpeng Liu, Bo Dai, and Yansong Tang. Motionlcm: Real-time controllable motion generation via latent consistency model. InEur. Conf. Comput. Vis., 2024

  7. [7]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. InAdv. Neural Inform. Process. Syst., pages 8780– 8794, 2021

  8. [8]

    Cg-hoi: Contact-guided 3d human- object interaction generation

    Christian Diller and Angela Dai. Cg-hoi: Contact-guided 3d human- object interaction generation. InIEEE Conf. Comput. Vis. Pattern Recog., 2024

  9. [9]

    Imos: Intent-driven full-body motion synthesis for human-object interactions

    Anindita Ghosh, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, and Philipp Slusallek. Imos: Intent-driven full-body motion synthesis for human-object interactions. InEur. Assoc. Comput. Graph., 2023

  10. [10]

    Bridging semantic and kinematic condi- tions with diffusion-based discrete motion tokenizer.arXiv preprint arXiv:2603.19227, 2026

    Chenyang Gu, Mingyuan Zhang, Haozhe Xie, Zhongang Cai, Lei Yang, and Ziwei Liu. Bridging semantic and kinematic condi- tions with diffusion-based discrete motion tokenizer.arXiv preprint arXiv:2603.19227, 2026

  11. [11]

    Momask: Generative masked modeling of 3d human motions

    Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked modeling of 3d human motions. InIEEE Conf. Comput. Vis. Pattern Recog., 2024

  12. [12]

    Generative human motion stylization in latent space

    Chuan Guo, Yuxuan Mu, Xinxin Zuo, Peng Dai, Youliang Yan, Juwei Lu, and Li Cheng. Generative human motion stylization in latent space. InInt. Conf. Learn. Represent., 2024. 14

  13. [13]

    Generating diverse and natural 3d human motions from text

    Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InIEEE Conf. Comput. Vis. Pattern Recog., 2022

  14. [14]

    Action2motion: Conditioned generation of 3d human motions

    Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. InACM Int. Conf. Multimedia, page 2021–2029, 2020

  15. [15]

    Semanticboost: Elevating motion generation with augmented textual cues.arXiv preprint arXiv:2310.20323, 2023

    Xin He, Shaoli Huang, Xiaohang Zhan, Chao Wen, and Ying Shan. Semanticboost: Elevating motion generation with augmented textual cues.arXiv preprint arXiv:2310.20323, 2023

  16. [16]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAdv. Neural Inform. Process. Syst., 2020

  17. [17]

    Phase-functioned neural networks for character control.ACM Transactions on Graphics, 2017

    Daniel Holden, Taku Komura, and Jun Saito. Phase-functioned neural networks for character control.ACM Transactions on Graphics, 2017

  18. [18]

    Avatarclip: Zero-shot text-driven generation and animation of 3d avatars.ACM Transactions on Graphics, 2022

    Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, and Ziwei Liu. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars.ACM Transactions on Graphics, 2022

  19. [19]

    Motiongpt: Human motion as a foreign language

    Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language. InAdv. Neural Inform. Process. Syst., 2024

  20. [20]

    Local action-guided motion diffusion model for text-to-motion generation

    Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Runyi Yu, Chang Liu, Xiangyang Ji, Li Yuan, and jie Chen. Local action-guided motion diffusion model for text-to-motion generation. InEur. Conf. Comput. Vis., 2024

  21. [21]

    Act as you wish: Fine-grained control of motion diffusion model with hierarchical semantic graphs

    Peng Jin, Yang Wu, Yanbo Fan, Zhongqian Sun, Wei Yang, and Li Yuan. Act as you wish: Fine-grained control of motion diffusion model with hierarchical semantic graphs. InAdv. Neural Inform. Process. Syst., 2024

  22. [22]

    Guided motion diffusion for controllable human motion synthesis

    Korrawe Karunratanakul, Konpat Preechakul, Supasorn Suwajanakorn, and Siyu Tang. Guided motion diffusion for controllable human motion synthesis. InInt. Conf. Comput. Vis., pages 2151–2162, 2023

  23. [23]

    Flame: Free-form language-based motion synthesis & editing

    Jihoon Kim, Jiseob Kim, and Sungjoon Choi. Flame: Free-form language-based motion synthesis & editing. InAAAI, 2023

  24. [24]

    Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis

    Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. InEur. Conf. Comput. Vis., 2022

  25. [25]

    Towards variable and coordinated holistic co-speech motion generation

    Yifei Liu, Qiong Cao, Yandong Wen, Huaiguang Jiang, and Changxing Ding. Towards variable and coordinated holistic co-speech motion generation. InIEEE Conf. Comput. Vis. Pattern Recog., 2024

  26. [26]

    Scamo: Exploring the scaling law in autoregressive motion generation model

    Shunlin Lu, Jingbo Wang, Zeyu Lu, Ling-Hao Chen, Wenxun Dai, Junting Dong, Zhiyang Dou, Bo Dai, and Ruimao Zhang. Scamo: Exploring the scaling law in autoregressive motion generation model. InIEEE Conf. Comput. Vis. Pattern Recog., 2025

  27. [27]

    Countering language drift with seeded iterated learning

    Yuchen Lu, Soumye Singhal, Florian Strub, Olivier Pietquin, and Aaron Courville. Countering language drift with seeded iterated learning. In Int. Conf. Mach. Learn., 2020

  28. [28]

    Amass: Archive of motion capture as surface shapes

    Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons- Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. InInt. Conf. Comput. Vis., 2019

  29. [29]

    Rethinking diffusion for text-driven human motion generation

    Zichong Meng, Yiming Xie, Xiaogang Peng, Zeyu Han, and Huaizu Jiang. Rethinking diffusion for text-driven human motion generation. InIEEE Conf. Comput. Vis. Pattern Recog., 2025

  30. [30]

    Long- term motion generation for interactive humanoid robots using gan with convolutional network

    Yusuke Nishimura, Yutaka Nakamura, and Hiroshi Ishiguro. Long- term motion generation for interactive humanoid robots using gan with convolutional network. InCompanion of the ACM/IEEE Int. Conf. Hum.- Robot Interact., page 375–377, 2020

  31. [31]

    Hoi-diff: Text-driven syn- thesis of 3d human-object interactions using diffusion mod- els

    Xiaogang Peng, Yiming Xie, Zizhao Wu, Varun Jampani, Deqing Sun, and Huaizu Jiang. Hoi-diff: Text-driven synthesis of 3d human-object interactions using diffusion models.arXiv preprint arXiv:2312.06553, 2023

  32. [32]

    Temos: Generating diverse human motions from textual descriptions

    Mathis Petrovich, Michael J Black, and G ¨ul Varol. Temos: Generating diverse human motions from textual descriptions. InEur. Conf. Comput. Vis., 2022

  33. [33]

    Tmr: Text-to-motion retrieval using contrastive 3d human motion synthesis

    Mathis Petrovich, Michael J Black, and G ¨ul Varol. Tmr: Text-to-motion retrieval using contrastive 3d human motion synthesis. InInt. Conf. Comput. Vis., 2023

  34. [34]

    Maskcontrol: Spatio-temporal control for masked motion synthesis

    Ekkasit Pinyoanuntapong, Muhammad Saleem, Korrawe Karun- ratanakul, Pu Wang, Hongfei Xue, Chen Chen, Chuan Guo, Junli Cao, Jian Ren, and Sergey Tulyakov. Maskcontrol: Spatio-temporal control for masked motion synthesis. InInt. Conf. Comput. Vis., 2025

  35. [35]

    Mmm: Generative masked motion model

    Ekkasit Pinyoanuntapong, Pu Wang, Minwoo Lee, and Chen Chen. Mmm: Generative masked motion model. InIEEE Conf. Comput. Vis. Pattern Recog., 2023

  36. [36]

    The KIT motion-language dataset.IEEE Trans

    Matthias Plappert, Christian Mandery, and Tamim Asfour. The KIT motion-language dataset.IEEE Trans. Big Data, 4(4):236–252, dec 2016

  37. [37]

    Emotiongesture: Audio-driven diverse emotional co-speech 3d gesture generation.IEEE Trans

    Xingqun Qi, Chen Liu, Lincheng Li, Jie Hou, Haoran Xin, and Xin Yu. Emotiongesture: Audio-driven diverse emotional co-speech 3d gesture generation.IEEE Trans. Multimedia, 26:10420–10430, 2024

  38. [38]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InIEEE Conf. Comput. Vis. Pattern Recog., 2023

  39. [39]

    Human motion diffusion as a generative prior

    Yoni Shafir, Guy Tevet, Roy Kapon, and Amit Haim Bermano. Human motion diffusion as a generative prior. InInt. Conf. Learn. Represent., 2024

  40. [40]

    Multi-semantics aggrega- tion network based on the dynamic-attention mechanism for 3d human motion prediction.IEEE Trans

    Junyu Shi, Jianqi Zhong, and Wenming Cao. Multi-semantics aggrega- tion network based on the dynamic-attention mechanism for 3d human motion prediction.IEEE Trans. Multimedia, 26:5194–5206, 2024

  41. [41]

    Mvdream: Multi-view diffusion for 3d generation

    Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d generation. InInt. Conf. Learn. Represent., 2024

  42. [42]

    Kankanhalli, Weidong Geng, and Xiangdong Li

    Guofei Sun, Yongkang Wong, Zhiyong Cheng, Mohan S. Kankanhalli, Weidong Geng, and Xiangdong Li. Deepdance: Music-to-dance motion choreography with adversarial learning.IEEE Trans. Multimedia, 23:497–509, 2021

  43. [43]

    Motionclip: Exposing human motion generation to clip space

    Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. Motionclip: Exposing human motion generation to clip space. InEur. Conf. Comput. Vis., 2022

  44. [44]

    Human motion diffusion model

    Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. Human motion diffusion model. InInt. Conf. Learn. Represent., 2023

  45. [45]

    Neural discrete representation learning

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InAdv. Neural Inform. Process. Syst., 2017

  46. [46]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdv. Neural Inform. Process. Syst., 2017

  47. [47]

    Tlcontrol: Trajectory and language control for human motion synthesis

    Weilin Wan, Zhiyang Dou, Taku Komura, Wenping Wang, Dinesh Jayaraman, and Lingjie Liu. Tlcontrol: Trajectory and language control for human motion synthesis. InEur. Conf. Comput. Vis., 2024

  48. [48]

    Stickmotion: Generating 3d hu- man motions by drawing a stickman.arXiv preprint arXiv:2503.04829, 2025

    Tao Wang, Zhihua Wu, Qiaozhi He, Jiaming Chu, Ling Qian, Yu Cheng, Junliang Xing, Jian Zhao, and Lei Jin. Stickmotion: Generating 3d hu- man motions by drawing a stickman.arXiv preprint arXiv:2503.04829, 2025

  49. [49]

    Humanise: Language-conditioned human motion generation in 3d scenes

    Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, Wei Liang, and Siyuan Huang. Humanise: Language-conditioned human motion generation in 3d scenes. InAdv. Neural Inform. Process. Syst., 2022

  50. [50]

    Intercontrol: Zero-shot human interaction generation by controlling every joint

    Zhenzhi Wang, Jingbo Wang, Yixuan Li, Dahua Lin, and Bo Dai. Intercontrol: Zero-shot human interaction generation by controlling every joint. InAdv. Neural Inform. Process. Syst., 2024

  51. [51]

    Omnicontrol: Control any joint at any time for human motion generation

    Yiming Xie, Varun Jampani, Lei Zhong, Deqing Sun, and Huaizu Jiang. Omnicontrol: Control any joint at any time for human motion generation. InInt. Conf. Learn. Represent., 2024

  52. [52]

    Implicit compositional generative network for length-variable co-speech gesture synthesis.IEEE Trans

    Chenghao Xu, Jiexi Yan, Yanhua Yang, and Cheng Deng. Implicit compositional generative network for length-variable co-speech gesture synthesis.IEEE Trans. Multimedia, 26:6325–6335, 2024

  53. [53]

    Guiding human-object interactions with rich geometry and relations

    Mengqing Xue, Yifei Liu, Ling Guo, Shaoli Huang, and Changxing Ding. Guiding human-object interactions with rich geometry and relations. InIEEE Conf. Comput. Vis. Pattern Recog., 2025

  54. [54]

    Generating human interaction motions in scenes with text control

    Hongwei Yi, Justus Thies, Michael J Black, Xue Bin Peng, and Davis Rempe. Generating human interaction motions in scenes with text control. InEur. Conf. Comput. Vis., 2024

  55. [55]

    Speech gesture generation from the trimodal context of text, audio, and speaker identity

    Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. Speech gesture generation from the trimodal context of text, audio, and speaker identity. InACM SIGGRAPH Conf. Comput. Graph. Interact. Tech. Asia, 2020

  56. [56]

    Divdiff: A conditional diffusion model for diverse human motion pre- diction.IEEE Trans

    Hua Yu, Yaqing Hou, Wenbin Pei, Yew-Soon Ong, and Qiang Zhang. Divdiff: A conditional diffusion model for diverse human motion pre- diction.IEEE Trans. Multimedia, pages 1–12, 2024

  57. [57]

    T2m-gpt: Generating human motion from textual descriptions with discrete representations

    Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. T2m-gpt: Generating human motion from textual descriptions with discrete representations. InIEEE Conf. Comput. Vis. Pattern Recog., 2023

  58. [58]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InInt. Conf. Comput. Vis., pages 3836–3847, 2023

  59. [59]

    Motiondiffuse: Text-driven human motion generation with diffusion model.IEEE Trans

    Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondiffuse: Text-driven human motion generation with diffusion model.IEEE Trans. Pattern Anal. Mach. Intell., 46(6):4115–4128, 2024

  60. [60]

    Smoodi: Stylized motion diffusion model

    Lei Zhong, Yiming Xie, Varun Jampani, Deqing Sun, and Huaizu Jiang. Smoodi: Stylized motion diffusion model. InEur. Conf. Comput. Vis., 2024