Coordinating Multiple Conditions for Trajectory-Controlled Human Motion Generation

arxiv: 2605.13729 · v1 · pith:BMFR3ZQInew · submitted 2026-05-13 · 💻 cs.CV · cs.AI

Coordinating Multiple Conditions for Trajectory-Controlled Human Motion Generation

Deli Cai , Haoyang Ma , Changxing Ding This is my paper

Pith reviewed 2026-05-14 19:53 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords human motion generationtrajectory controldiffusion modeltext-to-motionmotion inpaintingmultimodal conditionsselective inpainting

0 comments p. Extension

pith:BMFR3ZQI Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{BMFR3ZQI}

Prints a linked pith:BMFR3ZQI badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

CMC coordinates text and trajectory conditions via two-stage diffusion to generate accurate human motions

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces CMC, a decoupled two-stage framework for trajectory-controlled human motion generation. The first stage uses a diffusion model to generate simplified controlled joint representations guided by trajectories. The second stage employs a text-conditioned inpainting model to complete full-body motions from these partial observations. To prevent overfitting, it incorporates Selective Inpainting Mechanism that alternates between generation and inpainting tasks. This addresses conflicts between conditions and representation inconsistencies in existing methods, leading to better control accuracy and motion quality on benchmarks.

Core claim

By separating trajectory control into a simplified joint generation stage and using the output as partial observations for text-guided full motion inpainting, CMC resolves condition conflicts and representation inconsistencies, achieving state-of-the-art results in both trajectory following and motion realism.

What carries the argument

The divide-and-conquer cascade consisting of trajectory-guided diffusion for controlled joints and text-conditioned diffusion inpainting, with Selective Inpainting Mechanism (SIM) for training stability.

Load-bearing premise

The simplified controlled-joint representation supplies sufficient partial observations for generating consistent full-body motions without artifacts.

What would settle it

Observing frequent motion artifacts or trajectory deviations in the generated full-body motions when the first-stage output is used as input would disprove the effectiveness of the decoupling strategy.

Figures

Figures reproduced from arXiv: 2605.13729 by Changxing Ding, Deli Cai, Haoyang Ma.

**Figure 2.** Figure 2: We propose to Coordinate Multiple Conditions (CMC) for trajectory-controlled human motion generation. We visualize examples using different [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 4.** Figure 4: Comparisons of the control error for the redundant and simplified [PITH_FULL_IMAGE:figures/full_fig_p002_4.png] view at source ↗

**Figure 5.** Figure 5: Overview of our CMC. It consists of two stages: Trajectory Control and Motion Completion. In the Trajectory Control stage, we utilize textual descriptions and spatial trajectories of the controlled joints to predict the trajectories of both the pelvis and the controlled joints within a simplified representation space. Subsequently, the Motion Completion stage takes these trajectories as partial observation… view at source ↗

**Figure 6.** Figure 6: The workflow of SIM to train the diffusion inpainting model. SIM [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Pseudo-code of our SIM implementation in Python. [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparisons between our method, Omnicontrol [ [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Two visual comparisons (a) and (b) to qualitatively prove the existence [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Comparisons on FID scores with and without SIM. Each figure plots [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Average control error across denoising steps. Darker-colored and [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: Statistical mean and standard deviation of the control error across [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

**Figure 13.** Figure 13: Errors across all denoising steps.The solid lines denote the mean [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative visualizations of motions conditioned on text only. [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative visualizations of motions conditioned on both text and [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗

read the original abstract

Trajectory-controlled human motion generation aims to synthesize realistic human motions conditioned on both textual descriptions and spatial trajectories. However, existing methods suffer from two critical limitations: first, the conflict between text and trajectory conditions disrupts the denoising process, resulting in compromised motion quality or inaccurate trajectory following; second, the use of redundant motion representations introduces inconsistencies between motion components, leading to instability during trajectory control. To address these challenges, we propose CMC, a decoupled framework that effectively coordinates text and trajectory conditions through a divide-and-conquer strategy. CMC follows a divide-and-conquer paradigm, comprising two cascaded stages: Trajectory Control and Motion Completion. In the first stage, a diffusion model generates a simplified representation of the controlled joints under trajectory guidance, based on the given trajectories, ensuring accurate and stable trajectory following. In the second stage, a text-conditioned diffusion inpainting model generates full-body motions using the simplified representation from the first stage as partial observations. To mitigate overfitting caused by limited inpainting training data, we further introduce the Selective Inpainting Mechanism (SIM), which alternates between text-to-motion generation and motion inpainting tasks during training. Experiments on HumanML3D and KIT datasets demonstrate that CMC achieves state-of-the-art performance in control accuracy and motion quality, demonstrating its effectiveness in coordinating multimodal conditions and representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CMC splits trajectory control into a first-stage simplified joint model then text inpainting, which is a clean way to reduce condition conflicts, but the paper gives no ablations to show the first stage's output is actually sufficient.

read the letter

The paper's core move is to break the problem into two stages: a diffusion model that generates only the controlled joints under trajectory guidance, followed by a text-conditioned inpainting model that fills in the rest of the body using those joints as partial observations. They also add the Selective Inpainting Mechanism, which alternates between full text-to-motion and inpainting tasks during training to keep the second stage from overfitting on limited data. This directly tackles the two issues called out in the abstract—condition clashes during denoising and inconsistencies from redundant motion representations—and the architecture is explicit enough that someone could reimplement the cascade without much guesswork.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes CMC, a two-stage decoupled diffusion framework for trajectory-controlled human motion generation. Stage 1 trains a diffusion model to produce a simplified representation of controlled joints conditioned on input trajectories. Stage 2 uses a text-conditioned inpainting diffusion model that treats the Stage-1 output as partial observations to synthesize full-body motion, with the Selective Inpainting Mechanism (SIM) alternating between text-to-motion and inpainting tasks to reduce overfitting. Experiments on HumanML3D and KIT are reported to show state-of-the-art control accuracy and motion quality.

Significance. If the two-stage decomposition and SIM prove robust, the work would offer a practical way to resolve conflicts between textual and spatial conditions in motion synthesis while avoiding inconsistencies from redundant representations. The divide-and-conquer design and selective training strategy could generalize to other multimodal conditional generation problems.

major comments (3)

[Experiments] Experiments section: The SOTA claims on control accuracy and motion quality rest on the assumption that the first-stage simplified controlled-joint representation supplies sufficient partial observations for artifact-free inpainting, yet no ablation replaces Stage-1 outputs with ground-truth partial observations or compares against an end-to-end joint model. This directly tests the core divide-and-conquer premise but is absent.
[Method] Method description of SIM: The mechanism is introduced to mitigate overfitting from limited inpainting data, but the paper provides no quantitative comparison of training dynamics or final metrics with and without SIM, leaving the contribution to stability unverified.
[Experiments] Results tables (HumanML3D and KIT): No error bars, standard deviations across runs, or statistical significance tests are reported for the metrics against baselines, weakening the strength of the SOTA conclusion.

minor comments (1)

[Abstract] The abstract and introduction could more explicitly state the dimensionality or joint subset used in the simplified representation to clarify what information is retained versus discarded.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment point by point below, indicating planned revisions where appropriate to strengthen the validation of our divide-and-conquer approach.

read point-by-point responses

Referee: [Experiments] Experiments section: The SOTA claims on control accuracy and motion quality rest on the assumption that the first-stage simplified controlled-joint representation supplies sufficient partial observations for artifact-free inpainting, yet no ablation replaces Stage-1 outputs with ground-truth partial observations or compares against an end-to-end joint model. This directly tests the core divide-and-conquer premise but is absent.

Authors: We agree that directly testing the core premise with ground-truth partial observations from Stage 1 and a comparison to an end-to-end joint model would provide stronger evidence. We will add these ablations in the revised manuscript, reporting control accuracy and motion quality metrics for both settings to quantify the benefit of the decoupled stages. revision: yes
Referee: [Method] Method description of SIM: The mechanism is introduced to mitigate overfitting from limited inpainting data, but the paper provides no quantitative comparison of training dynamics or final metrics with and without SIM, leaving the contribution to stability unverified.

Authors: We acknowledge the need for explicit verification of SIM's contribution. In the revision we will include training loss curves and final performance metrics on HumanML3D and KIT comparing the full model against the variant trained without SIM, thereby quantifying its effect on stability and final results. revision: yes
Referee: [Experiments] Results tables (HumanML3D and KIT): No error bars, standard deviations across runs, or statistical significance tests are reported for the metrics against baselines, weakening the strength of the SOTA conclusion.

Authors: We will update the results tables to report standard deviations computed over multiple independent runs with different random seeds. Where feasible we will also add statistical significance tests (e.g., paired t-tests) against the strongest baselines to better substantiate the SOTA claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces an explicit new two-stage architecture (first-stage trajectory-guided diffusion on simplified controlled-joint representation, second-stage text-conditioned inpainting) plus the Selective Inpainting Mechanism, with performance measured directly against external benchmarks on HumanML3D and KIT. No equations, predictions, or central claims reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the divide-and-conquer strategy and empirical results remain independent of the inputs they are evaluated on.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework relies on the standard assumption that diffusion models can separately model partial joint trajectories and full-body completions; no new physical entities or ungrounded mathematical axioms are introduced beyond established generative modeling practices.

axioms (1)

domain assumption Diffusion models can generate realistic human motions when conditioned on text or partial observations
Invoked implicitly as the basis for both stages; standard in prior motion diffusion literature.

pith-pipeline@v0.9.0 · 5528 in / 1161 out tokens · 46895 ms · 2026-05-14T19:53:12.336054+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages

[1]

Crowdmogen: Event-driven collective human motion generation.Int

Yukang Cao, Xinying Guo, Mingyuan Zhang, Haozhe Xie, Chenyang Gu, and Ziwei Liu. Crowdmogen: Event-driven collective human motion generation.Int. J. Comput. Vis., 134(1):29, 2026

work page 2026
[2]

Executing your commands via motion diffusion in latent space

Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InIEEE Conf. Comput. Vis. Pattern Recog., 2023

work page 2023
[3]

Hop: Heterogeneous topology-based multimodal entanglement for co-speech gesture generation

Hongye Cheng, Tianyu Wang, Guangsi Shi, Zexing Zhao, and Yanwei Fu. Hop: Heterogeneous topology-based multimodal entanglement for co-speech gesture generation. InIEEE Conf. Comput. Vis. Pattern Recog., 2025

work page 2025
[4]

Interaction transformer for human reaction generation.IEEE Trans

Baptiste Chopin, Hao Tang, Naima Otberdout, Mohamed Daoudi, and Nicu Sebe. Interaction transformer for human reaction generation.IEEE Trans. Multimedia, 25:8842–8854, 2023

work page 2023
[5]

Mofusion: A framework for denoising-diffusion- based motion synthesis

Rishabh Dabral, Muhammad Hamza Mughal, Vladislav Golyanik, and Christian Theobalt. Mofusion: A framework for denoising-diffusion- based motion synthesis. InIEEE Conf. Comput. Vis. Pattern Recog., 2023

work page 2023
[6]

Motionlcm: Real-time controllable motion generation via latent consistency model

Wenxun Dai, Ling-Hao Chen, Jingbo Wang, Jinpeng Liu, Bo Dai, and Yansong Tang. Motionlcm: Real-time controllable motion generation via latent consistency model. InEur. Conf. Comput. Vis., 2024

work page 2024
[7]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. InAdv. Neural Inform. Process. Syst., pages 8780– 8794, 2021

work page 2021
[8]

Cg-hoi: Contact-guided 3d human- object interaction generation

Christian Diller and Angela Dai. Cg-hoi: Contact-guided 3d human- object interaction generation. InIEEE Conf. Comput. Vis. Pattern Recog., 2024

work page 2024
[9]

Imos: Intent-driven full-body motion synthesis for human-object interactions

Anindita Ghosh, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, and Philipp Slusallek. Imos: Intent-driven full-body motion synthesis for human-object interactions. InEur. Assoc. Comput. Graph., 2023

work page 2023
[10]

Bridging semantic and kinematic condi- tions with diffusion-based discrete motion tokenizer.arXiv preprint arXiv:2603.19227, 2026

Chenyang Gu, Mingyuan Zhang, Haozhe Xie, Zhongang Cai, Lei Yang, and Ziwei Liu. Bridging semantic and kinematic condi- tions with diffusion-based discrete motion tokenizer.arXiv preprint arXiv:2603.19227, 2026

work page arXiv 2026
[11]

Momask: Generative masked modeling of 3d human motions

Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked modeling of 3d human motions. InIEEE Conf. Comput. Vis. Pattern Recog., 2024

work page 2024
[12]

Generative human motion stylization in latent space

Chuan Guo, Yuxuan Mu, Xinxin Zuo, Peng Dai, Youliang Yan, Juwei Lu, and Li Cheng. Generative human motion stylization in latent space. InInt. Conf. Learn. Represent., 2024. 14

work page 2024
[13]

Generating diverse and natural 3d human motions from text

Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InIEEE Conf. Comput. Vis. Pattern Recog., 2022

work page 2022
[14]

Action2motion: Conditioned generation of 3d human motions

Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. InACM Int. Conf. Multimedia, page 2021–2029, 2020

work page 2021
[15]

Semanticboost: Elevating motion generation with augmented textual cues.arXiv preprint arXiv:2310.20323, 2023

Xin He, Shaoli Huang, Xiaohang Zhan, Chao Wen, and Ying Shan. Semanticboost: Elevating motion generation with augmented textual cues.arXiv preprint arXiv:2310.20323, 2023

work page arXiv 2023
[16]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAdv. Neural Inform. Process. Syst., 2020

work page 2020
[17]

Phase-functioned neural networks for character control.ACM Transactions on Graphics, 2017

Daniel Holden, Taku Komura, and Jun Saito. Phase-functioned neural networks for character control.ACM Transactions on Graphics, 2017

work page 2017
[18]

Avatarclip: Zero-shot text-driven generation and animation of 3d avatars.ACM Transactions on Graphics, 2022

Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, and Ziwei Liu. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars.ACM Transactions on Graphics, 2022

work page 2022
[19]

Motiongpt: Human motion as a foreign language

Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language. InAdv. Neural Inform. Process. Syst., 2024

work page 2024
[20]

Local action-guided motion diffusion model for text-to-motion generation

Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Runyi Yu, Chang Liu, Xiangyang Ji, Li Yuan, and jie Chen. Local action-guided motion diffusion model for text-to-motion generation. InEur. Conf. Comput. Vis., 2024

work page 2024
[21]

Act as you wish: Fine-grained control of motion diffusion model with hierarchical semantic graphs

Peng Jin, Yang Wu, Yanbo Fan, Zhongqian Sun, Wei Yang, and Li Yuan. Act as you wish: Fine-grained control of motion diffusion model with hierarchical semantic graphs. InAdv. Neural Inform. Process. Syst., 2024

work page 2024
[22]

Guided motion diffusion for controllable human motion synthesis

Korrawe Karunratanakul, Konpat Preechakul, Supasorn Suwajanakorn, and Siyu Tang. Guided motion diffusion for controllable human motion synthesis. InInt. Conf. Comput. Vis., pages 2151–2162, 2023

work page 2023
[23]

Flame: Free-form language-based motion synthesis & editing

Jihoon Kim, Jiseob Kim, and Sungjoon Choi. Flame: Free-form language-based motion synthesis & editing. InAAAI, 2023

work page 2023
[24]

Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis

Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. InEur. Conf. Comput. Vis., 2022

work page 2022
[25]

Towards variable and coordinated holistic co-speech motion generation

Yifei Liu, Qiong Cao, Yandong Wen, Huaiguang Jiang, and Changxing Ding. Towards variable and coordinated holistic co-speech motion generation. InIEEE Conf. Comput. Vis. Pattern Recog., 2024

work page 2024
[26]

Scamo: Exploring the scaling law in autoregressive motion generation model

Shunlin Lu, Jingbo Wang, Zeyu Lu, Ling-Hao Chen, Wenxun Dai, Junting Dong, Zhiyang Dou, Bo Dai, and Ruimao Zhang. Scamo: Exploring the scaling law in autoregressive motion generation model. InIEEE Conf. Comput. Vis. Pattern Recog., 2025

work page 2025
[27]

Countering language drift with seeded iterated learning

Yuchen Lu, Soumye Singhal, Florian Strub, Olivier Pietquin, and Aaron Courville. Countering language drift with seeded iterated learning. In Int. Conf. Mach. Learn., 2020

work page 2020
[28]

Amass: Archive of motion capture as surface shapes

Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons- Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. InInt. Conf. Comput. Vis., 2019

work page 2019
[29]

Rethinking diffusion for text-driven human motion generation

Zichong Meng, Yiming Xie, Xiaogang Peng, Zeyu Han, and Huaizu Jiang. Rethinking diffusion for text-driven human motion generation. InIEEE Conf. Comput. Vis. Pattern Recog., 2025

work page 2025
[30]

Long- term motion generation for interactive humanoid robots using gan with convolutional network

Yusuke Nishimura, Yutaka Nakamura, and Hiroshi Ishiguro. Long- term motion generation for interactive humanoid robots using gan with convolutional network. InCompanion of the ACM/IEEE Int. Conf. Hum.- Robot Interact., page 375–377, 2020

work page 2020
[31]

Hoi-diff: Text-driven syn- thesis of 3d human-object interactions using diffusion mod- els

Xiaogang Peng, Yiming Xie, Zizhao Wu, Varun Jampani, Deqing Sun, and Huaizu Jiang. Hoi-diff: Text-driven synthesis of 3d human-object interactions using diffusion models.arXiv preprint arXiv:2312.06553, 2023

work page arXiv 2023
[32]

Temos: Generating diverse human motions from textual descriptions

Mathis Petrovich, Michael J Black, and G ¨ul Varol. Temos: Generating diverse human motions from textual descriptions. InEur. Conf. Comput. Vis., 2022

work page 2022
[33]

Tmr: Text-to-motion retrieval using contrastive 3d human motion synthesis

Mathis Petrovich, Michael J Black, and G ¨ul Varol. Tmr: Text-to-motion retrieval using contrastive 3d human motion synthesis. InInt. Conf. Comput. Vis., 2023

work page 2023
[34]

Maskcontrol: Spatio-temporal control for masked motion synthesis

Ekkasit Pinyoanuntapong, Muhammad Saleem, Korrawe Karun- ratanakul, Pu Wang, Hongfei Xue, Chen Chen, Chuan Guo, Junli Cao, Jian Ren, and Sergey Tulyakov. Maskcontrol: Spatio-temporal control for masked motion synthesis. InInt. Conf. Comput. Vis., 2025

work page 2025
[35]

Mmm: Generative masked motion model

Ekkasit Pinyoanuntapong, Pu Wang, Minwoo Lee, and Chen Chen. Mmm: Generative masked motion model. InIEEE Conf. Comput. Vis. Pattern Recog., 2023

work page 2023
[36]

The KIT motion-language dataset.IEEE Trans

Matthias Plappert, Christian Mandery, and Tamim Asfour. The KIT motion-language dataset.IEEE Trans. Big Data, 4(4):236–252, dec 2016

work page 2016
[37]

Emotiongesture: Audio-driven diverse emotional co-speech 3d gesture generation.IEEE Trans

Xingqun Qi, Chen Liu, Lincheng Li, Jie Hou, Haoran Xin, and Xin Yu. Emotiongesture: Audio-driven diverse emotional co-speech 3d gesture generation.IEEE Trans. Multimedia, 26:10420–10430, 2024

work page 2024
[38]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InIEEE Conf. Comput. Vis. Pattern Recog., 2023

work page 2023
[39]

Human motion diffusion as a generative prior

Yoni Shafir, Guy Tevet, Roy Kapon, and Amit Haim Bermano. Human motion diffusion as a generative prior. InInt. Conf. Learn. Represent., 2024

work page 2024
[40]

Multi-semantics aggrega- tion network based on the dynamic-attention mechanism for 3d human motion prediction.IEEE Trans

Junyu Shi, Jianqi Zhong, and Wenming Cao. Multi-semantics aggrega- tion network based on the dynamic-attention mechanism for 3d human motion prediction.IEEE Trans. Multimedia, 26:5194–5206, 2024

work page 2024
[41]

Mvdream: Multi-view diffusion for 3d generation

Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d generation. InInt. Conf. Learn. Represent., 2024

work page 2024
[42]

Kankanhalli, Weidong Geng, and Xiangdong Li

Guofei Sun, Yongkang Wong, Zhiyong Cheng, Mohan S. Kankanhalli, Weidong Geng, and Xiangdong Li. Deepdance: Music-to-dance motion choreography with adversarial learning.IEEE Trans. Multimedia, 23:497–509, 2021

work page 2021
[43]

Motionclip: Exposing human motion generation to clip space

Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. Motionclip: Exposing human motion generation to clip space. InEur. Conf. Comput. Vis., 2022

work page 2022
[44]

Human motion diffusion model

Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. Human motion diffusion model. InInt. Conf. Learn. Represent., 2023

work page 2023
[45]

Neural discrete representation learning

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InAdv. Neural Inform. Process. Syst., 2017

work page 2017
[46]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdv. Neural Inform. Process. Syst., 2017

work page 2017
[47]

Tlcontrol: Trajectory and language control for human motion synthesis

Weilin Wan, Zhiyang Dou, Taku Komura, Wenping Wang, Dinesh Jayaraman, and Lingjie Liu. Tlcontrol: Trajectory and language control for human motion synthesis. InEur. Conf. Comput. Vis., 2024

work page 2024
[48]

Stickmotion: Generating 3d hu- man motions by drawing a stickman.arXiv preprint arXiv:2503.04829, 2025

Tao Wang, Zhihua Wu, Qiaozhi He, Jiaming Chu, Ling Qian, Yu Cheng, Junliang Xing, Jian Zhao, and Lei Jin. Stickmotion: Generating 3d hu- man motions by drawing a stickman.arXiv preprint arXiv:2503.04829, 2025

work page arXiv 2025
[49]

Humanise: Language-conditioned human motion generation in 3d scenes

Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, Wei Liang, and Siyuan Huang. Humanise: Language-conditioned human motion generation in 3d scenes. InAdv. Neural Inform. Process. Syst., 2022

work page 2022
[50]

Intercontrol: Zero-shot human interaction generation by controlling every joint

Zhenzhi Wang, Jingbo Wang, Yixuan Li, Dahua Lin, and Bo Dai. Intercontrol: Zero-shot human interaction generation by controlling every joint. InAdv. Neural Inform. Process. Syst., 2024

work page 2024
[51]

Omnicontrol: Control any joint at any time for human motion generation

Yiming Xie, Varun Jampani, Lei Zhong, Deqing Sun, and Huaizu Jiang. Omnicontrol: Control any joint at any time for human motion generation. InInt. Conf. Learn. Represent., 2024

work page 2024
[52]

Implicit compositional generative network for length-variable co-speech gesture synthesis.IEEE Trans

Chenghao Xu, Jiexi Yan, Yanhua Yang, and Cheng Deng. Implicit compositional generative network for length-variable co-speech gesture synthesis.IEEE Trans. Multimedia, 26:6325–6335, 2024

work page 2024
[53]

Guiding human-object interactions with rich geometry and relations

Mengqing Xue, Yifei Liu, Ling Guo, Shaoli Huang, and Changxing Ding. Guiding human-object interactions with rich geometry and relations. InIEEE Conf. Comput. Vis. Pattern Recog., 2025

work page 2025
[54]

Generating human interaction motions in scenes with text control

Hongwei Yi, Justus Thies, Michael J Black, Xue Bin Peng, and Davis Rempe. Generating human interaction motions in scenes with text control. InEur. Conf. Comput. Vis., 2024

work page 2024
[55]

Speech gesture generation from the trimodal context of text, audio, and speaker identity

Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. Speech gesture generation from the trimodal context of text, audio, and speaker identity. InACM SIGGRAPH Conf. Comput. Graph. Interact. Tech. Asia, 2020

work page 2020
[56]

Divdiff: A conditional diffusion model for diverse human motion pre- diction.IEEE Trans

Hua Yu, Yaqing Hou, Wenbin Pei, Yew-Soon Ong, and Qiang Zhang. Divdiff: A conditional diffusion model for diverse human motion pre- diction.IEEE Trans. Multimedia, pages 1–12, 2024

work page 2024
[57]

T2m-gpt: Generating human motion from textual descriptions with discrete representations

Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. T2m-gpt: Generating human motion from textual descriptions with discrete representations. InIEEE Conf. Comput. Vis. Pattern Recog., 2023

work page 2023
[58]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InInt. Conf. Comput. Vis., pages 3836–3847, 2023

work page 2023
[59]

Motiondiffuse: Text-driven human motion generation with diffusion model.IEEE Trans

Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondiffuse: Text-driven human motion generation with diffusion model.IEEE Trans. Pattern Anal. Mach. Intell., 46(6):4115–4128, 2024

work page 2024
[60]

Smoodi: Stylized motion diffusion model

Lei Zhong, Yiming Xie, Varun Jampani, Deqing Sun, and Huaizu Jiang. Smoodi: Stylized motion diffusion model. InEur. Conf. Comput. Vis., 2024

work page 2024

[1] [1]

Crowdmogen: Event-driven collective human motion generation.Int

Yukang Cao, Xinying Guo, Mingyuan Zhang, Haozhe Xie, Chenyang Gu, and Ziwei Liu. Crowdmogen: Event-driven collective human motion generation.Int. J. Comput. Vis., 134(1):29, 2026

work page 2026

[2] [2]

Executing your commands via motion diffusion in latent space

Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InIEEE Conf. Comput. Vis. Pattern Recog., 2023

work page 2023

[3] [3]

Hop: Heterogeneous topology-based multimodal entanglement for co-speech gesture generation

Hongye Cheng, Tianyu Wang, Guangsi Shi, Zexing Zhao, and Yanwei Fu. Hop: Heterogeneous topology-based multimodal entanglement for co-speech gesture generation. InIEEE Conf. Comput. Vis. Pattern Recog., 2025

work page 2025

[4] [4]

Interaction transformer for human reaction generation.IEEE Trans

Baptiste Chopin, Hao Tang, Naima Otberdout, Mohamed Daoudi, and Nicu Sebe. Interaction transformer for human reaction generation.IEEE Trans. Multimedia, 25:8842–8854, 2023

work page 2023

[5] [5]

Mofusion: A framework for denoising-diffusion- based motion synthesis

Rishabh Dabral, Muhammad Hamza Mughal, Vladislav Golyanik, and Christian Theobalt. Mofusion: A framework for denoising-diffusion- based motion synthesis. InIEEE Conf. Comput. Vis. Pattern Recog., 2023

work page 2023

[6] [6]

Motionlcm: Real-time controllable motion generation via latent consistency model

Wenxun Dai, Ling-Hao Chen, Jingbo Wang, Jinpeng Liu, Bo Dai, and Yansong Tang. Motionlcm: Real-time controllable motion generation via latent consistency model. InEur. Conf. Comput. Vis., 2024

work page 2024

[7] [7]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. InAdv. Neural Inform. Process. Syst., pages 8780– 8794, 2021

work page 2021

[8] [8]

Cg-hoi: Contact-guided 3d human- object interaction generation

Christian Diller and Angela Dai. Cg-hoi: Contact-guided 3d human- object interaction generation. InIEEE Conf. Comput. Vis. Pattern Recog., 2024

work page 2024

[9] [9]

Imos: Intent-driven full-body motion synthesis for human-object interactions

Anindita Ghosh, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, and Philipp Slusallek. Imos: Intent-driven full-body motion synthesis for human-object interactions. InEur. Assoc. Comput. Graph., 2023

work page 2023

[10] [10]

Bridging semantic and kinematic condi- tions with diffusion-based discrete motion tokenizer.arXiv preprint arXiv:2603.19227, 2026

Chenyang Gu, Mingyuan Zhang, Haozhe Xie, Zhongang Cai, Lei Yang, and Ziwei Liu. Bridging semantic and kinematic condi- tions with diffusion-based discrete motion tokenizer.arXiv preprint arXiv:2603.19227, 2026

work page arXiv 2026

[11] [11]

Momask: Generative masked modeling of 3d human motions

Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked modeling of 3d human motions. InIEEE Conf. Comput. Vis. Pattern Recog., 2024

work page 2024

[12] [12]

Generative human motion stylization in latent space

Chuan Guo, Yuxuan Mu, Xinxin Zuo, Peng Dai, Youliang Yan, Juwei Lu, and Li Cheng. Generative human motion stylization in latent space. InInt. Conf. Learn. Represent., 2024. 14

work page 2024

[13] [13]

Generating diverse and natural 3d human motions from text

Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InIEEE Conf. Comput. Vis. Pattern Recog., 2022

work page 2022

[14] [14]

Action2motion: Conditioned generation of 3d human motions

Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. InACM Int. Conf. Multimedia, page 2021–2029, 2020

work page 2021

[15] [15]

Semanticboost: Elevating motion generation with augmented textual cues.arXiv preprint arXiv:2310.20323, 2023

Xin He, Shaoli Huang, Xiaohang Zhan, Chao Wen, and Ying Shan. Semanticboost: Elevating motion generation with augmented textual cues.arXiv preprint arXiv:2310.20323, 2023

work page arXiv 2023

[16] [16]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAdv. Neural Inform. Process. Syst., 2020

work page 2020

[17] [17]

Phase-functioned neural networks for character control.ACM Transactions on Graphics, 2017

Daniel Holden, Taku Komura, and Jun Saito. Phase-functioned neural networks for character control.ACM Transactions on Graphics, 2017

work page 2017

[18] [18]

Avatarclip: Zero-shot text-driven generation and animation of 3d avatars.ACM Transactions on Graphics, 2022

Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, and Ziwei Liu. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars.ACM Transactions on Graphics, 2022

work page 2022

[19] [19]

Motiongpt: Human motion as a foreign language

Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language. InAdv. Neural Inform. Process. Syst., 2024

work page 2024

[20] [20]

Local action-guided motion diffusion model for text-to-motion generation

Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Runyi Yu, Chang Liu, Xiangyang Ji, Li Yuan, and jie Chen. Local action-guided motion diffusion model for text-to-motion generation. InEur. Conf. Comput. Vis., 2024

work page 2024

[21] [21]

Act as you wish: Fine-grained control of motion diffusion model with hierarchical semantic graphs

Peng Jin, Yang Wu, Yanbo Fan, Zhongqian Sun, Wei Yang, and Li Yuan. Act as you wish: Fine-grained control of motion diffusion model with hierarchical semantic graphs. InAdv. Neural Inform. Process. Syst., 2024

work page 2024

[22] [22]

Guided motion diffusion for controllable human motion synthesis

Korrawe Karunratanakul, Konpat Preechakul, Supasorn Suwajanakorn, and Siyu Tang. Guided motion diffusion for controllable human motion synthesis. InInt. Conf. Comput. Vis., pages 2151–2162, 2023

work page 2023

[23] [23]

Flame: Free-form language-based motion synthesis & editing

Jihoon Kim, Jiseob Kim, and Sungjoon Choi. Flame: Free-form language-based motion synthesis & editing. InAAAI, 2023

work page 2023

[24] [24]

Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis

Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. InEur. Conf. Comput. Vis., 2022

work page 2022

[25] [25]

Towards variable and coordinated holistic co-speech motion generation

Yifei Liu, Qiong Cao, Yandong Wen, Huaiguang Jiang, and Changxing Ding. Towards variable and coordinated holistic co-speech motion generation. InIEEE Conf. Comput. Vis. Pattern Recog., 2024

work page 2024

[26] [26]

Scamo: Exploring the scaling law in autoregressive motion generation model

Shunlin Lu, Jingbo Wang, Zeyu Lu, Ling-Hao Chen, Wenxun Dai, Junting Dong, Zhiyang Dou, Bo Dai, and Ruimao Zhang. Scamo: Exploring the scaling law in autoregressive motion generation model. InIEEE Conf. Comput. Vis. Pattern Recog., 2025

work page 2025

[27] [27]

Countering language drift with seeded iterated learning

Yuchen Lu, Soumye Singhal, Florian Strub, Olivier Pietquin, and Aaron Courville. Countering language drift with seeded iterated learning. In Int. Conf. Mach. Learn., 2020

work page 2020

[28] [28]

Amass: Archive of motion capture as surface shapes

Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons- Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. InInt. Conf. Comput. Vis., 2019

work page 2019

[29] [29]

Rethinking diffusion for text-driven human motion generation

Zichong Meng, Yiming Xie, Xiaogang Peng, Zeyu Han, and Huaizu Jiang. Rethinking diffusion for text-driven human motion generation. InIEEE Conf. Comput. Vis. Pattern Recog., 2025

work page 2025

[30] [30]

Long- term motion generation for interactive humanoid robots using gan with convolutional network

Yusuke Nishimura, Yutaka Nakamura, and Hiroshi Ishiguro. Long- term motion generation for interactive humanoid robots using gan with convolutional network. InCompanion of the ACM/IEEE Int. Conf. Hum.- Robot Interact., page 375–377, 2020

work page 2020

[31] [31]

Hoi-diff: Text-driven syn- thesis of 3d human-object interactions using diffusion mod- els

Xiaogang Peng, Yiming Xie, Zizhao Wu, Varun Jampani, Deqing Sun, and Huaizu Jiang. Hoi-diff: Text-driven synthesis of 3d human-object interactions using diffusion models.arXiv preprint arXiv:2312.06553, 2023

work page arXiv 2023

[32] [32]

Temos: Generating diverse human motions from textual descriptions

Mathis Petrovich, Michael J Black, and G ¨ul Varol. Temos: Generating diverse human motions from textual descriptions. InEur. Conf. Comput. Vis., 2022

work page 2022

[33] [33]

Tmr: Text-to-motion retrieval using contrastive 3d human motion synthesis

Mathis Petrovich, Michael J Black, and G ¨ul Varol. Tmr: Text-to-motion retrieval using contrastive 3d human motion synthesis. InInt. Conf. Comput. Vis., 2023

work page 2023

[34] [34]

Maskcontrol: Spatio-temporal control for masked motion synthesis

Ekkasit Pinyoanuntapong, Muhammad Saleem, Korrawe Karun- ratanakul, Pu Wang, Hongfei Xue, Chen Chen, Chuan Guo, Junli Cao, Jian Ren, and Sergey Tulyakov. Maskcontrol: Spatio-temporal control for masked motion synthesis. InInt. Conf. Comput. Vis., 2025

work page 2025

[35] [35]

Mmm: Generative masked motion model

Ekkasit Pinyoanuntapong, Pu Wang, Minwoo Lee, and Chen Chen. Mmm: Generative masked motion model. InIEEE Conf. Comput. Vis. Pattern Recog., 2023

work page 2023

[36] [36]

The KIT motion-language dataset.IEEE Trans

Matthias Plappert, Christian Mandery, and Tamim Asfour. The KIT motion-language dataset.IEEE Trans. Big Data, 4(4):236–252, dec 2016

work page 2016

[37] [37]

Emotiongesture: Audio-driven diverse emotional co-speech 3d gesture generation.IEEE Trans

Xingqun Qi, Chen Liu, Lincheng Li, Jie Hou, Haoran Xin, and Xin Yu. Emotiongesture: Audio-driven diverse emotional co-speech 3d gesture generation.IEEE Trans. Multimedia, 26:10420–10430, 2024

work page 2024

[38] [38]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InIEEE Conf. Comput. Vis. Pattern Recog., 2023

work page 2023

[39] [39]

Human motion diffusion as a generative prior

Yoni Shafir, Guy Tevet, Roy Kapon, and Amit Haim Bermano. Human motion diffusion as a generative prior. InInt. Conf. Learn. Represent., 2024

work page 2024

[40] [40]

Multi-semantics aggrega- tion network based on the dynamic-attention mechanism for 3d human motion prediction.IEEE Trans

Junyu Shi, Jianqi Zhong, and Wenming Cao. Multi-semantics aggrega- tion network based on the dynamic-attention mechanism for 3d human motion prediction.IEEE Trans. Multimedia, 26:5194–5206, 2024

work page 2024

[41] [41]

Mvdream: Multi-view diffusion for 3d generation

Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d generation. InInt. Conf. Learn. Represent., 2024

work page 2024

[42] [42]

Kankanhalli, Weidong Geng, and Xiangdong Li

Guofei Sun, Yongkang Wong, Zhiyong Cheng, Mohan S. Kankanhalli, Weidong Geng, and Xiangdong Li. Deepdance: Music-to-dance motion choreography with adversarial learning.IEEE Trans. Multimedia, 23:497–509, 2021

work page 2021

[43] [43]

Motionclip: Exposing human motion generation to clip space

Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. Motionclip: Exposing human motion generation to clip space. InEur. Conf. Comput. Vis., 2022

work page 2022

[44] [44]

Human motion diffusion model

Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. Human motion diffusion model. InInt. Conf. Learn. Represent., 2023

work page 2023

[45] [45]

Neural discrete representation learning

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InAdv. Neural Inform. Process. Syst., 2017

work page 2017

[46] [46]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdv. Neural Inform. Process. Syst., 2017

work page 2017

[47] [47]

Tlcontrol: Trajectory and language control for human motion synthesis

Weilin Wan, Zhiyang Dou, Taku Komura, Wenping Wang, Dinesh Jayaraman, and Lingjie Liu. Tlcontrol: Trajectory and language control for human motion synthesis. InEur. Conf. Comput. Vis., 2024

work page 2024

[48] [48]

Stickmotion: Generating 3d hu- man motions by drawing a stickman.arXiv preprint arXiv:2503.04829, 2025

Tao Wang, Zhihua Wu, Qiaozhi He, Jiaming Chu, Ling Qian, Yu Cheng, Junliang Xing, Jian Zhao, and Lei Jin. Stickmotion: Generating 3d hu- man motions by drawing a stickman.arXiv preprint arXiv:2503.04829, 2025

work page arXiv 2025

[49] [49]

Humanise: Language-conditioned human motion generation in 3d scenes

Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, Wei Liang, and Siyuan Huang. Humanise: Language-conditioned human motion generation in 3d scenes. InAdv. Neural Inform. Process. Syst., 2022

work page 2022

[50] [50]

Intercontrol: Zero-shot human interaction generation by controlling every joint

Zhenzhi Wang, Jingbo Wang, Yixuan Li, Dahua Lin, and Bo Dai. Intercontrol: Zero-shot human interaction generation by controlling every joint. InAdv. Neural Inform. Process. Syst., 2024

work page 2024

[51] [51]

Omnicontrol: Control any joint at any time for human motion generation

Yiming Xie, Varun Jampani, Lei Zhong, Deqing Sun, and Huaizu Jiang. Omnicontrol: Control any joint at any time for human motion generation. InInt. Conf. Learn. Represent., 2024

work page 2024

[52] [52]

Implicit compositional generative network for length-variable co-speech gesture synthesis.IEEE Trans

Chenghao Xu, Jiexi Yan, Yanhua Yang, and Cheng Deng. Implicit compositional generative network for length-variable co-speech gesture synthesis.IEEE Trans. Multimedia, 26:6325–6335, 2024

work page 2024

[53] [53]

Guiding human-object interactions with rich geometry and relations

Mengqing Xue, Yifei Liu, Ling Guo, Shaoli Huang, and Changxing Ding. Guiding human-object interactions with rich geometry and relations. InIEEE Conf. Comput. Vis. Pattern Recog., 2025

work page 2025

[54] [54]

Generating human interaction motions in scenes with text control

Hongwei Yi, Justus Thies, Michael J Black, Xue Bin Peng, and Davis Rempe. Generating human interaction motions in scenes with text control. InEur. Conf. Comput. Vis., 2024

work page 2024

[55] [55]

Speech gesture generation from the trimodal context of text, audio, and speaker identity

Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. Speech gesture generation from the trimodal context of text, audio, and speaker identity. InACM SIGGRAPH Conf. Comput. Graph. Interact. Tech. Asia, 2020

work page 2020

[56] [56]

Divdiff: A conditional diffusion model for diverse human motion pre- diction.IEEE Trans

Hua Yu, Yaqing Hou, Wenbin Pei, Yew-Soon Ong, and Qiang Zhang. Divdiff: A conditional diffusion model for diverse human motion pre- diction.IEEE Trans. Multimedia, pages 1–12, 2024

work page 2024

[57] [57]

T2m-gpt: Generating human motion from textual descriptions with discrete representations

Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. T2m-gpt: Generating human motion from textual descriptions with discrete representations. InIEEE Conf. Comput. Vis. Pattern Recog., 2023

work page 2023

[58] [58]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InInt. Conf. Comput. Vis., pages 3836–3847, 2023

work page 2023

[59] [59]

Motiondiffuse: Text-driven human motion generation with diffusion model.IEEE Trans

Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondiffuse: Text-driven human motion generation with diffusion model.IEEE Trans. Pattern Anal. Mach. Intell., 46(6):4115–4128, 2024

work page 2024

[60] [60]

Smoodi: Stylized motion diffusion model

Lei Zhong, Yiming Xie, Varun Jampani, Deqing Sun, and Huaizu Jiang. Smoodi: Stylized motion diffusion model. InEur. Conf. Comput. Vis., 2024

work page 2024