DrawMotion: Generating 3D Human Motions by Freehand Drawing

Jiaming Chu; Jian Zhao; Junliang Xing; Lei Jin; Li Wang; Qiaozhi He; Shuicheng Yan; Tao Wang; Yu Cheng; Zhihua Wu

arxiv: 2605.20955 · v1 · pith:2DIHPCITnew · submitted 2026-05-20 · 💻 cs.CV

DrawMotion: Generating 3D Human Motions by Freehand Drawing

Tao Wang , Lei Jin , Zhihua Wu , Qiaozhi He , Jiaming Chu , Yu Cheng , Junliang Xing , Jian Zhao

show 2 more authors

Shuicheng Yan Li Wang

This is my paper

Pith reviewed 2026-05-21 05:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D human motion generationtext-to-motionhand-drawn conditionsdiffusion modelsmulti-condition controlspatial guidancestickman sketches

0 comments

The pith

DrawMotion generates 3D human motions from both text descriptions and freehand drawings for semantic and spatial control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DrawMotion, a diffusion-based system that creates 3D human motions using text for meaning and hand-drawn stick figures for exact positioning and paths. This combination helps users express motions more accurately than text alone can. The framework includes a way to create stickman sketches automatically from existing motion data and a module that blends the two conditions during generation. It also uses guidance during inference to better match user intent. Experiments show users finish tasks faster and with results closer to what they imagined.

Core claim

DrawMotion is a diffusion-based framework for generating 3D human motions conditioned on text and hand-drawing inputs. It develops an algorithm to automatically generate hand-drawn stickman sketches from dataset motions in various formats, proposes a Multi-Condition Module integrated into the diffusion process to handle combinations of conditions, and applies training-free guidance to align outputs with user intentions while maintaining motion quality.

What carries the argument

The Multi-Condition Module (MCM), which fuses text and drawing conditions into the diffusion model's features to enable flexible control and continuous-space updates for guidance.

If this is right

Users gain spatial precision in generated motions without needing detailed text descriptions.
The approach cuts the time required to produce intended motions by roughly 46.7 percent.
Motions can be generated from any mix of available conditions without retraining for each combination.
The system preserves motion fidelity while allowing adjustments through guidance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar sketching interfaces might improve control in related generation tasks like image or video synthesis.
Integrating this with real-time drawing tools could enable interactive motion design sessions.
Extending the stick figure representation to include more body details could capture even finer intent.

Load-bearing premise

Hand-drawn stickman sketches generated automatically from motion datasets accurately reflect the spatial details that real users intend to convey in their drawings.

What would settle it

Compare generated motions against user-drawn sketches in a blind test and measure if key spatial features like joint angles and movement paths match within a small error margin.

Figures

Figures reproduced from arXiv: 2605.20955 by Jiaming Chu, Jian Zhao, Junliang Xing, Lei Jin, Li Wang, Qiaozhi He, Shuicheng Yan, Tao Wang, Yu Cheng, Zhihua Wu.

**Figure 1.** Figure 1: Pipeline of DrawMotion inference. In addition to the trainingbased guidance, a training-free guidance updates the intermediate feature of the model within the MD boundary to ensure that the generations meet the conditions while maintaining its fidelity [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: 2) Multi-Condition Fusion. Previous works [1], [9] achieve all possible combinations of two conditions via the mask operation for condition input in self-attention [1], [10] module, but this introduces redundant computation when calculating the masked-token attention. We instead design an efficient Multi-Condition Module (MCM) to process multiple conditions, as detailed in Section III-C. 3) Trajectory ali… view at source ↗

**Figure 2.** Figure 2: Stickmen generated by Stickman Generation Algorithm on the KIT-ML [62] and HumanML3D [38] datasets. human joints from existing motion datasets to automatically generate hand-drawn stickmen. Considering the characteristics of human hand-drawing, we take into account the following aspects: 1) Stroke smoothness. The smoothness of strokes is influenced by force and individual preferences. Moreover, the smoothn… view at source ↗

**Figure 3.** Figure 3: The DrawMotion framework consists of the diffusion process (left) and the network structure (right). 1) The diffusion process includes a forward and a reverse process. In the forward process, original motions are augmented with Gaussian noise and fed into DrawMotion, which learns to predict the added noise based on textual descriptions and hand-drawn sketches. In the reverse process, user-provided textual … view at source ↗

**Figure 4.** Figure 4: Conceptual illustration of intermediate feature distributions. The dashed lines correspond to level sets of the probability density function. (a) Ordinary models yield discrete clusters, (b) MCM forms a relatively continuous space, and (c) VAE enforces full latent coverage. This schematic is supported by Table I. refine the motion. Current methods ensure the fidelity of the generated motion in two ways: 1)… view at source ↗

**Figure 6.** Figure 6: 2D PCA projection onto the first two principal components of different condition settings in DrawMotion. Sample size = 20,000 and diffusion step = 299. structure. This continuity arises from the intrinsic properties of the multi-condition fusion process. Specifically, each condition (e.g., text or drawing) is encoded into a feature representation that may lie on a low-dimensional nonlinear manifold. The Mi… view at source ↗

**Figure 5.** Figure 5: 2D PCA projection onto the first two principal components of ReMoDiffuse and DrawMotion. Sample size = 80,000 and diffusion step = 299 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Visualization of DrawMotion (see the animation on GitHub) [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Visual comparison between ReModiffuse, StickMotion, and DrawMotion: 1) This user attempted to make the generated trajectory resemble the emblem from Naruto and specified that, at a designated position along the trajectory, the action should involve raising the left hand high. 2) This user simply wrote the letter ”m”, without specifying a stickman. (see the animation on GitHub). TABLE XI: The time consumpti… view at source ↗

read the original abstract

Text-to-motion generation, which translates textual descriptions into human motions, faces the challenge that users often struggle to precisely convey their intended motions through text alone. To address this issue, this paper introduces DrawMotion, an efficient diffusion-based framework designed for multi-condition scenarios. DrawMotion generates motions based on both a conventional text condition and a novel hand-drawing condition, which provide semantic and spatial control over the generated motions, respectively. Specifically, we tackle the fine-grained motion generation task from three perspectives: 1) freehand drawing condition. To accurately capture users' intended motions without requiring tedious textual input, we develop an algorithm to automatically generate hand-drawn stickman sketches across different dataset formats; 2) multi-condition fusion. We propose a Multi-Condition Module (MCM) that is integrated into the diffusion process, enabling the model to exploit all possible condition combinations while reducing computational complexity compared to conventional approaches; and 3) training-free guidance. Notably, the MCM in DrawMotion ensures that its intermediate features lie in a continuous space, allowing classifier-guidance gradients to update the features and thereby aligning the generated motions with user intentions while preserving fidelity. Quantitative experiments and user studies demonstrate that the freehand drawing approach reduces user time by approximately 46.7% when generating motions aligned with their imagination. The code, demos, and relevant data are publicly available at https://github.com/InvertedForest/DrawMotion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DrawMotion adds freehand drawing for spatial control in motion generation but trains on synthetic sketches that may not match real user drawings.

read the letter

The main point is that DrawMotion combines text with freehand stickman drawings to control both semantics and spatial details in 3D human motion generation. They introduce a Multi-Condition Module to fuse the inputs efficiently during diffusion and use training-free guidance on continuous features to steer outputs toward user intent without extra training steps. The code and demos are released publicly, which helps others test it directly. Their experiments and user studies report about 46.7% less time spent creating motions that match what users had in mind. This is a practical step for animation pipelines where text alone often falls short on precise poses. The automatic sketch generation from dataset motions lets them train without manual labeling, and the overall setup stays close to standard diffusion methods with no obvious circularity in the results. The soft spot is the domain gap the stress-test note flags. Real freehand drawings vary in stroke precision, body proportions, and joint clarity in ways the clean auto-generated training sketches probably do not. Because the MCM and guidance rely on feature alignment, inputs outside that distribution could weaken the spatial control in practice. The user studies show time savings and alignment but do not appear to test performance on messier, varied real-user sketches. This paper is aimed at computer vision researchers working on conditional motion generation and at tool builders who want multi-modal controls for animators. A reader interested in new input modalities would get concrete ideas from the MCM and guidance approach. It has enough novelty and reported evidence to deserve a serious referee, even if the drawing distribution issue needs more attention in revision. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces DrawMotion, a diffusion-based framework for 3D human motion generation that combines conventional text conditioning for semantic control with a novel hand-drawing condition for spatial control. Key technical elements include an algorithm that automatically converts dataset motions into hand-drawn stickman sketches, a Multi-Condition Module (MCM) integrated into the diffusion process to handle arbitrary condition combinations, and training-free classifier guidance that operates on the continuous feature space produced by the MCM. Quantitative experiments and user studies are reported to support a 46.7% reduction in user time for producing motions aligned with user intent, with code and demos released publicly.

Significance. If the central claims hold, the work provides a practical and intuitive extension to text-to-motion generation by incorporating freehand sketches as an additional spatial prior. This could meaningfully improve controllability in applications such as animation and virtual reality. The public release of code, demos, and data is a clear strength that aids reproducibility and follow-up research. The approach builds on established diffusion techniques rather than introducing entirely new paradigms.

major comments (2)

[§3] §3 (freehand drawing condition): The training pipeline relies on automatically generated stickman sketches derived from dataset motions, yet no ablation or out-of-distribution test evaluates performance when real user drawings—with their inherent variability in stroke thickness, proportions, and joint angles—are supplied at inference time. Because the MCM and training-free guidance depend on continuous feature-space alignment, this domain gap directly threatens the claimed reliability of spatial control.
[User-study evaluation] User-study evaluation: The reported 46.7% time savings is presented as evidence of practical utility, but the study description does not report quantitative metrics (e.g., Fréchet Motion Distance or joint-angle error) comparing motions generated from real freehand sketches versus the synthetic training distribution, leaving the alignment claim only partially supported.

minor comments (2)

[Abstract] The abstract states that the MCM 'reduces computational complexity compared to conventional approaches' without naming the baselines or providing FLOPs/latency numbers; a brief comparison table would clarify this advantage.
[§4] Notation for the MCM feature concatenation and guidance gradient computation could be made more explicit, especially for readers who may not immediately see how the continuous-space property enables classifier guidance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment point by point below, offering clarifications based on the manuscript and outlining planned revisions to strengthen the evaluation.

read point-by-point responses

Referee: [§3] §3 (freehand drawing condition): The training pipeline relies on automatically generated stickman sketches derived from dataset motions, yet no ablation or out-of-distribution test evaluates performance when real user drawings—with their inherent variability in stroke thickness, proportions, and joint angles—are supplied at inference time. Because the MCM and training-free guidance depend on continuous feature-space alignment, this domain gap directly threatens the claimed reliability of spatial control.

Authors: We appreciate this observation regarding the training distribution. The automatic sketch generation algorithm was developed specifically to create paired training data that matches the motion datasets across formats, ensuring the model learns consistent spatial mappings. The user studies in the paper did involve participants supplying their own freehand drawings at inference time, with the 46.7% time reduction reflecting real usage. The MCM's continuous feature space and training-free guidance are designed to support such inputs by allowing gradient-based alignment without retraining. To directly address the domain gap concern, the revised manuscript will include an out-of-distribution ablation using a collected set of real user drawings with natural variability, reporting metrics such as Fréchet Motion Distance to quantify robustness. revision: yes
Referee: [User-study evaluation] User-study evaluation: The reported 46.7% time savings is presented as evidence of practical utility, but the study description does not report quantitative metrics (e.g., Fréchet Motion Distance or joint-angle error) comparing motions generated from real freehand sketches versus the synthetic training distribution, leaving the alignment claim only partially supported.

Authors: We agree that the current user-study presentation focuses on time efficiency and subjective alignment rather than explicit quantitative motion-quality metrics for real versus synthetic inputs. This leaves room for stronger substantiation of the spatial control claims. In the revised manuscript, we will expand the evaluation section to include direct comparisons using metrics such as Fréchet Motion Distance and average joint-angle error between motions produced from real freehand sketches and those from the synthetic training distribution, while retaining the time-savings results. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces DrawMotion as a diffusion-based framework that augments standard text-to-motion generation with a hand-drawing condition via an auto-generated stickman sketch algorithm, a Multi-Condition Module (MCM) for fusion, and training-free classifier guidance. These additions are presented as engineering extensions rather than derivations that reduce to their own inputs by construction; the hand-drawing training data is produced by a separate algorithm applied to existing motion datasets, and performance is asserted through quantitative metrics and user studies measuring time savings. No equations, self-citations, or fitted parameters are shown in the provided text to create a self-definitional loop or to rename a fitted quantity as an independent prediction. The central claims therefore remain self-contained against external benchmarks such as standard diffusion models and empirical validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based on abstract only; no explicit free parameters, axioms, or invented entities are detailed beyond standard diffusion model components and the introduced MCM.

pith-pipeline@v0.9.0 · 5814 in / 1034 out tokens · 32152 ms · 2026-05-21T05:12:26.214463+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a Multi-Condition Module (MCM) ... training-free guidance method (IFG) ... Mahalanobis distance ... MD clipping
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Stickman Generation Algorithm (SGA) ... automatically produces stickman sketches ... candidate loss

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 3 internal anchors

[1]

Remodiffuse: Retrieval-augmented motion diffusion model,

M. Zhang, X. Guo, L. Pan, Z. Cai, F. Hong, H. Li, L. Yang, and Z. Liu, “Remodiffuse: Retrieval-augmented motion diffusion model,” inICCV, 2023, pp. 364–373

work page 2023
[2]

Motiongpt: Finetuned llms are general-purpose motion generators,

Y . Zhang, D. Huang, B. Liu, S. Tang, Y . Lu, L. Chen, L. Bai, Q. Chu, N. Yu, and W. Ouyang, “Motiongpt: Finetuned llms are general-purpose motion generators,” inAAAI, vol. 38, no. 7, 2024, pp. 7368–7376

work page 2024
[3]

Motionclip: Exposing human motion generation to clip space,

G. Tevet, B. Gordon, A. Hertz, A. H. Bermano, and D. Cohen-Or, “Motionclip: Exposing human motion generation to clip space,” in ECCV. Springer, 2022, pp. 358–374

work page 2022
[4]

Flame: Free-form language-based motion synthesis & editing,

J. Kim, J. Kim, and S. Choi, “Flame: Free-form language-based motion synthesis & editing,” inAAAI, vol. 37, no. 7, 2023, pp. 8255–8263

work page 2023
[5]

Finemogen: Fine-grained spatio-temporal motion generation and editing,

M. Zhang, H. Li, Z. Cai, J. Ren, L. Yang, and Z. Liu, “Finemogen: Fine-grained spatio-temporal motion generation and editing,”NeurIPS, vol. 36, 2024

work page 2024
[6]

head”, “neck

M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu, “Motiondiffuse: Text-driven human motion generation with diffusion model,”arXiv preprint arXiv:2208.15001, 2022

work page arXiv 2022
[7]

Iterative motion editing with natural language,

P. Goel, K.-C. Wang, C. K. Liu, and K. Fatahalian, “Iterative motion editing with natural language,” inSIGGRAPH, 2024, pp. 1–9

work page 2024
[8]

Stickmotion: Generating 3d human motions by drawing a stick- man,

T. Wang, Z. Wu, Q. He, J. Chu, L. Qian, Y . Cheng, J. Xing, J. Zhao, and L. Jin, “Stickmotion: Generating 3d human motions by drawing a stick- man,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 12 370–12 379

work page 2025
[9]

Re-imagen: Retrieval- augmented text-to-image generator,

W. Chen, H. Hu, C. Saharia, and W. W. Cohen, “Re-imagen: Retrieval- augmented text-to-image generator,”arXiv preprint arXiv:2209.14491, 2022

work page arXiv 2022
[10]

Attention is all you need,

A. Vaswani, “Attention is all you need,”NeurIPS, 2017

work page 2017
[11]

Deep unsupervised learning using nonequilibrium thermodynamics,

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” inICML. PMLR, 2015, pp. 2256–2265

work page 2015
[12]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” NeurIPS, vol. 33, pp. 6840–6851, 2020

work page 2020
[13]

Diffusion models beat gans on image synthesis,

P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,”NeurIPS, vol. 34, pp. 8780–8794, 2021

work page 2021
[14]

Generative adversarial networks,

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020

work page 2020
[15]

Classifier-free diffusion guidance,

J. Ho and T. Salimans, “Classifier-free diffusion guidance,” inNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applica- tions, 2021

work page 2021
[16]

A survey on generative diffusion models,

H. Cao, C. Tan, Z. Gao, Y . Xu, G. Chen, P.-A. Heng, and S. Z. Li, “A survey on generative diffusion models,”TKDE, 2024

work page 2024
[17]

Back to mlp: A simple baseline for human motion prediction,

W. Guo, Y . Du, X. Shen, V . Lepetit, X. Alameda-Pineda, and F. Moreno- Noguer, “Back to mlp: A simple baseline for human motion prediction,” inWACV, 2023, pp. 4809–4819

work page 2023
[18]

Humanmac: Masked motion completion for human motion prediction,

L.-H. Chen, J. Zhang, Y . Li, Y . Pang, X. Xia, and T. Liu, “Humanmac: Masked motion completion for human motion prediction,” inICCV, 2023, pp. 9544–9555

work page 2023
[19]

Incorporating physics principles for precise human motion prediction,

Y . Zhang, J. O. Kephart, and Q. Ji, “Incorporating physics principles for precise human motion prediction,” inWACV, 2024, pp. 6164–6174

work page 2024
[20]

Progressively generating better initial guesses towards next stages for high-quality human motion prediction,

T. Ma, Y . Nie, C. Long, Q. Zhang, and G. Li, “Progressively generating better initial guesses towards next stages for high-quality human motion prediction,” inCVPR, 2022, pp. 6437–6446

work page 2022
[21]

Gcnext: Towards the unity of graph convolutions for human motion prediction,

X. Wang, Q. Cui, C. Chen, and M. Liu, “Gcnext: Towards the unity of graph convolutions for human motion prediction,” inAAAI, vol. 38, no. 6, 2024, pp. 5642–5650

work page 2024
[22]

Action2motion: Conditioned generation of 3d human motions,

C. Guo, X. Zuo, S. Wang, S. Zou, Q. Sun, A. Deng, M. Gong, and L. Cheng, “Action2motion: Conditioned generation of 3d human motions,” inACM MM, 2020, pp. 2021–2029

work page 2020
[23]

Structure-aware human- action generation,

P. Yu, Y . Zhao, C. Li, J. Yuan, and C. Chen, “Structure-aware human- action generation,” inECCV. Springer, 2020, pp. 18–34

work page 2020
[24]

Generative adversarial graph convolutional networks for human action synthesis,

B. Degardin, J. Neves, V . Lopes, J. Brito, E. Yaghoubi, and H. Proenc ¸a, “Generative adversarial graph convolutional networks for human action synthesis,” inWACV, 2022, pp. 1150–1159

work page 2022
[25]

Action-conditioned 3d human motion synthesis with transformer vae,

M. Petrovich, M. J. Black, and G. Varol, “Action-conditioned 3d human motion synthesis with transformer vae,” inICCV, 2021, pp. 10 985– 10 995

work page 2021
[26]

Action-conditioned on-demand motion generation,

Q. Lu, Y . Zhang, M. Lu, and V . Roychowdhury, “Action-conditioned on-demand motion generation,” inACM MM, 2022, pp. 2249–2257

work page 2022
[27]

Implicit neural representations for variable length human motion generation,

P. Cervantes, Y . Sekikawa, I. Sato, and K. Shinoda, “Implicit neural representations for variable length human motion generation,” inECCV. Springer, 2022, pp. 356–372

work page 2022
[28]

Dancemeld: Unrav- eling dance phrases with hierarchical latent codes for music-to-dance synthesis,

X. Gao, L. Hu, P. Zhang, B. Zhang, and L. Bo, “Dancemeld: Unrav- eling dance phrases with hierarchical latent codes for music-to-dance synthesis,”arXiv preprint arXiv:2401.10242, 2023

work page arXiv 2023
[29]

Dance revolution: Long-term dance generation with music via curriculum learning,

R. Huang, H. Hu, W. Wu, K. Sawada, M. Zhang, and D. Jiang, “Dance revolution: Long-term dance generation with music via curriculum learning,”arXiv preprint arXiv:2006.06119, 2020

work page arXiv 2006
[30]

Danceformer: Music con- ditioned 3d dance generation with parametric motion transformer,

B. Li, Y . Zhao, S. Zhelun, and L. Sheng, “Danceformer: Music con- ditioned 3d dance generation with parametric motion transformer,” in AAAI, vol. 36, no. 2, 2022, pp. 1272–1279

work page 2022
[31]

Edge: Editable dance generation from music,

J. Tseng, R. Castellon, and K. Liu, “Edge: Editable dance generation from music,” inCVPR, 2023, pp. 448–458

work page 2023
[32]

Gesturediffuclip: Gesture diffusion model with clip latents,

T. Ao, Z. Zhang, and L. Liu, “Gesturediffuclip: Gesture diffusion model with clip latents,”TOG, vol. 42, no. 4, pp. 1–18, 2023

work page 2023
[33]

Zeroeggs: Zero-shot example-based gesture generation from speech,

S. Ghorbani, Y . Ferstl, D. Holden, N. F. Troje, and M.-A. Carbonneau, “Zeroeggs: Zero-shot example-based gesture generation from speech,” inComputer Graphics Forum, vol. 42, no. 1. Wiley Online Library, 2023, pp. 206–216

work page 2023
[34]

Analyzing input and output representations for speech-driven gesture generation,

T. Kucherenko, D. Hasegawa, G. E. Henter, N. Kaneko, and H. Kjell- str¨om, “Analyzing input and output representations for speech-driven gesture generation,” inProceedings of the 19th ACM International Conference on Intelligent Virtual Agents, 2019, pp. 97–104

work page 2019
[35]

Speech gesture generation from the trimodal context of text, audio, and speaker identity,

Y . Yoon, B. Cha, J.-H. Lee, M. Jang, J. Lee, J. Kim, and G. Lee, “Speech gesture generation from the trimodal context of text, audio, and speaker identity,”TOG, vol. 39, no. 6, pp. 1–16, 2020

work page 2020
[36]

Language2pose: Natural language grounded pose forecasting,

C. Ahuja and L.-P. Morency, “Language2pose: Natural language grounded pose forecasting,” in3DV. IEEE, 2019, pp. 719–728

work page 2019
[37]

Syn- thesis of compositional animations from textual descriptions,

A. Ghosh, N. Cheema, C. Oguz, C. Theobalt, and P. Slusallek, “Syn- thesis of compositional animations from textual descriptions,” inICCV, 2021, pp. 1396–1406

work page 2021
[38]

Generating diverse and natural 3d human motions from text,

C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng, “Generating diverse and natural 3d human motions from text,” inCVPR, 2022, pp. 5152–5161

work page 2022
[39]

Anyskill: Learn- ing open-vocabulary physical skill for interactive agents,

J. Cui, T. Liu, N. Liu, Y . Yang, Y . Zhu, and S. Huang, “Anyskill: Learn- ing open-vocabulary physical skill for interactive agents,” inCVPR, 2024, pp. 852–862

work page 2024
[40]

Momask: Generative masked modeling of 3d human motions,

C. Guo, Y . Mu, M. G. Javed, S. Wang, and L. Cheng, “Momask: Generative masked modeling of 3d human motions,” inCVPR, 2024, pp. 1900–1910

work page 2024
[41]

Diffusion-based generation, optimization, and planning in 3d scenes,

S. Huang, Z. Wang, P. Li, B. Jia, T. Liu, Y . Zhu, W. Liang, and S.- C. Zhu, “Diffusion-based generation, optimization, and planning in 3d scenes,” inCVPR, 2023, pp. 16 750–16 761

work page 2023
[42]

Populating 3d scenes by learning human-scene interaction,

M. Hassan, P. Ghosh, J. Tesch, D. Tzionas, and M. J. Black, “Populating 3d scenes by learning human-scene interaction,” inCVPR, 2021, pp. 14 708–14 718

work page 2021
[43]

Mammos: Mapping multiple human motion with scene understanding and natural interactions,

D. Lim, C. Jeong, and Y . M. Kim, “Mammos: Mapping multiple human motion with scene understanding and natural interactions,” inICCV, 2023, pp. 4278–4287

work page 2023
[44]

Revisit human-scene interaction via space occupancy,

X. Liu, H. Hou, Y . Yang, Y .-L. Li, and C. Lu, “Revisit human-scene interaction via space occupancy,”arXiv preprint arXiv:2312.02700, 2023

work page arXiv 2023
[45]

arXiv preprint arXiv:2309.07918 (2023)

Z. Xiao, T. Wang, J. Wang, J. Cao, W. Zhang, B. Dai, D. Lin, and J. Pang, “Unified human-scene interaction via prompted chain-of- contacts,”arXiv preprint arXiv:2309.07918, 2023

work page arXiv 2023
[46]

Cg-hoi: Contact-guided 3d human-object inter- action generation,

C. Diller and A. Dai, “Cg-hoi: Contact-guided 3d human-object inter- action generation,” inCVPR, 2024, pp. 19 888–19 901

work page 2024
[47]

Interdiff: Generating 3d human-object interactions with physics-informed diffusion,

S. Xu, Z. Li, Y .-X. Wang, and L.-Y . Gui, “Interdiff: Generating 3d human-object interactions with physics-informed diffusion,” inICCV, 2023, pp. 14 928–14 940

work page 2023
[48]

Interactgan: Learning to generate human-object interaction,

C. Gao, S. Liu, D. Zhu, Q. Liu, J. Cao, H. He, R. He, and S. Yan, “Interactgan: Learning to generate human-object interaction,” inACM MM, 2020, pp. 165–173. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 16

work page 2020
[49]

Handdiffuse: Generative controllers for two-hand interactions via diffusion models,

P. Lin, S. Xu, H. Yang, Y . Liu, X. Chen, J. Wang, J. Yu, and L. Xu, “Handdiffuse: Generative controllers for two-hand interactions via diffusion models,”arXiv preprint arXiv:2312.04867, 2023

work page arXiv 2023
[50]

Digital life project: Autonomous 3d characters with social intelligence,

Z. Cai, J. Jiang, Z. Qing, X. Guo, M. Zhang, Z. Lin, H. Mei, C. Wei, R. Wang, W. Yinet al., “Digital life project: Autonomous 3d characters with social intelligence,” inCVPR, 2024, pp. 582–592

work page 2024
[51]

Bipartite graph diffusion model for human interaction generation,

B. Chopin, H. Tang, and M. Daoudi, “Bipartite graph diffusion model for human interaction generation,” inWACV, 2024, pp. 5333–5342

work page 2024
[52]

Remos: Reactive 3d motion synthesis for two-person interactions,

A. Ghosh, R. Dabral, V . Golyanik, C. Theobalt, and P. Slusallek, “Remos: Reactive 3d motion synthesis for two-person interactions,” arXiv preprint arXiv:2311.17057, 2023

work page arXiv 2023
[53]

Intergen: Diffusion- based multi-human motion generation under complex interactions,

H. Liang, W. Zhang, W. Li, J. Yu, and L. Xu, “Intergen: Diffusion- based multi-human motion generation under complex interactions,” International Journal of Computer Vision, pp. 1–21, 2024

work page 2024
[54]

Role-aware interaction generation from textual description,

M. Tanaka and K. Fujiwara, “Role-aware interaction generation from textual description,” inICCV, 2023, pp. 15 999–16 009

work page 2023
[55]

Guided motion diffusion for controllable human motion synthesis,

K. Karunratanakul, K. Preechakul, S. Suwajanakorn, and S. Tang, “Guided motion diffusion for controllable human motion synthesis,” 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2151–2162, 2023. [Online]. Available: https://api.semanticscholar. org/CorpusID:258833752

work page 2023
[56]

Human motion diffusion as a generative prior,

Y . Shafir, G. Tevet, R. Kapon, and A. H. Bermano, “Human motion diffusion as a generative prior,”ArXiv, vol. abs/2303.01418,

work page arXiv
[57]

Available: https://api.semanticscholar.org/CorpusID: 257279944

[Online]. Available: https://api.semanticscholar.org/CorpusID: 257279944

work page
[58]

Flexible motion in-betweening with diffusion models,

S. Cohan, G. Tevet, D. Reda, X. B. Peng, and M. van de Panne, “Flexible motion in-betweening with diffusion models,”ACM SIGGRAPH 2024 Conference Papers, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:269922160

work page 2024
[59]

Omnicontrol: Control any joint at any time for human motion generation,

Y . Xie, V . Jampani, L. Zhong, D. Sun, and H. Jiang, “Omnicontrol: Control any joint at any time for human motion generation,”ArXiv, vol. abs/2310.08580, 2023. [Online]. Available: https://api.semanticscholar. org/CorpusID:263909429

work page arXiv 2023
[60]

Adding conditional control to text- to-image diffusion models,

L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text- to-image diffusion models,”2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3813–3824, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:256827727

work page 2023
[61]

Human Motion Diffusion Model

G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-Or, and A. H. Bermano, “Human motion diffusion model,” 2022. [Online]. Available: https://arxiv.org/abs/2209.14916

work page internal anchor Pith review Pith/arXiv arXiv 2022
[62]

Optimizing diffusion noise can serve as universal motion priors,

K. Karunratanakul, K. Preechakul, E. Aksan, T. Beeler, S. Suwajanakorn, and S. Tang, “Optimizing diffusion noise can serve as universal motion priors,”2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1334–1345, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:266362434

work page 2024
[63]

The kit motion-language dataset,

M. Plappert, C. Mandery, and T. Asfour, “The kit motion-language dataset,”Big data, vol. 4, no. 4, pp. 236–252, 2016

work page 2016
[64]

Denoising Diffusion Implicit Models

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,”ArXiv, vol. abs/2010.02502, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:222140788

work page internal anchor Pith review Pith/arXiv arXiv 2010
[65]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inICML. PMLR, 2021, pp. 8748–8763

work page 2021
[66]

Efficient attention: Attention with linear complexities,

Z. Shen, M. Zhang, H. Zhao, S. Yi, and H. Li, “Efficient attention: Attention with linear complexities,” inWACV, 2021, pp. 3531–3539

work page 2021
[67]

Probabilistic and semantic descriptions of image manifolds and their applications,

P. Tu, Z. Yang, R. Hartley, Z. Xu, J. Zhang, D. Campbell, J. Singh, and T. Wang, “Probabilistic and semantic descriptions of image manifolds and their applications,”ArXiv, vol. abs/2307.02881, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259360837

work page arXiv 2023
[68]

Reducing the dimensionality of data with neural networks,

G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,”science, vol. 313, no. 5786, pp. 504–507, 2006

work page 2006
[69]

Auto-Encoding Variational Bayes

D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[70]

Prevalence of neural collapse during the terminal phase of deep learning training,

V . Papyan, X. Han, and D. L. Donoho, “Prevalence of neural collapse during the terminal phase of deep learning training,”Proceedings of the National Academy of Sciences of the United States of America, vol. 117, pp. 24 652 – 24 663, 2020

work page 2020
[71]

Feature learning in deep classifiers through intermediate neural collapse,

A. Rangamani, M. Lindegaard, T. Galanti, and T. A. Poggio, “Feature learning in deep classifiers through intermediate neural collapse,” in International Conference on Machine Learning, 2023

work page 2023
[72]

The prevalence of neural collapse in neural multivariate regression,

G. Andriopoulos, Z. Dong, L. Guo, Z. Zhao, and K. Ross, “The prevalence of neural collapse in neural multivariate regression,”ArXiv, vol. abs/2409.04180, 2024

work page arXiv 2024
[73]

On the generalized distance in statistics,

P. C. Mahalanobis, “On the generalized distance in statistics,”

work page
[74]

Available: https://api.semanticscholar.org/CorpusID: 117765088

[Online]. Available: https://api.semanticscholar.org/CorpusID: 117765088

work page
[75]

Generating human motion from textual descriptions with discrete representations,

J. Zhang, Y . Zhang, X. Cun, Y . Zhang, H. Zhao, H. Lu, X. Shen, and Y . Shan, “Generating human motion from textual descriptions with discrete representations,” inCVPR, 2023, pp. 14 730–14 740. Tao Wangis currently pursuing a doctorate at Beijing University of Posts and Telecommunica- tions (BUPT), Beijing, China. His major research areas include human p...

work page 2023

[1] [1]

Remodiffuse: Retrieval-augmented motion diffusion model,

M. Zhang, X. Guo, L. Pan, Z. Cai, F. Hong, H. Li, L. Yang, and Z. Liu, “Remodiffuse: Retrieval-augmented motion diffusion model,” inICCV, 2023, pp. 364–373

work page 2023

[2] [2]

Motiongpt: Finetuned llms are general-purpose motion generators,

Y . Zhang, D. Huang, B. Liu, S. Tang, Y . Lu, L. Chen, L. Bai, Q. Chu, N. Yu, and W. Ouyang, “Motiongpt: Finetuned llms are general-purpose motion generators,” inAAAI, vol. 38, no. 7, 2024, pp. 7368–7376

work page 2024

[3] [3]

Motionclip: Exposing human motion generation to clip space,

G. Tevet, B. Gordon, A. Hertz, A. H. Bermano, and D. Cohen-Or, “Motionclip: Exposing human motion generation to clip space,” in ECCV. Springer, 2022, pp. 358–374

work page 2022

[4] [4]

Flame: Free-form language-based motion synthesis & editing,

J. Kim, J. Kim, and S. Choi, “Flame: Free-form language-based motion synthesis & editing,” inAAAI, vol. 37, no. 7, 2023, pp. 8255–8263

work page 2023

[5] [5]

Finemogen: Fine-grained spatio-temporal motion generation and editing,

M. Zhang, H. Li, Z. Cai, J. Ren, L. Yang, and Z. Liu, “Finemogen: Fine-grained spatio-temporal motion generation and editing,”NeurIPS, vol. 36, 2024

work page 2024

[6] [6]

head”, “neck

M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu, “Motiondiffuse: Text-driven human motion generation with diffusion model,”arXiv preprint arXiv:2208.15001, 2022

work page arXiv 2022

[7] [7]

Iterative motion editing with natural language,

P. Goel, K.-C. Wang, C. K. Liu, and K. Fatahalian, “Iterative motion editing with natural language,” inSIGGRAPH, 2024, pp. 1–9

work page 2024

[8] [8]

Stickmotion: Generating 3d human motions by drawing a stick- man,

T. Wang, Z. Wu, Q. He, J. Chu, L. Qian, Y . Cheng, J. Xing, J. Zhao, and L. Jin, “Stickmotion: Generating 3d human motions by drawing a stick- man,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 12 370–12 379

work page 2025

[9] [9]

Re-imagen: Retrieval- augmented text-to-image generator,

W. Chen, H. Hu, C. Saharia, and W. W. Cohen, “Re-imagen: Retrieval- augmented text-to-image generator,”arXiv preprint arXiv:2209.14491, 2022

work page arXiv 2022

[10] [10]

Attention is all you need,

A. Vaswani, “Attention is all you need,”NeurIPS, 2017

work page 2017

[11] [11]

Deep unsupervised learning using nonequilibrium thermodynamics,

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” inICML. PMLR, 2015, pp. 2256–2265

work page 2015

[12] [12]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” NeurIPS, vol. 33, pp. 6840–6851, 2020

work page 2020

[13] [13]

Diffusion models beat gans on image synthesis,

P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,”NeurIPS, vol. 34, pp. 8780–8794, 2021

work page 2021

[14] [14]

Generative adversarial networks,

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020

work page 2020

[15] [15]

Classifier-free diffusion guidance,

J. Ho and T. Salimans, “Classifier-free diffusion guidance,” inNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applica- tions, 2021

work page 2021

[16] [16]

A survey on generative diffusion models,

H. Cao, C. Tan, Z. Gao, Y . Xu, G. Chen, P.-A. Heng, and S. Z. Li, “A survey on generative diffusion models,”TKDE, 2024

work page 2024

[17] [17]

Back to mlp: A simple baseline for human motion prediction,

W. Guo, Y . Du, X. Shen, V . Lepetit, X. Alameda-Pineda, and F. Moreno- Noguer, “Back to mlp: A simple baseline for human motion prediction,” inWACV, 2023, pp. 4809–4819

work page 2023

[18] [18]

Humanmac: Masked motion completion for human motion prediction,

L.-H. Chen, J. Zhang, Y . Li, Y . Pang, X. Xia, and T. Liu, “Humanmac: Masked motion completion for human motion prediction,” inICCV, 2023, pp. 9544–9555

work page 2023

[19] [19]

Incorporating physics principles for precise human motion prediction,

Y . Zhang, J. O. Kephart, and Q. Ji, “Incorporating physics principles for precise human motion prediction,” inWACV, 2024, pp. 6164–6174

work page 2024

[20] [20]

Progressively generating better initial guesses towards next stages for high-quality human motion prediction,

T. Ma, Y . Nie, C. Long, Q. Zhang, and G. Li, “Progressively generating better initial guesses towards next stages for high-quality human motion prediction,” inCVPR, 2022, pp. 6437–6446

work page 2022

[21] [21]

Gcnext: Towards the unity of graph convolutions for human motion prediction,

X. Wang, Q. Cui, C. Chen, and M. Liu, “Gcnext: Towards the unity of graph convolutions for human motion prediction,” inAAAI, vol. 38, no. 6, 2024, pp. 5642–5650

work page 2024

[22] [22]

Action2motion: Conditioned generation of 3d human motions,

C. Guo, X. Zuo, S. Wang, S. Zou, Q. Sun, A. Deng, M. Gong, and L. Cheng, “Action2motion: Conditioned generation of 3d human motions,” inACM MM, 2020, pp. 2021–2029

work page 2020

[23] [23]

Structure-aware human- action generation,

P. Yu, Y . Zhao, C. Li, J. Yuan, and C. Chen, “Structure-aware human- action generation,” inECCV. Springer, 2020, pp. 18–34

work page 2020

[24] [24]

Generative adversarial graph convolutional networks for human action synthesis,

B. Degardin, J. Neves, V . Lopes, J. Brito, E. Yaghoubi, and H. Proenc ¸a, “Generative adversarial graph convolutional networks for human action synthesis,” inWACV, 2022, pp. 1150–1159

work page 2022

[25] [25]

Action-conditioned 3d human motion synthesis with transformer vae,

M. Petrovich, M. J. Black, and G. Varol, “Action-conditioned 3d human motion synthesis with transformer vae,” inICCV, 2021, pp. 10 985– 10 995

work page 2021

[26] [26]

Action-conditioned on-demand motion generation,

Q. Lu, Y . Zhang, M. Lu, and V . Roychowdhury, “Action-conditioned on-demand motion generation,” inACM MM, 2022, pp. 2249–2257

work page 2022

[27] [27]

Implicit neural representations for variable length human motion generation,

P. Cervantes, Y . Sekikawa, I. Sato, and K. Shinoda, “Implicit neural representations for variable length human motion generation,” inECCV. Springer, 2022, pp. 356–372

work page 2022

[28] [28]

Dancemeld: Unrav- eling dance phrases with hierarchical latent codes for music-to-dance synthesis,

X. Gao, L. Hu, P. Zhang, B. Zhang, and L. Bo, “Dancemeld: Unrav- eling dance phrases with hierarchical latent codes for music-to-dance synthesis,”arXiv preprint arXiv:2401.10242, 2023

work page arXiv 2023

[29] [29]

Dance revolution: Long-term dance generation with music via curriculum learning,

R. Huang, H. Hu, W. Wu, K. Sawada, M. Zhang, and D. Jiang, “Dance revolution: Long-term dance generation with music via curriculum learning,”arXiv preprint arXiv:2006.06119, 2020

work page arXiv 2006

[30] [30]

Danceformer: Music con- ditioned 3d dance generation with parametric motion transformer,

B. Li, Y . Zhao, S. Zhelun, and L. Sheng, “Danceformer: Music con- ditioned 3d dance generation with parametric motion transformer,” in AAAI, vol. 36, no. 2, 2022, pp. 1272–1279

work page 2022

[31] [31]

Edge: Editable dance generation from music,

J. Tseng, R. Castellon, and K. Liu, “Edge: Editable dance generation from music,” inCVPR, 2023, pp. 448–458

work page 2023

[32] [32]

Gesturediffuclip: Gesture diffusion model with clip latents,

T. Ao, Z. Zhang, and L. Liu, “Gesturediffuclip: Gesture diffusion model with clip latents,”TOG, vol. 42, no. 4, pp. 1–18, 2023

work page 2023

[33] [33]

Zeroeggs: Zero-shot example-based gesture generation from speech,

S. Ghorbani, Y . Ferstl, D. Holden, N. F. Troje, and M.-A. Carbonneau, “Zeroeggs: Zero-shot example-based gesture generation from speech,” inComputer Graphics Forum, vol. 42, no. 1. Wiley Online Library, 2023, pp. 206–216

work page 2023

[34] [34]

Analyzing input and output representations for speech-driven gesture generation,

T. Kucherenko, D. Hasegawa, G. E. Henter, N. Kaneko, and H. Kjell- str¨om, “Analyzing input and output representations for speech-driven gesture generation,” inProceedings of the 19th ACM International Conference on Intelligent Virtual Agents, 2019, pp. 97–104

work page 2019

[35] [35]

Speech gesture generation from the trimodal context of text, audio, and speaker identity,

Y . Yoon, B. Cha, J.-H. Lee, M. Jang, J. Lee, J. Kim, and G. Lee, “Speech gesture generation from the trimodal context of text, audio, and speaker identity,”TOG, vol. 39, no. 6, pp. 1–16, 2020

work page 2020

[36] [36]

Language2pose: Natural language grounded pose forecasting,

C. Ahuja and L.-P. Morency, “Language2pose: Natural language grounded pose forecasting,” in3DV. IEEE, 2019, pp. 719–728

work page 2019

[37] [37]

Syn- thesis of compositional animations from textual descriptions,

A. Ghosh, N. Cheema, C. Oguz, C. Theobalt, and P. Slusallek, “Syn- thesis of compositional animations from textual descriptions,” inICCV, 2021, pp. 1396–1406

work page 2021

[38] [38]

Generating diverse and natural 3d human motions from text,

C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng, “Generating diverse and natural 3d human motions from text,” inCVPR, 2022, pp. 5152–5161

work page 2022

[39] [39]

Anyskill: Learn- ing open-vocabulary physical skill for interactive agents,

J. Cui, T. Liu, N. Liu, Y . Yang, Y . Zhu, and S. Huang, “Anyskill: Learn- ing open-vocabulary physical skill for interactive agents,” inCVPR, 2024, pp. 852–862

work page 2024

[40] [40]

Momask: Generative masked modeling of 3d human motions,

C. Guo, Y . Mu, M. G. Javed, S. Wang, and L. Cheng, “Momask: Generative masked modeling of 3d human motions,” inCVPR, 2024, pp. 1900–1910

work page 2024

[41] [41]

Diffusion-based generation, optimization, and planning in 3d scenes,

S. Huang, Z. Wang, P. Li, B. Jia, T. Liu, Y . Zhu, W. Liang, and S.- C. Zhu, “Diffusion-based generation, optimization, and planning in 3d scenes,” inCVPR, 2023, pp. 16 750–16 761

work page 2023

[42] [42]

Populating 3d scenes by learning human-scene interaction,

M. Hassan, P. Ghosh, J. Tesch, D. Tzionas, and M. J. Black, “Populating 3d scenes by learning human-scene interaction,” inCVPR, 2021, pp. 14 708–14 718

work page 2021

[43] [43]

Mammos: Mapping multiple human motion with scene understanding and natural interactions,

D. Lim, C. Jeong, and Y . M. Kim, “Mammos: Mapping multiple human motion with scene understanding and natural interactions,” inICCV, 2023, pp. 4278–4287

work page 2023

[44] [44]

Revisit human-scene interaction via space occupancy,

X. Liu, H. Hou, Y . Yang, Y .-L. Li, and C. Lu, “Revisit human-scene interaction via space occupancy,”arXiv preprint arXiv:2312.02700, 2023

work page arXiv 2023

[45] [45]

arXiv preprint arXiv:2309.07918 (2023)

Z. Xiao, T. Wang, J. Wang, J. Cao, W. Zhang, B. Dai, D. Lin, and J. Pang, “Unified human-scene interaction via prompted chain-of- contacts,”arXiv preprint arXiv:2309.07918, 2023

work page arXiv 2023

[46] [46]

Cg-hoi: Contact-guided 3d human-object inter- action generation,

C. Diller and A. Dai, “Cg-hoi: Contact-guided 3d human-object inter- action generation,” inCVPR, 2024, pp. 19 888–19 901

work page 2024

[47] [47]

Interdiff: Generating 3d human-object interactions with physics-informed diffusion,

S. Xu, Z. Li, Y .-X. Wang, and L.-Y . Gui, “Interdiff: Generating 3d human-object interactions with physics-informed diffusion,” inICCV, 2023, pp. 14 928–14 940

work page 2023

[48] [48]

Interactgan: Learning to generate human-object interaction,

C. Gao, S. Liu, D. Zhu, Q. Liu, J. Cao, H. He, R. He, and S. Yan, “Interactgan: Learning to generate human-object interaction,” inACM MM, 2020, pp. 165–173. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 16

work page 2020

[49] [49]

Handdiffuse: Generative controllers for two-hand interactions via diffusion models,

P. Lin, S. Xu, H. Yang, Y . Liu, X. Chen, J. Wang, J. Yu, and L. Xu, “Handdiffuse: Generative controllers for two-hand interactions via diffusion models,”arXiv preprint arXiv:2312.04867, 2023

work page arXiv 2023

[50] [50]

Digital life project: Autonomous 3d characters with social intelligence,

Z. Cai, J. Jiang, Z. Qing, X. Guo, M. Zhang, Z. Lin, H. Mei, C. Wei, R. Wang, W. Yinet al., “Digital life project: Autonomous 3d characters with social intelligence,” inCVPR, 2024, pp. 582–592

work page 2024

[51] [51]

Bipartite graph diffusion model for human interaction generation,

B. Chopin, H. Tang, and M. Daoudi, “Bipartite graph diffusion model for human interaction generation,” inWACV, 2024, pp. 5333–5342

work page 2024

[52] [52]

Remos: Reactive 3d motion synthesis for two-person interactions,

A. Ghosh, R. Dabral, V . Golyanik, C. Theobalt, and P. Slusallek, “Remos: Reactive 3d motion synthesis for two-person interactions,” arXiv preprint arXiv:2311.17057, 2023

work page arXiv 2023

[53] [53]

Intergen: Diffusion- based multi-human motion generation under complex interactions,

H. Liang, W. Zhang, W. Li, J. Yu, and L. Xu, “Intergen: Diffusion- based multi-human motion generation under complex interactions,” International Journal of Computer Vision, pp. 1–21, 2024

work page 2024

[54] [54]

Role-aware interaction generation from textual description,

M. Tanaka and K. Fujiwara, “Role-aware interaction generation from textual description,” inICCV, 2023, pp. 15 999–16 009

work page 2023

[55] [55]

Guided motion diffusion for controllable human motion synthesis,

K. Karunratanakul, K. Preechakul, S. Suwajanakorn, and S. Tang, “Guided motion diffusion for controllable human motion synthesis,” 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2151–2162, 2023. [Online]. Available: https://api.semanticscholar. org/CorpusID:258833752

work page 2023

[56] [56]

Human motion diffusion as a generative prior,

Y . Shafir, G. Tevet, R. Kapon, and A. H. Bermano, “Human motion diffusion as a generative prior,”ArXiv, vol. abs/2303.01418,

work page arXiv

[57] [57]

Available: https://api.semanticscholar.org/CorpusID: 257279944

[Online]. Available: https://api.semanticscholar.org/CorpusID: 257279944

work page

[58] [58]

Flexible motion in-betweening with diffusion models,

S. Cohan, G. Tevet, D. Reda, X. B. Peng, and M. van de Panne, “Flexible motion in-betweening with diffusion models,”ACM SIGGRAPH 2024 Conference Papers, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:269922160

work page 2024

[59] [59]

Omnicontrol: Control any joint at any time for human motion generation,

Y . Xie, V . Jampani, L. Zhong, D. Sun, and H. Jiang, “Omnicontrol: Control any joint at any time for human motion generation,”ArXiv, vol. abs/2310.08580, 2023. [Online]. Available: https://api.semanticscholar. org/CorpusID:263909429

work page arXiv 2023

[60] [60]

Adding conditional control to text- to-image diffusion models,

L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text- to-image diffusion models,”2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3813–3824, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:256827727

work page 2023

[61] [61]

Human Motion Diffusion Model

G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-Or, and A. H. Bermano, “Human motion diffusion model,” 2022. [Online]. Available: https://arxiv.org/abs/2209.14916

work page internal anchor Pith review Pith/arXiv arXiv 2022

[62] [62]

Optimizing diffusion noise can serve as universal motion priors,

K. Karunratanakul, K. Preechakul, E. Aksan, T. Beeler, S. Suwajanakorn, and S. Tang, “Optimizing diffusion noise can serve as universal motion priors,”2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1334–1345, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:266362434

work page 2024

[63] [63]

The kit motion-language dataset,

M. Plappert, C. Mandery, and T. Asfour, “The kit motion-language dataset,”Big data, vol. 4, no. 4, pp. 236–252, 2016

work page 2016

[64] [64]

Denoising Diffusion Implicit Models

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,”ArXiv, vol. abs/2010.02502, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:222140788

work page internal anchor Pith review Pith/arXiv arXiv 2010

[65] [65]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inICML. PMLR, 2021, pp. 8748–8763

work page 2021

[66] [66]

Efficient attention: Attention with linear complexities,

Z. Shen, M. Zhang, H. Zhao, S. Yi, and H. Li, “Efficient attention: Attention with linear complexities,” inWACV, 2021, pp. 3531–3539

work page 2021

[67] [67]

Probabilistic and semantic descriptions of image manifolds and their applications,

P. Tu, Z. Yang, R. Hartley, Z. Xu, J. Zhang, D. Campbell, J. Singh, and T. Wang, “Probabilistic and semantic descriptions of image manifolds and their applications,”ArXiv, vol. abs/2307.02881, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259360837

work page arXiv 2023

[68] [68]

Reducing the dimensionality of data with neural networks,

G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,”science, vol. 313, no. 5786, pp. 504–507, 2006

work page 2006

[69] [69]

Auto-Encoding Variational Bayes

D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[70] [70]

Prevalence of neural collapse during the terminal phase of deep learning training,

V . Papyan, X. Han, and D. L. Donoho, “Prevalence of neural collapse during the terminal phase of deep learning training,”Proceedings of the National Academy of Sciences of the United States of America, vol. 117, pp. 24 652 – 24 663, 2020

work page 2020

[71] [71]

Feature learning in deep classifiers through intermediate neural collapse,

A. Rangamani, M. Lindegaard, T. Galanti, and T. A. Poggio, “Feature learning in deep classifiers through intermediate neural collapse,” in International Conference on Machine Learning, 2023

work page 2023

[72] [72]

The prevalence of neural collapse in neural multivariate regression,

G. Andriopoulos, Z. Dong, L. Guo, Z. Zhao, and K. Ross, “The prevalence of neural collapse in neural multivariate regression,”ArXiv, vol. abs/2409.04180, 2024

work page arXiv 2024

[73] [73]

On the generalized distance in statistics,

P. C. Mahalanobis, “On the generalized distance in statistics,”

work page

[74] [74]

Available: https://api.semanticscholar.org/CorpusID: 117765088

[Online]. Available: https://api.semanticscholar.org/CorpusID: 117765088

work page

[75] [75]

Generating human motion from textual descriptions with discrete representations,

J. Zhang, Y . Zhang, X. Cun, Y . Zhang, H. Zhao, H. Lu, X. Shen, and Y . Shan, “Generating human motion from textual descriptions with discrete representations,” inCVPR, 2023, pp. 14 730–14 740. Tao Wangis currently pursuing a doctorate at Beijing University of Posts and Telecommunica- tions (BUPT), Beijing, China. His major research areas include human p...

work page 2023