pith. sign in

arxiv: 2605.14935 · v1 · pith:MRTVHWCYnew · submitted 2026-05-14 · 💻 cs.CV

Multi-scale Coarse-to-fine Modeling for Test-time Human Motion Control

Pith reviewed 2026-06-30 21:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-motion generationhuman motion controlcoarse-to-fine modelingtoken guidancetest-time refinementcontrollable synthesisdiscrete token modelsmotion quality metrics
0
0 comments X

The pith

MSCoT uses multi-scale coarse-to-fine token prediction to generate text-controlled human motions with higher quality and tenfold faster inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MSCoT to create human motions from text while supporting flexible controls applied at test time. It represents motions as tokens arranged in a hierarchy of scales and generates complete sequences at each scale starting from coarse and moving to fine. A guidance method directs choices among discrete tokens to satisfy the controls, and a refiner adds small continuous adjustments to the embeddings. The design avoids repeated iterative steps required by diffusion models and does not need separate modules for different control signals. If the claims hold, controllable motion generation becomes quicker and more precise without custom tuning for each new constraint.

Core claim

MSCoT discretizes motion into a multi-scale hierarchical representation and predicts the entire token sequence at each temporal scale in a coarse-to-fine fashion. Building on this coarse-to-fine paradigm, an efficient multi-scale token guidance strategy overcomes the challenge of discrete sampling and steers the token distribution towards the control goals, allowing for fast and flexible control. To address the limitations of a discrete codebook, a lightweight token refiner further adds continuous residuals to the discrete token embeddings and allows differentiable test-time refinement optimization to ensure precise alignment with the control objectives, producing quality motions consistent

What carries the argument

multi-scale hierarchical token representation with coarse-to-fine sequence prediction, multi-scale token guidance, and lightweight token refiner for continuous residuals

If this is right

  • Produces motions consistent with control constraints at test time without modules tailored to specific signals.
  • Achieves 48% improvement in motion quality measured by FID on HumanML3D.
  • Reduces average control error by 61% relative to existing baselines.
  • Delivers 10 times faster inference speed than diffusion-based methods on the same benchmark.
  • Enables controllable text-to-motion generation that maintains naturalness while meeting arbitrary goals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The coarse-to-fine token structure could be tested on longer motion sequences to check whether the speed advantage scales with sequence length.
  • Similar guidance and refinement steps might apply to other discrete token tasks such as music generation if the discretization challenges are comparable.
  • Real-time interactive control in animation tools becomes more practical if the inference gains hold under varying hardware conditions.
  • Combining the refiner with additional loss terms could be explored to handle conflicting control signals without retraining.

Load-bearing premise

The multi-scale token guidance can reliably steer discrete token distributions toward arbitrary control goals without post-hoc dataset-specific tuning or loss of naturalness, and the refiner's continuous residuals suffice to overcome codebook discretization limits.

What would settle it

Evaluating MSCoT on the HumanML3D benchmark and observing no 48% FID improvement, no 61% reduction in average control error, or no 10x inference speedup over baselines while still matching the stated control constraints would falsify the performance claims.

Figures

Figures reproduced from arXiv: 2605.14935 by Ajmal Mian, Anh Nguyen, Daochang Liu, Nhat Le.

Figure 1
Figure 1. Figure 1: Comparing our model to existing guidance techniques for controlling motion. (a) Diffusion model [90] directly perturbs dense motion in high-dimensional space and requires many denoising steps. (b) Next-token model [99] can only guide one token at a time and suffers from error accumulation. (c) Masked model [58] repeatedly unmasks only part of the token sequence at each step, leading to inconsistent guidanc… view at source ↗
Figure 2
Figure 2. Figure 2: Example applications of our MSCoT on test-time, training-free, human motion control: (a) Con￾trolling any joint at any time, (b) Obstacle avoidance, (c) Human-scene interaction. Best viewed in supple￾mentary video. rior of our method is close to the exact posterior, which guarantees the quality and feasibility of controlling test-time human motion. In summary, our key contributions are: (i) we introduce MS… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our multi-scale motion generation model. First, a multi-scale VQ-VAE encodes a motion sequence into K discrete token sequences z = {z1, z2, . . . , zK} at increasing temporal resolu￾tion. Next, a multi-scale autoregressive transformer takes the token sequences from all previous scales {[s], z1, z2, . . . , zk−1} and predicts the tokens for the next scale zk. The input and corresponding outputs … view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of our multi-scale guidance and refinement. At each scale, the guided token posterior distribution is updated by evaluating the control objective’s likelihood through a single decoder’s forward￾backward pass. A small token refiner then adds continuous residuals to the sampled token embeddings; at the last scale K, a test-time refinement optimization further ensures precise alignment with the c… view at source ↗
Figure 5
Figure 5. Figure 5: Time-frequency spectrogram of the intermediate motions across 10 scales. Darker region indicates lower energy of the human motion. standard unconstrained text-to-motion FID because the control alters the sample distribution, which deviates from natural motion trajectories to satisfy the joint constraints. From [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example of the coarse-to-fine motion at each scale under our guidance. The motion becomes more realistic and better follows the control path after each scale. generating motions that are both realistic and accurate with respect to the control signal. More illustrative results can be found in our demonstration video. 5 Discussions Limitations. While our method achieves encouraging results, it also has certa… view at source ↗
Figure 7
Figure 7. Figure 7: Codebook usage per scale. For joint-controlled generation, [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of our token guidance behavior [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Trade-off between refinement iterations, control quality, and runtime [PITH_FULL_IMAGE:figures/full_fig_p032_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Failure cases where noisy pelvis trajectories (a) and implausible multi-joint targets (b) lead to unnatural motions under the text prompt "a person walks forward". Failure case. We show example failure cases in [PITH_FULL_IMAGE:figures/full_fig_p036_10.png] view at source ↗
read the original abstract

We present MSCoT, a multi-scale, coarse-to-fine model for test-time human motion synthesis and control. Unlike recent approaches that rely on multiple iterative denoising/token-prediction steps, or modules tailored for specific control signals, MSCoT discretizes motion into a multi-scale hierarchical representation and predicts the entire token sequence at each temporal scale in a coarse-to-fine fashion. Building on this coarse-to-fine paradigm, we propose an efficient multi-scale token guidance strategy that overcomes the challenge of discrete sampling and steers the token distribution towards the control goals, allowing for fast and flexible control. To address the limitations of a discrete codebook, a lightweight token refiner further adds continuous residuals to the discrete token embeddings and allows differentiable test-time refinement optimization to ensure precise alignment with the control objectives. MSCoT is able to produce quality motions, consistent with the control constraints, while offering substantially faster sampling than diffusion-based approaches. Experiments on popular benchmarks demonstrate state-of-the-art controllable text-to-motion generation performance of MSCoT over existing baselines, with better motion quality (48% FID improvement), higher control accuracy (-61% avg error), and $10 \times$ faster inference speed on HumanML3D.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces MSCoT, a multi-scale coarse-to-fine model for test-time human motion synthesis and control. Motion is discretized into a hierarchical multi-scale token representation that is predicted coarse-to-fine. A multi-scale token guidance strategy steers the discrete token distribution toward control goals, and a lightweight token refiner adds continuous residuals to the embeddings to enable differentiable test-time optimization. On the HumanML3D benchmark the method is reported to achieve state-of-the-art controllable text-to-motion performance, with a 48% FID improvement, 61% reduction in average control error, and 10× faster inference relative to existing baselines.

Significance. If the reported gains are reproducible, the work would constitute a meaningful advance in efficient test-time controllable motion generation. Replacing iterative denoising with a single-pass hierarchical token prediction plus guidance yields substantial speed-ups while improving both quality and control accuracy; the combination of discrete tokens with continuous residual refinement offers a practical route around codebook discretization limits that may transfer to other discrete generative settings in computer vision.

minor comments (2)
  1. [Abstract] Abstract: quantitative claims (48% FID, -61% error, 10× speed) are presented without cross-references to the tables or sections that contain the supporting numbers, baseline descriptions, or error bars; adding such pointers would improve readability.
  2. The manuscript would benefit from an explicit statement of the exact set of baselines used for the SOTA claim and whether they were re-implemented or taken from prior reports.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary, which correctly captures the core ideas and reported results of MSCoT. The positive assessment of the potential advance is appreciated. No specific major comments appear in the provided report, so we have no individual points to rebut or revise.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description frame MSCoT as an empirical architecture for multi-scale token-based motion control, with performance claims resting on benchmark results (FID, control error, inference speed) rather than any closed-form derivation or prediction step. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the text; the method is presented as a set of design choices (hierarchical discretization, token guidance, lightweight refiner) validated externally on HumanML3D and similar datasets. The central claims are therefore falsifiable via independent replication and do not reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone supplies insufficient technical detail to enumerate free parameters, axioms, or invented entities; the model introduces concepts such as multi-scale token guidance and token refiner whose implementation and fitting procedures are not specified.

pith-pipeline@v0.9.1-grok · 5748 in / 1040 out tokens · 26058 ms · 2026-06-30T21:46:21.428660+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

113 extracted references · 9 canonical work pages · 2 internal anchors

  1. [1]

    In: Interna- tional conference on 3D vision (3DV)

    Ahuja, C., Morency, L.P.: Language2pose: Natural language grounded pose forecasting. In: Interna- tional conference on 3D vision (3DV). pp. 719–728. IEEE (2019)

  2. [2]

    In: International Conference on 3D Vision (3DV)

    Aksan, E., Kaufmann, M., Cao, P., Hilliges, O.: A spatio-temporal transformer for 3d human motion prediction. In: International Conference on 3D Vision (3DV). pp. 565–574. IEEE (2021)

  3. [3]

    In: ICLR (2024)

    Bansal, A., Chu, H.M., Schwarzschild, A., Sengupta, R., Goldblum, M., Geiping, J., Goldstein, T.: Universal guidance for diffusion models. In: ICLR (2024)

  4. [4]

    In: CVPR

    Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., Yu, G.: Executing your commands via motion diffusion in latent space. In: CVPR. pp. 18000–18010 (2023)

  5. [5]

    In: CVPR

    Dabral, R., Mughal, M.H., Golyanik, V., Theobalt, C.: Mofusion: A framework for denoising-diffusion- based motion synthesis. In: CVPR. pp. 9760–9770 (2023)

  6. [6]

    In: ECCV

    Dai, W., Chen, L.H., Wang, J., Liu, J., Dai, B., Tang, Y.: Motionlcm: Real-time controllable motion generation via latent consistency model. In: ECCV. pp. 390–408. Springer (2024)

  7. [7]

    In: NeurIPS (2021)

    Dhariwal, P., Nichol, A.Q.: Diffusion models beat GANs on image synthesis. In: NeurIPS (2021)

  8. [8]

    In: CVPR

    Diller, C., Dai, A.: Cg-hoi: Contact-guided 3d human-object interaction generation. In: CVPR. pp. 19888–19901 (2024)

  9. [9]

    In: CVPR

    Diomataris, M., Athanasiou, N., Taheri, O., Wang, X., Hilliges, O., Black, M.J.: Wandr: Intention- guided human motion generation. In: CVPR. pp. 927–936 (2024)

  10. [10]

    In: CVPR

    Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR. pp. 12873–12883 (2021)

  11. [11]

    In: CVPR

    Feng, Y., Lin, J., Dwivedi, S.K., Sun, Y., Patel, P., Black, M.J.: Chatpose: Chatting about 3d human pose. In: CVPR. pp. 2093–2103 (2024)

  12. [12]

    In: ICCV

    Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: ICCV. pp. 4346–4354 (2015)

  13. [13]

    arXiv preprint arXiv:2507.09122 , year=

    Guo, C., Hwang, I., Wang, J., Zhou, B.: Snapmogen: Human motion generation from expressive texts. arXiv preprint arXiv:2507.09122 (2025)

  14. [14]

    In: CVPR

    Guo, C., Mu, Y., Javed, M.G., Wang, S., Cheng, L.: Momask: Generative masked modeling of 3d human motions. In: CVPR. pp. 1900–1910 (2024)

  15. [15]

    In: ICLR (2024)

    Guo, C., Mu, Y., Zuo, X., Dai, P., Yan, Y., Lu, J., Cheng, L.: Generative human motion stylization in latent space. In: ICLR (2024)

  16. [16]

    In: CVPR

    Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating diverse and natural 3d human motions from text. In: CVPR. pp. 5152–5161 (2022)

  17. [17]

    In: ECCV

    Guo, C., Zuo, X., Wang, S., Cheng, L.: Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In: ECCV. pp. 580–597. Springer (2022)

  18. [19]

    In: ACM International Conference on Multimedia

    Guo, C., Zuo, X., Wang, S., Zou, S., Sun, Q., Deng, A., Gong, M., Cheng, L.: Action2motion: Con- ditioned generation of 3d human motions. In: ACM International Conference on Multimedia. pp. 2021–2029 (2020)

  19. [20]

    In: ICCV

    Guo, Z., Hu, Z., Soh, D.W., Zhao, N.: Motionlab: Unified human motion generation and editing via the motion-condition-motion paradigm. In: ICCV. pp. 13869–13879 (2025)

  20. [21]

    In: AAAI

    Han, B., Peng, H., Dong, M., Ren, Y., Shen, Y., Xu, C.: Amd: Autoregressive motion diffusion. In: AAAI. vol. 38, pp. 2022–2030 (2024)

  21. [22]

    In: ICLR (2024)

    Han, I., Jayaram, R., Karbasi, A., Mirrokni, V., Woodruff, D., Zandieh, A.: Hyperattention: Long- context attention in near-linear time. In: ICLR (2024)

  22. [23]

    ACM Transactions on Graphics (ToG)35(4), 1–11 (2016)

    Holden, D., Saito, J., Komura, T.: A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics (ToG)35(4), 1–11 (2016)

  23. [24]

    In: CVPR

    Hong, S., Kim, C., Yoon, S., Nam, J., Cha, S., Noh, J.: Salad: Skeleton-aware latent diffusion for text-driven motion generation and editing. In: CVPR. pp. 7158–7168 (2025) Multi-scale Coarse-to-fine Modeling for Test-time Human Motion Control 15

  24. [25]

    In: CVPR

    Huang, S., Wang, Z., Li, P., Jia, B., Liu, T., Zhu, Y., Liang, W., Zhu, S.C.: Diffusion-based generation, optimization, and planning in 3d scenes. In: CVPR. pp. 16750–16761 (2023)

  25. [26]

    In: ECCV

    Huang, Y., Wan, W., Yang, Y., Callison-Burch, C., Yatskar, M., Liu, L.: Como: Controllable motion generation through language guided pose code editing. In: ECCV. pp. 180–196. Springer (2024)

  26. [27]

    In: CVPR

    Ji, B., Pan, Y., Liu, Z., Tan, S., Jin, X., Yang, X.: Pomp: Physics-consistent motion generative model through phase manifolds. In: CVPR. pp. 22690–22701 (2025)

  27. [28]

    In: NeurIPS (2023)

    Jiang, B.,Chen,X.,Liu,W.,Yu,J.,Yu,G., Chen,T.:Motiongpt:Humanmotionasaforeignlanguage. In: NeurIPS (2023)

  28. [29]

    arXiv preprint arXiv:2501.19083 (2025)

    Jiang, L., Wei, Y., Ni, H.: Motionpcm: Real-time motion synthesis with phased consistency model. arXiv preprint arXiv:2501.19083 (2025)

  29. [30]

    In: CVPR

    Jiang, N., Zhang, Z., Li, H., Ma, X., Wang, Z., Chen, Y., Liu, T., Zhu, Y., Huang, S.: Scaling up dynamic human-scene interaction modeling. In: CVPR. pp. 1737–1747 (2024)

  30. [31]

    In: CVPR

    Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3d human dynamics from video. In: CVPR. pp. 5614–5623 (2019)

  31. [32]

    In: ICLR (2018)

    Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, sta- bility, and variation. In: ICLR (2018)

  32. [33]

    In: CVPR

    Karunratanakul, K., Preechakul, K., Aksan, E., Beeler, T., Suwajanakorn, S., Tang, S.: Optimizing diffusion noise can serve as universal motion priors. In: CVPR. pp. 1334–1345 (2024)

  33. [34]

    In: ICCV

    Karunratanakul, K., Preechakul, K., Suwajanakorn, S., Tang, S.: Guided motion diffusion for control- lable human motion synthesis. In: ICCV. pp. 2151–2162 (2023)

  34. [35]

    In: CVPR

    Kim, B., Jeong, H.I., Sung, J., Cheng, Y., Lee, J., Chang, J.Y., Choi, S.I., Choi, Y., Shin, S., Kim, J., et al.: Personabooth: Personalized text-to-motion generation. In: CVPR. pp. 22756–22765 (2025)

  35. [36]

    In: ICCV

    Kong, H., Gong, K., Lian, D., Mi, M.B., Wang, X.: Priority-centric human motion generation in discrete latent space. In: ICCV. pp. 14806–14816 (2023)

  36. [37]

    In: CVPR

    Kulkarni, N., Rempe, D., Genova, K., Kundu, A., Johnson, J., Fouhey, D., Guibas, L.: Nifty: Neural object interaction fields for guided human motion synthesis. In: CVPR. pp. 947–957 (2024)

  37. [38]

    In: CVPR

    Lee, D., Kim, C., Kim, S., Cho, M., Han, W.S.: Autoregressive image generation using residual quan- tization. In: CVPR. pp. 11523–11532 (2022)

  38. [39]

    In: ECCV

    Li, J., Clegg, A., Mottaghi, R., Wu, J., Puig, X., Liu, C.K.: Controllable human-object interaction synthesis. In: ECCV. pp. 54–72. Springer (2024)

  39. [40]

    In: ICCV

    Li, R., Yang, S., Ross, D.A., Kanazawa, A.: Ai choreographer: Music conditioned 3d dance generation with aist++. In: ICCV. pp. 13401–13412 (2021)

  40. [41]

    In: ICLR (2025)

    Li, Z., Yuan, W., HE, Y., Qiu, L., Zhu, S., Gu, X., Shen, W., Dong, Y., Dong, Z., Yang, L.T.: LaMP: Language-motion pretraining for motion generation, retrieval, and captioning. In: ICLR (2025)

  41. [42]

    In: ICCV

    Li, Z., Luo, M., Hou, R., Zhao, X., Liu, H., Chang, H., Liu, Z., Li, C.: Morph: A motion-free physics optimization framework for human motion generation. In: ICCV. pp. 14580–14589 (2025)

  42. [43]

    In: CVPR

    Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR. pp. 2117–2125 (2017)

  43. [44]

    In: CVPR

    Liu, H., Zhan, X., Huang, S., Mu, T.J., Shan, Y.: Programmable motion generation for open-set motion control tasks. In: CVPR. pp. 1399–1408 (2024)

  44. [45]

    In: ICLR (2019),https : / / openreview.net/forum?id=Bkg6RiCqY7

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019),https : / / openreview.net/forum?id=Bkg6RiCqY7

  45. [46]

    In: CVPR

    Lu, S., Wang, J., Lu, Z., Chen, L.H., Dai, W., Dong, J., Dou, Z., Dai, B., Zhang, R.: Scamo: Exploring the scaling law in autoregressive motion generation model. In: CVPR. pp. 27872–27882 (2025)

  46. [47]

    In: ECCV

    Lucas, T., Baradel, F., Weinzaepfel, P., Rogez, G.: Posegpt: Quantization-based 3d human motion generation and forecasting. In: ECCV. pp. 417–435. Springer (2022)

  47. [48]

    In: ICCV

    Luo, Z., Cao, J., Kitani, K., Xu, W., et al.: Perpetual humanoid control for real-time simulated avatars. In: ICCV. pp. 10895–10904 (2023)

  48. [49]

    Journal of Machine Learning Research 9(86), 2579–2605 (2008) 16 N

    van der Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of Machine Learning Research 9(86), 2579–2605 (2008) 16 N. Le et al

  49. [50]

    In: ICCV

    Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: Archive of motion capture as surface shapes. In: ICCV. pp. 5442–5451 (2019)

  50. [51]

    In: ICCV

    Mao, W., Liu, M., Salzmann, M., Li, H.: Learning trajectory dependencies for human motion predic- tion. In: ICCV. pp. 9489–9497 (2019)

  51. [52]

    In: CVPR

    Meng, Z., Xie, Y., Peng, X., Han, Z., Jiang, H.: Rethinking diffusion for text-driven human motion generation: Redundant representations, evaluation, and masked autoregression. In: CVPR. pp. 27859– 27871 (2025)

  52. [53]

    In: ICLR (2024)

    Mentzer, F., Minnen, D., Agustsson, E., Tschannen, M.: Finite scalar quantization: VQ-VAE made simple. In: ICLR (2024)

  53. [54]

    ACM Transactions On Graphics (TOG)37(4), 1–14 (2018)

    Peng, X.B., Abbeel, P., Levine, S., Van de Panne, M.: Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions On Graphics (TOG)37(4), 1–14 (2018)

  54. [55]

    ACM Transactions on Graphics (ToG)40(4), 1–20 (2021)

    Peng, X.B., Ma, Z., Abbeel, P., Levine, S., Kanazawa, A.: Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (ToG)40(4), 1–20 (2021)

  55. [56]

    In: ICCV

    Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3d human motion synthesis with transformer vae. In: ICCV. pp. 10985–10995 (2021)

  56. [57]

    In: ECCV

    Petrovich, M., Black, M.J., Varol, G.: Temos: Generating diverse human motions from textual de- scriptions. In: ECCV. pp. 480–497. Springer (2022)

  57. [58]

    In: ICCV

    Pinyoanuntapong, E., Saleem, M., Karunratanakul, K., Wang, P., Xue, H., Chen, C., Guo, C., Cao, J., Ren, J., Tulyakov, S.: Maskcontrol: Spatio-temporal control for masked motion synthesis. In: ICCV. pp. 9955–9965 (2025)

  58. [59]

    In: CVPR

    Pinyoanuntapong, E., Wang, P., Lee, M., Chen, C.: Mmm: Generative masked motion model. In: CVPR. pp. 1546–1555 (2024)

  59. [60]

    Big data4(4), 236–252 (2016)

    Plappert, M., Mandery, C., Asfour, T.: The kit motion-language dataset. Big data4(4), 236–252 (2016)

  60. [61]

    In: CVPR

    Raab, S., Leibovitch, I., Li, P., Aberman, K., Sorkine-Hornung, O., Cohen-Or, D.: Modi: Unconditional motion synthesis from diverse data. In: CVPR. pp. 13873–13883 (2023)

  61. [62]

    In: ICML

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML. pp. 8748–8763 (2021)

  62. [63]

    In: ICCV

    Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: Humor: 3d human motion model for robust pose estimation. In: ICCV. pp. 11488–11499 (2021)

  63. [64]

    In: CVPR

    Rempe, D., Luo, Z., Bin Peng, X., Yuan, Y., Kitani, K., Kreis, K., Fidler, S., Litany, O.: Trace and pace: Controllable pedestrian animation via guided trajectory diffusion. In: CVPR. pp. 13756–13766 (2023)

  64. [65]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmenta- tion. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 234–241. Springer (2015)

  65. [66]

    An overview of gradient descent optimization algorithms

    Ruder, S.: An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 (2016)

  66. [67]

    In: ICLR (2024)

    Shafir, Y., Tevet, G., Kapon, R., Bermano, A.H.: Human motion diffusion as a generative prior. In: ICLR (2024)

  67. [68]

    In: ICCV

    Shi, M., Starke, S., Ye, Y., Komura, T., Won, J.: Phasemp: Robust 3d pose estimation via phase- conditioned human motion prior. In: ICCV. pp. 14725–14737 (2023)

  68. [69]

    ACM Transactions on Graphics (TOG)43(4), 1–14 (2024)

    Shi, Y., Wang, J., Jiang, X., Lin, B., Dai, B., Peng, X.B.: Interactive character control with auto- regressive motion diffusion models. ACM Transactions on Graphics (TOG)43(4), 1–14 (2024)

  69. [70]

    In: CVPR

    Siyao,L.,Yu,W.,Gu,T.,Lin,C.,Wang,Q.,Qian,C.,Loy,C.C.,Liu,Z.:Bailando:3ddancegeneration by actor-critic gpt with choreographic memory. In: CVPR. pp. 11050–11059 (2022)

  70. [71]

    In: CVPR

    Song, W., Jin, X., Li, S., Chen, C., Hao, A., Hou, X., Li, N., Qin, H.: Arbitrary motion style transfer with multi-condition motion latent diffusion model. In: CVPR. pp. 821–830 (2024)

  71. [72]

    In: CVPR

    Taheri, O., Choutas, V., Black, M.J., Tzionas, D.: Goal: Generating 4d whole-body motion for hand- object grasping. In: CVPR. pp. 13263–13273 (2022) Multi-scale Coarse-to-fine Modeling for Test-time Human Motion Control 17

  72. [73]

    In: ICLR (2025)

    Tan, W., Li, B., Jin, C., Huang, W., Wang, X., Song, R.: Think then react: Towards unconstrained action-to-reaction motion generation. In: ICLR (2025)

  73. [74]

    In: ICLR (2025)

    Tevet, G., Raab, S., Cohan, S., Reda, D., Luo, Z., Peng, X.B., Bermano, A.H., van de Panne, M.: CLoSD: Closing the loop between simulation and diffusion for multi-task character control. In: ICLR (2025)

  74. [75]

    In: ICLR (2023)

    Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. In: ICLR (2023)

  75. [76]

    NeurIPS37, 84839–84865 (2024)

    Tian, K., Jiang, Y., Yuan, Z., Peng, B., Wang, L.: Visual autoregressive modeling: Scalable image generation via next-scale prediction. NeurIPS37, 84839–84865 (2024)

  76. [77]

    In: CVPR

    Tseng, J., Castellon, R., Liu, K.: Edge: Editable dance generation from music. In: CVPR. pp. 448–458 (2023)

  77. [78]

    In: NeurIPS (2017)

    Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: NeurIPS (2017)

  78. [79]

    In: NeurIPS (2017)

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: NeurIPS (2017)

  79. [80]

    Foundations and Trends in Machine Learning1(1-2), 1–305 (2008)

    Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning1(1-2), 1–305 (2008)

  80. [81]

    In: ECCV

    Wan, W., Dou, Z., Komura, T., Wang, W., Jayaraman, D., Liu, L.: Tlcontrol: Trajectory and language control for human motion synthesis. In: ECCV. pp. 37–54. Springer (2024)

Showing first 80 references.