pith. machine review for the scientific record. sign in

arxiv: 2605.11704 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: no theorem link

ScaleMoGen: Autoregressive Next-Scale Prediction for Human Motion Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords human motion generationtext-to-motionautoregressive predictionmulti-scale tokensskeletal hierarchydiscrete quantizationmotion editing
0
0 comments X

The pith

ScaleMoGen generates human motions by autoregressively predicting discrete tokens from coarse to fine skeletal-temporal scales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ScaleMoGen treats text-driven human motion generation as a coarse-to-fine process instead of standard next-token prediction. It first quantizes 3D motions into compositional discrete tokens that span multiple skeletal-temporal scales of rising detail, then trains the model to predict the token map for each next scale in sequence. The tokenizers are built to keep the body's skeletal hierarchy intact at every level, while bitwise quantization enlarges the vocabulary and keeps training stable for detailed motions. If the approach holds, the result is higher-fidelity motions that also support direct text-based edits without any additional training. The method records an FID of 0.030 on HumanML3D and a CLIP score of 0.693 on SnapMoGen, both ahead of prior autoregressive baselines.

Core claim

ScaleMoGen frames motion generation as autoregressive next-scale prediction: 3D motions are quantized into compositional discrete tokens across multiple skeletal-temporal scales of increasing granularity, and the model learns to generate motion by predicting the next-scale token maps. Motion tokenizers and quantizers are explicitly designed so that discrete tokens at every scale strictly preserve the skeletal hierarchy. Bitwise quantization and prediction are used to scale up the tokenizer vocabulary while preserving motion details and stabilizing optimization.

What carries the argument

scale-wise autoregressive prediction of next-scale token maps from multi-scale motion tokenizers that preserve skeletal hierarchy

Load-bearing premise

Quantizing 3D motions into compositional discrete tokens across multiple skeletal-temporal scales preserves the skeletal hierarchy and motion details without loss.

What would settle it

Generate motions on a held-out set of complex actions and measure whether the skeletal joint angles or bone lengths at the finest scale deviate from ground-truth values by more than the reported baseline error.

Figures

Figures reproduced from arXiv: 2605.11704 by Bing Zhou, Chuan Guo, Hojun Jang, Inwoo Hwang, Jian Wang, Young Min Kim.

Figure 1
Figure 1. Figure 1: Overview of our skeletal-temporal multi-scale motion quantization pipeline. Given an input motion sequence m, the encoder E maps it to a continuous skeletal￾temporal latent grid f. The latent is decomposed into a hierarchy of residual compo￾nents {q v } V v=0 via binary multi-scale residual quantization, where each scale has its own temporal resolution and skeletal partition. The quantized residuals are th… view at source ↗
Figure 2
Figure 2. Figure 2: (a) Text-to-Motion Generation: Given a prompt cs, we autoregressively predict the next-scale token maps {q v s } V v=0 conditioned on all coarser-scale token maps. (b) Text-Driven Motion Editing: With additional target prompt ct with a source-token preservation mask {Mv } V v=0, we predict edited tokens q ′(v) t conditioned on the re￾maining source motion context and ct. The target token maps {q v t } V v=… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative text-to-motion generation results of ScaleMoGen. Given highly descriptive, long-form text prompts, ScaleMoGen accurately synthesizes complex se￾quences of actions (top-left), fine-grained body-part articulations (top-right), and timely executed motions with precise spatial constraints (bottom). 4.1 Text-to-Motion Generation Tables 1 and 2 show the results on HumanML3D and SnapMoGen. On Hu￾manML… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results of text-driven motion editing. Given a source motion and a new target description, ScaleMoGen accurately synthesizes the desired semantic changes, while preserving the identity and unrelated behaviors of the original source motion. baselines: MDM [35] and SALAD [16]. For each method, we first generate a source motion from the source text using its own generation pipeline, then apply the… view at source ↗
Figure 5
Figure 5. Figure 5: The full-body skeleton is spatially downsampled by merging adjacent joints’ in￾formation into coarser anatomical groups, as indicated by the color-coded regions. The initial full-resolution skeleton accommodates both the 22-joint HumanML3D format and the 24-joint SnapMoGen format. Through successive pooling stages, the model effectively captures skeletal complexity and represents the human body in an atomi… view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the intermediate accumulated token map motion reconstruc￾tion. Coarse token map captures global motions, which progressively disentangle into independent, fine-grained joint movements at finer scales, restoring full motion realism [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
read the original abstract

We present ScaleMoGen, a scale-wise autoregressive framework for text-driven human motion generation. Unlike conventional autoregressive approaches that rely on standard next-token prediction, ScaleMoGen frames motion generation as a coarse-to-fine process. We quantize 3D motions into compositional discrete tokens across multiple skeletal-emporal scales of increasing granularity, learning to generate motion by autoregressively predicting next-scale token maps. To maintain structural integrity, our motion tokenizers and quantizers are explicitly designed so that discrete tokens at every scale strictly preserve the skeletal hierarchy. Additionally, we employ bitwise quantization and prediction, which efficiently scale up the tokenizer vocabulary to preserve motion details and stabilize optimization. Extensive experiments demonstrate that ScaleMoGen achieves state-of-the-art performance, establishing an FID of 0.030 (vs. 0.045 for MoMask) on HumanML3D and a CLIP Score of 0.693 (vs. 0.685 for MoMask++) on the SnapMoGen dataset. Furthermore, we demonstrate that our skeletal-temporal multi-scale representation naturally facilitates training-free, text-guided motion editing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ScaleMoGen, a scale-wise autoregressive framework for text-driven human motion generation. Motions are quantized into compositional discrete tokens across multiple skeletal-temporal scales of increasing granularity; the model autoregressively predicts next-scale token maps while using bitwise quantization to enlarge vocabulary and stabilize optimization. Tokenizers are designed to preserve skeletal hierarchy at every scale. Experiments report SOTA results (FID 0.030 vs. MoMask 0.045 on HumanML3D; CLIP Score 0.693 vs. MoMask++ 0.685 on SnapMoGen) and demonstrate training-free text-guided editing.

Significance. If the multi-scale quantization and bitwise scheme function as described, the work supplies a concrete coarse-to-fine autoregressive alternative to standard next-token motion models, with measurable metric gains and a practical editing capability that prior single-scale tokenizers lack. The explicit hierarchy-preserving design and empirical comparisons constitute the primary strengths.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): the reported FID improvement (0.030 vs. 0.045) and CLIP-score gain are presented without error bars, multiple random seeds, or explicit dataset-split details; this weakens the claim that the multi-scale construction is responsible for the gains rather than training-protocol differences.
  2. [§3.2] §3.2 (Motion Tokenizer): the statement that tokens at every scale 'strictly preserve the skeletal hierarchy' is load-bearing for the central claim yet lacks an explicit equation or algorithm showing how the compositional quantization enforces this property (e.g., no definition of the per-scale skeletal constraint or proof of invariance under bitwise operations).
minor comments (2)
  1. [§3.3] §3.3: clarify the exact vocabulary sizes chosen for bitwise quantization and report an ablation on the number of skeletal-temporal scales, as these are the two free parameters listed in the design.
  2. [Figure 3 and §4.3] Figure 3 and §4.3: the qualitative editing examples would benefit from side-by-side comparison with a fine-tuned baseline to illustrate the training-free advantage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the reported FID improvement (0.030 vs. 0.045) and CLIP-score gain are presented without error bars, multiple random seeds, or explicit dataset-split details; this weakens the claim that the multi-scale construction is responsible for the gains rather than training-protocol differences.

    Authors: We agree that reporting error bars, multiple random seeds, and explicit dataset-split details would strengthen the statistical robustness of the results. In the revised version, we will add standard deviations computed over at least three independent runs with different seeds, clarify the exact train/validation/test splits used on HumanML3D and SnapMoGen, and include a brief discussion confirming that all baselines were re-evaluated under identical protocols. While the primary gains are attributable to the multi-scale architecture (as supported by our ablation studies), we acknowledge that these additions will better isolate the contribution of the proposed components from training variations. revision: yes

  2. Referee: [§3.2] §3.2 (Motion Tokenizer): the statement that tokens at every scale 'strictly preserve the skeletal hierarchy' is load-bearing for the central claim yet lacks an explicit equation or algorithm showing how the compositional quantization enforces this property (e.g., no definition of the per-scale skeletal constraint or proof of invariance under bitwise operations).

    Authors: We concur that an explicit mathematical formulation is needed to substantiate this key property. The preservation arises because each scale's tokenizer operates on a hierarchical skeletal graph where parent joints are quantized before children, and bitwise quantization is applied independently per scale without cross-scale mixing. In the revision, we will insert a formal definition in §3.2 (including the per-scale constraint equation and a short invariance argument under bitwise operations) together with a pseudocode outline of the quantization algorithm to make the enforcement mechanism fully transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces ScaleMoGen as a new autoregressive framework that quantizes motions into multi-scale discrete tokens and predicts next-scale maps, with performance validated through direct empirical comparisons (FID, CLIP scores) against prior published methods on HumanML3D and SnapMoGen. No equations or derivations are presented that reduce predictions or results to fitted parameters defined by the same inputs, self-citations that bear the central claim, or ansatzes smuggled via prior author work. The quantization design is described as explicitly constructed to preserve hierarchy, but this is an architectural choice evaluated externally rather than a tautological self-definition. The approach is self-contained against benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The approach depends on the design choice that multi-scale quantization preserves skeletal structure and on hyperparameters for the number of scales and vocabulary size.

free parameters (2)
  • number of skeletal-temporal scales
    Chosen to enable coarse-to-fine generation while preserving hierarchy.
  • bitwise quantization vocabulary size
    Scaled to preserve motion details without destabilizing training.
axioms (1)
  • domain assumption Discrete tokens at every scale strictly preserve the skeletal hierarchy
    Stated as an explicit design requirement for the motion tokenizers and quantizers.
invented entities (1)
  • compositional discrete tokens across multiple skeletal-temporal scales no independent evidence
    purpose: Enable autoregressive next-scale prediction while maintaining structural integrity
    New representation introduced for the coarse-to-fine motion generation process

pith-pipeline@v0.9.0 · 5504 in / 1248 out tokens · 52953 ms · 2026-05-13T06:20:11.668434+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 3 internal anchors

  1. [1]

    In: SIGGRAPH Asia 2024 Conference Pa- pers (2024) 4

    Athanasiou, N., Ceske, A., Diomataris, M., Black, M.J., Varol, G.: MotionFix: Text-driven 3d human motion editing. In: SIGGRAPH Asia 2024 Conference Pa- pers (2024) 4

  2. [2]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Bae, J., Hwang, I., Lee, Y.Y., Guo, Z., Liu, J., Ben-Shabat, Y., Kim, Y.M., Kapa- dia, M.: Less is more: Improving motion diffusion models with sparse keyframes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11069–11078 (October 2025) 3

  3. [3]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., Yu, G.: Executing your commands via motion diffusion in latent space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18000–18010 (2023) 1, 3, 10, 12

  4. [4]

    In: ACM SIGGRAPH (2025) 2, 3

    Ghosh, A., Zhou, B., Dabral, R., Wang, J., Golyanik, V., Theobalt, C., Slusallek, P., Guo, C.: Duetgen: Music driven two-person dance generation via hierarchical masked modeling. In: ACM SIGGRAPH (2025) 2, 3

  5. [5]

    In: The Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems (2025),https://openreview.net/forum?id=pdE9onSn2h 2, 3, 10, 11, 12, 18, 20

    Guo, C., Hwang, I., Wang, J., Zhou, B.: Snapmogen: Human motion generation from expressive texts. In: The Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems (2025),https://openreview.net/forum?id=pdE9onSn2h 2, 3, 10, 11, 12, 18, 20

  6. [6]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Guo, C., Mu, Y., Javed, M.G., Wang, S., Cheng, L.: Momask: Generative masked modeling of 3d human motions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1900–1910 (2024) 2, 3, 10, 12

  7. [7]

    In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition

    Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating diverse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. pp. 5152–5161 (2022) 1, 2, 3, 10, 12, 20

  8. [8]

    In: European Conference on Computer Vision

    Guo, C., Zuo, X., Wang, S., Cheng, L.: Tm2t: Stochastic and tokenized model- ing for the reciprocal generation of 3d human motions and texts. In: European Conference on Computer Vision. pp. 580–597. Springer (2022) 2, 3, 10, 12

  9. [9]

    In: Proceedings of the 28th ACM International Conference on Multimedia

    Guo, C., Zuo, X., Wang, S., Zou, S., Sun, Q., Deng, A., Gong, M., Cheng, L.: Action2motion: Conditioned generation of 3d human motions. In: Proceedings of the 28th ACM International Conference on Multimedia. pp. 2021–2029 (2020) 3, 10

  10. [10]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Han, J., Liu, J., Jiang, Y., Yan, B., Zhang, Y., Yuan, Z., Peng, B., Liu, X.: In- finity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 15733–15744 (June 2025) 4 16 I. Hwang et al

  11. [11]

    In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2025) 2, 4

    Han, S.H.K., et al.: Bad: Bidirectional auto-regressive diffusion for text-to-motion generation. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2025) 2, 4

  12. [12]

    Scaling Laws for Autoregressive Generative Modeling

    Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C., Jackson, J., Jun, H., Brown, T.B., Dhariwal, P., Gray, S., et al.: Scaling laws for autoregressive genera- tive modeling. arXiv preprint arXiv:2010.14701 (2020) 13

  13. [13]

    In: European Conference on Computer Vision (ECCV) (2024) 8

    Heo, B., Park, S., Han, D., Yun, S.: Rotary position embedding for vision trans- former. In: European Conference on Computer Vision (ECCV) (2024) 8

  14. [14]

    Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control (2022) 4

  15. [15]

    Ho, J., Salimans, T.: Classifier-free diffusion guidance (2022),https://arxiv.org/ abs/2207.1259820

  16. [16]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)

    Hong, S., Kim, C., Yoon, S., Nam, J., Cha, S., Noh, J.: Salad: Skeleton-aware latent diffusion for text-driven motion generation and editing. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR). pp. 7158–7168 (June 2025) 3, 4, 10, 11, 12, 13, 14, 18

  17. [17]

    In: Proceedings of the 32nd ACM International Conference on Multimedia

    Huang, Y., Yang, H., Luo, C., Wang, Y., Xu, S., Zhang, Z., Zhang, M., Peng, J.: Stablemofusion: Towards robust and efficient diffusion-based motion genera- tion framework. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 224–232 (2024) 10, 12

  18. [18]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops

    Hwang, I., Bae, J., Lim, D., Kim, Y.M.: Goal-driven human motion synthesis in diverse task. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops. pp. 2920–2930 (June 2025) 3

  19. [19]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Hwang, I., Bae, J., Lim, D., Kim, Y.M.: Motion synthesis with sparse and flexible keyjoint control. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13203–13213 (October 2025) 3

  20. [20]

    In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV)

    Hwang,I.,Zhou,B.,Kim,Y.M.,Wang,J.,Guo,C.:Scenemi:Motionin-betweening for modeling human-scene interaction. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV). pp. 6034–6045 (October 2025) 3

  21. [21]

    Advances in Neural Information Processing Systems36, 20067–20079 (2023) 2, 3

    Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: Motiongpt: Human motion as a foreign language. Advances in Neural Information Processing Systems36, 20067–20079 (2023) 2, 3

  22. [22]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Kim, J., Kim, J., Choi, S.: Flame: Free-form language-based motion synthesis & editing. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 8255–8263 (2023) 4

  23. [23]

    ACM Transactions on Graphics (TOG)41(4), 1–12 (2022) 2

    Li, P., Aberman, K., Zhang, Z., Hanocka, R., Sorkine-Hornung, O.: Ganimator: Neural motion synthesis from a single sequence. ACM Transactions on Graphics (TOG)41(4), 1–12 (2022) 2

  24. [24]

    Li, Z., Cheng, K., Ghosh, A., Bhattacharya, U., Gui, L., Bera, A.: Simmotionedit: Text-based human motion editing with motion similarity prediction (2025) 4

  25. [25]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Lu, S., Wang, J., Lu, Z., Chen, L.H., Dai, W., Dong, J., Dou, Z., Dai, B., Zhang, R.: Scamo: Exploring the scaling law in autoregressive motion generation model. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 27872–27882 (2025) 1, 13

  26. [26]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: Archive of motion capture as surface shapes. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 5442–5451 (2019) 10

  27. [27]

    In: International Conference on Learning Representations (2022) 4 ScaleMoGen 17

    Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: SDEdit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations (2022) 4 ScaleMoGen 17

  28. [28]

    Rethinking diffusion for text-driven human motion generation.arXiv preprint arXiv:2411.16575, 2024

    Meng, Z., Xie, Y., Peng, X., Han, Z., Jiang, H.: Rethinking diffusion for text-driven human motion generation. arXiv preprint arXiv:2411.16575 (2024) 1, 3, 10, 12

  29. [29]

    In: European Conference on Computer Vision

    Petrovich, M., Black, M.J., Varol, G.: Temos: Generating diverse human motions from textual descriptions. In: European Conference on Computer Vision. pp. 480–

  30. [30]

    Springer (2022) 1, 3

  31. [31]

    In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision

    Petrovich, M., Black, M.J., Varol, G.: Tmr: Text-to-motion retrieval using con- trastive 3d human motion synthesis. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 9488–9497 (2023) 10

  32. [32]

    In: Computer Vision – ECCV 2024 (2024) 2, 4

    Pinyoanuntapong, E., Saleem, M.U., Wang, P., Lee, M., Das, S., Chen, C.: Bamm: Bidirectional autoregressive motion model. In: Computer Vision – ECCV 2024 (2024) 2, 4

  33. [33]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Pinyoanuntapong, E., Wang, P., Lee, M., Chen, C.: Mmm: Generative masked motion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1546–1555 (2024) 2, 3, 10, 12

  34. [34]

    Journal of Machine Learning Research21(140), 1–67 (2020), http://jmlr.org/papers/v21/20-074.html7

    Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text- to-text transformer. Journal of Machine Learning Research21(140), 1–67 (2020), http://jmlr.org/papers/v21/20-074.html7

  35. [35]

    Sui, K., Ghosh, A., Hwang, I., Wang, J., Guo, C.: A survey on human interaction motion generation (2025),https://arxiv.org/abs/2503.127631

  36. [36]

    Human Motion Diffusion Model

    Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022) 1, 3, 10, 12, 13, 14, 18

  37. [37]

    In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https: //openreview.net/forum?id=gojL67CfS84, 20

    Tian, K., Jiang, Y., Yuan, Z., PENG, B., Wang, L.: Visual autoregressive model- ing: Scalable image generation via next-scale prediction. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https: //openreview.net/forum?id=gojL67CfS84, 20

  38. [38]

    Advances in neural information processing systems30(2017) 1, 3

    Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. Advances in neural information processing systems30(2017) 1, 3

  39. [39]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Wang, Y., Guo, L., Li, Z., Huang, J., Wang, P., Wen, B., Wang, J.: Training- free text-guided image editing with visual autoregressive model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 17577– 17586 (October 2025) 4, 10

  40. [40]

    Neural In- formation Processing Systems (NeurIPS) (2024) 2, 3, 10, 12, 13, 18

    Yuan, W., Shen, W., HE, Y., Dong, Y., Gu, X., Dong, Z., Bo, L., Huang, Q.: Mogents: Motion generation based on spatial-temporal joint modeling. Neural In- formation Processing Systems (NeurIPS) (2024) 2, 3, 10, 12, 13, 18

  41. [41]

    arXiv preprint arXiv:2301.06052 (2023) 2, 3, 10, 12, 18

    Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., Shen, X.: T2m-gpt: Generating human motion from textual descriptions with discrete representations. arXiv preprint arXiv:2301.06052 (2023) 2, 3, 10, 12, 18

  42. [42]

    Motiondiffuse: Text-driven human motion generation with diffusion model

    Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondif- fuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001 (2022) 1, 3, 10, 12

  43. [43]

    Zhang, M., Guo, X., Pan, L., Cai, Z., Hong, F., Li, H., Yang, L., Liu, Z.: Remodif- fuse:Retrieval-augmentedmotiondiffusionmodel.arXivpreprintarXiv:2304.01116 (2023) 1, 3

  44. [44]

    Advances in Neural Information Processing Systems36, 13981–13992 (2023) 2, 4

    Zhang, M., Li, H., Cai, Z., Ren, J., Yang, L., Liu, Z.: Finemogen: Fine-grained spatio-temporal motion generation and editing. Advances in Neural Information Processing Systems36, 13981–13992 (2023) 2, 4

  45. [45]

    inner-to- outer

    Zhao, Y., Xiong, Y., Krähenbühl, P.: Image and video tokenization with binary spherical quantization. arXiv preprint arXiv:2406.07548 (2024) 2, 6 18 I. Hwang et al. In the supplementary materials, we evaluate sampling efficiency (Section A), and provide implementation details on the model architecture and algorithm (Section B). We further analyze the prop...