pith. machine review for the scientific record. sign in

arxiv: 2604.25164 · v1 · submitted 2026-04-28 · 💻 cs.CV

Recognition: unknown

IAM: Identity-Aware Human Motion and Shape Joint Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords human motion generationidentity-aware modelingjoint shape-motion synthesistext-to-motionbody morphologymultimodal identitymotion realism
0
0 comments X

The pith

Body shape and identity cues must be modeled together to generate realistic human motions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing text-to-motion models assume a standard body shape and produce movements that ignore individual differences in proportions and mass. This leads to motions that look physically wrong for the described person. The paper introduces a framework that takes multimodal identity information—text descriptions plus visual references—and generates both the motion sequence and the body shape parameters at the same time. Identity information directly adjusts the dynamics so that tall or heavy bodies move appropriately. Tests on captured motions and everyday videos show higher consistency between the generated movement and the person's build.

Core claim

The proposed identity-aware framework represents identity via multimodal signals and uses a joint generation process for motion sequences and body shape parameters, enabling identity to modulate motion dynamics directly.

What carries the argument

The joint motion-shape generation paradigm that synthesizes both outputs simultaneously so identity cues influence the motion.

If this is right

  • Motions become consistent with the performer's body proportions and mass distribution.
  • Natural language and image cues suffice to control shape-specific dynamics.
  • Generated sequences maintain high quality while improving identity match.
  • Performance holds on both lab motion capture data and in-the-wild videos.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applications in animation could personalize characters without manual retargeting.
  • Future systems might use this to correct motion capture for different actor builds.
  • Testing across age groups and body types could expose dataset biases.
  • Integration with real-time video input may enable live identity transfer in AR.

Load-bearing premise

Multimodal identity signals from language and visuals are enough to capture body morphology effects on motion without explicit measurements.

What would settle it

Compare generated motions for the same text prompt but different body shapes against real recordings of people with those shapes performing the action; the identity-aware outputs should align better in joint angles and timing.

Figures

Figures reproduced from arXiv: 2604.25164 by Abhay Mittal, Chengcheng Tang, Chuan Guo, James Matthew Rehg, Lezi Wang, Lingling Tao, Size An, Wenqi Jia, Zekun Li.

Figure 1
Figure 1. Figure 1: Identity-Consistent Motion Generation. Our framework enables decou￾pled control of action dynamics and subject morphology. Given identity cues and mo￾tion prompts, the model synthesizes diverse body shapes while producing motions that remain physically consistent with body morphology. Abstract. Recent advances in text-driven human motion generation en￾able models to synthesize realistic motion sequences fr… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed framework. (a) Data Processing Pipeline: We extract motion sequences M, shape parameters β, and multimodal identity de￾scriptions (Ti, Ii) from diverse sources including in-the-wild videos and MoCap data. (b) Motion-Shape Generation: A multimodal identity conditioning framework in￾tegrates textual and visual priors through frozen encoders to jointly generate identity￾consistent mot… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of IdentityMotion Dataset. Left: Distribution of identity and motion attributes, including body type, age, gender, and motion category. Right: Word cloud of identity descriptions, highlighting common appearance attributes. reconstruction accuracy, and generalization to unseen identities. Comprehensive quantitative (Sec. 4.5) and qualitative results (Sec. 4.6) demonstrate the effec￾tiveness of the … view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Comparison on the HumanML3D Test Set. view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative Results of Zero-shot Generalization on Unseen Test Set. view at source ↗
Figure 6
Figure 6. Figure 6: Zero-shot Identity-Controllable Motion Generation. view at source ↗
Figure 7
Figure 7. Figure 7: User study interface. Each trial presents two anonymous videos (A/B), the input prompt, and a frontal mesh reference. Participants select which video better matches motion, body shape, and overall motion–shape realism. B Human Perception Study To evaluate subjective quality, we conducted a perception study comparing our method against Shape My Move. We collected 25 valid responses, where each par￾ticipant … view at source ↗
Figure 8
Figure 8. Figure 8: User Study Results. C IdentityMotion Annotation Prompt Gemini Annotation Prompt. We provide the prompt used for data annota￾tion with Gemini 2.5 Pro. Llama 3.2 Neutralization Prompt. We also provide the prompt utilized to anonymize identity-related descriptors from the initial Gemini-generated anno￾tations, leveraging the Llama 3.2 model view at source ↗
read the original abstract

Recent advances in text-driven human motion generation enable models to synthesize realistic motion sequences from natural language descriptions. However, most existing approaches assume identity-neutral motion and generate movements using a canonical body representation, ignoring the strong influence of body morphology on motion dynamics. In practice, attributes such as body proportions, mass distribution, and age significantly affect how actions are performed, and neglecting this coupling often leads to physically inconsistent motions. We propose an identity-aware motion generation framework that explicitly models the relationship between body morphology and motion dynamics. Instead of relying on explicit geometric measurements, identity is represented using multimodal signals, including natural language descriptions and visual cues. We further introduce a joint motion-shape generation paradigm that simultaneously synthesizes motion sequences and body shape parameters, allowing identity cues to directly modulate motion dynamics. Extensive experiments on motion capture datasets and large-scale in-the-wild videos demonstrate improved motion realism and motion-identity consistency while maintaining high motion quality. Project page: https://vjwq.github.io/IAM

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes IAM, an identity-aware framework for joint generation of human motion sequences and body shape parameters. It represents identity via multimodal signals (natural language descriptions and visual cues) rather than explicit geometric measurements, with the goal of modeling how body morphology influences motion dynamics. A joint synthesis paradigm is introduced so that identity cues directly modulate the generated motions. Experiments on motion capture datasets and large-scale in-the-wild videos are reported to show gains in motion realism and motion-identity consistency while preserving quality.

Significance. If the central claims hold, the work would address a clear limitation in current text-driven motion generation models that rely on canonical body representations. The joint motion-shape paradigm and use of multimodal identity signals represent a potentially useful direction for producing more personalized and physically plausible animations. Strengths include the explicit focus on morphology-motion coupling and the empirical evaluation across both controlled and in-the-wild data; however, the significance depends on whether the learned mappings respect biomechanical relationships rather than spurious correlations.

major comments (3)
  1. Abstract and §3 (joint generation paradigm): the claim that multimodal signals (text + visual cues) suffice to capture and modulate precise morphology effects on dynamics (e.g., stride length, mass distribution) is load-bearing for the central contribution, yet the abstract provides no quantitative evidence that the model respects biomechanical constraints; without such checks the improvement in 'motion-identity consistency' could reflect dataset biases rather than causal modeling.
  2. §4 (experiments): the reported gains in realism and consistency are asserted but the abstract gives no numerical tables, baselines, or ablation results; this prevents verification of whether the joint paradigm actually improves over separate motion-then-shape pipelines or merely trades off quality, undermining the empirical support for the framework.
  3. §3.2 (identity embedding): the free parameters for embedding dimension and conditioning strength are acknowledged in the axiom ledger; if these are tuned post-hoc on the same evaluation sets, the 'explicit modeling' claim risks circularity and requires a clear statement of how hyper-parameters were selected without leakage.
minor comments (2)
  1. The project page URL is provided but the manuscript should include a direct link to code, pre-trained models, or evaluation scripts to support reproducibility claims.
  2. Notation for the joint loss (motion + shape terms) should be introduced earlier and kept consistent across equations to aid readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and describe the revisions we will make to improve clarity and empirical support.

read point-by-point responses
  1. Referee: Abstract and §3 (joint generation paradigm): the claim that multimodal signals (text + visual cues) suffice to capture and modulate precise morphology effects on dynamics (e.g., stride length, mass distribution) is load-bearing for the central contribution, yet the abstract provides no quantitative evidence that the model respects biomechanical constraints; without such checks the improvement in 'motion-identity consistency' could reflect dataset biases rather than causal modeling.

    Authors: We agree that the abstract, as a concise summary, omits specific quantitative metrics on biomechanical aspects. Section 4 of the manuscript reports quantitative evaluations on motion-capture and in-the-wild data, including metrics for motion realism and identity consistency, with ablations and cross-dataset checks intended to reduce the risk of spurious correlations. We will revise the abstract to include key numerical results (e.g., consistency score improvements and proxies for morphology-motion coupling such as stride and mass-distribution alignment) and add a short discussion in §3 clarifying how the joint paradigm encourages respect for biomechanical relationships rather than dataset artifacts. revision: yes

  2. Referee: §4 (experiments): the reported gains in realism and consistency are asserted but the abstract gives no numerical tables, baselines, or ablation results; this prevents verification of whether the joint paradigm actually improves over separate motion-then-shape pipelines or merely trades off quality, undermining the empirical support for the framework.

    Authors: The abstract is not the appropriate location for tables or full ablation details; those appear in §4, which contains baseline comparisons, joint-versus-separate pipeline ablations, and numerical tables demonstrating gains in realism and consistency without quality degradation. To improve accessibility, we will revise the abstract to summarize the primary quantitative outcomes and explicitly note the comparison against separate motion-then-shape pipelines. revision: yes

  3. Referee: §3.2 (identity embedding): the free parameters for embedding dimension and conditioning strength are acknowledged in the axiom ledger; if these are tuned post-hoc on the same evaluation sets, the 'explicit modeling' claim risks circularity and requires a clear statement of how hyper-parameters were selected without leakage.

    Authors: We acknowledge the need for explicit documentation. The embedding dimension and conditioning strength were chosen via grid search on a held-out validation split that is disjoint from both the training data and the final test sets, following standard practice. The 'axiom ledger' simply records these design choices. We will expand §3.2 with a dedicated paragraph describing the validation protocol and confirming the absence of leakage to evaluation sets. revision: yes

Circularity Check

0 steps flagged

No circularity: framework claims rest on empirical validation

full rationale

The paper introduces a new identity-aware motion generation framework using multimodal signals (language and visual cues) for joint motion-shape synthesis. No derivations, equations, or predictions are presented that reduce by construction to fitted inputs, self-definitions, or self-citation chains. Central claims of improved realism and consistency are supported by experiments on motion capture datasets and in-the-wild videos, making the approach self-contained without load-bearing reductions to its own assumptions.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework rests on standard assumptions in deep generative modeling (e.g., that neural networks can learn the mapping from multimodal identity to motion-shape pairs) plus the domain assumption that body morphology strongly influences motion dynamics in a learnable way. No new physical entities are postulated.

free parameters (1)
  • identity embedding dimension and conditioning strength
    Hyperparameters chosen to balance expressiveness and training stability; their specific values affect how strongly identity modulates motion.
axioms (2)
  • domain assumption Multimodal signals (text + vision) are sufficient proxies for body morphology effects on dynamics
    Invoked in the problem statement and method description; no explicit geometric measurements are used.
  • ad hoc to paper Joint optimization of motion and shape parameters improves consistency without degrading quality
    Central modeling choice of the proposed paradigm.

pith-pipeline@v0.9.0 · 5490 in / 1403 out tokens · 41589 ms · 2026-05-07T16:53:54.722139+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 17 canonical work pages · 7 internal anchors

  1. [1]

    arXiv preprint arXiv:2510.12537 (2025) 4

    Björkstrand, D., Wang, T., Bretzner, L., Sullivan, J.: Unconditional human mo- tion and shape generation via balanced score-based diffusion. arXiv preprint arXiv:2510.12537 (2025) 4

  2. [2]

    Chen,C.,Zhang,J.,Lakshmikanth,S.K.,Fang,Y.,Shao,R.,Wetzstein,G.,Fei-Fei, L., Adeli, E.: The language of motion: Unifying verbal and non-verbal language of 3dhumanmotion.In:ProceedingsoftheComputerVisionandPatternRecognition Conference. pp. 6200–6211 (2025) 3

  3. [3]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., Yu, G.: Executing your commands via motion diffusion in latent space. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18000–18010 (2023) 3

  4. [4]

    Choutas, V., Müller, L., Huang, C.H.P., Tang, S., Tzionas, D., Black, M.J.: Accu- rate3dbodyshaperegressionusingmetricandsemanticattributes.In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2718–2728 (2022) 9, 11

  5. [5]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025) 5

  6. [6]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Fan,K., Lu,S., Dai,M.,Yu,R.,Xiao, L.,Dou, Z.,Dong,J., Ma,L.,Wang,J.:Go to zero: Towards zero-shot motion generation with million-scale data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13336–13348 (2025) 2, 3

  7. [7]

    Fan, K., Lu, S., Dai, M., Yu, R., Xiao, L., Dou, Z., Dong, J., Ma, L., Wang, J.: Go to zero: Towards zero-shot motion generation with million-scale data (2025), https://arxiv.org/abs/2507.070955

  8. [8]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024) 5

  9. [9]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Guo, C., Mu, Y., Javed, M.G., Wang, S., Cheng, L.: Momask: Generative masked modeling of 3d human motions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1900–1910 (2024) 2, 3, 9

  10. [10]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating di- verse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5152–5161 (June 2022) 2, 3, 9

  11. [11]

    Classifier-Free Diffusion Guidance

    Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022) 6 16 Jia et al

  12. [12]

    Advances in Neural Information Processing Systems36, 20067–20079 (2023) 3

    Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: Motiongpt: Human motion as a foreign language. Advances in Neural Information Processing Systems36, 20067–20079 (2023) 3

  13. [13]

    Advances in Neural Information Processing Systems36 (2024) 2, 3

    Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: Motiongpt: Human motion as a foreign language. Advances in Neural Information Processing Systems36 (2024) 2, 3

  14. [14]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Jiang, J., Xiao, W., Lin, Z., Zhang, H., Ren, T., Gao, Y., Lin, Z., Cai, Z., Yang, L., Liu, Z.: Solami: Social vision-language-action modeling for immersive interac- tion with 3d autonomous characters. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26887–26898 (2025) 2

  15. [15]

    ACM Transactions On Graphics (TOG)38(4), 1–12 (2019) 2

    Jiang, Y., Van Wouwe, T., De Groote, F., Liu, C.K.: Synthesis of biologically real- istic human motion using joint torque actuation. ACM Transactions On Graphics (TOG)38(4), 1–12 (2019) 2

  16. [16]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Kim, B., Jeong, H.I., Sung, J., Cheng, Y., Lee, J., Chang, J.Y., Choi, S.I., Choi, Y., Shin, S., Kim, J., et al.: Personabooth: Personalized text-to-motion generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22756–22765 (2025) 4

  17. [17]

    ACM Trans

    Lee, S., Lee, S., Lee, Y., Lee, J.: Learning a family of motor skills from a single motion clip. ACM Trans. Graph.40(4) (2021) 2

  18. [18]

    LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens

    Li, Z., An, S., Tang, C., Guo, C., Shugurov, I., Zhang, L., Zhao, A., Sridhar, S., Tao, L., Mittal, A.: Llamo: Scaling pretrained language models for unified mo- tion understanding and generation with continuous autoregressive tokens. arXiv preprint arXiv:2602.12370 (2026) 2, 3

  19. [19]

    Liao, T.H., Zhou, Y., Shen, Y., Huang, C.H.P., Mitra, S., Huang, J.B., Bhat- tacharya, U.: Shape my moves: Text-driven shape-aware synthesis of human mo- tions.In:ProceedingsoftheComputerVisionandPatternRecognitionConference. pp. 1917–1928 (2025) 2, 4, 9

  20. [20]

    ACM Transactions on Graphics, (Proc

    Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: A skinned multi-person linear model. ACM Transactions on Graphics, (Proc. SIG- GRAPH Asia)34(6), 248:1–248:16 (Oct 2015) 2

  21. [21]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Lu, S., Wang, J., Lu, Z., Chen, L.H., Dai, W., Dong, J., Dou, Z., Dai, B., Zhang, R.: Scamo: Exploring the scaling law in autoregressive motion generation model. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 27872–27882 (2025) 3

  22. [22]

    In: International Conference on Com- puter Vision

    Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: Archive of motion capture as surface shapes. In: International Conference on Com- puter Vision. pp. 5442–5451 (Oct 2019) 5, 8

  23. [23]

    In: Proceedings IEEE Conf

    Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single image. In: Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2019) 2

  24. [24]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 9

  25. [25]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019) 9

  26. [26]

    arXiv preprint arXiv:2503.19557 (2025) 4 IAM 17

    Sawdayee, H., Guo, C., Tevet, G., Zhou, B., Wang, J., Bermano, A.H.: Dance like a chicken: Low-rank stylization for human motion diffusion. arXiv preprint arXiv:2503.19557 (2025) 4 IAM 17

  27. [27]

    In: SIGGRAPH Asia Conference Proceedings (2024) 5

    Shen,Z.,Pi,H.,Xia,Y.,Cen,Z.,Peng,S.,Hu,Z.,Bao,H.,Hu,R.,Zhou,X.:World- grounded human motion recovery via gravity-view coordinates. In: SIGGRAPH Asia Conference Proceedings (2024) 5

  28. [28]

    arXiv preprint arXiv:2506.00173 (2025) 2

    Shi, M., Liu, W., Mei, J., Tse, W., Chen, R., Chen, X., Komura, T.: Motionpersona: Characteristics-aware locomotion control. arXiv preprint arXiv:2506.00173 (2025) 2

  29. [29]

    ACM Transactions on Graphics (TOG)35(4), 1–14 (2016) 5

    Streuber, S., Quiros-Ramirez, M.A., Hill, M.Q., Hahn, C.A., Zuffi, S., O’Toole, A., Black, M.J.: Body talk: Crowdshaping realistic 3d avatars with words. ACM Transactions on Graphics (TOG)35(4), 1–14 (2016) 5

  30. [30]

    Closd: Closing the loop between simulation and diffusion for multi-task character control.arXiv preprint arXiv:2410.03441,

    Tevet, G., Raab, S., Cohan, S., Reda, D., Luo, Z., Peng, X.B., Bermano, A.H., van de Panne, M.: Closd: Closing the loop between simulation and diffusion for multi-task character control. arXiv preprint arXiv:2410.03441 (2024) 7, 9

  31. [31]

    Human Motion Diffusion Model

    Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022) 2, 3

  32. [32]

    In: European Conference on Computer Vision

    Tripathi, S., Taheri, O., Lassner, C., Black, M.J., Holden, D., Stoll, C.: HUMOS: Human motion model conditioned on body shape. In: European Conference on Computer Vision. pp. 133–152. Springer (2025) 2, 4

  33. [33]

    arXiv preprint arXiv:2506.21912 (2025) 4

    Wang, X., Xu, K., Li, F., Sheng, C., Yu, J., Mu, Y.: Generating attribute-aware human motions from textual prompt. arXiv preprint arXiv:2506.21912 (2025) 4

  34. [34]

    Hy-motion 1.0: Scaling flow matching models for text-to-motion generation.arXiv preprint arXiv:2512.23464, 2025

    Wen, Y., Shuai, Q., Kang, D., Li, J., Wen, C., Qian, Y., Jiao, N., Chen, C., Chen, W., Wang, Y., et al.: Hy-motion 1.0: Scaling flow matching models for text-to- motion generation. arXiv preprint arXiv:2512.23464 (2025) 3

  35. [35]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Wu, Y., Karunratanakul, K., Luo, Z., Tang, S.: Uniphys: Unified planner and con- troller with diffusion for flexible physics-based character control. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13214–13224 (2025) 2

  36. [36]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Xiao, L., Lu, S., Pi, H., Fan, K., Pan, L., Zhou, Y., Feng, Z., Zhou, X., Peng, S., Wang, J.: Motionstreamer: Streaming motion generation via diffusion-based autoregressive model in causal latent space. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10086–10096 (2025) 2, 3, 4

  37. [37]

    arXiv: 2602.07439 [cs.RO]

    Xie, W., Zheng, J., Han, J., Shi, J., Zhang, W., Bai, C., Li, X.: Textop: Real- time interactive text-driven humanoid robot motion generation and control. arXiv preprint arXiv:2602.07439 (2026) 2

  38. [38]

    ACM Transactions on Graphics (TOG)42(6), 1–17 (2023) 2

    Xu, P., Xie, K., Andrews, S., Kry, P.G., Neff, M., McGuire, M., Karamouzas, I., Zordan, V.: Adaptnet: Policy adaptation for physics-based character control. ACM Transactions on Graphics (TOG)42(6), 1–17 (2023) 2

  39. [39]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023) 2, 3

    Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., Shen, X.: T2m-gpt: Generating human motion from textual descriptions with discrete representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023) 2, 3

  40. [40]

    In: Proceedings of the 32nd ACM Inter- national Conference on Multimedia

    Zhang,J.,Chen,X.,Yu,G.,Tu,Z.:Generativemotionstylizationofcross-structure characters within canonical motion space. In: Proceedings of the 32nd ACM Inter- national Conference on Multimedia. pp. 7018–7026 (2024) 2

  41. [41]

    arXiv preprint arXiv:2310.12678 (2023) 2

    Zhang, J., Huang, S., Tu, Z., Chen, X., Zhan, X., Yu, G., Shan, Y.: Tapmo: Shape-aware motion generation of skeleton-free characters. arXiv preprint arXiv:2310.12678 (2023) 2

  42. [42]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Zhang, J., Weng, J., Kang, D., Zhao, F., Huang, S., Zhe, X., Bao, L., Shan, Y., Wang, J., Tu, Z.: Skinned motion retargeting with residual perception of motion semantics & geometry. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13864–13872 (2023) 2 18 Jia et al

  43. [43]

    ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

    Zhang, J., Chen, C., Chen, X., Yu, H., Xiang, T., Khan, A.S., Lakshmikanth, S.K., Adeli, E.: Vibes: A conversational agent with behaviorally-intelligent 3d virtual body. arXiv preprint arXiv:2512.14234 (2025) 2

  44. [44]

    Motiondiffuse: Text-driven human motion generation with diffusion model

    Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondif- fuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001 (2022) 3

  45. [45]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Zhong, C., Hu, L., Zhang, Z., Xia, S.: Attt2m: Text-driven human motion generation with multi-perspective attention mechanism. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 509–519 (2023) 3

  46. [46]

    Cannot judge

    Zhong, L., Xie, Y., Jampani, V., Sun, D., Jiang, H.: Smoodi: Stylized motion diffu- sion model. In: European Conference on Computer Vision. pp. 405–421. Springer (2024) 4 IAM 19 A Video Demonstration The supplementary video provides a comprehensive visualization of our work, including animated results for all figures in the main paper. We recommend viewin...