arxiv: 2604.25164 · v1 · submitted 2026-04-28 · 💻 cs.CV

Recognition: unknown

IAM: Identity-Aware Human Motion and Shape Joint Generation

Wenqi Jia , Zekun Li , Abhay Mittal , Chengcheng Tang , Chuan Guo , Lezi Wang , James Matthew Rehg , Lingling Tao

show 1 more author

Size An

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords human motion generationidentity-aware modelingjoint shape-motion synthesistext-to-motionbody morphologymultimodal identitymotion realism

0 comments

The pith

Body shape and identity cues must be modeled together to generate realistic human motions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing text-to-motion models assume a standard body shape and produce movements that ignore individual differences in proportions and mass. This leads to motions that look physically wrong for the described person. The paper introduces a framework that takes multimodal identity information—text descriptions plus visual references—and generates both the motion sequence and the body shape parameters at the same time. Identity information directly adjusts the dynamics so that tall or heavy bodies move appropriately. Tests on captured motions and everyday videos show higher consistency between the generated movement and the person's build.

Core claim

The proposed identity-aware framework represents identity via multimodal signals and uses a joint generation process for motion sequences and body shape parameters, enabling identity to modulate motion dynamics directly.

What carries the argument

The joint motion-shape generation paradigm that synthesizes both outputs simultaneously so identity cues influence the motion.

If this is right

Motions become consistent with the performer's body proportions and mass distribution.
Natural language and image cues suffice to control shape-specific dynamics.
Generated sequences maintain high quality while improving identity match.
Performance holds on both lab motion capture data and in-the-wild videos.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applications in animation could personalize characters without manual retargeting.
Future systems might use this to correct motion capture for different actor builds.
Testing across age groups and body types could expose dataset biases.
Integration with real-time video input may enable live identity transfer in AR.

Load-bearing premise

Multimodal identity signals from language and visuals are enough to capture body morphology effects on motion without explicit measurements.

What would settle it

Compare generated motions for the same text prompt but different body shapes against real recordings of people with those shapes performing the action; the identity-aware outputs should align better in joint angles and timing.

Figures

Figures reproduced from arXiv: 2604.25164 by Abhay Mittal, Chengcheng Tang, Chuan Guo, James Matthew Rehg, Lezi Wang, Lingling Tao, Size An, Wenqi Jia, Zekun Li.

**Figure 1.** Figure 1: Identity-Consistent Motion Generation. Our framework enables decoupled control of action dynamics and subject morphology. Given identity cues and motion prompts, the model synthesizes diverse body shapes while producing motions that remain physically consistent with body morphology. Abstract. Recent advances in text-driven human motion generation enable models to synthesize realistic motion sequences fr… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed framework. (a) Data Processing Pipeline: We extract motion sequences M, shape parameters β, and multimodal identity descriptions (Ti, Ii) from diverse sources including in-the-wild videos and MoCap data. (b) Motion-Shape Generation: A multimodal identity conditioning framework integrates textual and visual priors through frozen encoders to jointly generate identityconsistent mot… view at source ↗

**Figure 3.** Figure 3: Overview of IdentityMotion Dataset. Left: Distribution of identity and motion attributes, including body type, age, gender, and motion category. Right: Word cloud of identity descriptions, highlighting common appearance attributes. reconstruction accuracy, and generalization to unseen identities. Comprehensive quantitative (Sec. 4.5) and qualitative results (Sec. 4.6) demonstrate the effectiveness of the … view at source ↗

**Figure 4.** Figure 4: Qualitative Comparison on the HumanML3D Test Set. view at source ↗

**Figure 5.** Figure 5: Qualitative Results of Zero-shot Generalization on Unseen Test Set. view at source ↗

**Figure 6.** Figure 6: Zero-shot Identity-Controllable Motion Generation. view at source ↗

**Figure 7.** Figure 7: User study interface. Each trial presents two anonymous videos (A/B), the input prompt, and a frontal mesh reference. Participants select which video better matches motion, body shape, and overall motion–shape realism. B Human Perception Study To evaluate subjective quality, we conducted a perception study comparing our method against Shape My Move. We collected 25 valid responses, where each participant … view at source ↗

**Figure 8.** Figure 8: User Study Results. C IdentityMotion Annotation Prompt Gemini Annotation Prompt. We provide the prompt used for data annotation with Gemini 2.5 Pro. Llama 3.2 Neutralization Prompt. We also provide the prompt utilized to anonymize identity-related descriptors from the initial Gemini-generated annotations, leveraging the Llama 3.2 model view at source ↗

read the original abstract

Recent advances in text-driven human motion generation enable models to synthesize realistic motion sequences from natural language descriptions. However, most existing approaches assume identity-neutral motion and generate movements using a canonical body representation, ignoring the strong influence of body morphology on motion dynamics. In practice, attributes such as body proportions, mass distribution, and age significantly affect how actions are performed, and neglecting this coupling often leads to physically inconsistent motions. We propose an identity-aware motion generation framework that explicitly models the relationship between body morphology and motion dynamics. Instead of relying on explicit geometric measurements, identity is represented using multimodal signals, including natural language descriptions and visual cues. We further introduce a joint motion-shape generation paradigm that simultaneously synthesizes motion sequences and body shape parameters, allowing identity cues to directly modulate motion dynamics. Extensive experiments on motion capture datasets and large-scale in-the-wild videos demonstrate improved motion realism and motion-identity consistency while maintaining high motion quality. Project page: https://vjwq.github.io/IAM

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper's core move is joint motion-shape generation conditioned on multimodal identity cues rather than explicit geometry, which is a practical step but rests on an untested bet that language and images will reliably encode morphology effects on dynamics.

read the letter

This paper's main advance is the joint synthesis of motion and body shape modulated by multimodal identity cues from text and visuals. They drop the usual identity-neutral canonical body and instead let descriptions plus images directly influence the generated dynamics, with a shared generation process that ties shape parameters to the motion sequence. That setup is new enough in the text-to-motion line of work and avoids needing separate 3D scans or measurements at inference time. The experiments on mocap datasets and in-the-wild video show gains in realism and identity consistency over prior models, which is the part that actually lands. The joint paradigm appears to help keep the outputs coherent without extra post-processing. The soft spot is the central assumption that multimodal signals will accurately capture and modulate how body proportions and mass distribution affect motion. Language tends to be qualitative and visuals entangle shape with pose, camera, and lighting, so the model could learn spurious correlations instead of the biomechanical relationships the abstract claims to model. Without seeing the full architecture, loss terms, or ablation tables it is hard to tell whether the reported consistency improvements reflect real causal learning or just better conditioning tricks. The paper is aimed at researchers building animation or simulation pipelines who already work with generative motion models and want to add lightweight identity control. A reader focused on practical extensions of diffusion or transformer-based motion generators would find the joint paradigm and the multimodal conditioning details worth looking at. I would send it for peer review; the empirical claims are concrete and the idea is timely enough that referees can test whether the gains survive closer inspection of the implementation and metrics.

Referee Report

3 major / 2 minor

Summary. The paper proposes IAM, an identity-aware framework for joint generation of human motion sequences and body shape parameters. It represents identity via multimodal signals (natural language descriptions and visual cues) rather than explicit geometric measurements, with the goal of modeling how body morphology influences motion dynamics. A joint synthesis paradigm is introduced so that identity cues directly modulate the generated motions. Experiments on motion capture datasets and large-scale in-the-wild videos are reported to show gains in motion realism and motion-identity consistency while preserving quality.

Significance. If the central claims hold, the work would address a clear limitation in current text-driven motion generation models that rely on canonical body representations. The joint motion-shape paradigm and use of multimodal identity signals represent a potentially useful direction for producing more personalized and physically plausible animations. Strengths include the explicit focus on morphology-motion coupling and the empirical evaluation across both controlled and in-the-wild data; however, the significance depends on whether the learned mappings respect biomechanical relationships rather than spurious correlations.

major comments (3)

Abstract and §3 (joint generation paradigm): the claim that multimodal signals (text + visual cues) suffice to capture and modulate precise morphology effects on dynamics (e.g., stride length, mass distribution) is load-bearing for the central contribution, yet the abstract provides no quantitative evidence that the model respects biomechanical constraints; without such checks the improvement in 'motion-identity consistency' could reflect dataset biases rather than causal modeling.
§4 (experiments): the reported gains in realism and consistency are asserted but the abstract gives no numerical tables, baselines, or ablation results; this prevents verification of whether the joint paradigm actually improves over separate motion-then-shape pipelines or merely trades off quality, undermining the empirical support for the framework.
§3.2 (identity embedding): the free parameters for embedding dimension and conditioning strength are acknowledged in the axiom ledger; if these are tuned post-hoc on the same evaluation sets, the 'explicit modeling' claim risks circularity and requires a clear statement of how hyper-parameters were selected without leakage.

minor comments (2)

The project page URL is provided but the manuscript should include a direct link to code, pre-trained models, or evaluation scripts to support reproducibility claims.
Notation for the joint loss (motion + shape terms) should be introduced earlier and kept consistent across equations to aid readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and describe the revisions we will make to improve clarity and empirical support.

read point-by-point responses

Referee: Abstract and §3 (joint generation paradigm): the claim that multimodal signals (text + visual cues) suffice to capture and modulate precise morphology effects on dynamics (e.g., stride length, mass distribution) is load-bearing for the central contribution, yet the abstract provides no quantitative evidence that the model respects biomechanical constraints; without such checks the improvement in 'motion-identity consistency' could reflect dataset biases rather than causal modeling.

Authors: We agree that the abstract, as a concise summary, omits specific quantitative metrics on biomechanical aspects. Section 4 of the manuscript reports quantitative evaluations on motion-capture and in-the-wild data, including metrics for motion realism and identity consistency, with ablations and cross-dataset checks intended to reduce the risk of spurious correlations. We will revise the abstract to include key numerical results (e.g., consistency score improvements and proxies for morphology-motion coupling such as stride and mass-distribution alignment) and add a short discussion in §3 clarifying how the joint paradigm encourages respect for biomechanical relationships rather than dataset artifacts. revision: yes
Referee: §4 (experiments): the reported gains in realism and consistency are asserted but the abstract gives no numerical tables, baselines, or ablation results; this prevents verification of whether the joint paradigm actually improves over separate motion-then-shape pipelines or merely trades off quality, undermining the empirical support for the framework.

Authors: The abstract is not the appropriate location for tables or full ablation details; those appear in §4, which contains baseline comparisons, joint-versus-separate pipeline ablations, and numerical tables demonstrating gains in realism and consistency without quality degradation. To improve accessibility, we will revise the abstract to summarize the primary quantitative outcomes and explicitly note the comparison against separate motion-then-shape pipelines. revision: yes
Referee: §3.2 (identity embedding): the free parameters for embedding dimension and conditioning strength are acknowledged in the axiom ledger; if these are tuned post-hoc on the same evaluation sets, the 'explicit modeling' claim risks circularity and requires a clear statement of how hyper-parameters were selected without leakage.

Authors: We acknowledge the need for explicit documentation. The embedding dimension and conditioning strength were chosen via grid search on a held-out validation split that is disjoint from both the training data and the final test sets, following standard practice. The 'axiom ledger' simply records these design choices. We will expand §3.2 with a dedicated paragraph describing the validation protocol and confirming the absence of leakage to evaluation sets. revision: yes

Circularity Check

0 steps flagged

No circularity: framework claims rest on empirical validation

full rationale

The paper introduces a new identity-aware motion generation framework using multimodal signals (language and visual cues) for joint motion-shape synthesis. No derivations, equations, or predictions are presented that reduce by construction to fitted inputs, self-definitions, or self-citation chains. Central claims of improved realism and consistency are supported by experiments on motion capture datasets and in-the-wild videos, making the approach self-contained without load-bearing reductions to its own assumptions.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework rests on standard assumptions in deep generative modeling (e.g., that neural networks can learn the mapping from multimodal identity to motion-shape pairs) plus the domain assumption that body morphology strongly influences motion dynamics in a learnable way. No new physical entities are postulated.

free parameters (1)

identity embedding dimension and conditioning strength
Hyperparameters chosen to balance expressiveness and training stability; their specific values affect how strongly identity modulates motion.

axioms (2)

domain assumption Multimodal signals (text + vision) are sufficient proxies for body morphology effects on dynamics
Invoked in the problem statement and method description; no explicit geometric measurements are used.
ad hoc to paper Joint optimization of motion and shape parameters improves consistency without degrading quality
Central modeling choice of the proposed paradigm.

pith-pipeline@v0.9.0 · 5490 in / 1403 out tokens · 41589 ms · 2026-05-07T16:53:54.722139+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 17 canonical work pages · 7 internal anchors

[1]

arXiv preprint arXiv:2510.12537 (2025) 4

Björkstrand, D., Wang, T., Bretzner, L., Sullivan, J.: Unconditional human mo- tion and shape generation via balanced score-based diffusion. arXiv preprint arXiv:2510.12537 (2025) 4

work page arXiv 2025
[2]

Chen,C.,Zhang,J.,Lakshmikanth,S.K.,Fang,Y.,Shao,R.,Wetzstein,G.,Fei-Fei, L., Adeli, E.: The language of motion: Unifying verbal and non-verbal language of 3dhumanmotion.In:ProceedingsoftheComputerVisionandPatternRecognition Conference. pp. 6200–6211 (2025) 3

2025
[3]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., Yu, G.: Executing your commands via motion diffusion in latent space. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18000–18010 (2023) 3

2023
[4]

Choutas, V., Müller, L., Huang, C.H.P., Tang, S., Tzionas, D., Black, M.J.: Accu- rate3dbodyshaperegressionusingmetricandsemanticattributes.In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2718–2728 (2022) 9, 11

2022
[5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025) 5

work page internal anchor Pith review arXiv 2025
[6]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Fan,K., Lu,S., Dai,M.,Yu,R.,Xiao, L.,Dou, Z.,Dong,J., Ma,L.,Wang,J.:Go to zero: Towards zero-shot motion generation with million-scale data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13336–13348 (2025) 2, 3

2025
[7]

Fan, K., Lu, S., Dai, M., Yu, R., Xiao, L., Dou, Z., Dong, J., Ma, L., Wang, J.: Go to zero: Towards zero-shot motion generation with million-scale data (2025), https://arxiv.org/abs/2507.070955

work page arXiv 2025
[8]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024) 5

work page internal anchor Pith review arXiv 2024
[9]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Guo, C., Mu, Y., Javed, M.G., Wang, S., Cheng, L.: Momask: Generative masked modeling of 3d human motions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1900–1910 (2024) 2, 3, 9

1900
[10]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating di- verse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5152–5161 (June 2022) 2, 3, 9

2022
[11]

Classifier-Free Diffusion Guidance

Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022) 6 16 Jia et al

work page internal anchor Pith review arXiv 2022
[12]

Advances in Neural Information Processing Systems36, 20067–20079 (2023) 3

Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: Motiongpt: Human motion as a foreign language. Advances in Neural Information Processing Systems36, 20067–20079 (2023) 3

2023
[13]

Advances in Neural Information Processing Systems36 (2024) 2, 3

Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: Motiongpt: Human motion as a foreign language. Advances in Neural Information Processing Systems36 (2024) 2, 3

2024
[14]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Jiang, J., Xiao, W., Lin, Z., Zhang, H., Ren, T., Gao, Y., Lin, Z., Cai, Z., Yang, L., Liu, Z.: Solami: Social vision-language-action modeling for immersive interac- tion with 3d autonomous characters. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26887–26898 (2025) 2

2025
[15]

ACM Transactions On Graphics (TOG)38(4), 1–12 (2019) 2

Jiang, Y., Van Wouwe, T., De Groote, F., Liu, C.K.: Synthesis of biologically real- istic human motion using joint torque actuation. ACM Transactions On Graphics (TOG)38(4), 1–12 (2019) 2

2019
[16]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Kim, B., Jeong, H.I., Sung, J., Cheng, Y., Lee, J., Chang, J.Y., Choi, S.I., Choi, Y., Shin, S., Kim, J., et al.: Personabooth: Personalized text-to-motion generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22756–22765 (2025) 4

2025
[17]

ACM Trans

Lee, S., Lee, S., Lee, Y., Lee, J.: Learning a family of motor skills from a single motion clip. ACM Trans. Graph.40(4) (2021) 2

2021
[18]

LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens

Li, Z., An, S., Tang, C., Guo, C., Shugurov, I., Zhang, L., Zhao, A., Sridhar, S., Tao, L., Mittal, A.: Llamo: Scaling pretrained language models for unified mo- tion understanding and generation with continuous autoregressive tokens. arXiv preprint arXiv:2602.12370 (2026) 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Liao, T.H., Zhou, Y., Shen, Y., Huang, C.H.P., Mitra, S., Huang, J.B., Bhat- tacharya, U.: Shape my moves: Text-driven shape-aware synthesis of human mo- tions.In:ProceedingsoftheComputerVisionandPatternRecognitionConference. pp. 1917–1928 (2025) 2, 4, 9

1917
[20]

ACM Transactions on Graphics, (Proc

Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: A skinned multi-person linear model. ACM Transactions on Graphics, (Proc. SIG- GRAPH Asia)34(6), 248:1–248:16 (Oct 2015) 2

2015
[21]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Lu, S., Wang, J., Lu, Z., Chen, L.H., Dai, W., Dong, J., Dou, Z., Dai, B., Zhang, R.: Scamo: Exploring the scaling law in autoregressive motion generation model. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 27872–27882 (2025) 3

2025
[22]

In: International Conference on Com- puter Vision

Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: Archive of motion capture as surface shapes. In: International Conference on Com- puter Vision. pp. 5442–5451 (Oct 2019) 5, 8

2019
[23]

In: Proceedings IEEE Conf

Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single image. In: Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2019) 2

2019
[24]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 9

2021
[25]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019) 9

work page internal anchor Pith review arXiv 1910
[26]

arXiv preprint arXiv:2503.19557 (2025) 4 IAM 17

Sawdayee, H., Guo, C., Tevet, G., Zhou, B., Wang, J., Bermano, A.H.: Dance like a chicken: Low-rank stylization for human motion diffusion. arXiv preprint arXiv:2503.19557 (2025) 4 IAM 17

work page arXiv 2025
[27]

In: SIGGRAPH Asia Conference Proceedings (2024) 5

Shen,Z.,Pi,H.,Xia,Y.,Cen,Z.,Peng,S.,Hu,Z.,Bao,H.,Hu,R.,Zhou,X.:World- grounded human motion recovery via gravity-view coordinates. In: SIGGRAPH Asia Conference Proceedings (2024) 5

2024
[28]

arXiv preprint arXiv:2506.00173 (2025) 2

Shi, M., Liu, W., Mei, J., Tse, W., Chen, R., Chen, X., Komura, T.: Motionpersona: Characteristics-aware locomotion control. arXiv preprint arXiv:2506.00173 (2025) 2

work page arXiv 2025
[29]

ACM Transactions on Graphics (TOG)35(4), 1–14 (2016) 5

Streuber, S., Quiros-Ramirez, M.A., Hill, M.Q., Hahn, C.A., Zuffi, S., O’Toole, A., Black, M.J.: Body talk: Crowdshaping realistic 3d avatars with words. ACM Transactions on Graphics (TOG)35(4), 1–14 (2016) 5

2016
[30]

Closd: Closing the loop between simulation and diffusion for multi-task character control.arXiv preprint arXiv:2410.03441,

Tevet, G., Raab, S., Cohan, S., Reda, D., Luo, Z., Peng, X.B., Bermano, A.H., van de Panne, M.: Closd: Closing the loop between simulation and diffusion for multi-task character control. arXiv preprint arXiv:2410.03441 (2024) 7, 9

work page arXiv 2024
[31]

Human Motion Diffusion Model

Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022) 2, 3

work page internal anchor Pith review arXiv 2022
[32]

In: European Conference on Computer Vision

Tripathi, S., Taheri, O., Lassner, C., Black, M.J., Holden, D., Stoll, C.: HUMOS: Human motion model conditioned on body shape. In: European Conference on Computer Vision. pp. 133–152. Springer (2025) 2, 4

2025
[33]

arXiv preprint arXiv:2506.21912 (2025) 4

Wang, X., Xu, K., Li, F., Sheng, C., Yu, J., Mu, Y.: Generating attribute-aware human motions from textual prompt. arXiv preprint arXiv:2506.21912 (2025) 4

work page arXiv 2025
[34]

Hy-motion 1.0: Scaling flow matching models for text-to-motion generation.arXiv preprint arXiv:2512.23464, 2025

Wen, Y., Shuai, Q., Kang, D., Li, J., Wen, C., Qian, Y., Jiao, N., Chen, C., Chen, W., Wang, Y., et al.: Hy-motion 1.0: Scaling flow matching models for text-to- motion generation. arXiv preprint arXiv:2512.23464 (2025) 3

work page arXiv 2025
[35]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wu, Y., Karunratanakul, K., Luo, Z., Tang, S.: Uniphys: Unified planner and con- troller with diffusion for flexible physics-based character control. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13214–13224 (2025) 2

2025
[36]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Xiao, L., Lu, S., Pi, H., Fan, K., Pan, L., Zhou, Y., Feng, Z., Zhou, X., Peng, S., Wang, J.: Motionstreamer: Streaming motion generation via diffusion-based autoregressive model in causal latent space. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10086–10096 (2025) 2, 3, 4

2025
[37]

arXiv: 2602.07439 [cs.RO]

Xie, W., Zheng, J., Han, J., Shi, J., Zhang, W., Bai, C., Li, X.: Textop: Real- time interactive text-driven humanoid robot motion generation and control. arXiv preprint arXiv:2602.07439 (2026) 2

work page arXiv 2026
[38]

ACM Transactions on Graphics (TOG)42(6), 1–17 (2023) 2

Xu, P., Xie, K., Andrews, S., Kry, P.G., Neff, M., McGuire, M., Karamouzas, I., Zordan, V.: Adaptnet: Policy adaptation for physics-based character control. ACM Transactions on Graphics (TOG)42(6), 1–17 (2023) 2

2023
[39]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023) 2, 3

Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., Shen, X.: T2m-gpt: Generating human motion from textual descriptions with discrete representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023) 2, 3

2023
[40]

In: Proceedings of the 32nd ACM Inter- national Conference on Multimedia

Zhang,J.,Chen,X.,Yu,G.,Tu,Z.:Generativemotionstylizationofcross-structure characters within canonical motion space. In: Proceedings of the 32nd ACM Inter- national Conference on Multimedia. pp. 7018–7026 (2024) 2

2024
[41]

arXiv preprint arXiv:2310.12678 (2023) 2

Zhang, J., Huang, S., Tu, Z., Chen, X., Zhan, X., Yu, G., Shan, Y.: Tapmo: Shape-aware motion generation of skeleton-free characters. arXiv preprint arXiv:2310.12678 (2023) 2

work page arXiv 2023
[42]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zhang, J., Weng, J., Kang, D., Zhao, F., Huang, S., Zhe, X., Bao, L., Shan, Y., Wang, J., Tu, Z.: Skinned motion retargeting with residual perception of motion semantics & geometry. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13864–13872 (2023) 2 18 Jia et al

2023
[43]

ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

Zhang, J., Chen, C., Chen, X., Yu, H., Xiang, T., Khan, A.S., Lakshmikanth, S.K., Adeli, E.: Vibes: A conversational agent with behaviorally-intelligent 3d virtual body. arXiv preprint arXiv:2512.14234 (2025) 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Motiondiffuse: Text-driven human motion generation with diffusion model

Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondif- fuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001 (2022) 3

work page arXiv 2022
[45]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhong, C., Hu, L., Zhang, Z., Xia, S.: Attt2m: Text-driven human motion generation with multi-perspective attention mechanism. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 509–519 (2023) 3

2023
[46]

Cannot judge

Zhong, L., Xie, Y., Jampani, V., Sun, D., Jiang, H.: Smoodi: Stylized motion diffu- sion model. In: European Conference on Computer Vision. pp. 405–421. Springer (2024) 4 IAM 19 A Video Demonstration The supplementary video provides a comprehensive visualization of our work, including animated results for all figures in the main paper. We recommend viewin...

2024