pith. sign in

arxiv: 2605.14731 · v1 · pith:6ERWEOVFnew · submitted 2026-05-14 · 💻 cs.GR · cs.CV· cs.SD

UMo: Unified Sparse Motion Modeling for Real-Time Co-Speech Avatars

Pith reviewed 2026-06-30 19:37 UTC · model grok-4.3

classification 💻 cs.GR cs.CVcs.SD
keywords co-speech avatarsspeech-driven animationreal-time motion generationmixture of expertsunified multimodal modelingkeyframe animationfacial and gesture synthesissparse motion modeling
0
0 comments X

The pith

UMo unifies text, audio and motion tokens with sparse experts to generate real-time co-speech avatar animations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents UMo as a single architecture that handles text, audio and motion inputs together for driving both facial expressions and body gestures. It relies on a spatially sparse Mixture-of-Experts layer paired with a temporally sparse keyframe approach to reconstruct dense motion sequences without processing every frame at full cost. A multi-stage training process with audio augmentation is used to keep speech and motion aligned even when latency is tightly constrained. If the design works as described, it removes the need to choose between quality and speed in interactive avatar systems. The evaluations focus on showing measurable gains in output quality while meeting real-time requirements.

Core claim

UMo processes text, audio, and motion tokens within a unified formulation. Leveraging a spatially sparse Mixture-of-Experts framework and a temporally sparse, keyframe-centric design, UMo efficiently performs real-time dense reconstruction, enabling temporally coherent and high-fidelity animation generation for both facial expressions and gestures. A multi-stage training strategy with targeted audio augmentation enhances acoustic diversity and semantic consistency, preserving fine-grained speech-motion alignment even under strict latency constraints.

What carries the argument

Unified sparse motion modeling architecture that combines a spatially sparse Mixture-of-Experts framework with a temporally sparse keyframe-centric design for processing text, audio and motion tokens.

If this is right

  • Real-time dense reconstruction of facial and gesture motion becomes feasible from multimodal inputs.
  • Multi-stage training with audio augmentation improves semantic consistency between speech and generated motion.
  • Spatially and temporally sparse components together reduce computational load while maintaining coherence.
  • The same architecture supports both facial expressions and full-body gestures without separate pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may allow single-model deployment across different avatar platforms that vary in frame rate or compute budget.
  • Keyframe sparsity could be adjusted at inference time to trade quality for even lower latency in specific applications.
  • Because the model ingests text alongside audio, it may support scripted or edited speech tracks without retraining.
  • The sparse design might reduce memory footprint enough to run on edge devices for mobile or AR use cases.

Load-bearing premise

A spatially sparse Mixture-of-Experts combined with a temporally sparse keyframe-centric design can preserve fine-grained speech-motion alignment and temporal coherence without full dense per-frame modeling or extra post-processing.

What would settle it

Side-by-side comparison at fixed low latency showing that UMo outputs lose measurable alignment with audio or temporal smoothness compared with a dense baseline model on the same test sequences.

Figures

Figures reproduced from arXiv: 2605.14731 by Chenghao Yang, Dongjie Fu, Hansung Kim, Jie Guo, Pengcheng Fang, Tengjiao Sun, Xiaohao Cai, Xiaohong Zhang, Xiaoyu Zhan, Xinyu Fu, Yanwen Guo, Yuanqi Li.

Figure 1
Figure 1. Figure 1: UMo is a real-time co-speech motion generation architecture achieved through unified sparse motion modeling. Abstract Speech-driven gestures and facial animations are fundamen￾tal to expressive digital avatars in games, virtual produc￾tion, and interactive media. However, existing methods are either limited to a single modality for audio motion align￾ment, failing to fully utilize the potential of massive … view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline overview. UMo is a unified sparse architecture for real-time co-speech generation. We achieve both high￾fidelity motion generation and low-latency response using an LLM-based backbone through an autoregressive framework (Sec. ), combined with spatial sparsity (Sec. ) and temporal sparsity (Sec. ). generation as follows. Let a1:t denote observed speech sig￾nals up to time t (audio with optional tex… view at source ↗
Figure 3
Figure 3. Figure 3: MoE details. The dense FFN layers within both the encoder and decoder are substituted with MoE FFNs. During inference, each token selectively activates a single expert from the four. face, hand, upper-body, and lower-body streams. A potential drawback of this design lies in the unique distributional pat￾terns of motion features across different body parts. There￾fore, utilizing shared parameters for infere… view at source ↗
Figure 4
Figure 4. Figure 4: Interpolation network. Our interpolation network employs global bidirectional attention for temporal feature extraction and part-aware attention within frames to encode spatial features among parts. in the minibatch. We define ℓ (ℓ) moe = E X E e=1 f (ℓ) e π¯ (ℓ) e , f (ℓ) e = 1 N X N t=1 D (ℓ) t,e , π¯ (ℓ) e = 1 N X N t=1 π (ℓ) t,e . (11) Under our k = 1 setting D (ℓ) t,e becomes a one-hot indicator and f… view at source ↗
Figure 5
Figure 5. Figure 5: Training recipe. UMo integrates paired data of various formats and training objectives into a unified frame￾work via a three-stage training recipe, enabling joint opti￾mization for the co-speech task. where λmoe is stage-dependent and explicitly set to 0 dur￾ing pre-training. The MoE loss Lmoe is the average of ℓ (ℓ) moe (described in Eq. 11) over all upcycled MoE layers. Pre-training. In the pre-training … view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative Comparisons. We visualize the motion sequence of LOM (Chen et al. 2025), MambaTalk (Xu et al. 2024), GestureLSM (Liu et al. 2025) and UMo with the same audio input. Our results are more vivid and reasonable compared to other methods. Aug. here means audio augmentation. identical skeleton for fair visual comparison. Across these examples, MambaTalk and GestureLSM produce relatively smooth motion… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization Results of UMo. The model is trained on BEATV2 females and the SMPLX motion sequences are retargeted to the avatar. We mainly showcase the rhythm and expressiveness of the movements, but the expressions are not synchronized. vored by users. With audio augmentation, UMo significantly outperforms other methods. Ablation Studies In this section, we conduct ablation studies on each mod￾ule of UMo… view at source ↗
read the original abstract

Speech-driven gestures and facial animations are fundamental to expressive digital avatars in games, virtual production, and interactive media. However, existing methods are either limited to a single modality for audio motion alignment, failing to fully utilize the potential of massive human motion data, or are constrained by the representation ability and throughput of multimodal models, which makes it difficult to achieve high-quality motion generation or real-time performance. We present UMo, a unified sparse motion modeling architecture for real-time co-speech avatars, which processes text, audio, and motion tokens within a unified formulation. Leveraging a spatially sparse Mixture-of-Experts framework and a temporally sparse, keyframe-centric design, UMo efficiently performs real-time dense reconstruction, enabling temporally coherent and high-fidelity animation generation for both facial expressions and gestures. Furthermore, we implement a multi-stage training strategy with targeted audio augmentation to enhance acoustic diversity and semantic consistency. Consequently, UMo preserves fine-grained speech-motion alignment even under strict latency constraints. Extensive quantitative and qualitative evaluations show that UMo achieves better output quality under low latency and real-time performance constraints, offering a practical solution for high-fidelity real-time co-speech avatars.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper presents UMo, a unified sparse motion modeling architecture for real-time co-speech avatars. It processes text, audio, and motion tokens within a single formulation, employing a spatially sparse Mixture-of-Experts framework together with a temporally sparse keyframe-centric design to perform real-time dense reconstruction of facial expressions and gestures. A multi-stage training strategy incorporating targeted audio augmentation is used to improve acoustic diversity and semantic consistency. The central claim is that this approach preserves fine-grained speech-motion alignment under strict latency constraints and delivers superior output quality relative to prior methods while satisfying real-time performance requirements, as supported by extensive quantitative and qualitative evaluations.

Significance. If the empirical claims are substantiated by the reported evaluations, the work would offer a practical advance for high-fidelity real-time co-speech avatars in games, virtual production, and interactive media. The combination of spatial and temporal sparsity to achieve dense, coherent animation without full per-frame modeling or post-processing addresses a key efficiency-quality trade-off in multimodal motion generation.

minor comments (2)
  1. [Abstract] Abstract: the claim of 'better output quality under low latency and real-time performance constraints' is stated without any numerical results, baselines, or protocol details; including at least the primary quantitative metrics (e.g., FID, beat alignment error, latency in ms) would strengthen immediate readability.
  2. The multi-stage training procedure and audio-augmentation strategy are described at a high level; a concise diagram or pseudocode listing the stages and loss terms would clarify how semantic consistency is enforced.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation for minor revision. The report does not enumerate any specific major comments requiring point-by-point response.

Circularity Check

0 steps flagged

No significant circularity; empirical architecture with no derivation chain

full rationale

The paper describes a neural architecture (spatially sparse MoE + temporally sparse keyframes), multi-stage training, and empirical evaluations. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation load-bearing uniqueness theorems appear in the provided text. The central claims rest on quantitative/qualitative results rather than any step that reduces to its own inputs by construction. This is the expected outcome for an applied ML systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5778 in / 1007 out tokens · 28831 ms · 2026-06-30T19:37:11.810841+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MoGeFlow: Flowing Through Motion Codebook Geometry for Text-to-Motion Generation

    cs.GR 2026-06 unverdicted novelty 6.0

    MoGeFlow learns text-conditioned flows over PartVQ group-specific code embeddings to generate motions, achieving SOTA R-Precision on HumanML3D and KIT-ML while preserving discrete token validity.

Reference graph

Works this paper leans on

78 extracted references · 15 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Alexanderson, S.; Nagy, R.; Beskow, J.; and Henter, G. E. 2023. Listen, denoise, action! audio-driven motion synthesis with diffusion models. ACM Transactions on Graphics (TOG), 42(4): 1--20

  4. [4]

    Ao, T.; Gao, Q.; Lou, Y.; Chen, B.; and Liu, L. 2022. Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. ACM Transactions on Graphics (TOG), 41(6): 1--19

  5. [5]

    Ao, T.; Zhang, Z.; and Liu, L. 2023. Gesturediffuclip: Gesture diffusion model with clip latents. ACM Transactions on Graphics (TOG), 42(4): 1--18

  6. [6]

    Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609

  7. [7]

    Cai, Y.; Chu, X.; Gao, X.; Gong, S.; Huang, Y.; Kang, C.; Li, K.; Liu, H.; Liu, R.; Liu, Y.; et al. 2025. Towards Interactive Intelligence for Digital Humans. arXiv preprint arXiv:2512.13674

  8. [8]

    Chen, B.; Li, Y.; Ding, Y.-X.; Shao, T.; and Zhou, K. 2024 a . Enabling synergistic full-body control in prompt-based co-speech motion generation. In Proceedings of the 32nd ACM International Conference on Multimedia, 6774--6783

  9. [9]

    K.; Fang, Y.; Shao, R.; Wetzstein, G.; Li, F.-F.; and Adeli, E

    Chen, C.; Zhang, J.; Lakshmikanth, S. K.; Fang, Y.; Shao, R.; Wetzstein, G.; Li, F.-F.; and Adeli, E. 2025. The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR )

  10. [10]

    Chen, J.; Liu, Y.; Wang, J.; Zeng, A.; Li, Y.; and Chen, Q. 2024 b . Diffsheg: A diffusion-based approach for real-time speech-driven holistic 3d expression and gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) , 7352--7361

  11. [11]

    Chen, R.; Shi, M.; Huang, S.; Tan, P.; Komura, T.; and Chen, X. 2024 c . Taming diffusion probabilistic models for character control. In ACM SIGGRAPH 2024 Conference Papers, 1--10

  12. [12]

    Chen, X.; Jiang, B.; Liu, W.; Huang, Z.; Fu, B.; Chen, T.; and Yu, G. 2023. Executing your commands via motion diffusion in latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) , 18000--18010

  13. [13]

    Chern, E.; Hu, Z.; Tang, B.; Su, J.; Chern, S.; Deng, Z.; and Liu, P. 2025. LiveTalk : Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation. arXiv preprint arXiv:2512.23576

  14. [14]

    Chi, S.; Chi, H.-g.; Ma, H.; Agarwal, N.; Siddiqui, F.; Ramani, K.; and Lee, K. 2024. M2d2m: Multi-motion generation from text with discrete diffusion models. In Proceedings of the European Conference on Computer Vision ( ECCV ) , 18--36

  15. [15]

    D \'e fossez, A.; Mazar \'e , L.; Orsini, M.; Royer, A.; P \'e rez, P.; J \'e gou, H.; Grave, E.; and Zeghidour, N. 2024. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037

  16. [16]

    Fu, D.; Sun, T.; Fang, P.; Cai, X.; and Kim, H. 2026. MOGO: Residual Quantized Hierarchical Causal Transformer for Real-Time and Infinite-Length 3D Human Motion Generation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 5, 3994--4002

  17. [17]

    Ginosar, S.; Bar, A.; Kohavi, G.; Chan, C.; Owens, A.; and Malik, J. 2019. Learning individual styles of conversational gesture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) , 3497--3506

  18. [18]

    Gu, C.; Zhang, M.; Xie, H.; Cai, Z.; Yang, L.; and Liu, Z. 2026. Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer. arXiv preprint arXiv:2603.19227

  19. [19]

    G.; Wang, S.; and Cheng, L

    Guo, C.; Mu, Y.; Javed, M. G.; Wang, S.; and Cheng, L. 2024. Momask: Generative masked modeling of 3d human motions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) , 1900--1910

  20. [20]

    Guo, C.; Zou, S.; Zuo, X.; Wang, S.; Ji, W.; Li, X.; and Cheng, L. 2022 a . Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) , 5152--5161

  21. [21]

    Guo, C.; Zuo, X.; Wang, S.; and Cheng, L. 2022 b . Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In Proceedings of the European Conference on Computer Vision ( ECCV ) , 580--597

  22. [22]

    Habibie, I.; Xu, W.; Mehta, D.; Liu, L.; Seidel, H.-P.; Pons-Moll, G.; Elgharib, M.; and Theobalt, C. 2021. Learning speech-driven 3d conversational gestures from video. In Proceedings of the 21st ACM international conference on intelligent virtual agents, 101--108

  23. [23]

    Han, Z.; Teye, M.; Yadgaroff, D.; and B \"u tepage, J. 2025. Tiny is not small enough: High quality, low-resource facial animation models through hybrid knowledge distillation. ACM Transactions on Graphics (TOG), 44(4): 1--18

  24. [24]

    Hou, R.; Luo, M.; Pan, H.; Chang, H.; and Shan, S. 2025. Motionverse: A unified multimodal framework for motion comprehension, generation and editing. arXiv preprint arXiv:2509.23635

  25. [25]

    H.; Lakhotia, K.; Salakhutdinov, R.; and Mohamed, A

    Hsu, W.-N.; Bolte, B.; Tsai, Y.-H. H.; Lakhotia, K.; Salakhutdinov, R.; and Mohamed, A. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing, 29: 3451--3460

  26. [26]

    Jiang, B.; Chen, X.; Liu, W.; Yu, J.; Yu, G.; and Chen, T. 2023. Motiongpt: Human motion as a foreign language. In Advances in Neural Information Processing Systems ( NeurIPS ) , volume 36, 20067--20079

  27. [27]

    Karunratanakul, K.; Preechakul, K.; Suwajanakorn, S.; and Tang, S. 2023. Guided motion diffusion for controllable human motion synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision ( ICCV ) , 2151--2162

  28. [28]

    A.; and Kanazawa, A

    Li, R.; Yang, S.; Ross, D. A.; and Kanazawa, A. 2021. Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) , 13401--13412

  29. [29]

    Li, Z.; An, S.; Tang, C.; Guo, C.; Shugurov, I.; Zhang, L.; Zhao, A.; Sridhar, S.; Tao, L.; and Mittal, A. 2026. LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR )

  30. [30]

    Lin, J.; Zeng, A.; Lu, S.; Cai, Y.; Zhang, R.; Wang, H.; and Zhang, L. 2023. Motion-x: A large-scale 3d expressive whole-body human motion dataset. Advances in Neural Information Processing Systems, 36: 25268--25280

  31. [31]

    Ling, Z.; Han, B.; Li, S.; Cheng, J.; Shen, H.; and Zou, C. 2024. VersatileMotion: A Unified Framework for Motion Synthesis and Comprehension. arXiv preprint arXiv:2411.17335

  32. [32]

    Liu, H.; Iwamoto, N.; Zhu, Z.; Li, Z.; Zhou, Y.; Bozkurt, E.; and Zheng, B. 2022 a . Disco: Disentangled implicit content and rhythm learning for diverse co-speech gestures synthesis. In Proceedings of the 30th ACM international conference on multimedia, 3764--3773

  33. [33]

    Liu, H.; Zhu, Z.; Becherini, G.; Peng, Y.; Su, M.; Zhou, Y.; Zhe, X.; Iwamoto, N.; Zheng, B.; and Black, M. J. 2024. EMAGE : Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) , 1144--1154

  34. [34]

    Liu, H.; Zhu, Z.; Iwamoto, N.; Peng, Y.; Li, Z.; Zhou, Y.; Bozkurt, E.; and Zheng, B. 2022 b . Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. In Proceedings of the European Conference on Computer Vision ( ECCV ) , 612--630

  35. [35]

    Liu, P.; Song, L.; Huang, J.; and Xu, C. 2025. GestureLSM : Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision ( ICCV )

  36. [36]

    Liu, X.; Wu, Q.; Zhou, H.; Du, Y.; Wu, W.; Lin, D.; and Liu, Z. 2022 c . Audio-driven co-speech gesture video generation. In Advances in Neural Information Processing Systems ( NeurIPS ) , volume 35, 21386--21399

  37. [37]

    Liu, X.; Wu, Q.; Zhou, H.; Xu, Y.; Qian, R.; Lin, X.; Zhou, X.; Wu, W.; Dai, B.; and Zhou, B. 2022 d . Learning hierarchical cross-modal association for co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) , 10462--10472

  38. [38]

    Lu, S.; Wang, J.; Lu, Z.; Chen, L.-H.; Dai, W.; Dong, J.; Dou, Z.; Dai, B.; and Zhang, R. 2025. Scamo: Exploring the scaling law in autoregressive motion generation model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) , 27872--27882

  39. [39]

    F.; Pons-Moll, G.; and Black, M

    Mahmood, N.; Ghorbani, N.; Troje, N. F.; Pons-Moll, G.; and Black, M. J. 2019. AMASS: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 5442--5451

  40. [40]

    MiniMax . 2025. MiniMax Speech 2.6

  41. [41]

    H.; Dabral, R.; Demberg, V.; and Theobalt, C

    Mughal, M. H.; Dabral, R.; Demberg, V.; and Theobalt, C. 2026. MIBURI : Towards Expressive Interactive Gesture Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR )

  42. [42]

    H.; Dabral, R.; Scholman, M

    Mughal, M. H.; Dabral, R.; Scholman, M. C.; Demberg, V.; and Theobalt, C. 2025. Retrieving semantics from the deep: an rag solution for gesture synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) , 16578--16588

  43. [43]

    E.; and Pons-Moll, G

    Nazarenus, E.; Li, C.; He, Y.; Xie, X.; Lenssen, J. E.; and Pons-Moll, G. 2026. ActionPlan: Future-Aware Streaming Motion Synthesis via Frame-Level Action Planning. arXiv preprint arXiv:2603.13500

  44. [44]

    Pan, Y.; Singh, K.; and Hafemann, L. G. 2025. Model See Model Do: Speech-Driven Facial Animation with Style Control. In Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, 1--10

  45. [45]

    J.; and Varol, G

    Petrovich, M.; Black, M. J.; and Varol, G. 2022. Temos: Generating diverse human motions from textual descriptions. In Proceedings of the European Conference on Computer Vision ( ECCV ) , 480--497

  46. [46]

    Pinyoanuntapong, E.; Wang, P.; Lee, M.; and Chen, C. 2024. Mmm: Generative masked motion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) , 1546--1555

  47. [47]

    R.; Chandrasekaran, A.; Athanasiou, N.; Quiros-Ramirez, A.; and Black, M

    Punnakkal, A. R.; Chandrasekaran, A.; Athanasiou, N.; Quiros-Ramirez, A.; and Black, M. J. 2021. BABEL: Bodies, action and behavior with english labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) , 722--731

  48. [48]

    Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140): 1--67

  49. [49]

    B.; Jiang, Y.; Wang, T.; Iqbal, U.; Minor, D.; de Ruyter, M.; et al

    Rempe, D.; Petrovich, M.; Yuan, Y.; Zhang, H.; Peng, X. B.; Jiang, Y.; Wang, T.; Iqbal, U.; Minor, D.; de Ruyter, M.; et al. 2026. Kimodo: Scaling Controllable Human Motion Generation. arXiv preprint arXiv:2603.15546

  50. [50]

    Sun, Z.; Lv, T.; Ye, S.; Lin, M.; Sheng, J.; Wen, Y.-H.; Yu, M.; and Liu, Y.-j. 2024. Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models. ACM Transactions on Graphics (ToG), 43(4): 1--9

  51. [51]

    H.; and Cohen-Or, D

    Tevet, G.; Gordon, B.; Hertz, A.; Bermano, A. H.; and Cohen-Or, D. 2022. Motionclip: Exposing human motion generation to clip space. In Proceedings of the European Conference on Computer Vision ( ECCV ) , 358--374

  52. [52]

    B.; Bermano, A

    Tevet, G.; Raab, S.; Cohan, S.; Reda, D.; Luo, Z.; Peng, X. B.; Bermano, A. H.; and van de Panne, M. 2024. Closd: Closing the loop between simulation and diffusion for multi-task character control. arXiv preprint arXiv:2410.03441

  53. [53]

    Tevet, G.; Raab, S.; Gordon, B.; Shafir, Y.; Cohen-Or, D.; and Bermano, A. H. 2023. Human Motion Diffusion Model. In International Conference on Learning Representations ( ICLR )

  54. [54]

    Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozi \`e re, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

  55. [55]

    Wu, Q.; Zhao, Y.; Wang, Y.; Liu, X.; Tai, Y.-W.; and Tang, C.-K. 2025. Motion-Agent : A Conversational Framework for Human Motion Generation with LLM s. In International Conference on Learning Representations ( ICLR )

  56. [56]

    Xiao, L.; Lu, S.; Pi, H.; Fan, K.; Pan, L.; Zhou, Y.; Feng, Z.; Zhou, X.; Peng, S.; and Wang, J. 2025. Motionstreamer: Streaming motion generation via diffusion-based autoregressive model in causal latent space. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 10086--10096

  57. [57]

    Xie, Y.; Jampani, V.; Zhong, L.; Sun, D.; and Jiang, H. 2024. Omnicontrol: Control any joint at any time for human motion generation. In International Conference on Learning Representations ( ICLR )

  58. [58]

    Xing, J.; Xia, M.; Zhang, Y.; Cun, X.; Wang, J.; and Wong, T.-T. 2023. Codetalker: Speech-driven 3d facial animation with discrete motion prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) , 12780--12790

  59. [59]

    Xu, Z.; Lin, Y.; Han, H.; Yang, S.; Li, R.; Zhang, Y.; and Li, X. 2024. Mambatalk: Efficient holistic gesture synthesis with selective state space models. In Advances in Neural Information Processing Systems ( NeurIPS ) , volume 37, 20055--20080

  60. [60]

    Yi, H.; Liang, H.; Liu, Y.; Cao, Q.; Wen, Y.; Bolkart, T.; Tao, D.; and Black, M. J. 2023. Generating holistic 3d human motion from speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) , 469--480

  61. [61]

    C.; and Adeli, E

    Yu, H.; Zhang, J.; Chen, C.; Xiang, T.; Fang, Y.; Niebles, J. C.; and Adeli, E. 2025. Socialgen: Modeling multi-human social interaction with language models. arXiv preprint arXiv:2503.22906

  62. [62]

    Yuan, Y.; Song, J.; Iqbal, U.; Vahdat, A.; and Kautz, J. 2023. Physdiff: Physics-guided human motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision ( ICCV ) , 16010--16021

  63. [63]

    Zeng, L.-A.; Huang, G.; Wu, G.; and Zheng, W.-S. 2025. Light-t2m: A lightweight and fast model for text-to-motion generation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 9, 9797--9805

  64. [64]

    Zhang, J.; Fan, H.; and Yang, Y. 2025. Energymogen: Compositional human motion generation with energy-based diffusion model in latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) , 17592--17602

  65. [65]

    Zhang, J.; Zhang, Y.; Cun, X.; Zhang, Y.; Zhao, H.; Lu, H.; Shen, X.; and Shan, Y. 2023 a . Generating human motion from textual descriptions with discrete representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) , 14730--14740

  66. [66]

    Zhang, M.; Cai, Z.; Pan, L.; Hong, F.; Guo, X.; Yang, L.; and Liu, Z. 2024 a . Motiondiffuse: Text-driven human motion generation with diffusion model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(6): 4115--4128

  67. [67]

    Zhang, M.; Guo, X.; Pan, L.; Cai, Z.; Hong, F.; Li, H.; Yang, L.; and Liu, Z. 2023 b . Remodiffuse: Retrieval-augmented motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision ( ICCV ) , 364--373

  68. [68]

    Zhang, M.; Li, H.; Cai, Z.; Ren, J.; Yang, L.; and Liu, Z. 2023 c . Finemogen: Fine-grained spatio-temporal motion generation and editing. In Advances in Neural Information Processing Systems ( NeurIPS ) , volume 36, 13981--13992

  69. [69]

    W.; Xu, M.; Wang, Q.; Wen, Z.; He, X.; Zhao, W.; Gong, K.; and Zhang, M

    Zhang, N.; Li, Z.; Loh, K. W.; Xu, M.; Wang, Q.; Wen, Z.; He, X.; Zhao, W.; Gong, K.; and Zhang, M. 2026. DiMo: Discrete Diffusion Modeling for Motion Generation and Understanding. arXiv preprint arXiv:2602.04188

  70. [70]

    Zhang, Y.; Feng, Y.; Cseke, A.; Saini, N.; Bajandas, N.; Heron, N.; and Black, M. J. 2025 a . PRIMAL : Physically Reactive and Interactive Motor Model for Avatar Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision ( ICCV ) , 12725--12736

  71. [71]

    Zhang, Y.; Huang, D.; Liu, B.; Tang, S.; Lu, Y.; Chen, L.; Bai, L.; Chu, Q.; Yu, N.; and Ouyang, W. 2024 b . MotionGPT : Finetuned LLM s Are General-Purpose Motion Generators. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , 7, 7368--7376

  72. [72]

    Zhang, Y.-Y.; Sun, T.; Fang, P.; Wang, D.-B.; Cai, X.; Zhang, M.-L.; and Kim, H. 2025 b . MotionDuet: Dual-Conditioned 3D Human Motion Generation with Video-Regularized Text Learning. arXiv preprint arXiv:2511.18209

  73. [73]

    Zhang, Z.; Ao, T.; Zhang, Y.; Gao, Q.; Lin, C.; Chen, B.; and Liu, L. 2024 c . Semantic gesticulator: Semantics-aware co-speech gesture synthesis. ACM Transactions on Graphics (TOG), 43(4): 1--17

  74. [74]

    Zhao, K.; Li, G.; and Tang, S. 2025. DartControl: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control. In International Conference on Learning Representations ( ICLR )

  75. [75]

    Zhao, Q.; Long, P.; Zhang, Q.; Qin, D.; Liang, H.; Zhang, L.; Zhang, Y.; Yu, J.; and Xu, L. 2024. Media2face: Co-speech facial animation generation with multi-modality guidance. In ACM SIGGRAPH 2024 conference papers, 1--13

  76. [76]

    Zheng, B.; Chen, K.; Yao, Y.; Zeng, Z.; Jiang, X.; Wang, H.; Lasenby, J.; and Jin, X. 2025. Autokeyframe: Autoregressive keyframe generation for human motion synthesis and editing. In Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, 1--12

  77. [77]

    Zhou, W.; Dou, Z.; Cao, Z.; Liao, Z.; Wang, J.; Wang, W.; Liu, Y.; Komura, T.; Wang, W.; and Liu, L. 2024. Emdm: Efficient motion diffusion model for fast and high-quality motion generation. In Proceedings of the European Conference on Computer Vision ( ECCV ) , 18--38

  78. [78]

    Zhu, B.; Jiang, B.; Wang, S.; Tang, S.; Chen, T.; Luo, L.; Zheng, Y.; and Chen, X. 2025. Motiongpt3: Human motion as a second modality. arXiv preprint arXiv:2506.24086