pith. machine review for the scientific record. sign in

arxiv: 2604.10927 · v1 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

LiveGesture Streamable Co-Speech Gesture Generation Model

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:43 UTC · model grok-4.3

classification 💻 cs.CV
keywords co-speech gesture generationstreamable motion modelautoregressive transformerfull-body gesture synthesisreal-time gesture generationcausal motion tokenizerBEAT2 dataset
0
0 comments X

The pith

LiveGesture generates full-body co-speech gestures in real time with zero look-ahead while matching offline state-of-the-art performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LiveGesture as a complete framework for producing gestures across the entire body directly from ongoing speech, without access to any future audio frames and without limits on sequence length. Existing approaches require the full speech recording upfront and often process body parts separately or as one entangled unit, which prevents live use. LiveGesture instead builds causal components from the start: a streamable tokenizer turns each body region's motion into discrete tokens that decode incrementally, while hierarchical autoregressive transformers model each region and then fuse their correlated dynamics using only past and present audio. Special masking during training forces the system to recover from the kinds of partial histories and early errors that arise in actual streaming. If the approach holds, gesture generation becomes feasible for live settings such as virtual avatars responding instantly to a speaker.

Core claim

LiveGesture is the first fully streamable, speech-driven full-body gesture generation framework that operates with zero look-ahead and supports arbitrary sequence length. It consists of a Streamable Vector Quantized Motion Tokenizer that converts each body region's motion sequence into causal discrete tokens, a Hierarchical Autoregressive Transformer that uses region-expert autoregressive modules plus a causal spatio-temporal fusion layer conditioned on continuously arriving audio, and autoregressive masking training that applies uncertainty-guided token masking and random region masking to build robustness against imperfect histories. On the BEAT2 dataset this produces coherent, diverse, be

What carries the argument

The Streamable Vector Quantized Motion Tokenizer (SVQ) that produces causal discrete motion tokens per body region, paired with the Hierarchical Autoregressive Transformer (HAR) that runs region-expert xAR transformers and a causal xAR Fusion module, all conditioned on a streamable causal audio encoder.

If this is right

  • Full-body gesture sequences of any length can be produced causally from continuous speech input without fixed windows or future frames.
  • Region-specific motion dynamics are modeled separately yet coordinated through causal fusion, preserving both fine-grained detail and inter-region consistency.
  • Performance on the BEAT2 benchmark reaches or exceeds offline methods even when the model receives only past and present audio under true zero look-ahead conditions.
  • The system remains robust to streaming noise because training explicitly exposes it to partially erroneous token histories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The causal architecture could be adapted to other incremental generation tasks such as live facial animation or sign-language synthesis from speech.
  • Deployment in real-time applications like virtual meetings would require only the addition of an audio buffer and motion renderer, since the model already supports arbitrary lengths.
  • If the masking strategy generalizes, similar uncertainty-guided training could improve robustness in other autoregressive streaming models for motion or video.

Load-bearing premise

The autoregressive masking training with uncertainty-guided token masking and random region masking sufficiently prepares the model for the kinds of prediction errors and incomplete histories that occur during actual live streaming inference.

What would settle it

Running the trained model in a true end-to-end streaming pipeline on long unseen speech sequences from BEAT2 or a similar dataset and measuring whether gesture beat synchronization, diversity, and naturalness scores remain comparable to offline baselines when evaluated against ground-truth motion.

Figures

Figures reproduced from arXiv: 2604.10927 by Ahmed Helmy, Chen Chen, Ekkasit Pinyoanuntapong, Hongfei Xue, Li Yang, Mayur Jagdishbhai Patel, Muhammad Usama Saleem, Pu Wang, Zhongxing Qin.

Figure 1
Figure 1. Figure 1: LiveGesture overview. Given live audio chunks, our framework generates full-body SMPL-X motion online with zero look [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Streamable Asymmetric Motion To [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the Hierarchical Autoregressive Model in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison with state-of-the-art methods on [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Interactive human–avatar conversation enabled by [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

We propose LiveGesture, the first fully streamable, speech-driven full-body gesture generation framework that operates with zero look-ahead and supports arbitrary sequence length. Unlike existing co-speech gesture methods, which are designed for offline generation and either treat body regions independently or entangle all joints within a single model, LiveGesture is built from the ground up for causal, region-coordinated motion generation. LiveGesture consists of two main modules: the Streamable Vector Quantized Motion Tokenizer (SVQ) and the Hierarchical Autoregressive Transformer (HAR). The SVQ tokenizer converts the motion sequence of each body region into causal, discrete motion tokens, enabling real-time, streamable token decoding. On top of SVQ, HAR employs region-expert autoregressive (xAR) transformers to model expressive, fine-grained motion dynamics for each body region. A causal spatio-temporal fusion module (xAR Fusion) then captures and integrates correlated motion dynamics across regions. Both xAR and xAR Fusion are conditioned on live, continuously arriving audio signals encoded by a streamable causal audio encoder. To enhance robustness under streaming noise and prediction errors, we introduce autoregressive masking training, which leverages uncertainty-guided token masking and random region masking to expose the model to imperfect, partially erroneous histories during training. Experiments on the BEAT2 dataset demonstrate that LiveGesture produces coherent, diverse, and beat-synchronous full-body gestures in real time, matching or surpassing state-of-the-art offline methods under true zero look-ahead conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes LiveGesture, the first fully streamable speech-driven full-body gesture generation framework with zero look-ahead and arbitrary sequence length. It introduces a Streamable Vector Quantized Motion Tokenizer (SVQ) for causal discrete motion tokens per body region and a Hierarchical Autoregressive Transformer (HAR) using region-expert autoregressive (xAR) transformers plus a causal xAR Fusion module, both conditioned on a streamable causal audio encoder. Training incorporates autoregressive masking (uncertainty-guided token masking and random region masking) to build robustness to imperfect histories. Experiments on BEAT2 claim coherent, diverse, beat-synchronous gestures matching or surpassing offline SOTA under true streaming conditions.

Significance. If the streaming performance claims hold with rigorous quantitative support, the work would be significant for enabling practical real-time applications in virtual agents, VR, and HCI by closing the gap between high-quality offline co-speech gesture models and causal online requirements. The causal architecture and masking strategy represent a targeted engineering contribution, though its robustness to closed-loop error accumulation remains to be fully validated.

major comments (1)
  1. [Description of autoregressive masking training and inference procedure] The autoregressive masking training (uncertainty-guided token masking and random region masking) is presented as the mechanism to handle prediction errors in live streaming. However, these strategies are applied to ground-truth sequences with artificial masks rather than to the model's own accumulated predictions in closed-loop inference. This leaves open whether the training distribution matches the error patterns arising from the xAR and xAR Fusion modules when feeding generated tokens back as history, which is load-bearing for the central claim of matching offline SOTA under true zero look-ahead streaming.
minor comments (2)
  1. [Abstract] The abstract states that experiments 'demonstrate' matching or surpassing SOTA but provides no numerical metrics, ablation results, or error analysis; these quantitative details should be summarized in the abstract or introduction for immediate assessment of the claims.
  2. [Hierarchical Autoregressive Transformer (HAR) section] Notation for the xAR Fusion module and its conditioning on live audio should be clarified with explicit equations showing causality constraints, as the current description leaves the exact integration of region correlations ambiguous.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and the insightful comment on our training procedure. We address the concern directly below and propose revisions to improve clarity.

read point-by-point responses
  1. Referee: The autoregressive masking training (uncertainty-guided token masking and random region masking) is presented as the mechanism to handle prediction errors in live streaming. However, these strategies are applied to ground-truth sequences with artificial masks rather than to the model's own accumulated predictions in closed-loop inference. This leaves open whether the training distribution matches the error patterns arising from the xAR and xAR Fusion modules when feeding generated tokens back as history, which is load-bearing for the central claim of matching offline SOTA under true zero look-ahead streaming.

    Authors: We agree that the masking is performed on ground-truth sequences with artificial perturbations rather than on the model's own closed-loop predictions. This is a standard teacher-forcing approximation chosen to expose the model to imperfect histories while avoiding the severe instability and slow convergence that full closed-loop training often produces in hierarchical autoregressive setups. The uncertainty-guided token masking and random region masking are calibrated to approximate the kinds of local and cross-region errors observed at inference. Our BEAT2 streaming results (matching offline SOTA under true zero-lookahead conditions) provide empirical support that the strategy transfers effectively, but we acknowledge the distribution mismatch remains a valid concern. In the revised manuscript we will (1) expand the method section with an explicit discussion of this training-inference gap and the rationale for the chosen masking schedule, (2) add an ablation that reports error statistics under both artificial masking and a limited closed-loop simulation on validation data, and (3) include a short limitations paragraph noting that full closed-loop robustness validation is left for future work. These changes will make the load-bearing claim more transparent without altering the core architecture or reported numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims rest on external empirical evaluation

full rationale

The paper introduces LiveGesture with SVQ tokenizer, HAR (xAR + xAR Fusion), and autoregressive masking training as architectural and training choices. These are then evaluated empirically on the external BEAT2 dataset under streaming conditions. No equations, parameters, or claims reduce by construction to the inputs; the masking strategy is a proposed robustness technique whose effectiveness is tested rather than assumed. No self-citations, uniqueness theorems, or fitted inputs are invoked as load-bearing for the core claims. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 2 invented entities

The central claim depends on the empirical effectiveness of the newly introduced SVQ and HAR modules plus the masking training strategy; these are validated on BEAT2 rather than derived from first principles or external benchmarks.

free parameters (1)
  • model architecture hyperparameters
    Standard neural network design choices (layer counts, token vocabulary size, masking probabilities) that are fitted or selected during development.
invented entities (2)
  • Streamable Vector Quantized Motion Tokenizer (SVQ) no independent evidence
    purpose: Converts per-region motion sequences into causal discrete tokens for real-time decoding.
    New component introduced to enable streamable tokenization.
  • Hierarchical Autoregressive Transformer (HAR) with xAR Fusion no independent evidence
    purpose: Models fine-grained per-region dynamics and integrates cross-region correlations under live audio conditioning.
    New hierarchical structure proposed for coordinated streaming motion.

pith-pipeline@v0.9.0 · 5598 in / 1216 out tokens · 35909 ms · 2026-05-10T16:43:25.573626+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 15 canonical work pages · 2 internal anchors

  1. [1]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, T. J. Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz ...

  2. [2]

    Everybody Dance Now

    Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros. Everybody Dance Now. InICCV, 2019. 2

  3. [3]

    Enabling synergistic full-body control in prompt-based co-speech motion generation

    Bohong Chen, Yumeng Li, Yao-Xiang Ding, Tianjia Shao, and Kun Zhou. Enabling synergistic full-body control in prompt-based co-speech motion generation. InProceedings of the 32nd ACM International Conference on Multimedia, page 10, New York, NY , USA, 2024. ACM. 2, 7

  4. [4]

    DiffSHEG: A Diffusion-Based Ap- proach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation, 2024

    Junming Chen, Yunfei Liu, Jianan Wang, Ailing Zeng, Yu Li, and Qifeng Chen. DiffSHEG: A Diffusion-Based Ap- proach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation, 2024. 2, 7

  5. [5]

    Diffusion-based co-speech gesture genera- tion using joint text and audio representation

    Anna Deichler, Shivam Mehta, Simon Alexanderson, and Jonas Beskow. Diffusion-based co-speech gesture genera- tion using joint text and audio representation. InINTER- NATIONAL CONFERENCE ON MULTIMODAL INTERAC- TION. ACM, 2023. 2

  6. [6]

    Focalcodec-stream: Streaming low-bitrate speech coding via causal distillation.arXiv preprint arXiv:2509.16195, 2025

    Luca Della Libera, Cem Subakan, and Mirco Ravanelli. Focalcodec-stream: Streaming low-bitrate speech coding via causal distillation.arXiv preprint arXiv:2509.16195, 2025. 8

  7. [7]

    Tam- ing transformers for high-resolution image synthesis.2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12868–12878, 2020

    Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Tam- ing transformers for high-resolution image synthesis.2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12868–12878, 2020. 2

  8. [8]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 4

  9. [9]

    Ginosar, A

    S. Ginosar, A. Bar, G. Kohavi, C. Chan, A. Owens, and J. Malik. Learning Individual Styles of Conversational Ges- ture. InCVPR. IEEE, 2019. 2

  10. [10]

    Momask: Generative masked model- ing of 3d human motions

    Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked model- ing of 3d human motions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910, 2024. 2

  11. [11]

    Learning speech-driven 3d conversational gestures from video.arXiv preprint arXiv:2102.06837, 2021

    Ikhsanul Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, Hans-Peter Seidel, Gerard Pons-Moll, Mohamed El- gharib, and Christian Theobalt. Learning speech-driven 3d conversational gestures from video.arXiv preprint arXiv:2102.06837, 2021. 2

  12. [12]

    Modeling and driving human body soundfields through acoustic primitives, 2024

    Chao Huang, Dejan Markovic, Chenliang Xu, and Alexan- der Richard. Modeling and driving human body soundfields through acoustic primitives, 2024. 2

  13. [13]

    Motiongpt: Human motion as a foreign language

    Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language. ArXiv, abs/2306.14795, 2023. 2

  14. [14]

    Audio2gestures: Generating diverse gestures from speech audio with conditional varia- tional autoencoders

    Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Zhenyu He, and Linchao Bao. Audio2gestures: Generating diverse gestures from speech audio with conditional varia- tional autoencoders. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 11293– 11302, 2021. 6, 13

  15. [15]

    AI Choreographer: Music Conditioned 3D Dance Generation with AIST++

    Ruilong Li, Shan Yang, David A Ross, and Angjoo Kanazawa. AI Choreographer: Music Conditioned 3D Dance Generation with AIST++. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13401–13412, 2021. 6, 14

  16. [16]

    DisCo: Disentan- gled Implicit Content and Rhythm Learning for Diverse Co- Speech Gestures Synthesis

    Haiyang Liu, Naoya Iwamoto, Zihao Zhu, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. DisCo: Disentan- gled Implicit Content and Rhythm Learning for Diverse Co- Speech Gestures Synthesis. InProceedings of the 30th ACM International Conference on Multimedia, pages 3764–3773,

  17. [17]

    Beat: A large-scale semantic and emotional multi-modal dataset for conversa- tional gestures synthesis.arXiv preprint arXiv:2203.05297, 2022

    Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis.arXiv preprint arXiv:2203.05297, 2022. 2, 7, 14, 15

  18. [18]

    EMAGE: Towards Unified Holistic Co- Speech Gesture Generation via Masked Audio Gesture Mod- eling.arXiv preprint arXiv:2401.00374, 2023

    Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Naoya Iwamoto, Bo Zheng, and Michael J Black. EMAGE: Towards Unified Holistic Co- Speech Gesture Generation via Masked Audio Gesture Mod- eling.arXiv preprint arXiv:2401.00374, 2023. 2, 6, 7, 14, 15

  19. [19]

    Tango: Co-speech gesture video reenactment with hi- erarchical audio motion embedding and diffusion interpola- tion.arXiv preprint arXiv:2410.04221, 2024

    Haiyang Liu, Xingchao Yang, Tomoya Akiyama, Yuantian Huang, Qiaoge Li, Shigeru Kuriyama, and Takafumi Take- tomi. Tango: Co-speech gesture video reenactment with hi- erarchical audio motion embedding and diffusion interpola- tion.arXiv preprint arXiv:2410.04221, 2024. 2

  20. [20]

    Semges: Semantics-aware co-speech gesture genera- tion using semantic coherence and relevance learning, 2025

    Lanmiao Liu, Esam Ghaleb, Aslı ¨Ozy¨urek, and Zerrin Yu- mak. Semges: Semantics-aware co-speech gesture genera- tion using semantic coherence and relevance learning, 2025

  21. [21]

    Intentional gesture: Deliver your intentions with gestures for speech, 2025

    Pinxin Liu, Haiyang Liu, Luchuan Song, and Chenliang Xu. Intentional gesture: Deliver your intentions with gestures for speech, 2025. 2

  22. [22]

    Gesturelsm: Latent shortcut based co-speech gesture generation with spatial-temporal modeling

    Pinxin Liu, Luchuan Song, Junhua Huang, and Chenliang Xu. Gesturelsm: Latent shortcut based co-speech gesture generation with spatial-temporal modeling. InIEEE/CVF International Conference on Computer Vision, 2025. 2, 3, 7, 14, 15

  23. [23]

    Contextual gesture: Co- speech gesture video generation through context-aware ges- ture representation, 2025

    Pinxin Liu, Pengfei Zhang, Hyeongwoo Kim, Pablo Garrido, Ari Sharpio, and Kyle Olszewski. Contextual gesture: Co- speech gesture video generation through context-aware ges- ture representation, 2025. 2

  24. [24]

    Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation

    Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, and Bolei Zhou. Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation. InCVPR, pages 10462– 10472, 2022. 2, 7

  25. [25]

    Towards variable and coordinated holistic co-speech motion generation.arXiv preprint arXiv:2404.00368, 2024

    Yifei Liu, Qiong Cao, Yandong Wen, Huaiguang Jiang, and Changxing Ding. Towards variable and coordinated holistic co-speech motion generation.arXiv preprint arXiv:2404.00368, 2024. 7

  26. [26]

    Vita-audio: Fast interleaved cross-modal token generation for efficient large speech-language model

    Zuwei Long, Yunhang Shen, Chaoyou Fu, Heting Gao, Li- jiang Li, Peixian Chen, Mengdan Zhang, Hang Shao, Jian Li, Jinlong Peng, et al. Vita-audio: Fast interleaved cross-modal token generation for efficient large speech-language model. arXiv preprint arXiv:2505.03739, 2025. 8

  27. [27]

    Retrieving seman- tics from the deep: an rag solution for gesture synthesis

    M Hamza Mughal, Rishabh Dabral, Merel CJ Scholman, Vera Demberg, and Christian Theobalt. Retrieving seman- tics from the deep: an rag solution for gesture synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 16578–16588, 2025. 7

  28. [28]

    Expressive body capture: 3d hands, face, and body from a single image

    Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019. 3

  29. [29]

    Bamm: Bidirectional autoregressive motion model

    Ekkasit Pinyoanuntapong, Muhammad Usama Saleem, Pu Wang, Minwoo Lee, Srijan Das, and Chen Chen. Bamm: Bidirectional autoregressive motion model. InComputer Vi- sion – ECCV 2024, 2024. 2

  30. [30]

    Mmm: Generative masked motion model

    Ekkasit Pinyoanuntapong, Pu Wang, Minwoo Lee, and Chen Chen. Mmm: Generative masked motion model. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

  31. [31]

    Zero-Shot Text-to-Image Generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation.ArXiv, abs/2102.12092,

  32. [32]

    Talking face video generation with editable expression

    Luchuan Song, Bin Liu, and Nenghai Yu. Talking face video generation with editable expression. InImage and Graphics: 11th International Conference, ICIG 2021, Haikou, China, August 6–8, 2021, Proceedings, Part III 11, pages 753–764. Springer, 2021. 2

  33. [33]

    Fsft-net: face transfer video generation with few-shot views

    Luchuan Song, Guojun Yin, Bin Liu, Yuhui Zhang, and Nenghai Yu. Fsft-net: face transfer video generation with few-shot views. In2021 IEEE international conference on image processing (ICIP), pages 3582–3586. IEEE, 2021. 2

  34. [34]

    Texttoon: Real-time text toonify head avatar from single video

    Luchuan Song, Lele Chen, Celong Liu, Pinxin Liu, and Chenliang Xu. Texttoon: Real-time text toonify head avatar from single video. InSIGGRAPH Asia 2024 Conference Pa- pers, pages 1–11, 2024. 2

  35. [35]

    Tri 2-plane: Thinking head avatar via fea- ture pyramid

    Luchuan Song, Pinxin Liu, Lele Chen, Guojun Yin, and Chenliang Xu. Tri 2-plane: Thinking head avatar via fea- ture pyramid. InEuropean Conference on Computer Vision, pages 1–20. Springer, 2024

  36. [36]

    Adaptive super resolution for one-shot talking-head genera- tion

    Luchuan Song, Pinxin Liu, Guojun Yin, and Chenliang Xu. Adaptive super resolution for one-shot talking-head genera- tion. InICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4115–4119, 2024. 2

  37. [37]

    Magicodec: Simple masked gaussian- injected codec for high-fidelity reconstruction and genera- tion.arXiv preprint arXiv:2506.00385, 2025

    Yakun Song, Jiawei Chen, Xiaobin Zhuang, Chenpeng Du, Ziyang Ma, Jian Wu, Jian Cong, Dongya Jia, Zhuo Chen, Yuping Wang, et al. Magicodec: Simple masked gaussian- injected codec for high-fidelity reconstruction and genera- tion.arXiv preprint arXiv:2506.00385, 2025. 8

  38. [38]

    Generative ai for cel- animation: A survey.arXiv preprint arXiv:2501.06250,

    Yunlong Tang, Junjia Guo, Pinxin Liu, Zhiyuan Wang, Hang Hua, Jia-Xing Zhong, Yunzhong Xiao, Chao Huang, Luchuan Song, Susan Liang, et al. Generative ai for cel- animation: A survey.arXiv preprint arXiv:2501.06250,

  39. [39]

    Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017. 4

  40. [40]

    High-Resolution Im- age Synthesis and Semantic Manipulation with Conditional GANs

    Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-Resolution Im- age Synthesis and Semantic Manipulation with Conditional GANs. InCVPR, 2018. 2

  41. [41]

    Hierarchical quantized au- toencoders.ArXiv, abs/2002.08111, 2020

    Will Williams, Sam Ringer, Tom Ash, John Hughes, David Macleod, and Jamie Dougherty. Hierarchical quantized au- toencoders.ArXiv, abs/2002.08111, 2020. 2

  42. [42]

    Codetalker: Speech-driven 3d facial animation with discrete motion prior

    Jinbo Xing, Menghan Xia, Yuechen Zhang, Xiaodong Cun, Jue Wang, and Tien-Tsin Wong. Codetalker: Speech-driven 3d facial animation with discrete motion prior. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12780–12790, 2023. 6, 14

  43. [43]

    Chain of generation: Multi-modal gesture synthesis via cascaded conditional control, 2023

    Zunnan Xu, Yachao Zhang, Sicheng Yang, Ronghui Li, and Xiu Li. Chain of generation: Multi-modal gesture synthesis via cascaded conditional control, 2023. 2

  44. [44]

    Mambatalk: Ef- ficient holistic gesture synthesis with selective state space models, 2024

    Zunnan Xu, Yukang Lin, Haonan Han, Sicheng Yang, Ronghui Li, Yachao Zhang, and Xiu Li. Mambatalk: Ef- ficient holistic gesture synthesis with selective state space models, 2024. 2, 7

  45. [45]

    Generating Holistic 3D Human Motion from Speech

    Hongwei Yi, Hualin Liang, Yifei Liu, Qiong Cao, Yandong Wen, Timo Bolkart, Dacheng Tao, and Michael J Black. Generating Holistic 3D Human Motion from Speech. In CVPR, 2023. 2, 7

  46. [46]

    Speech Ges- ture Generation from the Trimodal Context of Text, Audio, and Speaker Identity.ACM TOG, 39(6), 2020

    Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. Speech Ges- ture Generation from the Trimodal Context of Text, Audio, and Speaker Identity.ACM TOG, 39(6), 2020. 6, 13

  47. [47]

    Vector-quantized image modeling with improved vqgan.arXiv preprint arXiv:2110.04627, 2021

    Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan.ArXiv, abs/2110.04627, 2021. 2

  48. [48]

    Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi- aodong Shen. Generating human motion from textual de- scriptions with discrete representations.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14730–14740, 2023. 2

  49. [49]

    Kinmo: Kinematic-aware human motion understanding and generation, 2024

    Pengfei Zhang, Pinxin Liu, Hyeongwoo Kim, Pablo Garrido, and Bindita Chaudhuri. Kinmo: Kinematic-aware human motion understanding and generation, 2024. 2

  50. [50]

    Attt2m: Text-driven human motion generation with multi- perspective attention mechanism.ArXiv, abs/2309.00796,

    Chongyang Zhong, Lei Hu, Zihao Zhang, and Shihong Xia. Attt2m: Text-driven human motion generation with multi- perspective attention mechanism.ArXiv, abs/2309.00796,

  51. [51]

    Supplementary Material A. Overview The supplementary material is organized as follows: • Section B: Implementation Details of the Streamable Vector-Quantized Motion Tokenizer (SVQ) and the Hierarchical Autoregressive Transformer (HAR) • Section C: Evaluation Metrics • Section D: SOTA Comparison: Quantitative Results Without Face Module • Section E: SOTA C...

  52. [52]

    Higher values indicate more expressive and varied motion under causal token-by-token prediction

    (2) Global translation of the SMPL-X body is removed prior to evaluation. Higher values indicate more expressive and varied motion under causal token-by-token prediction. Beat Constancy (BC).. Beat Constancy [15] evaluates the synchrony between motion beats and prosodic beats in the audio. Motion beats are detected from local minima of upper-body joint ve...

  53. [53]

    causal self- attention

    (4) This complements body-motion metrics by evaluating fine-grained facial deformation fidelity. D. SOTA Comparison: Quantitative Results Without Face Module Table 1. Comparison with state-of-the-art methods on BEAT2 without the facial motion module.LiveGestureremains the only zero–look- ahead streaming model while achieving competitive or superior perfor...