arxiv: 2605.01197 · v1 · submitted 2026-05-02 · 💻 cs.SD · cs.MM

Recognition: unknown

MG-Former: A Transformer-Based Framework for Music-Driven 3D Conducting Gesture Generation

Kaimin Wang, Kaixing Yang, Ke Qiu, Tianzhi Jia, Xiaole Yang, Yawen Qin

Pith reviewed 2026-05-10 15:31 UTC · model grok-4.3

classification 💻 cs.SD cs.MM

keywords music-driven gesture generationtransformer model3D motion synthesisconducting gesturesSMPL body modelcross-modal learningautoregressive predictionalignment loss

0 comments

The pith

A Transformer framework generates 3D conducting gestures from music while maintaining beat synchronization and long-range musical structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops TransConductor to create expressive 3D conductor motions driven by music input. It constructs a new dataset called ConductorMotion by recovering detailed body poses from conducting videos using SMPL parameters. The model employs temporal Transformer encoders and decoders to predict pose parameters autoregressively from acoustic features and initial poses. An alignment loss helps ensure the gestures correspond artistically to the music. A retrieval-based evaluator embeds both modalities to compute metrics that better reflect synchronization and correspondence than traditional ones. If the approach holds, it advances cross-modal motion synthesis for professional conducting performances.

Core claim

TransConductor is a Transformer-based framework for music-driven 3D conducting gesture generation that uses a Trans-Temporal Music Encoder and a Trans-Temporal Conducting Gesture Decoder to autoregressively predict SMPL pose parameters from audio descriptors and an initial pose. The system is supported by the ConductorMotion dataset built from conducting videos and evaluated with a new retrieval-based model that measures music-gesture alignment through shared embeddings, yielding metrics including FID, modality distance, multi-modality distance, and diversity. Experiments demonstrate that it outperforms dance-generation and conducting-generation baselines, with ablations confirming the value

What carries the argument

The Trans-Temporal Music Encoder paired with the Trans-Temporal Conducting Gesture Decoder, which together handle temporal dependencies to enforce long-range musical structure and beat-level synchronization in the generated SMPL poses.

If this is right

Outperforms existing dance and conducting generation methods on standard and custom metrics.
Benefits from the Transformer architecture for handling sequence data in motion synthesis.
The alignment loss improves music-gesture correspondence.
The ConductorMotion dataset provides a targeted resource for professional conducting gesture research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar Transformer designs could extend to other music-driven body motions like dance or instrument playing.
Integration into virtual reality systems might enable real-time conducting avatars for remote performances.
Testing on diverse music genres could reveal limitations in generalization beyond the training data.

Load-bearing premise

The retrieval-based evaluation model that embeds music and gestures into a shared space accurately measures artistic correspondence and synchronization beyond what standard metrics capture.

What would settle it

A controlled user study with conducting experts comparing the synchronization and expressiveness of TransConductor outputs against baselines on unseen music pieces; failure would occur if no significant preference or alignment difference is found.

Figures

Figures reproduced from arXiv: 2605.01197 by Kaimin Wang, Kaixing Yang, Ke Qiu, Tianzhi Jia, Xiaole Yang, Yawen Qin.

**Figure 1.** Figure 1: Given a piece of music, MG-Former generates an audio-visually synchronized 3D conducting gesture sequence. The task requires [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the CG-Data construction pipeline. Videos are collected and cleaned, then processed by pose estimation, anomaly [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of MG-Former. Music descriptors are encoded by the Trans-Temporal Music Encoder. The gesture decoder autore [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Retrieval model used for evaluation. A music encoder [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Generated conducting gestures across diverse musical emotions, including passionate, solemn, joyful, lyrical, and sorrowful [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Generated gestures for choral and solo conducting scenarios. The visualizations show that MG-Former can adapt gesture range [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

Generating expressive conducting gestures from music is a challenging cross-modal motion synthesis problem: the output must follow long-range musical structure, preserve beat-level synchronization, and remain plausible as a fine-grained 3D human performance. Existing conducting-motion studies are often limited by sparse pose representations, small-scale data, or evaluation protocols that do not directly measure whether music and gesture are mutually aligned. This paper presents TransConductor, a Transformer-based framework for music-driven conducting gesture generation. We introduce ConductorMotion, a SMPL-parameter data construction pipeline that recovers detailed body motion from conducting videos and forms a dataset targeted at professional conducting gestures. Given acoustic descriptors extracted from audio and an initial pose, TransConductor uses a Trans-Temporal Music Encoder and a Trans-Temporal Conducting Gesture Decoder to autoregressively predict SMPL pose parameters. To better assess artistic correspondence, we further build a retrieval-based evaluation model that embeds music and gestures into a shared space and yields FID, modality distance, multi-modality distance, and diversity metrics. Experiments show that TransConductor outperforms dance-generation and conducting-generation baselines, while ablations verify the benefits of the Transformer backbone and the proposed alignment loss.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a conducting-specific SMPL dataset and Transformer model but its outperformance claims rest on an unvalidated retrieval embedding.

read the letter

The main things to know are that this work builds ConductorMotion, a dataset of detailed 3D conducting motions recovered from videos via SMPL parameters, and pairs it with TransConductor, a Transformer encoder-decoder that predicts poses autoregressively from music descriptors and a starting pose. It also introduces a retrieval model that embeds music and gestures to compute FID and distance metrics for evaluation. These pieces target a narrow but real gap in music-to-motion work that has mostly used dance data and sparser representations. The temporal Transformer split and alignment loss are reasonable choices for handling beat sync and longer musical structure. The data construction pipeline is practical and moves past the small-scale or low-detail sets in earlier conducting studies. Ablations on the backbone and loss are mentioned, which at least shows they checked component contributions. The soft spot is the evaluation. The strongest claims rest on the new retrieval-based metrics showing better results than dance and conducting baselines. Yet nothing in the description confirms that the shared embedding actually ranks musically plausible gestures higher than implausible ones, or that it correlates with human ratings of conducting quality. Without retrieval accuracy on held-out pairs or expert validation, the metrics could reflect dataset biases or the alignment loss rather than genuine artistic improvement. Data splits and exact quantitative numbers are also not detailed here. This paper is for people working on cross-modal motion synthesis or animation tools that need conducting-specific motion. A reader building on music-driven gesture models would get concrete value from the dataset and architecture choices. It deserves a serious referee because the new data and focused application are substantive enough to warrant review, even with the metric questions. I would recommend sending it for peer review and asking the referees to verify the retrieval model against human judgment and to check the experimental setup details.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces TransConductor, a Transformer-based autoregressive framework for music-driven 3D conducting gesture generation using SMPL pose parameters. It presents the ConductorMotion dataset constructed via a pipeline that recovers detailed body motion from conducting videos. The model employs a Trans-Temporal Music Encoder and Trans-Temporal Conducting Gesture Decoder. To evaluate artistic correspondence and synchronization, the authors propose a retrieval-based model that embeds music and gestures into a shared space, from which they derive FID, modality distance, multi-modality distance, and diversity metrics. Experiments claim outperformance over dance-generation and conducting-generation baselines, with ablations supporting the Transformer backbone and proposed alignment loss.

Significance. If the central empirical claims hold and the retrieval-based evaluation is shown to reliably capture artistic correspondence beyond dataset biases, the work would advance cross-modal motion synthesis by targeting the specialized domain of conducting gestures with long-range structure and beat-level synchronization. The ConductorMotion dataset and SMPL-parameter pipeline constitute a concrete, reusable contribution for future research in fine-grained 3D human performance generation.

major comments (1)

[Experiments] Experiments section: The outperformance claims rest entirely on FID, modality distance, multi-modality distance, and diversity metrics produced by the custom retrieval-based evaluation model that embeds music and gestures into a shared space. The manuscript provides no validation that this embedding ranks plausible conducting pairs higher than implausible ones, reports no retrieval accuracy on held-out pairs, and shows no correlation with human expert ratings of musicality or synchronization. Without such evidence, the metrics may simply reflect the alignment loss or dataset artifacts rather than genuine improvement in conducting quality.

minor comments (2)

[Title and Abstract] The abstract refers to the proposed framework as TransConductor while the title uses MG-Former; the relationship between these names should be clarified for consistency.
[Abstract] The abstract states that experiments show outperformance and ablation benefits but supplies no numerical values, data-split details, or error analysis; including at least summary statistics would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation approach. We address the concern regarding validation of the retrieval-based metrics below and will incorporate additional supporting analyses in the revised manuscript.

read point-by-point responses

Referee: The outperformance claims rest entirely on FID, modality distance, multi-modality distance, and diversity metrics produced by the custom retrieval-based evaluation model that embeds music and gestures into a shared space. The manuscript provides no validation that this embedding ranks plausible conducting pairs higher than implausible ones, reports no retrieval accuracy on held-out pairs, and shows no correlation with human expert ratings of musicality or synchronization. Without such evidence, the metrics may simply reflect the alignment loss or dataset artifacts rather than genuine improvement in conducting quality.

Authors: We appreciate the referee raising this critical point about metric validation. The retrieval-based evaluation model is trained independently using a contrastive loss on paired music-gesture samples from ConductorMotion to learn a shared embedding space, with the FID and distance metrics derived from distances in this space. While the original manuscript does not report explicit retrieval accuracy (such as top-k recall on held-out matching vs. negative pairs) or correlation with human ratings, we agree these would strengthen the claims. In the revision we will add retrieval accuracy results on a held-out test split to demonstrate that the embedding prioritizes plausible pairs. We acknowledge that a new human expert study correlating metrics with ratings of musicality and synchronization is not present and would require additional data collection; we will therefore note this as a limitation and future direction rather than claiming such correlation. These additions should help confirm that improvements are not solely due to the alignment loss or dataset biases. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain; empirical claims are self-contained

full rationale

The manuscript presents an empirical ML framework (TransConductor) with a Transformer encoder-decoder, a new dataset pipeline, an alignment loss, and a separate retrieval-based evaluation model that produces standard metrics (FID, modality distances, diversity). No equations, derivations, or parameter-fitting steps are described that reduce by construction to the model's own outputs or inputs. The outperformance claims rest on comparisons against baselines and ablations, which are externally falsifiable and do not invoke self-citation chains or self-definitional loops. The evaluation embedding is introduced as an additional tool rather than a fitted component whose results are then renamed as predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only review yields insufficient detail to enumerate specific free parameters, axioms, or invented entities beyond the high-level components named; SMPL model and Transformer architecture are treated as standard background.

invented entities (2)

TransConductor framework no independent evidence
purpose: Music-to-3D-conducting-gesture generation
Core proposed system combining Trans-Temporal Music Encoder and Gesture Decoder
ConductorMotion dataset no independent evidence
purpose: Targeted professional conducting gesture data from SMPL parameters
New data construction pipeline from conducting videos

pith-pipeline@v0.9.0 · 5520 in / 1217 out tokens · 26438 ms · 2026-05-10T15:31:38.956453+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 12 canonical work pages · 3 internal anchors

[1]

Lan- guage2pose: Natural language grounded pose forecasting

Chaitanya Ahuja and Louis-Philippe Morency. Lan- guage2pose: Natural language grounded pose forecasting. In Proceedings of the International Conference on 3D Vision, pages 719–728, 2019

2019
[2]

Groovenet: Real-time music-driven dance movement gen- eration using artificial neural networks

Omid Alemi, Jules Franc ¸oise, and Philippe Pasquier. Groovenet: Real-time music-driven dance movement gen- eration using artificial neural networks. InProceedings of the International Conference on Computational Creativity, 2017

2017
[3]

Style-controllable speech-driven gesture synthesis using normalising flows.Computer Graphics Fo- rum, 39(2):487–496, 2020

Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. Style-controllable speech-driven gesture synthesis using normalising flows.Computer Graphics Fo- rum, 39(2):487–496, 2020

2020
[4]

Gesturediffuclip: Gesture diffusion model with clip latents

Tenglong Ao, Zeyi Zhang, and Libin Liu. Gesturediffuclip: Gesture diffusion model with clip latents. InACM Transac- tions on Graphics, pages 1–18, 2023

2023
[5]

Diffsheg: A diffusion-based approach for real-time speech-driven holistic 3d expression and ges- ture generation

Junming Chen, Yunfei Liu, Jianan Wang, Ailing Zeng, Yu Li, and Qifeng Chen. Diffsheg: A diffusion-based approach for real-time speech-driven holistic 3d expression and ges- ture generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7352– 7361, 2024

2024
[6]

Black, and Timo Bolkart

Kiran Chhatre, Nikos Athanasiou, Michael J. Black, and Timo Bolkart. Amuse: Emotional speech-driven 3d body an- imation via disentangled latent diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19398–19408, 2024

2024
[7]

Learning individual styles of conversational gesture

Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, An- drew Owens, and Jitendra Malik. Learning individual styles of conversational gesture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3497–3506, 2019

2019
[8]

Kehong Gong, Dongze Lian, Heng Chang, Chuan Guo, Zi- hang Jiang, Xinxin Zuo, Xun Wang, and Michael K. Cheng. Tm2d: Bimodality driven 3d dance generation via music-text integration. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9942–9952, 2023

2023
[9]

Choreography cgan: Gen- erating dances with music beats using conditional generative adversarial networks.Neural Computing and Applications, 33:9817–9833, 2021

Ya-Fan Huang and Wei-De Liu. Choreography cgan: Gen- erating dances with music beats using conditional generative adversarial networks.Neural Computing and Applications, 33:9817–9833, 2021

2021
[10]

BiTDiff: Fine-Grained 3D Conducting Motion Generation via BiMamba-Transformer Diffusion

Tianzhi Jia, Kaixing Yang, Xiaole Yang, Xulong Tang, Ke Qiu, Shikui Wei, and Yao Zhao. Bitdiff: Fine-grained 3d conducting motion generation via bimamba-transformer dif- fusion.arXiv preprint arXiv:2604.04395, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Gesticulator: A framework for semantically-aware speech-driven gesture generation

Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexanderson, Iolanda Leite, and Hedvig Kjellstr ¨om. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the International Conference on Multimodal Interaction, pages 242–250, 2020

2020
[12]

Tran, and Anh Nguyen

Nhat Le, Tu Do, Kien Do, Hoang Nguyen, Erman Tjiputra, Quang D. Tran, and Anh Nguyen. Controllable group chore- ography using contrastive diffusion.ACM Transactions on Graphics, 42(6):1–14, 2023

2023
[13]

Tran, and Anh Nguyen

Nhat Le, Thang Pham, Tu Do, Erman Tjiputra, Quang D. Tran, and Anh Nguyen. Music-driven group choreography. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8673–8682, 2023

2023
[14]

Dance- former: Music conditioned 3d dance generation with para- metric motion transformer

Buyu Li, Yongchi Zhao, Shi Zhelun, and Lu Sheng. Dance- former: Music conditioned 3d dance generation with para- metric motion transformer. InProceedings of the AAAI Con- ference on Artificial Intelligence, 2022

2022
[15]

Ross, and Angjoo Kanazawa

Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. Ai choreographer: Music conditioned 3d dance generation with aist++. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 13401– 13412, 2021

2021
[16]

Finedance: A fine-grained choreography dataset for 3d full body dance generation

Ronghui Li, Jun Zhao, Yachao Zhang, Mingyang Su, Zhizheng Ren, Han Zhang, Yuhang Zhang, Yebin Liu, and Xiu Li. Finedance: A fine-grained choreography dataset for 3d full body dance generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10234–10243, 2023

2023
[17]

Lodge++: High-quality and long dance generation with vivid choreography patterns.arXiv preprint arXiv:2410.20389, 2024

Ronghui Li, Han Zhang, Yachao Zhang, Yuhang Zhang, Jia- man Guo, and Yebin Liu. Lodge++: High-quality and long dance generation with vivid choreography patterns.arXiv preprint arXiv:2410.20389, 2024

work page arXiv 2024
[18]

Lodge: A coarse to fine diffusion network for long dance generation guided by the characteristic dance primitives

Ronghui Li, Yachao Zhang, Yuhang Zhang, Han Zhang, Jia- man Guo, Yuxuan Zhang, and Xiu Li. Lodge: A coarse to fine diffusion network for long dance generation guided by the characteristic dance primitives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1524–1534, 2024

2024
[19]

Exploring multi-modal control in music-driven dance generation.arXiv preprint arXiv:2401.01382, 2024

Ronghui Li, Yachao Zhang, Yuhang Zhang, Han Zhang, Zhizheng Ren, Jiaman Guo, Yebin Liu, and Xiu Li. Explor- ing multi-modal control in music-driven dance generation. arXiv preprint arXiv:2401.01382, 2024

work page arXiv 2024
[20]

Soulnet: Music-aligned holistic 3d dance generation via hierarchical motion modeling.arXiv preprint arXiv:2502.20154, 2025

Ronghui Li, Yachao Zhang, Yuhang Zhang, Han Zhang, Jia- man Guo, and Yebin Liu. Soulnet: Music-aligned holistic 3d dance generation via hierarchical motion modeling.arXiv preprint arXiv:2502.20154, 2025

work page arXiv 2025
[21]

Self-supervised music motion synchronization learning for music-driven conducting motion generation

Fang Liu, Donglin Chen, Ruizhi Zhou, Sheng Yang, and Feng Xu. Self-supervised music motion synchronization learning for music-driven conducting motion generation. Journal of Computer Science and Technology, 37(3):539– 558, 2022

2022
[22]

Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Eren Bozkurt, Bo Zheng, and Michael J. Black. Beat: A large-scale semantic and emo- tional multi-modal dataset for conversational gestures syn- thesis. InProceedings of the European Conference on Com- puter Vision, pages 612–630, 2022

2022
[23]

Near-Optimal Sample Complexities of Divergence-based S-rectangular Distributionally Robust Reinforcement Learning

Xia Liu, Ronghui Li, Han Zhang, Yuhang Zhang, Yachao Zhang, Zehua Hu, Jiaman Guo, and Yebin Liu. Dgfm: Full body dance generation driven by music foundation model. arXiv preprint arXiv:2505.12202, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. Smpl: A skinned multi- person linear model.ACM Transactions on Graphics, 34(6): 248:1–248:16, 2015

2015
[25]

Brian McFee, Colin Raffel, Dawen Liang, Daniel P. W. Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. Librosa: Audio and music signal analysis in python. InProceedings of the 14th Python in Science Conference, pages 18–24, 2015

2015
[26]

A transfer learn- ing approach for music-driven 3d conducting motion genera- tion with limited data

Jisoo Oh, Jinwoo Jeong, and Youngho Chai. A transfer learn- ing approach for music-driven 3d conducting motion genera- tion with limited data. InProceedings of the 30th ACM Sym- posium on Virtual Reality Software and Technology, 2024

2024
[27]

Music-driven dance generation.IEEE Access, 7:166540–166550, 2019

Yuan Qi, Yixing Liu, and Qiong Sun. Music-driven dance generation.IEEE Access, 7:166540–166550, 2019

2019
[28]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, pages 8748–8763, 2021

2021
[29]

Self-supervised dance video synthesis conditioned on music

Xuanchi Ren, Haoran Li, Zhaoxiang Huang, and Qifeng Chen. Self-supervised dance video synthesis conditioned on music. InProceedings of the 28th ACM International Con- ference on Multimedia, pages 46–54, 2020

2020
[30]

Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J. Black. Wham: Reconstructing world-grounded humans with accu- rate 3d motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2070– 2080, 2024

2070
[31]

Bailando: 3d dance generation by actor-critic gpt with choreographic memory

Li Siyao, Weiji Yu, Tingting Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Change Loy, and Ziwei Liu. Bailando: 3d dance generation by actor-critic gpt with choreographic memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11050– 11059, 2022

2022
[32]

Bailando++: 3d dance gpt with choreographic memory.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2023

Li Siyao, Weiji Yu, Tingting Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Change Loy, and Ziwei Liu. Bailando++: 3d dance gpt with choreographic memory.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2023

2023
[33]

Kankanhalli, Weidong Geng, and Xiang Li

Guofei Sun, Yew-Soon Wong, Zhiyong Cheng, Mohan S. Kankanhalli, Weidong Geng, and Xiang Li. Deepdance: Music-to-dance motion choreography with adversarial learn- ing.IEEE Transactions on Multimedia, 23:497–509, 2021

2021
[34]

Dance with melody: An lstm-autoencoder approach to music-oriented dance syn- thesis

Taoran Tang, Jia Jia, and Hanyang Mao. Dance with melody: An lstm-autoencoder approach to music-oriented dance syn- thesis. InProceedings of the 26th ACM International Con- ference on Multimedia, pages 1598–1606, 2018

2018
[35]

Karen Liu

Jonathan Tseng, Rodrigo Castellon, and C. Karen Liu. Edge: Editable dance generation from music. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 448–458, 2023

2023
[36]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017

2017
[37]

Codancers: Music-driven coherent group dance generation with choreographic unit

Kanglei Yang, Xinyu Tang, Ruixuan Diao, Hong Liu, Jun He, and Zhi Fan. Codancers: Music-driven coherent group dance generation with choreographic unit. InProceedings of the 2024 International Conference on Multimedia Retrieval, pages 675–683, 2024

2024
[38]

Megadance: Mixture-of- experts architecture for genre-aware 3d dance generation

Kaixing Yang, Tianyu Zhang, Yuzhuo Li, Tianyun Zhang, Guangyu Xu, Xulong Tang, Li Yang, Jian Yang, Shenghai Dai, Tao Wang, Zhigang Wu, and Ke Qiu. Megadance: Music-driven 3d dance generation with motion-aware mix- ture of experts.arXiv preprint arXiv:2505.17543, 2025

work page arXiv 2025
[39]

Flow- erdance: Meanflow for efficient and refined 3d dance generation.arXiv preprint arXiv:2511.21029, 2025

Kaixing Yang, Tianyu Zhang, Xulong Tang, Yuzhuo Li, Li Yang, Jian Yang, Shenghai Dai, Tao Wang, Zhigang Wu, and Ke Qiu. Flowerdance: Music-driven 3d dance gener- ation with efficient refined flow matching.arXiv preprint arXiv:2511.21029, 2025

work page arXiv 2025
[40]

Cohedancers: Music-driven coherent group dance generation

Kaixing Yang, Tianyu Zhang, Yujie Zhao, Chenjing Cai, Zeyu Zhao, Xulong Tang, Li Yang, Jian Yang, Ke Qiu, Shenghai Dai, Jinfeng Dong, Tao Wang, Zhigang Wu, and Meng Wang. Cohedancers: Music-driven coherent group dance generation. InProceedings of the ACM International Conference on Multimedia, pages 6663–6671, 2025

2025
[41]

MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation

Kaixing Yang, Jiashu Zhu, Xulong Tang, Ziqiao Peng, Xi- angyue Zhang, Puwei Wang, Jiahong Wu, Xiangxiang Chu, Hongyan Liu, and Jun He. Mace-dance: Motion-appearance cascaded experts for music-driven dance video generation. arXiv preprint arXiv:2512.18181, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Diffus- estylegesture: Stylized audio-driven co-speech gesture gen- eration with diffusion models

Sicheng Yang, Zhiyong Wu, Minglei Li, Zhensong Zhang, Lei Hao, Weihong Bao, Ming Cheng, and Jun Xiao. Diffus- estylegesture: Stylized audio-driven co-speech gesture gen- eration with diffusion models. InProceedings of the Thirty- Second International Joint Conference on Artificial Intelli- gence, pages 5860–5868, 2023

2023
[43]

Token- dance: Token-to-token music-to-dance generation with bidirectional mamba.arXiv preprint arXiv:2603.27314,

Ziyue Yang, Kaixing Yang, Xulong Tang, Li Yang, Jian Yang, Tao Wang, Zhigang Wu, and Ke Qiu. Tokendance: Token-to-token mamba for long music-to-dance generation. arXiv preprint arXiv:2603.27314, 2026

work page arXiv 2026
[44]

Hongwei Yi, Hualin Liang, Yifei Liu, Qiong Cao, Yandong Wen, Timo Bolkart, Dacheng Tao, and Michael J. Black. Generating holistic 3d human motion from speech. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 469–480, 2023

2023
[45]

Speech gesture generation from the trimodal context of text, audio, and speaker iden- tity

Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jae- hong Kim, and Geehyuk Lee. Speech gesture generation from the trimodal context of text, audio, and speaker iden- tity. InACM Transactions on Graphics, 2020

2020
[46]

Diffmotion: Speech-driven gesture synthesis using denoising diffusion model

Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xuan Guo, Lei Yang, and Ziwei Liu. Diffmotion: Speech-driven gesture synthesis using denoising diffusion model. InProceedings of the 31st ACM International Con- ference on Multimedia, pages 2310–2319, 2023

2023
[47]

Echomask: Efficient speech-driven gesture generation via masked audio-motion modeling.arXiv preprint arXiv:2508.00047, 2025

Xiangyue Zhang, Dongsheng Li, Jing Li, Yecheng Li, Yan Wang, Shihong Xia, and Xueming Liu. Echomask: Efficient speech-driven gesture generation via masked audio-motion modeling.arXiv preprint arXiv:2508.00047, 2025

work page arXiv 2025
[48]

Guaranteed robust nonlinear mpc via distur- bance feedback.arXiv preprint arXiv:2509.18760, 2025

Xiangyue Zhang, Dongsheng Li, Yan Wang, Shihong Xia, and Xueming Liu. Globaldiff: Mitigating error accumu- lation in autoregressive motion generation.arXiv preprint arXiv:2509.18760, 2025

work page arXiv 2025
[49]

Semtalk: Holistic co- speech motion generation with frame-level semantic empha- sis.arXiv preprint arXiv:2503.13255, 2025

Xiangyue Zhang, Yecheng Li, Dongsheng Li, Yan Wang, Shihong Xia, and Xueming Liu. Semtalk: Holistic co- speech motion generation with frame-level semantic empha- sis.arXiv preprint arXiv:2503.13255, 2025

work page arXiv 2025
[50]

Taming diffusion models for music- driven conducting motion generation

Zhuoran Zhao et al. Taming diffusion models for music- driven conducting motion generation. InProceedings of the AAAI Symposium Series, 2023

2023
[51]

On the continuity of rotation representations in neu- ral networks

Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neu- ral networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5745– 5753, 2019

2019
[52]

Diffgesture: Generating human gesture from two-person dialogue with diffusion models

Lingting Zhu, Xian Liu, Xuanyu Liu, Rui Qian, Ziwei Liu, and Lequan Yu. Diffgesture: Generating human gesture from two-person dialogue with diffusion models. InProceedings of the International Conference on Multimodal Interaction, pages 583–591, 2023

2023