MotionMERGE: A Multi-granular Framework for Human Motion Editing, Reasoning, Generation, and Explanation

arxiv: 2605.18956 · v1 · pith:M5XYYPEJnew · submitted 2026-05-18 · 💻 cs.CV

MotionMERGE: A Multi-granular Framework for Human Motion Editing, Reasoning, Generation, and Explanation

Bizhu Wu , Jinheng Xie , Wenting Chen , Zhe Kong , Jianfeng Ren , Linlin Shen , Ruibin Bai , Rong Qu This is my paper

Pith reviewed 2026-05-20 10:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords human motionmotion editingfine-grained controllanguage modelmotion reasoningpre-trainingchain-of-thoughtdataset

0 comments p. Extension

pith:M5XYYPEJ Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{M5XYYPEJ}

Prints a linked pith:M5XYYPEJ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

MotionMERGE lets a single language model control and reason about human motions at the level of specific body parts and time steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current motion-language models stay coarse and cannot isolate changes to one limb or one moment, which blocks detailed animation and interaction work. The paper claims that explicitly modeling motion at part and temporal levels inside one LLM, then training with joint supervision on alignment, grounding, coherency, and motion-grounded chain-of-thought reasoning, gives the model the priors needed for precise control. The authors back this with a new large dataset of fine-grained corrective instructions and reasoning traces. If the claim holds, text instructions can now drive localized edits without disturbing the rest of the motion sequence.

Core claim

MotionMERGE bridges the granularity gap by explicitly modeling motion at part and temporal levels within a single LLM and applying ReasoningAware Granularity-Synergy pre-training. This pre-training supplies joint supervision for cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded chain-of-thought reasoning. The work also releases the MotionFineEdit dataset of 837K atomic and 144K complex triplets that carry fine-grained spatio-temporal corrective instructions and motion-grounded CoT annotations. Experiments show the resulting model produces more precise generation, understanding, and editing while generalizing zero-shot to other complex

What carries the argument

ReasoningAware Granularity-Synergy pre-training that jointly supervises cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded chain-of-thought reasoning.

If this is right

The model performs more precise motion generation, understanding, and editing at fine granularity.
It exhibits compelling zero-shot generalization to other complex motion tasks.
A new benchmark is created for fine-grained text-driven motion editing and motion-grounded reasoning.
The model acquires fine-grained motion-language alignment and explicit reasoning ability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Animators could issue natural-language instructions that change only a chosen limb or moment without rewriting the full sequence.
The same multi-level structure might support interactive editing loops where a user refines a motion step by step.
Robotics pipelines that plan human-like actions could adopt similar part-and-time supervision to improve safety and precision.

Load-bearing premise

The assumption that explicitly modeling motion at part and temporal levels inside one LLM plus joint supervision across granularities and reasoning tasks will create robust priors for precise localized control.

What would settle it

A controlled test set of instructions that ask the model to edit only one named body part or time interval; if the output motion changes other parts or times at rates comparable to coarse baselines, the fine-grained claim fails.

Figures

Figures reproduced from arXiv: 2605.18956 by Bizhu Wu, Jianfeng Ren, Jinheng Xie, Linlin Shen, Rong Qu, Ruibin Bai, Wenting Chen, Zhe Kong.

**Figure 2.** Figure 2: Overview of MotionMERGE. Motions are converted into special text, allowing all tasks to be formulated as conditional text generation. The framework comprises a motion VQ-VAE that transforms continuous motion into discrete tokens, and a T5-based language model that processes interleaved text and motion tokens. It explicitly handles diverse motion-language tasks (e.g., generation, editing) at both global and… view at source ↗

**Figure 3.** Figure 3: Construction pipeline of the MotionFineEdit dataset. The pipeline consists of an atomic triplet stage, a quality control stage, and an enrichment and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Examples of text-driven fine-grained human motion editing from the MotionFineEdit dataset. Top: Atomic edits targeting spatial or temporal dimensions. Bottom: Complex edit with chain-of-thought annotations, illustrating intermediate states across both dimensions. Colored text marks temporal intervals and body parts; circles indicate edited body parts, rectangles edited time segments. Motions are sampled at… view at source ↗

**Figure 5.** Figure 5: Motion statistics of MotionFineEdit. Left: Distribution of cosine similarity between source and target motions. MotionFineEdit pairs exhibit higher similarity than MotionFix, confirming more localized, fine-grained edits. Middle: Distribution of temporal length differences within pairs, demonstrating support for flexible duration changes. Right: Distribution of step counts in complex (CoT) edits, reflecti… view at source ↗

**Figure 6.** Figure 6: Qualitative results of text-driven fine-grained human motion editing on MotionFineEdit. Rows illustrate atomic (top: spatial, middle: temporal) and complex (bottom: combined) editing tasks. Motions are sampled every 0.5 seconds. model maintains a strong lead in snippet-level metrics (e.g., 41.81% vs. 7.16% R@1 on atomic edits), proving its edits are locally accurate. Its advantage is even more pronounced o… view at source ↗

**Figure 7.** Figure 7: Qualitative text-to-motion results. MotionMERGE generates motions that accurately match textual descriptions, including complex multi-action sequences [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Novel motion-language tasks. Examples include chain-of-thought motion generation (top), a zero-shot emergent reasoning capability, fine-grained captioning of partial sequences (middle), and motion localization via textual description (bottom). validation that our framework achieves not only semantic but also temporal precision. E. More Results of Novel Applications. Our RAGS pre-training enables zero-shot … view at source ↗

**Figure 10.** Figure 10: Impact of chain-of-thought on complex editing. Lower values are better. Ins.: directly instruction-tuned. Pre.: RAGS pre-trained. variant without temporal tasks across all fine-grained evaluations, with gains most pronounced in tasks requiring precise temporal control: fine-grained generation ((T+DT)2M, R-Top3: +3.92) and detailed captioning (M2DT, Bleu@4: +4.68). Improvements are smaller for atomic edi… view at source ↗

**Figure 11.** Figure 11: Our annotation platform. side by side, allowing annotators to directly compare motion realism and consistency. Video snapshots are also provided, with the start and end frames of the edited temporal interval explicitly highlighted to facilitate temporal comparison. Annotators are required to judge whether each pair is acceptable (i.e., good pair) or unacceptable (i.e., bad pair). A pair is considered acc… view at source ↗

**Figure 12.** Figure 12: Visualizations of the 200 most frequent words in our textual descriptions. From left to right, the figure presents the word clouds for basic corrective instructions and their rewritten counterparts for the atomic editing, followed by those for the complex editing. TABLE XIII STATISTICS OF THE MOTIONFINEEDIT TEXTUAL DATA. Atomic Editing Complex Editing basic rewritten all basic rewritten all Total #texts 9… view at source ↗

**Figure 13.** Figure 13: Qualitative reasoning processes and results for complex fine-grained text-driven human motion editing. Dashed shapes (circles or rectangles) denote deletion operations in the spatial or temporal dimensions, while solid shapes (curly brackets or rectangles) indicate addition or repetition operations. The results show that MotionMERGE can precisely decompose complex fine-grained corrective instructions into… view at source ↗

read the original abstract

Recent motion-language models unify tasks like comprehension and generation but operate at a coarse granularity, lacking fine-grained understanding and nuanced control over body parts needed for animation or interaction. This stems from fundamental issues in both the model and the data, in which the model can't focus on motion's localized pattern, and the training data lacks fine-grained supervision. To tackle this, we propose MotionMERGE, a unified framework that bridges the granularity gap. First, we pioneer the study of fine-grained languageguided motion control, including detailed understanding and localized editing, by explicitly modeling motion at part and temporal levels within a single LLM, thereby endowing the model with robust priors for precise control. Second, we design ReasoningAware Granularity-Synergy pre-training, a novel strategy that employs joint supervision for cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded chain-of-thought (CoT) reasoning. This equips the model with fine-grained motion-language alignment, crossgranularity synergy, and explicit reasoning ability. Third, we curate MotionFineEdit, a large-scale dataset (837K atomic + 144K complex triplets) with the first fine-grained spatio-temporal corrective instructions and motion-grounded CoT annotations, establishing a new benchmark for fine-grained text-driven motion editing and motion-grounded reasoning. Extensive experiments demonstrate the capability of MotionMERGE for more precise motion generation, understanding, and editing, and compelling zero-shot generalization to other complex motion tasks. This work represents a significant step toward models that interact with motion in finer granularity and human-like reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MotionMERGE brings a new fine-grained motion dataset and tries multi-level modeling in one LLM, but the abstract gives no numbers so the gains are hard to judge.

read the letter

This paper's main contribution is a new large-scale dataset for fine-grained text-driven motion editing, along with a framework that models human motion at part and temporal levels inside a single LLM. They also introduce a pre-training approach that combines several alignment and reasoning objectives. The dataset, MotionFineEdit, has 837K atomic and 144K complex triplets with spatio-temporal corrective instructions and motion-grounded CoT annotations. That fills a gap in the literature for detailed supervision that previous work didn't have. The idea of explicitly handling localized patterns and temporal grounding in one model is a reasonable step toward more precise control in motion generation and editing. What the work does well is identifying the granularity problem in current motion-language models and trying to address both the model architecture and the data side at once. Curating that much annotated data is no small task, and the zero-shot generalization claims suggest they tested on tasks beyond the main ones. The soft spots are in the evidence presented. The abstract talks about extensive experiments and compelling results but doesn't give any quantitative metrics, baseline numbers, or details on how the pre-training impacts specific outputs. Without those, it's hard to see if the joint supervision for cross-granularity alignment, temporal grounding, localized alignment, coherency, and CoT actually produces synergy or runs into interference between objectives. The assumption that all these together will endow robust priors for precise control needs more testing, like ablations to show where the gains come from. The stress-test concern about negative transfer is plausible here because multi-objective training on continuous sequences can conflict. This paper is for researchers in computer vision working on human motion synthesis, editing, and language-conditioned generation. A reader looking for new benchmarks or ideas on multi-granular modeling would find value in the dataset and framework description, even if the results need more backing. It deserves a serious referee. The problem is relevant, the dataset is novel, and the approach is worth discussing in review. I would recommend sending it to peer review so the authors can strengthen the experimental section with numbers and comparisons.

Referee Report

2 major / 2 minor

Summary. The manuscript presents MotionMERGE, a unified multi-granular framework for human motion tasks including editing, reasoning, generation, and explanation. It explicitly models motion at part and temporal levels in a single LLM and introduces ReasoningAware Granularity-Synergy pre-training with joint supervision across cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded chain-of-thought reasoning. A new dataset, MotionFineEdit, is curated with 837K atomic and 144K complex triplets featuring fine-grained spatio-temporal corrective instructions and CoT annotations. The authors report that extensive experiments validate the framework's ability for precise motion generation, understanding, and editing, along with strong zero-shot generalization to complex tasks.

Significance. If substantiated, this work could significantly advance the field by addressing the granularity gap in motion-language models, enabling finer control over body parts and temporal aspects for applications in animation and human-computer interaction. The curation of a specialized dataset and the multi-objective pre-training approach are notable contributions that could serve as benchmarks for future research in fine-grained motion modeling. The emphasis on motion-grounded reasoning adds a valuable dimension to LLM-based motion systems.

major comments (2)

[Abstract] Abstract: The abstract asserts 'extensive experiments demonstrate the capability of MotionMERGE for more precise motion generation, understanding, and editing, and compelling zero-shot generalization' but supplies no quantitative metrics, error bars, baseline comparisons, or details on how the pre-training affects specific outputs, so the data-to-claim link cannot be verified.
[ReasoningAware Granularity-Synergy pre-training] ReasoningAware Granularity-Synergy pre-training description: The central claim requires that explicitly modeling part- and temporal-level motion inside one LLM plus the five-way joint supervision will produce robust priors for fine-grained control and motion-grounded reasoning, yet no ablation isolates whether gains come from the joint schedule versus the new dataset or base LLM capacity, leaving the synergy assumption untested at the level needed to underwrite the headline results on precise editing and zero-shot generalization.

minor comments (2)

The manuscript would benefit from additional details on the exact architecture for part- and temporal-level modeling within the LLM to support reproducibility.
[Dataset] Clarify how the MotionFineEdit dataset triplets were generated and validated for annotation quality, particularly the motion-grounded CoT annotations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating revisions where we agree changes are warranted to strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract asserts 'extensive experiments demonstrate the capability of MotionMERGE for more precise motion generation, understanding, and editing, and compelling zero-shot generalization' but supplies no quantitative metrics, error bars, baseline comparisons, or details on how the pre-training affects specific outputs, so the data-to-claim link cannot be verified.

Authors: We agree that the abstract would be strengthened by including concrete quantitative highlights to make the claims more verifiable at a glance. In the revised version we will update the abstract to briefly report key metrics from our experiments, such as relative improvements in fine-grained editing accuracy and zero-shot generalization performance over strong baselines, while preserving the abstract's concise style. revision: yes
Referee: [ReasoningAware Granularity-Synergy pre-training] ReasoningAware Granularity-Synergy pre-training description: The central claim requires that explicitly modeling part- and temporal-level motion inside one LLM plus the five-way joint supervision will produce robust priors for fine-grained control and motion-grounded reasoning, yet no ablation isolates whether gains come from the joint schedule versus the new dataset or base LLM capacity, leaving the synergy assumption untested at the level needed to underwrite the headline results on precise editing and zero-shot generalization.

Authors: The referee correctly identifies that our current set of experiments, while demonstrating overall gains from the multi-granular modeling and pre-training objectives, does not include a fully isolated ablation that disentangles the joint supervision schedule from the contributions of the MotionFineEdit dataset or the base LLM capacity. We will add targeted ablation studies in the revision, training controlled variants that disable subsets of the five supervision signals while keeping the dataset and base model fixed, to more rigorously substantiate the synergy effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity in MotionMERGE derivation chain

full rationale

The paper introduces a new multi-granular LLM framework, a custom ReasoningAware Granularity-Synergy pre-training strategy with joint supervision objectives, and a newly curated MotionFineEdit dataset containing fine-grained triplets and CoT annotations. Performance claims on precise editing, understanding, and zero-shot generalization are presented as outcomes of extensive experiments on this benchmark rather than quantities derived by construction from fitted parameters, self-referential definitions, or load-bearing self-citations. No equations, uniqueness theorems, or ansatzes are shown reducing the central synergy assumption or control priors to tautological inputs; the design choices remain independent empirical hypotheses validated externally to the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the domain assumption that current LLMs lack focus on localized motion patterns and that the proposed joint supervision strategy plus new dataset will supply the missing fine-grained alignment and reasoning ability.

axioms (2)

domain assumption LLMs can acquire robust priors for precise localized motion control when motion is explicitly modeled at part and temporal levels inside a single model.
Invoked to justify endowing the model with fine-grained control capabilities.
domain assumption Joint supervision across cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded CoT reasoning produces cross-granularity synergy and explicit reasoning ability.
Central to the ReasoningAware Granularity-Synergy pre-training description.

pith-pipeline@v0.9.0 · 5847 in / 1476 out tokens · 62649 ms · 2026-05-20T10:33:15.900749+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

99 extracted references · 99 canonical work pages · 5 internal anchors

[1]

MotionLLM: Multimodal motion-language learning with large language models,

Q. Wu, Y . Zhao, Y . Wang, Y .-W. Tai, and C.-K. Tang, “MotionLLM: Multimodal motion-language learning with large language models,” arXiv preprint arXiv:2405.17013, 2024

work page arXiv 2024
[2]

Human motion generation: A survey,

W. Zhu, X. Ma, D. Ro, H. Ci, J. Zhang, J. Shi, F. Gao, Q. Tian, and Y . Wang, “Human motion generation: A survey,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, pp. 2430–2449, 2024

work page 2024
[3]

MotionGPT: Human motion as a foreign language,

B. Jiang, X. Chen, W. Liu, J. Yu, G. Yu, and T. Chen, “MotionGPT: Human motion as a foreign language,”Adv. Neural Inform. Process. Syst., vol. 36, pp. 20 067–20 079, 2023

work page 2023
[4]

TM2T: Stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts,

C. Guo, X. Zuo, S. Wang, and L. Cheng, “TM2T: Stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts,” inEur. Conf. Comput. Vis., 2022, pp. 580–597

work page 2022
[5]

Human motion diffusion model,

G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-Or, and A. H. Bermano, “Human motion diffusion model,” inInt. Conf. Learn. Rep- resent., 2023

work page 2023
[6]

Generating human motion from textual descriptions with discrete representations,

J. Zhang, Y . Zhang, X. Cun, Y . Zhang, H. Zhao, H. Lu, X. Shen, and Y . Shan, “Generating human motion from textual descriptions with discrete representations,” inIEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 14 730–14 740

work page 2023
[7]

Generating diverse and natural 3d human motions from text,

C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng, “Generating diverse and natural 3d human motions from text,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 5152–5161

work page 2022
[8]

MoMask: Generative masked modeling of 3d human motions,

C. Guo, Y . Mu, M. G. Javed, S. Wang, and L. Cheng, “MoMask: Generative masked modeling of 3d human motions,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 1900–1910

work page 2024
[9]

MotionFix: Text-driven 3D human motion editing,

N. Athanasiou, A. Cseke, M. Diomataris, M. J. Black, and G. Varol, “MotionFix: Text-driven 3D human motion editing,” inSIGGRAPH Asia, 2024, pp. 1–11

work page 2024
[10]

FLAME: Free-form language-based motion synthesis & editing,

J. Kim, J. Kim, and S. Choi, “FLAME: Free-form language-based motion synthesis & editing,” inAAAI Conf. Artif. Intell., vol. 37, 2023, pp. 8255–8263

work page 2023
[11]

MotionCLIP: Exposing human motion generation to clip space,

G. Tevet, B. Gordon, A. Hertz, A. H. Bermano, and D. Cohen-Or, “MotionCLIP: Exposing human motion generation to clip space,” in Eur. Conf. Comput. Vis., 2022, pp. 358–374

work page 2022
[12]

EDGE: Editable dance generation from music,

J. Tseng, R. Castellon, and K. Liu, “EDGE: Editable dance generation from music,” inIEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 448–458

work page 2023
[13]

MotionLab: Unified hu- man motion generation and editing via the motion-condition-motion paradigm,

Z. Guo, Z. Hu, D. W. Soh, and N. Zhao, “MotionLab: Unified hu- man motion generation and editing via the motion-condition-motion paradigm,” inInt. Conf. Comput. Vis., 2025, pp. 13 869–13 879

work page 2025
[14]

M3GPT: An advanced multimodal, multitask framework for motion comprehension and generation,

M. Luo, R. Hou, Z. Li, H. Chang, Z. Liu, Y . Wang, and S. Shan, “M3GPT: An advanced multimodal, multitask framework for motion comprehension and generation,” inAdv. Neural Inform. Process. Syst., vol. 37, 2024, pp. 28 051–28 077

work page 2024
[15]

MotionGPT: Finetuned LLMs are general- purpose motion generators,

Y . Zhang, D. Huang, B. Liu, S. Tang, Y . Lu, L. Chen, L. Bai, Q. Chu, N. Yu, and W. Ouyang, “MotionGPT: Finetuned LLMs are general- purpose motion generators,” inAAAI Conf. Artif. Intell., vol. 38, 2024, pp. 7368–7376

work page 2024
[16]

MG-MotionLLM: A unified framework for motion comprehension and generation across multiple granularities,

B. Wu, J. Xie, K. Shen, Z. Kong, J. Ren, R. Bai, R. Qu, and L. Shen, “MG-MotionLLM: A unified framework for motion comprehension and generation across multiple granularities,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 27 849–27 858

work page 2025
[17]

Dynamic motion blending for versatile motion editing,

N. Jiang, H. Li, Z. Yuan, Z. He, Y . Chen, T. Liu, Y . Zhu, and S. Huang, “Dynamic motion blending for versatile motion editing,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 22 735–22 745

work page 2025
[18]

Action-conditioned 3d human motion synthesis with transformer vae,

M. Petrovich, M. J. Black, and G. Varol, “Action-conditioned 3d human motion synthesis with transformer vae,” inInt. Conf. Comput. Vis., 2021, pp. 10 985–10 995

work page 2021
[19]

MoDi: Unconditional motion synthesis from diverse data,

S. Raab, I. Leibovitch, P. Li, K. Aberman, O. Sorkine-Hornung, and D. Cohen-Or, “MoDi: Unconditional motion synthesis from diverse data,” inIEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 13 873– 13 883

work page 2023
[20]

DanceFormer: Music conditioned 3d dance generation with parametric motion transformer,

B. Li, Y . Zhao, S. Zhelun, and L. Sheng, “DanceFormer: Music conditioned 3d dance generation with parametric motion transformer,” inAAAI Conf. Artif. Intell., vol. 36, 2022, pp. 1272–1279

work page 2022
[21]

Lodge++: High-quality and long dance generation with robust choreography patterns,

R. Li, H. Zhang, Y . Zhang, Y . Zhang, Y . Zhang, J. Guo, Y . Zhang, X. Li, and Y . Liu, “Lodge++: High-quality and long dance generation with robust choreography patterns,”IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–15, 2025

work page 2025
[22]

Lagrangian motion fields for long-term motion generation,

Y . Yang, Z. Huang, C. Xu, and S. He, “Lagrangian motion fields for long-term motion generation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 48, pp. 1171–1184, 2026

work page 2026
[23]

MNET++: Music-driven plural- istic dancing toward multiple dance genre synthesis,

J. Kim, B. Kwon, J. Kim, and S. Lee, “MNET++: Music-driven plural- istic dancing toward multiple dance genre synthesis,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, pp. 15 036– 15 050, 2023

work page 2023
[24]

Combo: Co-speech holistic 3d human motion generation and efficient customizable adaptation in harmony,

C. Xu, M. Sun, Z.-Q. Cheng, F. Wang, Y . Liu, B. Sun, R. Huang, and A. Hauptmann, “Combo: Co-speech holistic 3d human motion generation and efficient customizable adaptation in harmony,”IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–18, 2025

work page 2025
[25]

Audio2Gestures: Generating diverse gestures from audio,

J. Li, D. Kang, W. Pei, X. Zhe, Y . Zhang, L. Bao, and Z. He, “Audio2Gestures: Generating diverse gestures from audio,”IEEE Trans. Vis. Comput. Graph., vol. 30, pp. 4752–4766, 2023

work page 2023
[26]

From audio to photoreal embodiment: Synthesizing humans in conversations,

E. Ng, J. Romero, T. Bagautdinov, S. Bai, T. Darrell, A. Kanazawa, and A. Richard, “From audio to photoreal embodiment: Synthesizing humans in conversations,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 1001–1010

work page 2024
[27]

TEMOS: Generating diverse human motions from textual descriptions,

M. Petrovich, M. J. Black, and G. Varol, “TEMOS: Generating diverse human motions from textual descriptions,” inEur. Conf. Comput. Vis., 2022, pp. 480–497

work page 2022
[28]

DrawMotion: Generating 3d human motions by freehand drawing,

T. Wang, L. Jin, Z. Wu, Q. He, J. Chu, Y . Cheng, J. Xing, J. Zhao, S. Yan, and L. Wang, “DrawMotion: Generating 3d human motions by freehand drawing,”IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–17, 2026

work page 2026
[29]

Seamless human motion composition with blended positional encodings,

G. Barquero, S. Escalera, and C. Palmero, “Seamless human motion composition with blended positional encodings,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 457–469

work page 2024
[30]

CoMo: Controllable motion generation through language guided pose code editing,

Y . Huang, W. Wan, Y . Yang, C. Callison-Burch, M. Yatskar, and L. Liu, “CoMo: Controllable motion generation through language guided pose code editing,” inEur. Conf. Comput. Vis., 2024, p. 180–196

work page 2024
[31]

Multi-track timeline control for text-driven 3d human motion generation,

M. Petrovich, O. Litany, U. Iqbal, M. J. Black, G. Varol, X. Bin Peng, and D. Rempe, “Multi-track timeline control for text-driven 3d human motion generation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 1911–1921

work page 2024
[32]

Fg-T2M: Fine- grained text-driven human motion generation via diffusion model,

Y . Wang, Z. Leng, F. W. Li, S.-C. Wu, and X. Liang, “Fg-T2M: Fine- grained text-driven human motion generation via diffusion model,” in Int. Conf. Comput. Vis., 2023, pp. 22 035–22 044

work page 2023
[33]

LaMP: Language-motion pretraining for motion generation, retrieval, and captioning,

Z. Li, W. Yuan, Y . He, L. Qiu, S. Zhu, X. Gu, W. Shen, Y . Dong, Z. Dong, and L. T. Yang, “LaMP: Language-motion pretraining for motion generation, retrieval, and captioning,” inInt. Conf. Learn. Represent., 2025

work page 2025
[34]

ScaMo: Exploring the scaling law in autoregressive motion generation model,

S. Lu, J. Wang, Z. Lu, L.-H. Chen, W. Dai, J. Dong, Z. Dou, B. Dai, and R. Zhang, “ScaMo: Exploring the scaling law in autoregressive motion generation model,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 27 872–27 882

work page 2025
[35]

The language of motion: Unifying verbal and IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15 non-verbal language of 3d human motion,

C. Chen, J. Zhang, S. K. Lakshmikanth, Y . Fang, R. Shao, G. Wetzstein, L. Fei-Fei, and E. Adeli, “The language of motion: Unifying verbal and IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15 non-verbal language of 3d human motion,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 6200–6211

work page 2025
[36]

ParCo: Part-coordinating text-to-motion synthesis,

Q. Zou, S. Yuan, S. Du, Y . Wang, C. Liu, Y . Xu, J. Chen, and X. Ji, “ParCo: Part-coordinating text-to-motion synthesis,” inEur. Conf. Comput. Vis., 2024, pp. 126–143

work page 2024
[37]

MotionDiffuse: Text-driven human motion generation with diffusion model,

M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu, “MotionDiffuse: Text-driven human motion generation with diffusion model,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, pp. 4115– 4128, 2024

work page 2024
[38]

Executing your commands via motion diffusion in latent space,

X. Chen, B. Jiang, W. Liu, Z. Huang, B. Fu, T. Chen, and G. Yu, “Executing your commands via motion diffusion in latent space,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 18 000–18 010

work page 2023
[39]

AMD: autoregressive motion diffusion,

B. Han, H. Peng, M. Dong, Y . Ren, Y . Shen, and C. Xu, “AMD: autoregressive motion diffusion,” inAAAI Conf. Artif. Intell., 2024, pp. 2022–2030

work page 2024
[40]

CLoSD: Closing the loop between simulation and diffusion for multi-task character control,

G. Tevet, S. Raab, S. Cohan, D. Reda, Z. Luo, X. B. Peng, A. H. Bermano, and M. van de Panne, “CLoSD: Closing the loop between simulation and diffusion for multi-task character control,” inInt. Conf. Learn. Represent., 2025

work page 2025
[41]

MotionStreamer: Streaming motion generation via diffusion-based autoregressive model in causal latent space,

L. Xiao, S. Lu, H. Pi, K. Fan, L. Pan, Y . Zhou, Z. Feng, X. Zhou, S. Peng, and J. Wang, “MotionStreamer: Streaming motion generation via diffusion-based autoregressive model in causal latent space,” inInt. Conf. Comput. Vis., 2025, pp. 10 086–10 096

work page 2025
[42]

The kit motion-language dataset,

M. Plappert, C. Mandery, and T. Asfour, “The kit motion-language dataset,”Big Data, vol. 4, pp. 236–252, 2016

work page 2016
[43]

Action- GPT: Leveraging large-scale language models for improved and gener- alized action generation,

S. S. Kalakonda, S. Maheshwari, and R. K. Sarvadevabhatla, “Action- GPT: Leveraging large-scale language models for improved and gener- alized action generation,” inInt. Conf. Multimedia and Expo, 2023, pp. 31–36

work page 2023
[44]

Fg-T2M++: LLMs-augmented fine-grained text driven human motion generation,

Y . Wang, M. Li, J. Liu, Z. Leng, F. W. B. Li, Z. Zhang, and X. Liang, “Fg-T2M++: LLMs-augmented fine-grained text driven human motion generation,”Int. J. Comput. Vis., vol. 133, pp. 4277–4293, 2025

work page 2025
[45]

Semanticboost: Ele- vating motion generation with augmented textual cues,

X. He, S. Huang, X. Zhan, C. Wen, and Y . Shan, “Semanticboost: Ele- vating motion generation with augmented textual cues,”arXiv preprint arXiv:2310.20323, 2023

work page arXiv 2023
[46]

MotionScript: Natural language descriptions for expressive 3d human motions,

P. J. Yazdian, R. Lagasse, H. Mohammadi, E. Liu, L. Cheng, and A. Lim, “MotionScript: Natural language descriptions for expressive 3d human motions,” inIEEE Int. Conf. Intell. Robots Syst., 2025, pp. 21 574– 21 581

work page 2025
[47]

FineMoGen: Fine-grained spatio-temporal motion generation and editing,

M. Zhang, H. Li, Z. Cai, J. Ren, L. Yang, and Z. Liu, “FineMoGen: Fine-grained spatio-temporal motion generation and editing,” inAdv. Neural Inform. Process. Syst., vol. 36, 2023, pp. 13 981–13 992

work page 2023
[48]

FineMotion: A dataset and benchmark with both spatial and temporal annotation for fine-grained motion generation and editing,

B. Wu, J. Xie, M. Ding, Z. Kong, J. Ren, R. Bai, R. Qu, and L. Shen, “FineMotion: A dataset and benchmark with both spatial and temporal annotation for fine-grained motion generation and editing,” inInt. Conf. Comput. Vis., 2025, pp. 13 837–13 846

work page 2025
[49]

Realtime style transfer for unlabeled heterogeneous human motion,

S. Xia, C. Wang, J. Chai, and J. Hodgins, “Realtime style transfer for unlabeled heterogeneous human motion,”ACM Trans. Graph., vol. 34, pp. 1–10, 2015

work page 2015
[50]

Unpaired motion style transfer from video to animation,

K. Aberman, Y . Weng, D. Lischinski, D. Cohen-Or, and B. Chen, “Unpaired motion style transfer from video to animation,”ACM Trans. Graph., vol. 39, pp. 64:1–64:12, 2020

work page 2020
[51]

Motion Puzzle: Arbitrary motion style transfer by body part,

D.-K. Jang, S. Park, and S.-H. Lee, “Motion Puzzle: Arbitrary motion style transfer by body part,”ACM Trans. Graph., vol. 41, pp. 1–16, 2022

work page 2022
[52]

Style-ERD: Responsive and coherent online motion style transfer,

T. Tao, X. Zhan, Z. Chen, and M. van de Panne, “Style-ERD: Responsive and coherent online motion style transfer,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 6593–6603

work page 2022
[53]

SALAD: Skeleton-aware latent diffusion for text-driven motion generation and editing,

S. Hong, C. Kim, S. Yoon, J. Nam, S. Cha, and J. Noh, “SALAD: Skeleton-aware latent diffusion for text-driven motion generation and editing,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 7158– 7168

work page 2025
[54]

MaskCon- trol: Spatio-temporal control for masked motion synthesis,

E. Pinyoanuntapong, M. U. Saleem, K. Karunratanakul, P. Wang, H. Xue, C. Chen, C. Guo, J. Cao, J. Ren, and S. Tulyakov, “MaskCon- trol: Spatio-temporal control for masked motion synthesis,” inInt. Conf. Comput. Vis., 2025, pp. 9955–9965

work page 2025
[55]

Iterative motion editing with natural language,

P. Goel, K. Wang, C. K. Liu, and K. Fatahalian, “Iterative motion editing with natural language,” inSIGGRAPH, 2024, pp. 1–9

work page 2024
[56]

SimMotionEdit: Text-based human motion editing with motion simi- larity prediction,

Z. Li, K. Cheng, A. Ghosh, U. Bhattacharya, L. Gui, and A. Bera, “SimMotionEdit: Text-based human motion editing with motion simi- larity prediction,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 27 827–27 837

work page 2025
[57]

Weakly-supervised 3d spatial reasoning for text-based visual question answering,

H. Li, J. Huang, P. Jin, G. Song, Q. Wu, and J. Chen, “Weakly-supervised 3d spatial reasoning for text-based visual question answering,”IEEE Trans. Image Process., vol. 32, pp. 3367–3382, 2023

work page 2023
[58]

TEILP: Time prediction over knowledge graphs via logical reasoning,

S. Xiong, Y . Yang, A. Payani, J. C. Kerce, and F. Fekri, “TEILP: Time prediction over knowledge graphs via logical reasoning,” inAAAI Conf. Artif. Intell., vol. 38, 2024, pp. 16 112–16 119

work page 2024
[59]

From system 1 to system 2: A survey of reasoning large language models,

D. Zhang, Z.-Z. Li, M.-L. Zhang, J. Zhang, Z. Liu, Y . Yao, H. Xu, J. Zheng, X. Chen, Y . Zhang, F. Yin, J. Dong, Z. Guo, L. Song, and C.-L. Liu, “From system 1 to system 2: A survey of reasoning large language models,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 48, pp. 3335–3354, 2026

work page 2026
[60]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Adv. Neural Inform. Process. Syst., vol. 35, pp. 24 824–24 837, 2022

work page 2022
[61]

EAGLE: expert-guided self-enhancement for preference alignment in pathology large vision-language model,

M. Ding, J. Zhang, W. Wang, H. Zhong, X. Wang, X. Lyu, W. Chen, and L. Shen, “EAGLE: expert-guided self-enhancement for preference alignment in pathology large vision-language model,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, vol. 1, 2025, pp. 14 603–14 619

work page 2025
[62]

AtomThink: Multimodal slow thinking with atomic step reasoning,

K. Xiang, Z. Liu, T. J. Zhang, Y . Huang, Y . Nie, K. Cai, Y . Yin, R. Huang, H. Li, Y . Zeng, Y .-J. Yuan, J. Han, L. Hong, H. Xu, and X. Liang, “AtomThink: Multimodal slow thinking with atomic step reasoning,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 48, pp. 5725– 5741, 2026

work page 2026
[63]

EvolveNav: Empow- ering llm-based vision-language navigation via self-improving embodied reasoning,

B. Lin, Y . Nie, K. L. Zai, Z. Wei, M. Han, R. Xu, M. Niu, J. Han, H. Zhang, L. Lin, B. Chen, C. Lu, and X. Liang, “EvolveNav: Empow- ering llm-based vision-language navigation via self-improving embodied reasoning,”IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–15, 2026

work page 2026
[64]

Motion question answering via modular motion programs,

M. Endo, J. Hsu, J. Li, and J. Wu, “Motion question answering via modular motion programs,” inInt. Conf. Mach. Learn., 2023, pp. 9312– 9328

work page 2023
[65]

IMoRe: Implicit program-guided reasoning for human motion q&a,

C. Li, C. Sugandhika, Y . K. Ee, E. Peh, H. Zhang, H. Yang, D. Rajan, and B. Fernando, “IMoRe: Implicit program-guided reasoning for human motion q&a,” inInt. Conf. Comput. Vis., 2025, pp. 12 987–12 996

work page 2025
[66]

ChatGPT (Mar 14 version) [Large Language Model],

OpenAI, “ChatGPT (Mar 14 version) [Large Language Model],” https: //chat.openai.com/chat/, 2023

work page 2023
[67]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “GPT-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

Mo- tionChain: Conversational motion controllers via multimodal prompts,

B. Jiang, X. Chen, C. Zhang, F. Yin, Z. Li, G. Yu, and J. Fan, “Mo- tionChain: Conversational motion controllers via multimodal prompts,” inEur. Conf. Comput. Vis., 2024, pp. 54–74

work page 2024
[69]

AvatarGPT: All-in-one framework for motion understanding planning generation and beyond,

Z. Zhou, Y . Wan, and B. Wang, “AvatarGPT: All-in-one framework for motion understanding planning generation and beyond,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 1357–1366

work page 2024
[70]

A unified framework for motion reasoning and generation in human interaction,

J. Park, S. Choi, and S. Yun, “A unified framework for motion reasoning and generation in human interaction,” inInt. Conf. Comput. Vis., 2025, pp. 10 698–10 707

work page 2025
[71]

Motion-Agent: A conversational framework for human motion generation with LLMs,

Q. Wu, Y . Zhao, Y . Wang, X. Liu, Y . Tai, and C. Tang, “Motion-Agent: A conversational framework for human motion generation with LLMs,” inInt. Conf. Learn. Represent., 2025

work page 2025
[72]

HuTuMotion: Human-tuned navigation of latent motion diffusion models with minimal feedback,

G. Han, S. Huang, M. Gong, and J. Tang, “HuTuMotion: Human-tuned navigation of latent motion diffusion models with minimal feedback,” inAAAI Conf. Artif. Intell., vol. 38, 2024, pp. 2031–2039

work page 2024
[73]

Aligning human motion generation with human perceptions,

H. Wang, W. Zhu, L. Miao, Y . Xu, F. Gao, Q. Tian, and Y . Wang, “Aligning human motion generation with human perceptions,” inInt. Conf. Learn. Represent., 2025

work page 2025
[74]

Learning generalizable human motion generator with reinforcement learning,

Y . Mao, X. Liu, W. Zhou, Z. Lu, and H. Li, “Learning generalizable human motion generator with reinforcement learning,”arXiv preprint arXiv:2405.15541, 2024

work page arXiv 2024
[75]

Motionrl: Align text-to-motion generation to human preferences with multi-reward reinforcement learn- ing,

X. Liu, Y . Mao, W. Zhou, and H. Li, “Motionrl: Align text-to-motion generation to human preferences with multi-reward reinforcement learn- ing,”arXiv preprint arXiv:2410.06513, 2024

work page arXiv 2024
[76]

Motion-R1: Chain-of-thought reasoning and reinforcement learning for human motion generation,

R. Ouyang, H. Li, Z. Zhang, X. Wang, Z. Zhu, G. Huang, and X. Wang, “Motion-R1: Chain-of-thought reasoning and reinforcement learning for human motion generation,” inInt. Conf. Learn. Represent., 2026

work page 2026
[77]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The Llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[78]

Exploring the limits of transfer learning with a unified text-to-text transformer,

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”J. Mach. Learn. Res., vol. 21, pp. 140:1–140:67, 2020

work page 2020
[79]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Adv. Neural Inform. Process. Syst., vol. 36, pp. 34 892–34 916, 2023

work page 2023
[80]

Scaling rectified flow transformers for high-resolution image synthesis,

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boeselet al., “Scaling rectified flow transformers for high-resolution image synthesis,” inInt. Conf. Mach. Learn., vol. 235, 2024, pp. 12 606–12 633. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16

work page 2024

Showing first 80 references.

[1] [1]

MotionLLM: Multimodal motion-language learning with large language models,

Q. Wu, Y . Zhao, Y . Wang, Y .-W. Tai, and C.-K. Tang, “MotionLLM: Multimodal motion-language learning with large language models,” arXiv preprint arXiv:2405.17013, 2024

work page arXiv 2024

[2] [2]

Human motion generation: A survey,

W. Zhu, X. Ma, D. Ro, H. Ci, J. Zhang, J. Shi, F. Gao, Q. Tian, and Y . Wang, “Human motion generation: A survey,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, pp. 2430–2449, 2024

work page 2024

[3] [3]

MotionGPT: Human motion as a foreign language,

B. Jiang, X. Chen, W. Liu, J. Yu, G. Yu, and T. Chen, “MotionGPT: Human motion as a foreign language,”Adv. Neural Inform. Process. Syst., vol. 36, pp. 20 067–20 079, 2023

work page 2023

[4] [4]

TM2T: Stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts,

C. Guo, X. Zuo, S. Wang, and L. Cheng, “TM2T: Stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts,” inEur. Conf. Comput. Vis., 2022, pp. 580–597

work page 2022

[5] [5]

Human motion diffusion model,

G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-Or, and A. H. Bermano, “Human motion diffusion model,” inInt. Conf. Learn. Rep- resent., 2023

work page 2023

[6] [6]

Generating human motion from textual descriptions with discrete representations,

J. Zhang, Y . Zhang, X. Cun, Y . Zhang, H. Zhao, H. Lu, X. Shen, and Y . Shan, “Generating human motion from textual descriptions with discrete representations,” inIEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 14 730–14 740

work page 2023

[7] [7]

Generating diverse and natural 3d human motions from text,

C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng, “Generating diverse and natural 3d human motions from text,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 5152–5161

work page 2022

[8] [8]

MoMask: Generative masked modeling of 3d human motions,

C. Guo, Y . Mu, M. G. Javed, S. Wang, and L. Cheng, “MoMask: Generative masked modeling of 3d human motions,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 1900–1910

work page 2024

[9] [9]

MotionFix: Text-driven 3D human motion editing,

N. Athanasiou, A. Cseke, M. Diomataris, M. J. Black, and G. Varol, “MotionFix: Text-driven 3D human motion editing,” inSIGGRAPH Asia, 2024, pp. 1–11

work page 2024

[10] [10]

FLAME: Free-form language-based motion synthesis & editing,

J. Kim, J. Kim, and S. Choi, “FLAME: Free-form language-based motion synthesis & editing,” inAAAI Conf. Artif. Intell., vol. 37, 2023, pp. 8255–8263

work page 2023

[11] [11]

MotionCLIP: Exposing human motion generation to clip space,

G. Tevet, B. Gordon, A. Hertz, A. H. Bermano, and D. Cohen-Or, “MotionCLIP: Exposing human motion generation to clip space,” in Eur. Conf. Comput. Vis., 2022, pp. 358–374

work page 2022

[12] [12]

EDGE: Editable dance generation from music,

J. Tseng, R. Castellon, and K. Liu, “EDGE: Editable dance generation from music,” inIEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 448–458

work page 2023

[13] [13]

MotionLab: Unified hu- man motion generation and editing via the motion-condition-motion paradigm,

Z. Guo, Z. Hu, D. W. Soh, and N. Zhao, “MotionLab: Unified hu- man motion generation and editing via the motion-condition-motion paradigm,” inInt. Conf. Comput. Vis., 2025, pp. 13 869–13 879

work page 2025

[14] [14]

M3GPT: An advanced multimodal, multitask framework for motion comprehension and generation,

M. Luo, R. Hou, Z. Li, H. Chang, Z. Liu, Y . Wang, and S. Shan, “M3GPT: An advanced multimodal, multitask framework for motion comprehension and generation,” inAdv. Neural Inform. Process. Syst., vol. 37, 2024, pp. 28 051–28 077

work page 2024

[15] [15]

MotionGPT: Finetuned LLMs are general- purpose motion generators,

Y . Zhang, D. Huang, B. Liu, S. Tang, Y . Lu, L. Chen, L. Bai, Q. Chu, N. Yu, and W. Ouyang, “MotionGPT: Finetuned LLMs are general- purpose motion generators,” inAAAI Conf. Artif. Intell., vol. 38, 2024, pp. 7368–7376

work page 2024

[16] [16]

MG-MotionLLM: A unified framework for motion comprehension and generation across multiple granularities,

B. Wu, J. Xie, K. Shen, Z. Kong, J. Ren, R. Bai, R. Qu, and L. Shen, “MG-MotionLLM: A unified framework for motion comprehension and generation across multiple granularities,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 27 849–27 858

work page 2025

[17] [17]

Dynamic motion blending for versatile motion editing,

N. Jiang, H. Li, Z. Yuan, Z. He, Y . Chen, T. Liu, Y . Zhu, and S. Huang, “Dynamic motion blending for versatile motion editing,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 22 735–22 745

work page 2025

[18] [18]

Action-conditioned 3d human motion synthesis with transformer vae,

M. Petrovich, M. J. Black, and G. Varol, “Action-conditioned 3d human motion synthesis with transformer vae,” inInt. Conf. Comput. Vis., 2021, pp. 10 985–10 995

work page 2021

[19] [19]

MoDi: Unconditional motion synthesis from diverse data,

S. Raab, I. Leibovitch, P. Li, K. Aberman, O. Sorkine-Hornung, and D. Cohen-Or, “MoDi: Unconditional motion synthesis from diverse data,” inIEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 13 873– 13 883

work page 2023

[20] [20]

DanceFormer: Music conditioned 3d dance generation with parametric motion transformer,

B. Li, Y . Zhao, S. Zhelun, and L. Sheng, “DanceFormer: Music conditioned 3d dance generation with parametric motion transformer,” inAAAI Conf. Artif. Intell., vol. 36, 2022, pp. 1272–1279

work page 2022

[21] [21]

Lodge++: High-quality and long dance generation with robust choreography patterns,

R. Li, H. Zhang, Y . Zhang, Y . Zhang, Y . Zhang, J. Guo, Y . Zhang, X. Li, and Y . Liu, “Lodge++: High-quality and long dance generation with robust choreography patterns,”IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–15, 2025

work page 2025

[22] [22]

Lagrangian motion fields for long-term motion generation,

Y . Yang, Z. Huang, C. Xu, and S. He, “Lagrangian motion fields for long-term motion generation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 48, pp. 1171–1184, 2026

work page 2026

[23] [23]

MNET++: Music-driven plural- istic dancing toward multiple dance genre synthesis,

J. Kim, B. Kwon, J. Kim, and S. Lee, “MNET++: Music-driven plural- istic dancing toward multiple dance genre synthesis,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, pp. 15 036– 15 050, 2023

work page 2023

[24] [24]

Combo: Co-speech holistic 3d human motion generation and efficient customizable adaptation in harmony,

C. Xu, M. Sun, Z.-Q. Cheng, F. Wang, Y . Liu, B. Sun, R. Huang, and A. Hauptmann, “Combo: Co-speech holistic 3d human motion generation and efficient customizable adaptation in harmony,”IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–18, 2025

work page 2025

[25] [25]

Audio2Gestures: Generating diverse gestures from audio,

J. Li, D. Kang, W. Pei, X. Zhe, Y . Zhang, L. Bao, and Z. He, “Audio2Gestures: Generating diverse gestures from audio,”IEEE Trans. Vis. Comput. Graph., vol. 30, pp. 4752–4766, 2023

work page 2023

[26] [26]

From audio to photoreal embodiment: Synthesizing humans in conversations,

E. Ng, J. Romero, T. Bagautdinov, S. Bai, T. Darrell, A. Kanazawa, and A. Richard, “From audio to photoreal embodiment: Synthesizing humans in conversations,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 1001–1010

work page 2024

[27] [27]

TEMOS: Generating diverse human motions from textual descriptions,

M. Petrovich, M. J. Black, and G. Varol, “TEMOS: Generating diverse human motions from textual descriptions,” inEur. Conf. Comput. Vis., 2022, pp. 480–497

work page 2022

[28] [28]

DrawMotion: Generating 3d human motions by freehand drawing,

T. Wang, L. Jin, Z. Wu, Q. He, J. Chu, Y . Cheng, J. Xing, J. Zhao, S. Yan, and L. Wang, “DrawMotion: Generating 3d human motions by freehand drawing,”IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–17, 2026

work page 2026

[29] [29]

Seamless human motion composition with blended positional encodings,

G. Barquero, S. Escalera, and C. Palmero, “Seamless human motion composition with blended positional encodings,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 457–469

work page 2024

[30] [30]

CoMo: Controllable motion generation through language guided pose code editing,

Y . Huang, W. Wan, Y . Yang, C. Callison-Burch, M. Yatskar, and L. Liu, “CoMo: Controllable motion generation through language guided pose code editing,” inEur. Conf. Comput. Vis., 2024, p. 180–196

work page 2024

[31] [31]

Multi-track timeline control for text-driven 3d human motion generation,

M. Petrovich, O. Litany, U. Iqbal, M. J. Black, G. Varol, X. Bin Peng, and D. Rempe, “Multi-track timeline control for text-driven 3d human motion generation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 1911–1921

work page 2024

[32] [32]

Fg-T2M: Fine- grained text-driven human motion generation via diffusion model,

Y . Wang, Z. Leng, F. W. Li, S.-C. Wu, and X. Liang, “Fg-T2M: Fine- grained text-driven human motion generation via diffusion model,” in Int. Conf. Comput. Vis., 2023, pp. 22 035–22 044

work page 2023

[33] [33]

LaMP: Language-motion pretraining for motion generation, retrieval, and captioning,

Z. Li, W. Yuan, Y . He, L. Qiu, S. Zhu, X. Gu, W. Shen, Y . Dong, Z. Dong, and L. T. Yang, “LaMP: Language-motion pretraining for motion generation, retrieval, and captioning,” inInt. Conf. Learn. Represent., 2025

work page 2025

[34] [34]

ScaMo: Exploring the scaling law in autoregressive motion generation model,

S. Lu, J. Wang, Z. Lu, L.-H. Chen, W. Dai, J. Dong, Z. Dou, B. Dai, and R. Zhang, “ScaMo: Exploring the scaling law in autoregressive motion generation model,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 27 872–27 882

work page 2025

[35] [35]

The language of motion: Unifying verbal and IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15 non-verbal language of 3d human motion,

C. Chen, J. Zhang, S. K. Lakshmikanth, Y . Fang, R. Shao, G. Wetzstein, L. Fei-Fei, and E. Adeli, “The language of motion: Unifying verbal and IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15 non-verbal language of 3d human motion,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 6200–6211

work page 2025

[36] [36]

ParCo: Part-coordinating text-to-motion synthesis,

Q. Zou, S. Yuan, S. Du, Y . Wang, C. Liu, Y . Xu, J. Chen, and X. Ji, “ParCo: Part-coordinating text-to-motion synthesis,” inEur. Conf. Comput. Vis., 2024, pp. 126–143

work page 2024

[37] [37]

MotionDiffuse: Text-driven human motion generation with diffusion model,

M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu, “MotionDiffuse: Text-driven human motion generation with diffusion model,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, pp. 4115– 4128, 2024

work page 2024

[38] [38]

Executing your commands via motion diffusion in latent space,

X. Chen, B. Jiang, W. Liu, Z. Huang, B. Fu, T. Chen, and G. Yu, “Executing your commands via motion diffusion in latent space,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 18 000–18 010

work page 2023

[39] [39]

AMD: autoregressive motion diffusion,

B. Han, H. Peng, M. Dong, Y . Ren, Y . Shen, and C. Xu, “AMD: autoregressive motion diffusion,” inAAAI Conf. Artif. Intell., 2024, pp. 2022–2030

work page 2024

[40] [40]

CLoSD: Closing the loop between simulation and diffusion for multi-task character control,

G. Tevet, S. Raab, S. Cohan, D. Reda, Z. Luo, X. B. Peng, A. H. Bermano, and M. van de Panne, “CLoSD: Closing the loop between simulation and diffusion for multi-task character control,” inInt. Conf. Learn. Represent., 2025

work page 2025

[41] [41]

MotionStreamer: Streaming motion generation via diffusion-based autoregressive model in causal latent space,

L. Xiao, S. Lu, H. Pi, K. Fan, L. Pan, Y . Zhou, Z. Feng, X. Zhou, S. Peng, and J. Wang, “MotionStreamer: Streaming motion generation via diffusion-based autoregressive model in causal latent space,” inInt. Conf. Comput. Vis., 2025, pp. 10 086–10 096

work page 2025

[42] [42]

The kit motion-language dataset,

M. Plappert, C. Mandery, and T. Asfour, “The kit motion-language dataset,”Big Data, vol. 4, pp. 236–252, 2016

work page 2016

[43] [43]

Action- GPT: Leveraging large-scale language models for improved and gener- alized action generation,

S. S. Kalakonda, S. Maheshwari, and R. K. Sarvadevabhatla, “Action- GPT: Leveraging large-scale language models for improved and gener- alized action generation,” inInt. Conf. Multimedia and Expo, 2023, pp. 31–36

work page 2023

[44] [44]

Fg-T2M++: LLMs-augmented fine-grained text driven human motion generation,

Y . Wang, M. Li, J. Liu, Z. Leng, F. W. B. Li, Z. Zhang, and X. Liang, “Fg-T2M++: LLMs-augmented fine-grained text driven human motion generation,”Int. J. Comput. Vis., vol. 133, pp. 4277–4293, 2025

work page 2025

[45] [45]

Semanticboost: Ele- vating motion generation with augmented textual cues,

X. He, S. Huang, X. Zhan, C. Wen, and Y . Shan, “Semanticboost: Ele- vating motion generation with augmented textual cues,”arXiv preprint arXiv:2310.20323, 2023

work page arXiv 2023

[46] [46]

MotionScript: Natural language descriptions for expressive 3d human motions,

P. J. Yazdian, R. Lagasse, H. Mohammadi, E. Liu, L. Cheng, and A. Lim, “MotionScript: Natural language descriptions for expressive 3d human motions,” inIEEE Int. Conf. Intell. Robots Syst., 2025, pp. 21 574– 21 581

work page 2025

[47] [47]

FineMoGen: Fine-grained spatio-temporal motion generation and editing,

M. Zhang, H. Li, Z. Cai, J. Ren, L. Yang, and Z. Liu, “FineMoGen: Fine-grained spatio-temporal motion generation and editing,” inAdv. Neural Inform. Process. Syst., vol. 36, 2023, pp. 13 981–13 992

work page 2023

[48] [48]

FineMotion: A dataset and benchmark with both spatial and temporal annotation for fine-grained motion generation and editing,

B. Wu, J. Xie, M. Ding, Z. Kong, J. Ren, R. Bai, R. Qu, and L. Shen, “FineMotion: A dataset and benchmark with both spatial and temporal annotation for fine-grained motion generation and editing,” inInt. Conf. Comput. Vis., 2025, pp. 13 837–13 846

work page 2025

[49] [49]

Realtime style transfer for unlabeled heterogeneous human motion,

S. Xia, C. Wang, J. Chai, and J. Hodgins, “Realtime style transfer for unlabeled heterogeneous human motion,”ACM Trans. Graph., vol. 34, pp. 1–10, 2015

work page 2015

[50] [50]

Unpaired motion style transfer from video to animation,

K. Aberman, Y . Weng, D. Lischinski, D. Cohen-Or, and B. Chen, “Unpaired motion style transfer from video to animation,”ACM Trans. Graph., vol. 39, pp. 64:1–64:12, 2020

work page 2020

[51] [51]

Motion Puzzle: Arbitrary motion style transfer by body part,

D.-K. Jang, S. Park, and S.-H. Lee, “Motion Puzzle: Arbitrary motion style transfer by body part,”ACM Trans. Graph., vol. 41, pp. 1–16, 2022

work page 2022

[52] [52]

Style-ERD: Responsive and coherent online motion style transfer,

T. Tao, X. Zhan, Z. Chen, and M. van de Panne, “Style-ERD: Responsive and coherent online motion style transfer,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 6593–6603

work page 2022

[53] [53]

SALAD: Skeleton-aware latent diffusion for text-driven motion generation and editing,

S. Hong, C. Kim, S. Yoon, J. Nam, S. Cha, and J. Noh, “SALAD: Skeleton-aware latent diffusion for text-driven motion generation and editing,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 7158– 7168

work page 2025

[54] [54]

MaskCon- trol: Spatio-temporal control for masked motion synthesis,

E. Pinyoanuntapong, M. U. Saleem, K. Karunratanakul, P. Wang, H. Xue, C. Chen, C. Guo, J. Cao, J. Ren, and S. Tulyakov, “MaskCon- trol: Spatio-temporal control for masked motion synthesis,” inInt. Conf. Comput. Vis., 2025, pp. 9955–9965

work page 2025

[55] [55]

Iterative motion editing with natural language,

P. Goel, K. Wang, C. K. Liu, and K. Fatahalian, “Iterative motion editing with natural language,” inSIGGRAPH, 2024, pp. 1–9

work page 2024

[56] [56]

SimMotionEdit: Text-based human motion editing with motion simi- larity prediction,

Z. Li, K. Cheng, A. Ghosh, U. Bhattacharya, L. Gui, and A. Bera, “SimMotionEdit: Text-based human motion editing with motion simi- larity prediction,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 27 827–27 837

work page 2025

[57] [57]

Weakly-supervised 3d spatial reasoning for text-based visual question answering,

H. Li, J. Huang, P. Jin, G. Song, Q. Wu, and J. Chen, “Weakly-supervised 3d spatial reasoning for text-based visual question answering,”IEEE Trans. Image Process., vol. 32, pp. 3367–3382, 2023

work page 2023

[58] [58]

TEILP: Time prediction over knowledge graphs via logical reasoning,

S. Xiong, Y . Yang, A. Payani, J. C. Kerce, and F. Fekri, “TEILP: Time prediction over knowledge graphs via logical reasoning,” inAAAI Conf. Artif. Intell., vol. 38, 2024, pp. 16 112–16 119

work page 2024

[59] [59]

From system 1 to system 2: A survey of reasoning large language models,

D. Zhang, Z.-Z. Li, M.-L. Zhang, J. Zhang, Z. Liu, Y . Yao, H. Xu, J. Zheng, X. Chen, Y . Zhang, F. Yin, J. Dong, Z. Guo, L. Song, and C.-L. Liu, “From system 1 to system 2: A survey of reasoning large language models,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 48, pp. 3335–3354, 2026

work page 2026

[60] [60]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Adv. Neural Inform. Process. Syst., vol. 35, pp. 24 824–24 837, 2022

work page 2022

[61] [61]

EAGLE: expert-guided self-enhancement for preference alignment in pathology large vision-language model,

M. Ding, J. Zhang, W. Wang, H. Zhong, X. Wang, X. Lyu, W. Chen, and L. Shen, “EAGLE: expert-guided self-enhancement for preference alignment in pathology large vision-language model,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, vol. 1, 2025, pp. 14 603–14 619

work page 2025

[62] [62]

AtomThink: Multimodal slow thinking with atomic step reasoning,

K. Xiang, Z. Liu, T. J. Zhang, Y . Huang, Y . Nie, K. Cai, Y . Yin, R. Huang, H. Li, Y . Zeng, Y .-J. Yuan, J. Han, L. Hong, H. Xu, and X. Liang, “AtomThink: Multimodal slow thinking with atomic step reasoning,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 48, pp. 5725– 5741, 2026

work page 2026

[63] [63]

EvolveNav: Empow- ering llm-based vision-language navigation via self-improving embodied reasoning,

B. Lin, Y . Nie, K. L. Zai, Z. Wei, M. Han, R. Xu, M. Niu, J. Han, H. Zhang, L. Lin, B. Chen, C. Lu, and X. Liang, “EvolveNav: Empow- ering llm-based vision-language navigation via self-improving embodied reasoning,”IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–15, 2026

work page 2026

[64] [64]

Motion question answering via modular motion programs,

M. Endo, J. Hsu, J. Li, and J. Wu, “Motion question answering via modular motion programs,” inInt. Conf. Mach. Learn., 2023, pp. 9312– 9328

work page 2023

[65] [65]

IMoRe: Implicit program-guided reasoning for human motion q&a,

C. Li, C. Sugandhika, Y . K. Ee, E. Peh, H. Zhang, H. Yang, D. Rajan, and B. Fernando, “IMoRe: Implicit program-guided reasoning for human motion q&a,” inInt. Conf. Comput. Vis., 2025, pp. 12 987–12 996

work page 2025

[66] [66]

ChatGPT (Mar 14 version) [Large Language Model],

OpenAI, “ChatGPT (Mar 14 version) [Large Language Model],” https: //chat.openai.com/chat/, 2023

work page 2023

[67] [67]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “GPT-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[68] [68]

Mo- tionChain: Conversational motion controllers via multimodal prompts,

B. Jiang, X. Chen, C. Zhang, F. Yin, Z. Li, G. Yu, and J. Fan, “Mo- tionChain: Conversational motion controllers via multimodal prompts,” inEur. Conf. Comput. Vis., 2024, pp. 54–74

work page 2024

[69] [69]

AvatarGPT: All-in-one framework for motion understanding planning generation and beyond,

Z. Zhou, Y . Wan, and B. Wang, “AvatarGPT: All-in-one framework for motion understanding planning generation and beyond,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 1357–1366

work page 2024

[70] [70]

A unified framework for motion reasoning and generation in human interaction,

J. Park, S. Choi, and S. Yun, “A unified framework for motion reasoning and generation in human interaction,” inInt. Conf. Comput. Vis., 2025, pp. 10 698–10 707

work page 2025

[71] [71]

Motion-Agent: A conversational framework for human motion generation with LLMs,

Q. Wu, Y . Zhao, Y . Wang, X. Liu, Y . Tai, and C. Tang, “Motion-Agent: A conversational framework for human motion generation with LLMs,” inInt. Conf. Learn. Represent., 2025

work page 2025

[72] [72]

HuTuMotion: Human-tuned navigation of latent motion diffusion models with minimal feedback,

G. Han, S. Huang, M. Gong, and J. Tang, “HuTuMotion: Human-tuned navigation of latent motion diffusion models with minimal feedback,” inAAAI Conf. Artif. Intell., vol. 38, 2024, pp. 2031–2039

work page 2024

[73] [73]

Aligning human motion generation with human perceptions,

H. Wang, W. Zhu, L. Miao, Y . Xu, F. Gao, Q. Tian, and Y . Wang, “Aligning human motion generation with human perceptions,” inInt. Conf. Learn. Represent., 2025

work page 2025

[74] [74]

Learning generalizable human motion generator with reinforcement learning,

Y . Mao, X. Liu, W. Zhou, Z. Lu, and H. Li, “Learning generalizable human motion generator with reinforcement learning,”arXiv preprint arXiv:2405.15541, 2024

work page arXiv 2024

[75] [75]

Motionrl: Align text-to-motion generation to human preferences with multi-reward reinforcement learn- ing,

X. Liu, Y . Mao, W. Zhou, and H. Li, “Motionrl: Align text-to-motion generation to human preferences with multi-reward reinforcement learn- ing,”arXiv preprint arXiv:2410.06513, 2024

work page arXiv 2024

[76] [76]

Motion-R1: Chain-of-thought reasoning and reinforcement learning for human motion generation,

R. Ouyang, H. Li, Z. Zhang, X. Wang, Z. Zhu, G. Huang, and X. Wang, “Motion-R1: Chain-of-thought reasoning and reinforcement learning for human motion generation,” inInt. Conf. Learn. Represent., 2026

work page 2026

[77] [77]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The Llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[78] [78]

Exploring the limits of transfer learning with a unified text-to-text transformer,

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”J. Mach. Learn. Res., vol. 21, pp. 140:1–140:67, 2020

work page 2020

[79] [79]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Adv. Neural Inform. Process. Syst., vol. 36, pp. 34 892–34 916, 2023

work page 2023

[80] [80]

Scaling rectified flow transformers for high-resolution image synthesis,

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boeselet al., “Scaling rectified flow transformers for high-resolution image synthesis,” inInt. Conf. Mach. Learn., vol. 235, 2024, pp. 12 606–12 633. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16

work page 2024