pith. sign in

arxiv: 2605.18956 · v1 · pith:M5XYYPEJnew · submitted 2026-05-18 · 💻 cs.CV

MotionMERGE: A Multi-granular Framework for Human Motion Editing, Reasoning, Generation, and Explanation

Pith reviewed 2026-05-20 10:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords human motionmotion editingfine-grained controllanguage modelmotion reasoningpre-trainingchain-of-thoughtdataset
0
0 comments X p. Extension
pith:M5XYYPEJ Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{M5XYYPEJ}

Prints a linked pith:M5XYYPEJ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

MotionMERGE lets a single language model control and reason about human motions at the level of specific body parts and time steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current motion-language models stay coarse and cannot isolate changes to one limb or one moment, which blocks detailed animation and interaction work. The paper claims that explicitly modeling motion at part and temporal levels inside one LLM, then training with joint supervision on alignment, grounding, coherency, and motion-grounded chain-of-thought reasoning, gives the model the priors needed for precise control. The authors back this with a new large dataset of fine-grained corrective instructions and reasoning traces. If the claim holds, text instructions can now drive localized edits without disturbing the rest of the motion sequence.

Core claim

MotionMERGE bridges the granularity gap by explicitly modeling motion at part and temporal levels within a single LLM and applying ReasoningAware Granularity-Synergy pre-training. This pre-training supplies joint supervision for cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded chain-of-thought reasoning. The work also releases the MotionFineEdit dataset of 837K atomic and 144K complex triplets that carry fine-grained spatio-temporal corrective instructions and motion-grounded CoT annotations. Experiments show the resulting model produces more precise generation, understanding, and editing while generalizing zero-shot to other complex

What carries the argument

ReasoningAware Granularity-Synergy pre-training that jointly supervises cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded chain-of-thought reasoning.

If this is right

  • The model performs more precise motion generation, understanding, and editing at fine granularity.
  • It exhibits compelling zero-shot generalization to other complex motion tasks.
  • A new benchmark is created for fine-grained text-driven motion editing and motion-grounded reasoning.
  • The model acquires fine-grained motion-language alignment and explicit reasoning ability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Animators could issue natural-language instructions that change only a chosen limb or moment without rewriting the full sequence.
  • The same multi-level structure might support interactive editing loops where a user refines a motion step by step.
  • Robotics pipelines that plan human-like actions could adopt similar part-and-time supervision to improve safety and precision.

Load-bearing premise

The assumption that explicitly modeling motion at part and temporal levels inside one LLM plus joint supervision across granularities and reasoning tasks will create robust priors for precise localized control.

What would settle it

A controlled test set of instructions that ask the model to edit only one named body part or time interval; if the output motion changes other parts or times at rates comparable to coarse baselines, the fine-grained claim fails.

Figures

Figures reproduced from arXiv: 2605.18956 by Bizhu Wu, Jianfeng Ren, Jinheng Xie, Linlin Shen, Rong Qu, Ruibin Bai, Wenting Chen, Zhe Kong.

Figure 1
Figure 1. Figure 1: MotionMERGE unifies diverse motion-related tasks across granulari [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of MotionMERGE. Motions are converted into special text, allowing all tasks to be formulated as conditional text generation. The framework comprises a motion VQ-VAE that transforms continuous motion into discrete tokens, and a T5-based language model that processes interleaved text and motion tokens. It explicitly handles diverse motion-language tasks (e.g., generation, editing) at both global and… view at source ↗
Figure 3
Figure 3. Figure 3: Construction pipeline of the MotionFineEdit dataset. The pipeline consists of an atomic triplet stage, a quality control stage, and an enrichment and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Examples of text-driven fine-grained human motion editing from the MotionFineEdit dataset. Top: Atomic edits targeting spatial or temporal dimensions. Bottom: Complex edit with chain-of-thought annotations, illustrating intermediate states across both dimensions. Colored text marks temporal intervals and body parts; circles indicate edited body parts, rectangles edited time segments. Motions are sampled at… view at source ↗
Figure 5
Figure 5. Figure 5: Motion statistics of MotionFineEdit. Left: Distribution of cosine similarity between source and target motions. MotionFineEdit pairs exhibit higher similarity than MotionFix, confirming more localized, fine-grained edits. Middle: Distribution of temporal length differences within pairs, demon￾strating support for flexible duration changes. Right: Distribution of step counts in complex (CoT) edits, reflecti… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results of text-driven fine-grained human motion editing on MotionFineEdit. Rows illustrate atomic (top: spatial, middle: temporal) and complex (bottom: combined) editing tasks. Motions are sampled every 0.5 seconds. model maintains a strong lead in snippet-level metrics (e.g., 41.81% vs. 7.16% R@1 on atomic edits), proving its edits are locally accurate. Its advantage is even more pronounced o… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative text-to-motion results. MotionMERGE generates motions that accurately match textual descriptions, including complex multi-action sequences [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Novel motion-language tasks. Examples include chain-of-thought motion generation (top), a zero-shot emergent reasoning capability, fine-grained captioning of partial sequences (middle), and motion localization via textual description (bottom). validation that our framework achieves not only semantic but also temporal precision. E. More Results of Novel Applications. Our RAGS pre-training enables zero-shot … view at source ↗
Figure 10
Figure 10. Figure 10: Impact of chain-of-thought on complex editing. Lower values are better. Ins.: directly instruction-tuned. Pre.: RAGS pre-trained. variant without temporal tasks across all fine-grained evalu￾ations, with gains most pronounced in tasks requiring pre￾cise temporal control: fine-grained generation ((T+DT)2M, R-Top3: +3.92) and detailed captioning (M2DT, Bleu@4: +4.68). Improvements are smaller for atomic edi… view at source ↗
Figure 11
Figure 11. Figure 11: Our annotation platform. side by side, allowing annotators to directly compare motion realism and consistency. Video snapshots are also provided, with the start and end frames of the edited temporal interval explicitly highlighted to facilitate temporal comparison. An￾notators are required to judge whether each pair is acceptable (i.e., good pair) or unacceptable (i.e., bad pair). A pair is considered acc… view at source ↗
Figure 12
Figure 12. Figure 12: Visualizations of the 200 most frequent words in our textual descriptions. From left to right, the figure presents the word clouds for basic corrective instructions and their rewritten counterparts for the atomic editing, followed by those for the complex editing. TABLE XIII STATISTICS OF THE MOTIONFINEEDIT TEXTUAL DATA. Atomic Editing Complex Editing basic rewritten all basic rewritten all Total #texts 9… view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative reasoning processes and results for complex fine-grained text-driven human motion editing. Dashed shapes (circles or rectangles) denote deletion operations in the spatial or temporal dimensions, while solid shapes (curly brackets or rectangles) indicate addition or repetition operations. The results show that MotionMERGE can precisely decompose complex fine-grained corrective instructions into… view at source ↗
read the original abstract

Recent motion-language models unify tasks like comprehension and generation but operate at a coarse granularity, lacking fine-grained understanding and nuanced control over body parts needed for animation or interaction. This stems from fundamental issues in both the model and the data, in which the model can't focus on motion's localized pattern, and the training data lacks fine-grained supervision. To tackle this, we propose MotionMERGE, a unified framework that bridges the granularity gap. First, we pioneer the study of fine-grained languageguided motion control, including detailed understanding and localized editing, by explicitly modeling motion at part and temporal levels within a single LLM, thereby endowing the model with robust priors for precise control. Second, we design ReasoningAware Granularity-Synergy pre-training, a novel strategy that employs joint supervision for cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded chain-of-thought (CoT) reasoning. This equips the model with fine-grained motion-language alignment, crossgranularity synergy, and explicit reasoning ability. Third, we curate MotionFineEdit, a large-scale dataset (837K atomic + 144K complex triplets) with the first fine-grained spatio-temporal corrective instructions and motion-grounded CoT annotations, establishing a new benchmark for fine-grained text-driven motion editing and motion-grounded reasoning. Extensive experiments demonstrate the capability of MotionMERGE for more precise motion generation, understanding, and editing, and compelling zero-shot generalization to other complex motion tasks. This work represents a significant step toward models that interact with motion in finer granularity and human-like reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents MotionMERGE, a unified multi-granular framework for human motion tasks including editing, reasoning, generation, and explanation. It explicitly models motion at part and temporal levels in a single LLM and introduces ReasoningAware Granularity-Synergy pre-training with joint supervision across cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded chain-of-thought reasoning. A new dataset, MotionFineEdit, is curated with 837K atomic and 144K complex triplets featuring fine-grained spatio-temporal corrective instructions and CoT annotations. The authors report that extensive experiments validate the framework's ability for precise motion generation, understanding, and editing, along with strong zero-shot generalization to complex tasks.

Significance. If substantiated, this work could significantly advance the field by addressing the granularity gap in motion-language models, enabling finer control over body parts and temporal aspects for applications in animation and human-computer interaction. The curation of a specialized dataset and the multi-objective pre-training approach are notable contributions that could serve as benchmarks for future research in fine-grained motion modeling. The emphasis on motion-grounded reasoning adds a valuable dimension to LLM-based motion systems.

major comments (2)
  1. [Abstract] Abstract: The abstract asserts 'extensive experiments demonstrate the capability of MotionMERGE for more precise motion generation, understanding, and editing, and compelling zero-shot generalization' but supplies no quantitative metrics, error bars, baseline comparisons, or details on how the pre-training affects specific outputs, so the data-to-claim link cannot be verified.
  2. [ReasoningAware Granularity-Synergy pre-training] ReasoningAware Granularity-Synergy pre-training description: The central claim requires that explicitly modeling part- and temporal-level motion inside one LLM plus the five-way joint supervision will produce robust priors for fine-grained control and motion-grounded reasoning, yet no ablation isolates whether gains come from the joint schedule versus the new dataset or base LLM capacity, leaving the synergy assumption untested at the level needed to underwrite the headline results on precise editing and zero-shot generalization.
minor comments (2)
  1. The manuscript would benefit from additional details on the exact architecture for part- and temporal-level modeling within the LLM to support reproducibility.
  2. [Dataset] Clarify how the MotionFineEdit dataset triplets were generated and validated for annotation quality, particularly the motion-grounded CoT annotations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating revisions where we agree changes are warranted to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract asserts 'extensive experiments demonstrate the capability of MotionMERGE for more precise motion generation, understanding, and editing, and compelling zero-shot generalization' but supplies no quantitative metrics, error bars, baseline comparisons, or details on how the pre-training affects specific outputs, so the data-to-claim link cannot be verified.

    Authors: We agree that the abstract would be strengthened by including concrete quantitative highlights to make the claims more verifiable at a glance. In the revised version we will update the abstract to briefly report key metrics from our experiments, such as relative improvements in fine-grained editing accuracy and zero-shot generalization performance over strong baselines, while preserving the abstract's concise style. revision: yes

  2. Referee: [ReasoningAware Granularity-Synergy pre-training] ReasoningAware Granularity-Synergy pre-training description: The central claim requires that explicitly modeling part- and temporal-level motion inside one LLM plus the five-way joint supervision will produce robust priors for fine-grained control and motion-grounded reasoning, yet no ablation isolates whether gains come from the joint schedule versus the new dataset or base LLM capacity, leaving the synergy assumption untested at the level needed to underwrite the headline results on precise editing and zero-shot generalization.

    Authors: The referee correctly identifies that our current set of experiments, while demonstrating overall gains from the multi-granular modeling and pre-training objectives, does not include a fully isolated ablation that disentangles the joint supervision schedule from the contributions of the MotionFineEdit dataset or the base LLM capacity. We will add targeted ablation studies in the revision, training controlled variants that disable subsets of the five supervision signals while keeping the dataset and base model fixed, to more rigorously substantiate the synergy effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity in MotionMERGE derivation chain

full rationale

The paper introduces a new multi-granular LLM framework, a custom ReasoningAware Granularity-Synergy pre-training strategy with joint supervision objectives, and a newly curated MotionFineEdit dataset containing fine-grained triplets and CoT annotations. Performance claims on precise editing, understanding, and zero-shot generalization are presented as outcomes of extensive experiments on this benchmark rather than quantities derived by construction from fitted parameters, self-referential definitions, or load-bearing self-citations. No equations, uniqueness theorems, or ansatzes are shown reducing the central synergy assumption or control priors to tautological inputs; the design choices remain independent empirical hypotheses validated externally to the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the domain assumption that current LLMs lack focus on localized motion patterns and that the proposed joint supervision strategy plus new dataset will supply the missing fine-grained alignment and reasoning ability.

axioms (2)
  • domain assumption LLMs can acquire robust priors for precise localized motion control when motion is explicitly modeled at part and temporal levels inside a single model.
    Invoked to justify endowing the model with fine-grained control capabilities.
  • domain assumption Joint supervision across cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded CoT reasoning produces cross-granularity synergy and explicit reasoning ability.
    Central to the ReasoningAware Granularity-Synergy pre-training description.

pith-pipeline@v0.9.0 · 5847 in / 1476 out tokens · 62649 ms · 2026-05-20T10:33:15.900749+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

99 extracted references · 99 canonical work pages · 5 internal anchors

  1. [1]

    MotionLLM: Multimodal motion-language learning with large language models,

    Q. Wu, Y . Zhao, Y . Wang, Y .-W. Tai, and C.-K. Tang, “MotionLLM: Multimodal motion-language learning with large language models,” arXiv preprint arXiv:2405.17013, 2024

  2. [2]

    Human motion generation: A survey,

    W. Zhu, X. Ma, D. Ro, H. Ci, J. Zhang, J. Shi, F. Gao, Q. Tian, and Y . Wang, “Human motion generation: A survey,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, pp. 2430–2449, 2024

  3. [3]

    MotionGPT: Human motion as a foreign language,

    B. Jiang, X. Chen, W. Liu, J. Yu, G. Yu, and T. Chen, “MotionGPT: Human motion as a foreign language,”Adv. Neural Inform. Process. Syst., vol. 36, pp. 20 067–20 079, 2023

  4. [4]

    TM2T: Stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts,

    C. Guo, X. Zuo, S. Wang, and L. Cheng, “TM2T: Stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts,” inEur. Conf. Comput. Vis., 2022, pp. 580–597

  5. [5]

    Human motion diffusion model,

    G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-Or, and A. H. Bermano, “Human motion diffusion model,” inInt. Conf. Learn. Rep- resent., 2023

  6. [6]

    Generating human motion from textual descriptions with discrete representations,

    J. Zhang, Y . Zhang, X. Cun, Y . Zhang, H. Zhao, H. Lu, X. Shen, and Y . Shan, “Generating human motion from textual descriptions with discrete representations,” inIEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 14 730–14 740

  7. [7]

    Generating diverse and natural 3d human motions from text,

    C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng, “Generating diverse and natural 3d human motions from text,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 5152–5161

  8. [8]

    MoMask: Generative masked modeling of 3d human motions,

    C. Guo, Y . Mu, M. G. Javed, S. Wang, and L. Cheng, “MoMask: Generative masked modeling of 3d human motions,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 1900–1910

  9. [9]

    MotionFix: Text-driven 3D human motion editing,

    N. Athanasiou, A. Cseke, M. Diomataris, M. J. Black, and G. Varol, “MotionFix: Text-driven 3D human motion editing,” inSIGGRAPH Asia, 2024, pp. 1–11

  10. [10]

    FLAME: Free-form language-based motion synthesis & editing,

    J. Kim, J. Kim, and S. Choi, “FLAME: Free-form language-based motion synthesis & editing,” inAAAI Conf. Artif. Intell., vol. 37, 2023, pp. 8255–8263

  11. [11]

    MotionCLIP: Exposing human motion generation to clip space,

    G. Tevet, B. Gordon, A. Hertz, A. H. Bermano, and D. Cohen-Or, “MotionCLIP: Exposing human motion generation to clip space,” in Eur. Conf. Comput. Vis., 2022, pp. 358–374

  12. [12]

    EDGE: Editable dance generation from music,

    J. Tseng, R. Castellon, and K. Liu, “EDGE: Editable dance generation from music,” inIEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 448–458

  13. [13]

    MotionLab: Unified hu- man motion generation and editing via the motion-condition-motion paradigm,

    Z. Guo, Z. Hu, D. W. Soh, and N. Zhao, “MotionLab: Unified hu- man motion generation and editing via the motion-condition-motion paradigm,” inInt. Conf. Comput. Vis., 2025, pp. 13 869–13 879

  14. [14]

    M3GPT: An advanced multimodal, multitask framework for motion comprehension and generation,

    M. Luo, R. Hou, Z. Li, H. Chang, Z. Liu, Y . Wang, and S. Shan, “M3GPT: An advanced multimodal, multitask framework for motion comprehension and generation,” inAdv. Neural Inform. Process. Syst., vol. 37, 2024, pp. 28 051–28 077

  15. [15]

    MotionGPT: Finetuned LLMs are general- purpose motion generators,

    Y . Zhang, D. Huang, B. Liu, S. Tang, Y . Lu, L. Chen, L. Bai, Q. Chu, N. Yu, and W. Ouyang, “MotionGPT: Finetuned LLMs are general- purpose motion generators,” inAAAI Conf. Artif. Intell., vol. 38, 2024, pp. 7368–7376

  16. [16]

    MG-MotionLLM: A unified framework for motion comprehension and generation across multiple granularities,

    B. Wu, J. Xie, K. Shen, Z. Kong, J. Ren, R. Bai, R. Qu, and L. Shen, “MG-MotionLLM: A unified framework for motion comprehension and generation across multiple granularities,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 27 849–27 858

  17. [17]

    Dynamic motion blending for versatile motion editing,

    N. Jiang, H. Li, Z. Yuan, Z. He, Y . Chen, T. Liu, Y . Zhu, and S. Huang, “Dynamic motion blending for versatile motion editing,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 22 735–22 745

  18. [18]

    Action-conditioned 3d human motion synthesis with transformer vae,

    M. Petrovich, M. J. Black, and G. Varol, “Action-conditioned 3d human motion synthesis with transformer vae,” inInt. Conf. Comput. Vis., 2021, pp. 10 985–10 995

  19. [19]

    MoDi: Unconditional motion synthesis from diverse data,

    S. Raab, I. Leibovitch, P. Li, K. Aberman, O. Sorkine-Hornung, and D. Cohen-Or, “MoDi: Unconditional motion synthesis from diverse data,” inIEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 13 873– 13 883

  20. [20]

    DanceFormer: Music conditioned 3d dance generation with parametric motion transformer,

    B. Li, Y . Zhao, S. Zhelun, and L. Sheng, “DanceFormer: Music conditioned 3d dance generation with parametric motion transformer,” inAAAI Conf. Artif. Intell., vol. 36, 2022, pp. 1272–1279

  21. [21]

    Lodge++: High-quality and long dance generation with robust choreography patterns,

    R. Li, H. Zhang, Y . Zhang, Y . Zhang, Y . Zhang, J. Guo, Y . Zhang, X. Li, and Y . Liu, “Lodge++: High-quality and long dance generation with robust choreography patterns,”IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–15, 2025

  22. [22]

    Lagrangian motion fields for long-term motion generation,

    Y . Yang, Z. Huang, C. Xu, and S. He, “Lagrangian motion fields for long-term motion generation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 48, pp. 1171–1184, 2026

  23. [23]

    MNET++: Music-driven plural- istic dancing toward multiple dance genre synthesis,

    J. Kim, B. Kwon, J. Kim, and S. Lee, “MNET++: Music-driven plural- istic dancing toward multiple dance genre synthesis,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, pp. 15 036– 15 050, 2023

  24. [24]

    Combo: Co-speech holistic 3d human motion generation and efficient customizable adaptation in harmony,

    C. Xu, M. Sun, Z.-Q. Cheng, F. Wang, Y . Liu, B. Sun, R. Huang, and A. Hauptmann, “Combo: Co-speech holistic 3d human motion generation and efficient customizable adaptation in harmony,”IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–18, 2025

  25. [25]

    Audio2Gestures: Generating diverse gestures from audio,

    J. Li, D. Kang, W. Pei, X. Zhe, Y . Zhang, L. Bao, and Z. He, “Audio2Gestures: Generating diverse gestures from audio,”IEEE Trans. Vis. Comput. Graph., vol. 30, pp. 4752–4766, 2023

  26. [26]

    From audio to photoreal embodiment: Synthesizing humans in conversations,

    E. Ng, J. Romero, T. Bagautdinov, S. Bai, T. Darrell, A. Kanazawa, and A. Richard, “From audio to photoreal embodiment: Synthesizing humans in conversations,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 1001–1010

  27. [27]

    TEMOS: Generating diverse human motions from textual descriptions,

    M. Petrovich, M. J. Black, and G. Varol, “TEMOS: Generating diverse human motions from textual descriptions,” inEur. Conf. Comput. Vis., 2022, pp. 480–497

  28. [28]

    DrawMotion: Generating 3d human motions by freehand drawing,

    T. Wang, L. Jin, Z. Wu, Q. He, J. Chu, Y . Cheng, J. Xing, J. Zhao, S. Yan, and L. Wang, “DrawMotion: Generating 3d human motions by freehand drawing,”IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–17, 2026

  29. [29]

    Seamless human motion composition with blended positional encodings,

    G. Barquero, S. Escalera, and C. Palmero, “Seamless human motion composition with blended positional encodings,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 457–469

  30. [30]

    CoMo: Controllable motion generation through language guided pose code editing,

    Y . Huang, W. Wan, Y . Yang, C. Callison-Burch, M. Yatskar, and L. Liu, “CoMo: Controllable motion generation through language guided pose code editing,” inEur. Conf. Comput. Vis., 2024, p. 180–196

  31. [31]

    Multi-track timeline control for text-driven 3d human motion generation,

    M. Petrovich, O. Litany, U. Iqbal, M. J. Black, G. Varol, X. Bin Peng, and D. Rempe, “Multi-track timeline control for text-driven 3d human motion generation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 1911–1921

  32. [32]

    Fg-T2M: Fine- grained text-driven human motion generation via diffusion model,

    Y . Wang, Z. Leng, F. W. Li, S.-C. Wu, and X. Liang, “Fg-T2M: Fine- grained text-driven human motion generation via diffusion model,” in Int. Conf. Comput. Vis., 2023, pp. 22 035–22 044

  33. [33]

    LaMP: Language-motion pretraining for motion generation, retrieval, and captioning,

    Z. Li, W. Yuan, Y . He, L. Qiu, S. Zhu, X. Gu, W. Shen, Y . Dong, Z. Dong, and L. T. Yang, “LaMP: Language-motion pretraining for motion generation, retrieval, and captioning,” inInt. Conf. Learn. Represent., 2025

  34. [34]

    ScaMo: Exploring the scaling law in autoregressive motion generation model,

    S. Lu, J. Wang, Z. Lu, L.-H. Chen, W. Dai, J. Dong, Z. Dou, B. Dai, and R. Zhang, “ScaMo: Exploring the scaling law in autoregressive motion generation model,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 27 872–27 882

  35. [35]

    The language of motion: Unifying verbal and IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15 non-verbal language of 3d human motion,

    C. Chen, J. Zhang, S. K. Lakshmikanth, Y . Fang, R. Shao, G. Wetzstein, L. Fei-Fei, and E. Adeli, “The language of motion: Unifying verbal and IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15 non-verbal language of 3d human motion,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 6200–6211

  36. [36]

    ParCo: Part-coordinating text-to-motion synthesis,

    Q. Zou, S. Yuan, S. Du, Y . Wang, C. Liu, Y . Xu, J. Chen, and X. Ji, “ParCo: Part-coordinating text-to-motion synthesis,” inEur. Conf. Comput. Vis., 2024, pp. 126–143

  37. [37]

    MotionDiffuse: Text-driven human motion generation with diffusion model,

    M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu, “MotionDiffuse: Text-driven human motion generation with diffusion model,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, pp. 4115– 4128, 2024

  38. [38]

    Executing your commands via motion diffusion in latent space,

    X. Chen, B. Jiang, W. Liu, Z. Huang, B. Fu, T. Chen, and G. Yu, “Executing your commands via motion diffusion in latent space,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 18 000–18 010

  39. [39]

    AMD: autoregressive motion diffusion,

    B. Han, H. Peng, M. Dong, Y . Ren, Y . Shen, and C. Xu, “AMD: autoregressive motion diffusion,” inAAAI Conf. Artif. Intell., 2024, pp. 2022–2030

  40. [40]

    CLoSD: Closing the loop between simulation and diffusion for multi-task character control,

    G. Tevet, S. Raab, S. Cohan, D. Reda, Z. Luo, X. B. Peng, A. H. Bermano, and M. van de Panne, “CLoSD: Closing the loop between simulation and diffusion for multi-task character control,” inInt. Conf. Learn. Represent., 2025

  41. [41]

    MotionStreamer: Streaming motion generation via diffusion-based autoregressive model in causal latent space,

    L. Xiao, S. Lu, H. Pi, K. Fan, L. Pan, Y . Zhou, Z. Feng, X. Zhou, S. Peng, and J. Wang, “MotionStreamer: Streaming motion generation via diffusion-based autoregressive model in causal latent space,” inInt. Conf. Comput. Vis., 2025, pp. 10 086–10 096

  42. [42]

    The kit motion-language dataset,

    M. Plappert, C. Mandery, and T. Asfour, “The kit motion-language dataset,”Big Data, vol. 4, pp. 236–252, 2016

  43. [43]

    Action- GPT: Leveraging large-scale language models for improved and gener- alized action generation,

    S. S. Kalakonda, S. Maheshwari, and R. K. Sarvadevabhatla, “Action- GPT: Leveraging large-scale language models for improved and gener- alized action generation,” inInt. Conf. Multimedia and Expo, 2023, pp. 31–36

  44. [44]

    Fg-T2M++: LLMs-augmented fine-grained text driven human motion generation,

    Y . Wang, M. Li, J. Liu, Z. Leng, F. W. B. Li, Z. Zhang, and X. Liang, “Fg-T2M++: LLMs-augmented fine-grained text driven human motion generation,”Int. J. Comput. Vis., vol. 133, pp. 4277–4293, 2025

  45. [45]

    Semanticboost: Ele- vating motion generation with augmented textual cues,

    X. He, S. Huang, X. Zhan, C. Wen, and Y . Shan, “Semanticboost: Ele- vating motion generation with augmented textual cues,”arXiv preprint arXiv:2310.20323, 2023

  46. [46]

    MotionScript: Natural language descriptions for expressive 3d human motions,

    P. J. Yazdian, R. Lagasse, H. Mohammadi, E. Liu, L. Cheng, and A. Lim, “MotionScript: Natural language descriptions for expressive 3d human motions,” inIEEE Int. Conf. Intell. Robots Syst., 2025, pp. 21 574– 21 581

  47. [47]

    FineMoGen: Fine-grained spatio-temporal motion generation and editing,

    M. Zhang, H. Li, Z. Cai, J. Ren, L. Yang, and Z. Liu, “FineMoGen: Fine-grained spatio-temporal motion generation and editing,” inAdv. Neural Inform. Process. Syst., vol. 36, 2023, pp. 13 981–13 992

  48. [48]

    FineMotion: A dataset and benchmark with both spatial and temporal annotation for fine-grained motion generation and editing,

    B. Wu, J. Xie, M. Ding, Z. Kong, J. Ren, R. Bai, R. Qu, and L. Shen, “FineMotion: A dataset and benchmark with both spatial and temporal annotation for fine-grained motion generation and editing,” inInt. Conf. Comput. Vis., 2025, pp. 13 837–13 846

  49. [49]

    Realtime style transfer for unlabeled heterogeneous human motion,

    S. Xia, C. Wang, J. Chai, and J. Hodgins, “Realtime style transfer for unlabeled heterogeneous human motion,”ACM Trans. Graph., vol. 34, pp. 1–10, 2015

  50. [50]

    Unpaired motion style transfer from video to animation,

    K. Aberman, Y . Weng, D. Lischinski, D. Cohen-Or, and B. Chen, “Unpaired motion style transfer from video to animation,”ACM Trans. Graph., vol. 39, pp. 64:1–64:12, 2020

  51. [51]

    Motion Puzzle: Arbitrary motion style transfer by body part,

    D.-K. Jang, S. Park, and S.-H. Lee, “Motion Puzzle: Arbitrary motion style transfer by body part,”ACM Trans. Graph., vol. 41, pp. 1–16, 2022

  52. [52]

    Style-ERD: Responsive and coherent online motion style transfer,

    T. Tao, X. Zhan, Z. Chen, and M. van de Panne, “Style-ERD: Responsive and coherent online motion style transfer,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 6593–6603

  53. [53]

    SALAD: Skeleton-aware latent diffusion for text-driven motion generation and editing,

    S. Hong, C. Kim, S. Yoon, J. Nam, S. Cha, and J. Noh, “SALAD: Skeleton-aware latent diffusion for text-driven motion generation and editing,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 7158– 7168

  54. [54]

    MaskCon- trol: Spatio-temporal control for masked motion synthesis,

    E. Pinyoanuntapong, M. U. Saleem, K. Karunratanakul, P. Wang, H. Xue, C. Chen, C. Guo, J. Cao, J. Ren, and S. Tulyakov, “MaskCon- trol: Spatio-temporal control for masked motion synthesis,” inInt. Conf. Comput. Vis., 2025, pp. 9955–9965

  55. [55]

    Iterative motion editing with natural language,

    P. Goel, K. Wang, C. K. Liu, and K. Fatahalian, “Iterative motion editing with natural language,” inSIGGRAPH, 2024, pp. 1–9

  56. [56]

    SimMotionEdit: Text-based human motion editing with motion simi- larity prediction,

    Z. Li, K. Cheng, A. Ghosh, U. Bhattacharya, L. Gui, and A. Bera, “SimMotionEdit: Text-based human motion editing with motion simi- larity prediction,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 27 827–27 837

  57. [57]

    Weakly-supervised 3d spatial reasoning for text-based visual question answering,

    H. Li, J. Huang, P. Jin, G. Song, Q. Wu, and J. Chen, “Weakly-supervised 3d spatial reasoning for text-based visual question answering,”IEEE Trans. Image Process., vol. 32, pp. 3367–3382, 2023

  58. [58]

    TEILP: Time prediction over knowledge graphs via logical reasoning,

    S. Xiong, Y . Yang, A. Payani, J. C. Kerce, and F. Fekri, “TEILP: Time prediction over knowledge graphs via logical reasoning,” inAAAI Conf. Artif. Intell., vol. 38, 2024, pp. 16 112–16 119

  59. [59]

    From system 1 to system 2: A survey of reasoning large language models,

    D. Zhang, Z.-Z. Li, M.-L. Zhang, J. Zhang, Z. Liu, Y . Yao, H. Xu, J. Zheng, X. Chen, Y . Zhang, F. Yin, J. Dong, Z. Guo, L. Song, and C.-L. Liu, “From system 1 to system 2: A survey of reasoning large language models,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 48, pp. 3335–3354, 2026

  60. [60]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Adv. Neural Inform. Process. Syst., vol. 35, pp. 24 824–24 837, 2022

  61. [61]

    EAGLE: expert-guided self-enhancement for preference alignment in pathology large vision-language model,

    M. Ding, J. Zhang, W. Wang, H. Zhong, X. Wang, X. Lyu, W. Chen, and L. Shen, “EAGLE: expert-guided self-enhancement for preference alignment in pathology large vision-language model,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, vol. 1, 2025, pp. 14 603–14 619

  62. [62]

    AtomThink: Multimodal slow thinking with atomic step reasoning,

    K. Xiang, Z. Liu, T. J. Zhang, Y . Huang, Y . Nie, K. Cai, Y . Yin, R. Huang, H. Li, Y . Zeng, Y .-J. Yuan, J. Han, L. Hong, H. Xu, and X. Liang, “AtomThink: Multimodal slow thinking with atomic step reasoning,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 48, pp. 5725– 5741, 2026

  63. [63]

    EvolveNav: Empow- ering llm-based vision-language navigation via self-improving embodied reasoning,

    B. Lin, Y . Nie, K. L. Zai, Z. Wei, M. Han, R. Xu, M. Niu, J. Han, H. Zhang, L. Lin, B. Chen, C. Lu, and X. Liang, “EvolveNav: Empow- ering llm-based vision-language navigation via self-improving embodied reasoning,”IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–15, 2026

  64. [64]

    Motion question answering via modular motion programs,

    M. Endo, J. Hsu, J. Li, and J. Wu, “Motion question answering via modular motion programs,” inInt. Conf. Mach. Learn., 2023, pp. 9312– 9328

  65. [65]

    IMoRe: Implicit program-guided reasoning for human motion q&a,

    C. Li, C. Sugandhika, Y . K. Ee, E. Peh, H. Zhang, H. Yang, D. Rajan, and B. Fernando, “IMoRe: Implicit program-guided reasoning for human motion q&a,” inInt. Conf. Comput. Vis., 2025, pp. 12 987–12 996

  66. [66]

    ChatGPT (Mar 14 version) [Large Language Model],

    OpenAI, “ChatGPT (Mar 14 version) [Large Language Model],” https: //chat.openai.com/chat/, 2023

  67. [67]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “GPT-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  68. [68]

    Mo- tionChain: Conversational motion controllers via multimodal prompts,

    B. Jiang, X. Chen, C. Zhang, F. Yin, Z. Li, G. Yu, and J. Fan, “Mo- tionChain: Conversational motion controllers via multimodal prompts,” inEur. Conf. Comput. Vis., 2024, pp. 54–74

  69. [69]

    AvatarGPT: All-in-one framework for motion understanding planning generation and beyond,

    Z. Zhou, Y . Wan, and B. Wang, “AvatarGPT: All-in-one framework for motion understanding planning generation and beyond,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 1357–1366

  70. [70]

    A unified framework for motion reasoning and generation in human interaction,

    J. Park, S. Choi, and S. Yun, “A unified framework for motion reasoning and generation in human interaction,” inInt. Conf. Comput. Vis., 2025, pp. 10 698–10 707

  71. [71]

    Motion-Agent: A conversational framework for human motion generation with LLMs,

    Q. Wu, Y . Zhao, Y . Wang, X. Liu, Y . Tai, and C. Tang, “Motion-Agent: A conversational framework for human motion generation with LLMs,” inInt. Conf. Learn. Represent., 2025

  72. [72]

    HuTuMotion: Human-tuned navigation of latent motion diffusion models with minimal feedback,

    G. Han, S. Huang, M. Gong, and J. Tang, “HuTuMotion: Human-tuned navigation of latent motion diffusion models with minimal feedback,” inAAAI Conf. Artif. Intell., vol. 38, 2024, pp. 2031–2039

  73. [73]

    Aligning human motion generation with human perceptions,

    H. Wang, W. Zhu, L. Miao, Y . Xu, F. Gao, Q. Tian, and Y . Wang, “Aligning human motion generation with human perceptions,” inInt. Conf. Learn. Represent., 2025

  74. [74]

    Learning generalizable human motion generator with reinforcement learning,

    Y . Mao, X. Liu, W. Zhou, Z. Lu, and H. Li, “Learning generalizable human motion generator with reinforcement learning,”arXiv preprint arXiv:2405.15541, 2024

  75. [75]

    Motionrl: Align text-to-motion generation to human preferences with multi-reward reinforcement learn- ing,

    X. Liu, Y . Mao, W. Zhou, and H. Li, “Motionrl: Align text-to-motion generation to human preferences with multi-reward reinforcement learn- ing,”arXiv preprint arXiv:2410.06513, 2024

  76. [76]

    Motion-R1: Chain-of-thought reasoning and reinforcement learning for human motion generation,

    R. Ouyang, H. Li, Z. Zhang, X. Wang, Z. Zhu, G. Huang, and X. Wang, “Motion-R1: Chain-of-thought reasoning and reinforcement learning for human motion generation,” inInt. Conf. Learn. Represent., 2026

  77. [77]

    The Llama 3 Herd of Models

    A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The Llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  78. [78]

    Exploring the limits of transfer learning with a unified text-to-text transformer,

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”J. Mach. Learn. Res., vol. 21, pp. 140:1–140:67, 2020

  79. [79]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Adv. Neural Inform. Process. Syst., vol. 36, pp. 34 892–34 916, 2023

  80. [80]

    Scaling rectified flow transformers for high-resolution image synthesis,

    P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boeselet al., “Scaling rectified flow transformers for high-resolution image synthesis,” inInt. Conf. Mach. Learn., vol. 235, 2024, pp. 12 606–12 633. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16

Showing first 80 references.