MotionMERGE: A Multi-granular Framework for Human Motion Editing, Reasoning, Generation, and Explanation
Pith reviewed 2026-05-20 10:33 UTC · model grok-4.3
The pith
MotionMERGE lets a single language model control and reason about human motions at the level of specific body parts and time steps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MotionMERGE bridges the granularity gap by explicitly modeling motion at part and temporal levels within a single LLM and applying ReasoningAware Granularity-Synergy pre-training. This pre-training supplies joint supervision for cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded chain-of-thought reasoning. The work also releases the MotionFineEdit dataset of 837K atomic and 144K complex triplets that carry fine-grained spatio-temporal corrective instructions and motion-grounded CoT annotations. Experiments show the resulting model produces more precise generation, understanding, and editing while generalizing zero-shot to other complex
What carries the argument
ReasoningAware Granularity-Synergy pre-training that jointly supervises cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded chain-of-thought reasoning.
If this is right
- The model performs more precise motion generation, understanding, and editing at fine granularity.
- It exhibits compelling zero-shot generalization to other complex motion tasks.
- A new benchmark is created for fine-grained text-driven motion editing and motion-grounded reasoning.
- The model acquires fine-grained motion-language alignment and explicit reasoning ability.
Where Pith is reading between the lines
- Animators could issue natural-language instructions that change only a chosen limb or moment without rewriting the full sequence.
- The same multi-level structure might support interactive editing loops where a user refines a motion step by step.
- Robotics pipelines that plan human-like actions could adopt similar part-and-time supervision to improve safety and precision.
Load-bearing premise
The assumption that explicitly modeling motion at part and temporal levels inside one LLM plus joint supervision across granularities and reasoning tasks will create robust priors for precise localized control.
What would settle it
A controlled test set of instructions that ask the model to edit only one named body part or time interval; if the output motion changes other parts or times at rates comparable to coarse baselines, the fine-grained claim fails.
Figures
read the original abstract
Recent motion-language models unify tasks like comprehension and generation but operate at a coarse granularity, lacking fine-grained understanding and nuanced control over body parts needed for animation or interaction. This stems from fundamental issues in both the model and the data, in which the model can't focus on motion's localized pattern, and the training data lacks fine-grained supervision. To tackle this, we propose MotionMERGE, a unified framework that bridges the granularity gap. First, we pioneer the study of fine-grained languageguided motion control, including detailed understanding and localized editing, by explicitly modeling motion at part and temporal levels within a single LLM, thereby endowing the model with robust priors for precise control. Second, we design ReasoningAware Granularity-Synergy pre-training, a novel strategy that employs joint supervision for cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded chain-of-thought (CoT) reasoning. This equips the model with fine-grained motion-language alignment, crossgranularity synergy, and explicit reasoning ability. Third, we curate MotionFineEdit, a large-scale dataset (837K atomic + 144K complex triplets) with the first fine-grained spatio-temporal corrective instructions and motion-grounded CoT annotations, establishing a new benchmark for fine-grained text-driven motion editing and motion-grounded reasoning. Extensive experiments demonstrate the capability of MotionMERGE for more precise motion generation, understanding, and editing, and compelling zero-shot generalization to other complex motion tasks. This work represents a significant step toward models that interact with motion in finer granularity and human-like reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents MotionMERGE, a unified multi-granular framework for human motion tasks including editing, reasoning, generation, and explanation. It explicitly models motion at part and temporal levels in a single LLM and introduces ReasoningAware Granularity-Synergy pre-training with joint supervision across cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded chain-of-thought reasoning. A new dataset, MotionFineEdit, is curated with 837K atomic and 144K complex triplets featuring fine-grained spatio-temporal corrective instructions and CoT annotations. The authors report that extensive experiments validate the framework's ability for precise motion generation, understanding, and editing, along with strong zero-shot generalization to complex tasks.
Significance. If substantiated, this work could significantly advance the field by addressing the granularity gap in motion-language models, enabling finer control over body parts and temporal aspects for applications in animation and human-computer interaction. The curation of a specialized dataset and the multi-objective pre-training approach are notable contributions that could serve as benchmarks for future research in fine-grained motion modeling. The emphasis on motion-grounded reasoning adds a valuable dimension to LLM-based motion systems.
major comments (2)
- [Abstract] Abstract: The abstract asserts 'extensive experiments demonstrate the capability of MotionMERGE for more precise motion generation, understanding, and editing, and compelling zero-shot generalization' but supplies no quantitative metrics, error bars, baseline comparisons, or details on how the pre-training affects specific outputs, so the data-to-claim link cannot be verified.
- [ReasoningAware Granularity-Synergy pre-training] ReasoningAware Granularity-Synergy pre-training description: The central claim requires that explicitly modeling part- and temporal-level motion inside one LLM plus the five-way joint supervision will produce robust priors for fine-grained control and motion-grounded reasoning, yet no ablation isolates whether gains come from the joint schedule versus the new dataset or base LLM capacity, leaving the synergy assumption untested at the level needed to underwrite the headline results on precise editing and zero-shot generalization.
minor comments (2)
- The manuscript would benefit from additional details on the exact architecture for part- and temporal-level modeling within the LLM to support reproducibility.
- [Dataset] Clarify how the MotionFineEdit dataset triplets were generated and validated for annotation quality, particularly the motion-grounded CoT annotations.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating revisions where we agree changes are warranted to strengthen the presentation.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract asserts 'extensive experiments demonstrate the capability of MotionMERGE for more precise motion generation, understanding, and editing, and compelling zero-shot generalization' but supplies no quantitative metrics, error bars, baseline comparisons, or details on how the pre-training affects specific outputs, so the data-to-claim link cannot be verified.
Authors: We agree that the abstract would be strengthened by including concrete quantitative highlights to make the claims more verifiable at a glance. In the revised version we will update the abstract to briefly report key metrics from our experiments, such as relative improvements in fine-grained editing accuracy and zero-shot generalization performance over strong baselines, while preserving the abstract's concise style. revision: yes
-
Referee: [ReasoningAware Granularity-Synergy pre-training] ReasoningAware Granularity-Synergy pre-training description: The central claim requires that explicitly modeling part- and temporal-level motion inside one LLM plus the five-way joint supervision will produce robust priors for fine-grained control and motion-grounded reasoning, yet no ablation isolates whether gains come from the joint schedule versus the new dataset or base LLM capacity, leaving the synergy assumption untested at the level needed to underwrite the headline results on precise editing and zero-shot generalization.
Authors: The referee correctly identifies that our current set of experiments, while demonstrating overall gains from the multi-granular modeling and pre-training objectives, does not include a fully isolated ablation that disentangles the joint supervision schedule from the contributions of the MotionFineEdit dataset or the base LLM capacity. We will add targeted ablation studies in the revision, training controlled variants that disable subsets of the five supervision signals while keeping the dataset and base model fixed, to more rigorously substantiate the synergy effects. revision: yes
Circularity Check
No significant circularity in MotionMERGE derivation chain
full rationale
The paper introduces a new multi-granular LLM framework, a custom ReasoningAware Granularity-Synergy pre-training strategy with joint supervision objectives, and a newly curated MotionFineEdit dataset containing fine-grained triplets and CoT annotations. Performance claims on precise editing, understanding, and zero-shot generalization are presented as outcomes of extensive experiments on this benchmark rather than quantities derived by construction from fitted parameters, self-referential definitions, or load-bearing self-citations. No equations, uniqueness theorems, or ansatzes are shown reducing the central synergy assumption or control priors to tautological inputs; the design choices remain independent empirical hypotheses validated externally to the reported results.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs can acquire robust priors for precise localized motion control when motion is explicitly modeled at part and temporal levels inside a single model.
- domain assumption Joint supervision across cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded CoT reasoning produces cross-granularity synergy and explicit reasoning ability.
Reference graph
Works this paper leans on
-
[1]
MotionLLM: Multimodal motion-language learning with large language models,
Q. Wu, Y . Zhao, Y . Wang, Y .-W. Tai, and C.-K. Tang, “MotionLLM: Multimodal motion-language learning with large language models,” arXiv preprint arXiv:2405.17013, 2024
-
[2]
Human motion generation: A survey,
W. Zhu, X. Ma, D. Ro, H. Ci, J. Zhang, J. Shi, F. Gao, Q. Tian, and Y . Wang, “Human motion generation: A survey,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, pp. 2430–2449, 2024
work page 2024
-
[3]
MotionGPT: Human motion as a foreign language,
B. Jiang, X. Chen, W. Liu, J. Yu, G. Yu, and T. Chen, “MotionGPT: Human motion as a foreign language,”Adv. Neural Inform. Process. Syst., vol. 36, pp. 20 067–20 079, 2023
work page 2023
-
[4]
TM2T: Stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts,
C. Guo, X. Zuo, S. Wang, and L. Cheng, “TM2T: Stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts,” inEur. Conf. Comput. Vis., 2022, pp. 580–597
work page 2022
-
[5]
G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-Or, and A. H. Bermano, “Human motion diffusion model,” inInt. Conf. Learn. Rep- resent., 2023
work page 2023
-
[6]
Generating human motion from textual descriptions with discrete representations,
J. Zhang, Y . Zhang, X. Cun, Y . Zhang, H. Zhao, H. Lu, X. Shen, and Y . Shan, “Generating human motion from textual descriptions with discrete representations,” inIEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 14 730–14 740
work page 2023
-
[7]
Generating diverse and natural 3d human motions from text,
C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng, “Generating diverse and natural 3d human motions from text,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 5152–5161
work page 2022
-
[8]
MoMask: Generative masked modeling of 3d human motions,
C. Guo, Y . Mu, M. G. Javed, S. Wang, and L. Cheng, “MoMask: Generative masked modeling of 3d human motions,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 1900–1910
work page 2024
-
[9]
MotionFix: Text-driven 3D human motion editing,
N. Athanasiou, A. Cseke, M. Diomataris, M. J. Black, and G. Varol, “MotionFix: Text-driven 3D human motion editing,” inSIGGRAPH Asia, 2024, pp. 1–11
work page 2024
-
[10]
FLAME: Free-form language-based motion synthesis & editing,
J. Kim, J. Kim, and S. Choi, “FLAME: Free-form language-based motion synthesis & editing,” inAAAI Conf. Artif. Intell., vol. 37, 2023, pp. 8255–8263
work page 2023
-
[11]
MotionCLIP: Exposing human motion generation to clip space,
G. Tevet, B. Gordon, A. Hertz, A. H. Bermano, and D. Cohen-Or, “MotionCLIP: Exposing human motion generation to clip space,” in Eur. Conf. Comput. Vis., 2022, pp. 358–374
work page 2022
-
[12]
EDGE: Editable dance generation from music,
J. Tseng, R. Castellon, and K. Liu, “EDGE: Editable dance generation from music,” inIEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 448–458
work page 2023
-
[13]
MotionLab: Unified hu- man motion generation and editing via the motion-condition-motion paradigm,
Z. Guo, Z. Hu, D. W. Soh, and N. Zhao, “MotionLab: Unified hu- man motion generation and editing via the motion-condition-motion paradigm,” inInt. Conf. Comput. Vis., 2025, pp. 13 869–13 879
work page 2025
-
[14]
M3GPT: An advanced multimodal, multitask framework for motion comprehension and generation,
M. Luo, R. Hou, Z. Li, H. Chang, Z. Liu, Y . Wang, and S. Shan, “M3GPT: An advanced multimodal, multitask framework for motion comprehension and generation,” inAdv. Neural Inform. Process. Syst., vol. 37, 2024, pp. 28 051–28 077
work page 2024
-
[15]
MotionGPT: Finetuned LLMs are general- purpose motion generators,
Y . Zhang, D. Huang, B. Liu, S. Tang, Y . Lu, L. Chen, L. Bai, Q. Chu, N. Yu, and W. Ouyang, “MotionGPT: Finetuned LLMs are general- purpose motion generators,” inAAAI Conf. Artif. Intell., vol. 38, 2024, pp. 7368–7376
work page 2024
-
[16]
B. Wu, J. Xie, K. Shen, Z. Kong, J. Ren, R. Bai, R. Qu, and L. Shen, “MG-MotionLLM: A unified framework for motion comprehension and generation across multiple granularities,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 27 849–27 858
work page 2025
-
[17]
Dynamic motion blending for versatile motion editing,
N. Jiang, H. Li, Z. Yuan, Z. He, Y . Chen, T. Liu, Y . Zhu, and S. Huang, “Dynamic motion blending for versatile motion editing,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 22 735–22 745
work page 2025
-
[18]
Action-conditioned 3d human motion synthesis with transformer vae,
M. Petrovich, M. J. Black, and G. Varol, “Action-conditioned 3d human motion synthesis with transformer vae,” inInt. Conf. Comput. Vis., 2021, pp. 10 985–10 995
work page 2021
-
[19]
MoDi: Unconditional motion synthesis from diverse data,
S. Raab, I. Leibovitch, P. Li, K. Aberman, O. Sorkine-Hornung, and D. Cohen-Or, “MoDi: Unconditional motion synthesis from diverse data,” inIEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 13 873– 13 883
work page 2023
-
[20]
DanceFormer: Music conditioned 3d dance generation with parametric motion transformer,
B. Li, Y . Zhao, S. Zhelun, and L. Sheng, “DanceFormer: Music conditioned 3d dance generation with parametric motion transformer,” inAAAI Conf. Artif. Intell., vol. 36, 2022, pp. 1272–1279
work page 2022
-
[21]
Lodge++: High-quality and long dance generation with robust choreography patterns,
R. Li, H. Zhang, Y . Zhang, Y . Zhang, Y . Zhang, J. Guo, Y . Zhang, X. Li, and Y . Liu, “Lodge++: High-quality and long dance generation with robust choreography patterns,”IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–15, 2025
work page 2025
-
[22]
Lagrangian motion fields for long-term motion generation,
Y . Yang, Z. Huang, C. Xu, and S. He, “Lagrangian motion fields for long-term motion generation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 48, pp. 1171–1184, 2026
work page 2026
-
[23]
MNET++: Music-driven plural- istic dancing toward multiple dance genre synthesis,
J. Kim, B. Kwon, J. Kim, and S. Lee, “MNET++: Music-driven plural- istic dancing toward multiple dance genre synthesis,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, pp. 15 036– 15 050, 2023
work page 2023
-
[24]
C. Xu, M. Sun, Z.-Q. Cheng, F. Wang, Y . Liu, B. Sun, R. Huang, and A. Hauptmann, “Combo: Co-speech holistic 3d human motion generation and efficient customizable adaptation in harmony,”IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–18, 2025
work page 2025
-
[25]
Audio2Gestures: Generating diverse gestures from audio,
J. Li, D. Kang, W. Pei, X. Zhe, Y . Zhang, L. Bao, and Z. He, “Audio2Gestures: Generating diverse gestures from audio,”IEEE Trans. Vis. Comput. Graph., vol. 30, pp. 4752–4766, 2023
work page 2023
-
[26]
From audio to photoreal embodiment: Synthesizing humans in conversations,
E. Ng, J. Romero, T. Bagautdinov, S. Bai, T. Darrell, A. Kanazawa, and A. Richard, “From audio to photoreal embodiment: Synthesizing humans in conversations,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 1001–1010
work page 2024
-
[27]
TEMOS: Generating diverse human motions from textual descriptions,
M. Petrovich, M. J. Black, and G. Varol, “TEMOS: Generating diverse human motions from textual descriptions,” inEur. Conf. Comput. Vis., 2022, pp. 480–497
work page 2022
-
[28]
DrawMotion: Generating 3d human motions by freehand drawing,
T. Wang, L. Jin, Z. Wu, Q. He, J. Chu, Y . Cheng, J. Xing, J. Zhao, S. Yan, and L. Wang, “DrawMotion: Generating 3d human motions by freehand drawing,”IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–17, 2026
work page 2026
-
[29]
Seamless human motion composition with blended positional encodings,
G. Barquero, S. Escalera, and C. Palmero, “Seamless human motion composition with blended positional encodings,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 457–469
work page 2024
-
[30]
CoMo: Controllable motion generation through language guided pose code editing,
Y . Huang, W. Wan, Y . Yang, C. Callison-Burch, M. Yatskar, and L. Liu, “CoMo: Controllable motion generation through language guided pose code editing,” inEur. Conf. Comput. Vis., 2024, p. 180–196
work page 2024
-
[31]
Multi-track timeline control for text-driven 3d human motion generation,
M. Petrovich, O. Litany, U. Iqbal, M. J. Black, G. Varol, X. Bin Peng, and D. Rempe, “Multi-track timeline control for text-driven 3d human motion generation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 1911–1921
work page 2024
-
[32]
Fg-T2M: Fine- grained text-driven human motion generation via diffusion model,
Y . Wang, Z. Leng, F. W. Li, S.-C. Wu, and X. Liang, “Fg-T2M: Fine- grained text-driven human motion generation via diffusion model,” in Int. Conf. Comput. Vis., 2023, pp. 22 035–22 044
work page 2023
-
[33]
LaMP: Language-motion pretraining for motion generation, retrieval, and captioning,
Z. Li, W. Yuan, Y . He, L. Qiu, S. Zhu, X. Gu, W. Shen, Y . Dong, Z. Dong, and L. T. Yang, “LaMP: Language-motion pretraining for motion generation, retrieval, and captioning,” inInt. Conf. Learn. Represent., 2025
work page 2025
-
[34]
ScaMo: Exploring the scaling law in autoregressive motion generation model,
S. Lu, J. Wang, Z. Lu, L.-H. Chen, W. Dai, J. Dong, Z. Dou, B. Dai, and R. Zhang, “ScaMo: Exploring the scaling law in autoregressive motion generation model,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 27 872–27 882
work page 2025
-
[35]
C. Chen, J. Zhang, S. K. Lakshmikanth, Y . Fang, R. Shao, G. Wetzstein, L. Fei-Fei, and E. Adeli, “The language of motion: Unifying verbal and IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15 non-verbal language of 3d human motion,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 6200–6211
work page 2025
-
[36]
ParCo: Part-coordinating text-to-motion synthesis,
Q. Zou, S. Yuan, S. Du, Y . Wang, C. Liu, Y . Xu, J. Chen, and X. Ji, “ParCo: Part-coordinating text-to-motion synthesis,” inEur. Conf. Comput. Vis., 2024, pp. 126–143
work page 2024
-
[37]
MotionDiffuse: Text-driven human motion generation with diffusion model,
M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu, “MotionDiffuse: Text-driven human motion generation with diffusion model,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, pp. 4115– 4128, 2024
work page 2024
-
[38]
Executing your commands via motion diffusion in latent space,
X. Chen, B. Jiang, W. Liu, Z. Huang, B. Fu, T. Chen, and G. Yu, “Executing your commands via motion diffusion in latent space,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 18 000–18 010
work page 2023
-
[39]
AMD: autoregressive motion diffusion,
B. Han, H. Peng, M. Dong, Y . Ren, Y . Shen, and C. Xu, “AMD: autoregressive motion diffusion,” inAAAI Conf. Artif. Intell., 2024, pp. 2022–2030
work page 2024
-
[40]
CLoSD: Closing the loop between simulation and diffusion for multi-task character control,
G. Tevet, S. Raab, S. Cohan, D. Reda, Z. Luo, X. B. Peng, A. H. Bermano, and M. van de Panne, “CLoSD: Closing the loop between simulation and diffusion for multi-task character control,” inInt. Conf. Learn. Represent., 2025
work page 2025
-
[41]
L. Xiao, S. Lu, H. Pi, K. Fan, L. Pan, Y . Zhou, Z. Feng, X. Zhou, S. Peng, and J. Wang, “MotionStreamer: Streaming motion generation via diffusion-based autoregressive model in causal latent space,” inInt. Conf. Comput. Vis., 2025, pp. 10 086–10 096
work page 2025
-
[42]
The kit motion-language dataset,
M. Plappert, C. Mandery, and T. Asfour, “The kit motion-language dataset,”Big Data, vol. 4, pp. 236–252, 2016
work page 2016
-
[43]
S. S. Kalakonda, S. Maheshwari, and R. K. Sarvadevabhatla, “Action- GPT: Leveraging large-scale language models for improved and gener- alized action generation,” inInt. Conf. Multimedia and Expo, 2023, pp. 31–36
work page 2023
-
[44]
Fg-T2M++: LLMs-augmented fine-grained text driven human motion generation,
Y . Wang, M. Li, J. Liu, Z. Leng, F. W. B. Li, Z. Zhang, and X. Liang, “Fg-T2M++: LLMs-augmented fine-grained text driven human motion generation,”Int. J. Comput. Vis., vol. 133, pp. 4277–4293, 2025
work page 2025
-
[45]
Semanticboost: Ele- vating motion generation with augmented textual cues,
X. He, S. Huang, X. Zhan, C. Wen, and Y . Shan, “Semanticboost: Ele- vating motion generation with augmented textual cues,”arXiv preprint arXiv:2310.20323, 2023
-
[46]
MotionScript: Natural language descriptions for expressive 3d human motions,
P. J. Yazdian, R. Lagasse, H. Mohammadi, E. Liu, L. Cheng, and A. Lim, “MotionScript: Natural language descriptions for expressive 3d human motions,” inIEEE Int. Conf. Intell. Robots Syst., 2025, pp. 21 574– 21 581
work page 2025
-
[47]
FineMoGen: Fine-grained spatio-temporal motion generation and editing,
M. Zhang, H. Li, Z. Cai, J. Ren, L. Yang, and Z. Liu, “FineMoGen: Fine-grained spatio-temporal motion generation and editing,” inAdv. Neural Inform. Process. Syst., vol. 36, 2023, pp. 13 981–13 992
work page 2023
-
[48]
B. Wu, J. Xie, M. Ding, Z. Kong, J. Ren, R. Bai, R. Qu, and L. Shen, “FineMotion: A dataset and benchmark with both spatial and temporal annotation for fine-grained motion generation and editing,” inInt. Conf. Comput. Vis., 2025, pp. 13 837–13 846
work page 2025
-
[49]
Realtime style transfer for unlabeled heterogeneous human motion,
S. Xia, C. Wang, J. Chai, and J. Hodgins, “Realtime style transfer for unlabeled heterogeneous human motion,”ACM Trans. Graph., vol. 34, pp. 1–10, 2015
work page 2015
-
[50]
Unpaired motion style transfer from video to animation,
K. Aberman, Y . Weng, D. Lischinski, D. Cohen-Or, and B. Chen, “Unpaired motion style transfer from video to animation,”ACM Trans. Graph., vol. 39, pp. 64:1–64:12, 2020
work page 2020
-
[51]
Motion Puzzle: Arbitrary motion style transfer by body part,
D.-K. Jang, S. Park, and S.-H. Lee, “Motion Puzzle: Arbitrary motion style transfer by body part,”ACM Trans. Graph., vol. 41, pp. 1–16, 2022
work page 2022
-
[52]
Style-ERD: Responsive and coherent online motion style transfer,
T. Tao, X. Zhan, Z. Chen, and M. van de Panne, “Style-ERD: Responsive and coherent online motion style transfer,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 6593–6603
work page 2022
-
[53]
SALAD: Skeleton-aware latent diffusion for text-driven motion generation and editing,
S. Hong, C. Kim, S. Yoon, J. Nam, S. Cha, and J. Noh, “SALAD: Skeleton-aware latent diffusion for text-driven motion generation and editing,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 7158– 7168
work page 2025
-
[54]
MaskCon- trol: Spatio-temporal control for masked motion synthesis,
E. Pinyoanuntapong, M. U. Saleem, K. Karunratanakul, P. Wang, H. Xue, C. Chen, C. Guo, J. Cao, J. Ren, and S. Tulyakov, “MaskCon- trol: Spatio-temporal control for masked motion synthesis,” inInt. Conf. Comput. Vis., 2025, pp. 9955–9965
work page 2025
-
[55]
Iterative motion editing with natural language,
P. Goel, K. Wang, C. K. Liu, and K. Fatahalian, “Iterative motion editing with natural language,” inSIGGRAPH, 2024, pp. 1–9
work page 2024
-
[56]
SimMotionEdit: Text-based human motion editing with motion simi- larity prediction,
Z. Li, K. Cheng, A. Ghosh, U. Bhattacharya, L. Gui, and A. Bera, “SimMotionEdit: Text-based human motion editing with motion simi- larity prediction,” inIEEE Conf. Comput. Vis. Pattern Recog., 2025, pp. 27 827–27 837
work page 2025
-
[57]
Weakly-supervised 3d spatial reasoning for text-based visual question answering,
H. Li, J. Huang, P. Jin, G. Song, Q. Wu, and J. Chen, “Weakly-supervised 3d spatial reasoning for text-based visual question answering,”IEEE Trans. Image Process., vol. 32, pp. 3367–3382, 2023
work page 2023
-
[58]
TEILP: Time prediction over knowledge graphs via logical reasoning,
S. Xiong, Y . Yang, A. Payani, J. C. Kerce, and F. Fekri, “TEILP: Time prediction over knowledge graphs via logical reasoning,” inAAAI Conf. Artif. Intell., vol. 38, 2024, pp. 16 112–16 119
work page 2024
-
[59]
From system 1 to system 2: A survey of reasoning large language models,
D. Zhang, Z.-Z. Li, M.-L. Zhang, J. Zhang, Z. Liu, Y . Yao, H. Xu, J. Zheng, X. Chen, Y . Zhang, F. Yin, J. Dong, Z. Guo, L. Song, and C.-L. Liu, “From system 1 to system 2: A survey of reasoning large language models,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 48, pp. 3335–3354, 2026
work page 2026
-
[60]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Adv. Neural Inform. Process. Syst., vol. 35, pp. 24 824–24 837, 2022
work page 2022
-
[61]
M. Ding, J. Zhang, W. Wang, H. Zhong, X. Wang, X. Lyu, W. Chen, and L. Shen, “EAGLE: expert-guided self-enhancement for preference alignment in pathology large vision-language model,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, vol. 1, 2025, pp. 14 603–14 619
work page 2025
-
[62]
AtomThink: Multimodal slow thinking with atomic step reasoning,
K. Xiang, Z. Liu, T. J. Zhang, Y . Huang, Y . Nie, K. Cai, Y . Yin, R. Huang, H. Li, Y . Zeng, Y .-J. Yuan, J. Han, L. Hong, H. Xu, and X. Liang, “AtomThink: Multimodal slow thinking with atomic step reasoning,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 48, pp. 5725– 5741, 2026
work page 2026
-
[63]
EvolveNav: Empow- ering llm-based vision-language navigation via self-improving embodied reasoning,
B. Lin, Y . Nie, K. L. Zai, Z. Wei, M. Han, R. Xu, M. Niu, J. Han, H. Zhang, L. Lin, B. Chen, C. Lu, and X. Liang, “EvolveNav: Empow- ering llm-based vision-language navigation via self-improving embodied reasoning,”IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–15, 2026
work page 2026
-
[64]
Motion question answering via modular motion programs,
M. Endo, J. Hsu, J. Li, and J. Wu, “Motion question answering via modular motion programs,” inInt. Conf. Mach. Learn., 2023, pp. 9312– 9328
work page 2023
-
[65]
IMoRe: Implicit program-guided reasoning for human motion q&a,
C. Li, C. Sugandhika, Y . K. Ee, E. Peh, H. Zhang, H. Yang, D. Rajan, and B. Fernando, “IMoRe: Implicit program-guided reasoning for human motion q&a,” inInt. Conf. Comput. Vis., 2025, pp. 12 987–12 996
work page 2025
-
[66]
ChatGPT (Mar 14 version) [Large Language Model],
OpenAI, “ChatGPT (Mar 14 version) [Large Language Model],” https: //chat.openai.com/chat/, 2023
work page 2023
-
[67]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “GPT-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[68]
Mo- tionChain: Conversational motion controllers via multimodal prompts,
B. Jiang, X. Chen, C. Zhang, F. Yin, Z. Li, G. Yu, and J. Fan, “Mo- tionChain: Conversational motion controllers via multimodal prompts,” inEur. Conf. Comput. Vis., 2024, pp. 54–74
work page 2024
-
[69]
AvatarGPT: All-in-one framework for motion understanding planning generation and beyond,
Z. Zhou, Y . Wan, and B. Wang, “AvatarGPT: All-in-one framework for motion understanding planning generation and beyond,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 1357–1366
work page 2024
-
[70]
A unified framework for motion reasoning and generation in human interaction,
J. Park, S. Choi, and S. Yun, “A unified framework for motion reasoning and generation in human interaction,” inInt. Conf. Comput. Vis., 2025, pp. 10 698–10 707
work page 2025
-
[71]
Motion-Agent: A conversational framework for human motion generation with LLMs,
Q. Wu, Y . Zhao, Y . Wang, X. Liu, Y . Tai, and C. Tang, “Motion-Agent: A conversational framework for human motion generation with LLMs,” inInt. Conf. Learn. Represent., 2025
work page 2025
-
[72]
HuTuMotion: Human-tuned navigation of latent motion diffusion models with minimal feedback,
G. Han, S. Huang, M. Gong, and J. Tang, “HuTuMotion: Human-tuned navigation of latent motion diffusion models with minimal feedback,” inAAAI Conf. Artif. Intell., vol. 38, 2024, pp. 2031–2039
work page 2024
-
[73]
Aligning human motion generation with human perceptions,
H. Wang, W. Zhu, L. Miao, Y . Xu, F. Gao, Q. Tian, and Y . Wang, “Aligning human motion generation with human perceptions,” inInt. Conf. Learn. Represent., 2025
work page 2025
-
[74]
Learning generalizable human motion generator with reinforcement learning,
Y . Mao, X. Liu, W. Zhou, Z. Lu, and H. Li, “Learning generalizable human motion generator with reinforcement learning,”arXiv preprint arXiv:2405.15541, 2024
-
[75]
X. Liu, Y . Mao, W. Zhou, and H. Li, “Motionrl: Align text-to-motion generation to human preferences with multi-reward reinforcement learn- ing,”arXiv preprint arXiv:2410.06513, 2024
-
[76]
Motion-R1: Chain-of-thought reasoning and reinforcement learning for human motion generation,
R. Ouyang, H. Li, Z. Zhang, X. Wang, Z. Zhu, G. Huang, and X. Wang, “Motion-R1: Chain-of-thought reasoning and reinforcement learning for human motion generation,” inInt. Conf. Learn. Represent., 2026
work page 2026
-
[77]
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The Llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[78]
Exploring the limits of transfer learning with a unified text-to-text transformer,
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”J. Mach. Learn. Res., vol. 21, pp. 140:1–140:67, 2020
work page 2020
-
[79]
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Adv. Neural Inform. Process. Syst., vol. 36, pp. 34 892–34 916, 2023
work page 2023
-
[80]
Scaling rectified flow transformers for high-resolution image synthesis,
P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boeselet al., “Scaling rectified flow transformers for high-resolution image synthesis,” inInt. Conf. Mach. Learn., vol. 235, 2024, pp. 12 606–12 633. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.