pith. machine review for the scientific record. sign in

arxiv: 2602.12370 · v2 · submitted 2026-02-12 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens

Authors on Pith no claims yet

Pith reviewed 2026-05-16 04:58 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-motion generationmotion-to-text captioningpretrained LLMsMixture-of-Transformerscontinuous tokensflow matchingzero-shot generationunified multimodal model
0
0 comments X

The pith

LLaMo extends pretrained language models with a Mixture-of-Transformers design to unify text-to-motion generation and motion captioning using continuous tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to build one model that both creates human motions from text descriptions and writes captions for given motions by starting from existing large language models rather than training from scratch. It avoids the usual loss of language ability during motion training through a modality-specific Mixture-of-Transformers structure that keeps the original language pathways intact while adding motion handling. Motion is represented as continuous latent values instead of discrete tokens to eliminate jitter, and a lightweight flow-matching head lets the decoder-only backbone predict the next token in a streaming way that runs above 30 frames per second. Experiments show strong results on general text-to-motion tasks and especially on zero-shot generation of unseen motions, along with motion-to-text captioning.

Core claim

LLaMo extends pretrained LLMs through a modality-specific Mixture-of-Transformers architecture that encodes human motion into a causal continuous latent space while preserving the next-token prediction paradigm via a lightweight flow-matching head, enabling real-time streaming motion generation above 30 FPS and delivering high-fidelity text-to-motion generation plus motion-to-text captioning without catastrophic forgetting of linguistic capabilities.

What carries the argument

Modality-specific Mixture-of-Transformers (MoT) architecture paired with continuous autoregressive tokens and a flow-matching head for next-token prediction.

If this is right

  • Real-time text-to-motion generation runs above 30 FPS in a streaming manner.
  • Zero-shot motion generation works in general settings without task-specific fine-tuning.
  • Motion-to-text captioning and text-to-motion generation are handled inside the same unified model.
  • Continuous latent encoding removes jitter artifacts that come from discrete motion tokenization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same continuous-token plus flow-matching pattern could be tested on other sequential modalities such as audio or 3D video.
  • Scaling the base LLM size further might improve zero-shot performance on long or complex motion sequences.
  • The architecture might generalize to joint training on multiple motion datasets without separate quantization steps for each.

Load-bearing premise

The modality-specific Mixture-of-Transformers structure keeps the base language model's understanding intact while still allowing effective adaptation to motion data.

What would settle it

Measure language-only task performance on the base LLM before and after full LLaMo training on motion-text pairs; a large drop would falsify the preservation claim.

Figures

Figures reproduced from arXiv: 2602.12370 by Abhay Mittal, Amy Zhao, Chengcheng Tang, Chuan Guo, Ivan Shugurov, Lingling Tao, Linguang Zhang, Sizhe An, Srinath Sridhar, Zekun Li.

Figure 1
Figure 1. Figure 1: We introduce LLaMo, the first large-scale motion-language model supporting unified motion understanding and generation [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Framework overview of LLaMo. We utilize modality-specific Mixture-of-Transformer (MoT) to process text and motion tokens separately, while enabling cross-modal interactions through shared self-attention. To preserve the language performance of the base model, text-related modules are frozen. The [BOM] and [EOM] tokens denote the start and end of the motion sequence, respectively. An additional exit head al… view at source ↗
Figure 3
Figure 3. Figure 3: Dataset Composition. We gather a large-scale human motion dataset by combining high quality Mocap datasets with large-scale HMR estimated datasets. cannot rely on the traditional strategy to end the autoregres￾sive generation, i.e. terminate the motion generation when end of motion token [EOM] appeared. To address this, following the approach used in TransformerTTS [31] and SpeechT5 [1], we introduce a bin… view at source ↗
Figure 4
Figure 4. Figure 4: Zero-shot Text-to-Motion Generation Results on MotionMillion-Eval [ [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: User Study of Zero-shot Text-to-Motion Generation. We use the prompts from MotionMillion-Eval [12] to evaluate our model against MotionMillion [12]. Results show that users signif￾icantly prefer our model across all the evaluation axes. Methods Motion Activated #Params Text Activated #Params Total #Params MotionMillion-3B 3B 4.2B 4.2B MotionMillion-7B 7B 8.2B 8.2B LLaMo-1B 1B 1B 2B LLaMo-3B 3B 3B 6B LLaMo-… view at source ↗
Figure 7
Figure 7. Figure 7: Semantic distribution visualization. We apply t-SNE [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative Comparison about Motion Reconstruction. The blue motion is ground truth motion and the orange motion is the [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
read the original abstract

Recent progress in large models has led to significant advances in unified multimodal generation and understanding. However, the development of models that unify motion-language generation and understanding remains largely underexplored. Existing approaches often fine-tune large language models (LLMs) on paired motion-text data, which can result in catastrophic forgetting of linguistic capabilities due to the limited scale of available text-motion pairs. Furthermore, prior methods typically convert motion into discrete representations via quantization to integrate with language models, introducing substantial jitter artifacts from discrete tokenization. To address these challenges, we propose LLaMo, a unified framework that extends pretrained LLMs through a modality-specific Mixture-of-Transformers (MoT) architecture. This design inherently preserves the language understanding of the base model while enabling scalable multimodal adaptation. We encode human motion into a causal continuous latent space and maintain the next-token prediction paradigm in the decoder-only backbone through a lightweight flow-matching head, allowing for streaming motion generation in real-time (>30 FPS). Leveraging the comprehensive language understanding of pretrained LLMs and large-scale motion-text pretraining, our experiments demonstrate that LLaMo achieves high-fidelity text-to-motion generation and motion-to-text captioning in general settings, especially zero-shot motion generation, marking a significant step towards a general unified motion-language large model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes LLaMo, a unified framework extending pretrained LLMs via a modality-specific Mixture-of-Transformers (MoT) architecture. Motion is encoded into a causal continuous latent space with a lightweight flow-matching head to preserve the next-token prediction paradigm, enabling real-time streaming generation. The central claims are that this design inherently avoids catastrophic forgetting of linguistic capabilities during large-scale motion-text pretraining and achieves high-fidelity text-to-motion generation plus motion-to-text captioning, with particular strength in zero-shot settings.

Significance. If the experimental claims are substantiated, the work would advance unified motion-language modeling by demonstrating scalable multimodal adaptation of LLMs without discrete quantization artifacts or loss of base capabilities, with practical benefits for real-time (>30 FPS) generation. The continuous autoregressive token approach and MoT routing represent a potentially generalizable direction for other modalities, though the absence of reported language-benchmark retention metrics limits assessment of the preservation claim.

major comments (3)
  1. [Abstract / §4] Abstract and §4 (Experiments): The claim that the MoT architecture 'inherently preserves the language understanding of the base model' is presented as a design property but is unsupported by any quantitative comparison (e.g., GLUE, MMLU, or zero-shot reasoning scores) between the original LLM and the post-adaptation LLaMo model. Large-scale motion-text pretraining could still induce forgetting even with routing, and no such results are provided.
  2. [Abstract] Abstract: The statements of 'high-fidelity text-to-motion generation and motion-to-text captioning in general settings, especially zero-shot motion generation' are made without any reported metrics, baselines, ablation studies, or dataset details. This leaves the central performance claims without verifiable support in the manuscript as described.
  3. [§3] §3 (Method): The continuous flow-matching head is introduced to maintain the decoder-only next-token paradigm, but no derivation or analysis is given showing how this head interacts with the discrete language token predictions to guarantee retention of original LLM behavior after joint training.
minor comments (1)
  1. [§3] Notation for the continuous latent space and flow-matching head should be defined more explicitly with equations to clarify the autoregressive generation process.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's thorough review and constructive suggestions. We will revise the manuscript to provide quantitative evidence for our claims and additional analysis as requested.

read point-by-point responses
  1. Referee: [Abstract / §4] Abstract and §4 (Experiments): The claim that the MoT architecture 'inherently preserves the language understanding of the base model' is presented as a design property but is unsupported by any quantitative comparison (e.g., GLUE, MMLU, or zero-shot reasoning scores) between the original LLM and the post-adaptation LLaMo model. Large-scale motion-text pretraining could still induce forgetting even with routing, and no such results are provided.

    Authors: We agree that empirical validation is important to substantiate the preservation claim. Although the MoT design routes language inputs exclusively through the frozen base LLM parameters, we will add quantitative comparisons on language benchmarks such as GLUE, MMLU, and zero-shot reasoning tasks in the revised §4 to demonstrate retention of linguistic capabilities post-adaptation. revision: yes

  2. Referee: [Abstract] Abstract: The statements of 'high-fidelity text-to-motion generation and motion-to-text captioning in general settings, especially zero-shot motion generation' are made without any reported metrics, baselines, ablation studies, or dataset details. This leaves the central performance claims without verifiable support in the manuscript as described.

    Authors: The full manuscript in §4 provides detailed experimental results, including quantitative metrics (e.g., FID, R-Precision for text-to-motion; BLEU, CIDEr for motion-to-text), comparisons against baselines like MDM and MotionGPT, ablation studies on the MoT and flow-matching components, and dataset details (HumanML3D, KIT-Motion-Language). We will revise the abstract to include key numerical results and explicit references to these sections for better verifiability. revision: partial

  3. Referee: [§3] §3 (Method): The continuous flow-matching head is introduced to maintain the decoder-only next-token paradigm, but no derivation or analysis is given showing how this head interacts with the discrete language token predictions to guarantee retention of original LLM behavior after joint training.

    Authors: We will expand §3 with a formal analysis and derivation. Specifically, we will show that the flow-matching head predicts continuous motion latents in an autoregressive manner using a separate output projection, while language token prediction uses the original LLM head and loss. The MoT architecture ensures that language token embeddings and attention are processed only by the base experts, isolating gradients and preserving the original next-token prediction objective for text. This will include equations demonstrating the separation of modalities during joint training. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rely on architectural proposal and empirical results, not self-referential reductions

full rationale

The paper introduces LLaMo via a modality-specific Mixture-of-Transformers extension to pretrained LLMs, continuous latent motion encoding, and a flow-matching head for next-token prediction. No equations, derivations, or self-citations are shown that reduce the central claims (preservation of language capabilities, zero-shot performance) to fitted parameters or prior self-work by construction. The 'inherently preserves' statement is presented as a design property of the new architecture rather than a derived equivalence to inputs. This is a standard model-extension approach without internal circular logic in the provided derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Limited information from abstract only; the central claim rests on the assumption that continuous latent encoding plus flow-matching preserves autoregressive next-token prediction without introducing new inconsistencies, plus standard transformer training assumptions.

axioms (1)
  • domain assumption Next-token prediction paradigm remains valid when a flow-matching head is added to a decoder-only LLM backbone for motion sequences
    Invoked to enable streaming generation at >30 FPS while keeping the original training objective.
invented entities (1)
  • Mixture-of-Transformers (MoT) architecture no independent evidence
    purpose: Modality-specific extension of pretrained LLMs that preserves language capabilities during multimodal adaptation
    New architectural component introduced to solve catastrophic forgetting when fine-tuning on motion-text pairs.

pith-pipeline@v0.9.0 · 5568 in / 1370 out tokens · 57637 ms · 2026-05-16T04:58:05.784753+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. IAM: Identity-Aware Human Motion and Shape Joint Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    IAM jointly synthesizes motion sequences and body shape parameters conditioned on multimodal identity signals to achieve more realistic and identity-consistent human motions.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · cited by 1 Pith paper · 21 internal anchors

  1. [1]

    Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing

    Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, et al. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. InProceedings of the 60th an- nual meeting of the association for computational linguistics (volume 1: Long papers), pages 5723–5738, 2022. 5, 1

  2. [2]

    Black, and G ¨ul Varol

    Nikos Athanasiou, Alp ´ar Ceske, Markos Diomataris, Michael J. Black, and G ¨ul Varol. MotionFix: Text-driven 3d human motion editing. InSIGGRAPH Asia 2024 Confer- ence Papers, 2024. 8

  3. [3]

    Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020. 2

  4. [4]

    Motionctrl: A real-time controllable vision-language-motion model

    Bin Cao, Sipeng Zheng, Ye Wang, Lujie Xia, Qianshan Wei, Qin Jin, Jing Liu, and Zongqing Lu. Motionctrl: A real-time controllable vision-language-motion model. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 12253–12262, 2025. 3, 8

  5. [5]

    Executing your commands via motion diffusion in latent space

    Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18000–18010, 2023. 7

  6. [6]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

  7. [7]

    Dis- cord: Discrete tokens to continuous motion via rectified flow decoding

    Jungbin Cho, Junwan Kim, Jisoo Kim, Minseo Kim, Mingu Kang, Sungeun Hong, Tae-Hyun Oh, and Youngjae Yu. Dis- cord: Discrete tokens to continuous motion via rectified flow decoding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14602–14612, 2025. 2, 6

  8. [8]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 5

  9. [9]

    Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xing- hang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583,

  10. [10]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 1, 3, 6, 8

  11. [11]

    Motion question answering via modular motion programs

    Mark Endo, Joy Hsu, Jiaman Li, and Jiajun Wu. Motion question answering via modular motion programs. InIn- ternational Conference on Machine Learning, pages 9312–

  12. [12]

    Go to zero: Towards zero-shot motion generation with million-scale data.arXiv preprint arXiv:2507.07095,

    Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, and Jingbo Wang. Go to zero: Towards zero-shot motion generation with million-scale data.arXiv preprint arXiv:2507.07095,

  13. [13]

    Humocon: Concept discovery for hu- man motion understanding

    Qihang Fang, Chengcheng Tang, Bugra Tekin, Shugao Ma, and Yanchao Yang. Humocon: Concept discovery for hu- man motion understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7179–7190, 2025. 2

  14. [14]

    Learning complex 3d human self-contact

    Mihai Fieraru, Mihai Zanfir, Elisabeta Oneata, Alin-Ionut Popa, Vlad Olaru, and Cristian Sminchisescu. Learning complex 3d human self-contact. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1343– 1351, 2021. 5

  15. [15]

    Generating diverse and natural 3d human motions from text

    Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5152–5161, 2022. 2, 3, 5, 7, 8

  16. [16]

    Tm2t: Stochastic and tokenized modeling for the reciprocal gener- ation of 3d human motions and texts

    Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. Tm2t: Stochastic and tokenized modeling for the reciprocal gener- ation of 3d human motions and texts. InECCV, 2022. 7

  17. [17]

    Tm2t: Stochastic and tokenized modeling for the reciprocal genera- tion of 3d human motions and texts

    Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. Tm2t: Stochastic and tokenized modeling for the reciprocal genera- tion of 3d human motions and texts. InEuropean Conference on Computer Vision, pages 580–597. Springer, 2022. 2, 7, 8

  18. [18]

    Momask: Generative masked model- ing of 3d human motions

    Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked model- ing of 3d human motions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910, 2024. 3, 7

  19. [19]

    Snapmogen: Human motion generation from expressive texts.arXiv preprint arXiv:2507.09122, 2025

    Chuan Guo, Inwoo Hwang, Jian Wang, and Bing Zhou. Snapmogen: Human motion generation from expressive texts.arXiv preprint arXiv:2507.09122, 2025. 8, 2

  20. [20]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2

  21. [21]

    Gaussian Error Linear Units (GELUs)

    D Hendrycks. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016. 1

  22. [22]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020. 8 9

  23. [23]

    Hmvlm: Human motion-vision-lanuage model via moe lora.arXiv preprint arXiv:2511.01463, 2025

    Lei Hu, Yongjing Ye, and Shihong Xia. Hmvlm: Human motion-vision-lanuage model via moe lora.arXiv preprint arXiv:2511.01463, 2025. 2, 3

  24. [24]

    Motiongpt: Human motion as a foreign lan- guage.Advances in Neural Information Processing Systems, 36:20067–20079, 2023

    Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign lan- guage.Advances in Neural Information Processing Systems, 36:20067–20079, 2023. 2, 3, 5, 7, 8

  25. [25]

    Motionchain: Conversational motion controllers via multimodal prompts

    Biao Jiang, Xin Chen, Chi Zhang, Fukun Yin, Zhuoyuan Li, Gang Yu, and Jiayuan Fan. Motionchain: Conversational motion controllers via multimodal prompts. InEuropean Conference on Computer Vision, pages 54–74. Springer,

  26. [26]

    Solami: Social vision- language-action modeling for immersive interaction with 3d autonomous characters

    Jianping Jiang, Weiye Xiao, Zhengyu Lin, Huaizhong Zhang, Tianxiang Ren, Yang Gao, Zhiqian Lin, Zhongang Cai, Lei Yang, and Ziwei Liu. Solami: Social vision- language-action modeling for immersive interaction with 3d autonomous characters. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26887– 26898, 2025. 2

  27. [27]

    Scaling up dynamic human-scene interaction mod- eling

    Nan Jiang, Zhiyuan Zhang, Hongjie Li, Xiaoxuan Ma, Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, and Siyuan Huang. Scaling up dynamic human-scene interaction mod- eling. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 1737–1747,

  28. [28]

    Hyperspherical latents improve continuous-token autoregressive generation.arXiv preprint arXiv:2509.24335, 2025

    Guolin Ke and Hui Xue. Hyperspherical latents improve continuous-token autoregressive generation.arXiv preprint arXiv:2509.24335, 2025. 4, 2

  29. [29]

    Imore: Implicit program-guided reasoning for human mo- tion q&a.arXiv preprint arXiv:2508.01984, 2025

    Chen Li, Chinthani Sugandhika, Yeo Keat Ee, Eric Peh, Hao Zhang, Hong Yang, Deepu Rajan, and Basura Fernando. Imore: Implicit program-guided reasoning for human mo- tion q&a.arXiv preprint arXiv:2508.01984, 2025. 8

  30. [30]

    BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InPro- ceedings of the 40th International Conference on Machine Learning, pages 19730–19742. PMLR, 2023. 2

  31. [31]

    Neural speech synthesis with transformer network

    Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. Neural speech synthesis with transformer network. InProceedings of the AAAI conference on artificial intelli- gence, pages 6706–6713, 2019. 5, 1

  32. [32]

    Finedance: A fine-grained choreography dataset for 3d full body dance generation

    Ronghui Li, Junfan Zhao, Yachao Zhang, Mingyang Su, Zeping Ren, Han Zhang, Yansong Tang, and Xiu Li. Finedance: A fine-grained choreography dataset for 3d full body dance generation. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 10234– 10243, 2023. 5

  33. [33]

    Autoregressive image generation without vec- tor quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vec- tor quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024. 2, 3, 5, 6, 1

  34. [34]

    Lamp: Language-motion pretraining for motion generation, retrieval, and captioning

    Zhe Li, Weihao Yuan, Yisheng He, Lingteng Qiu, Shen- hao Zhu, Xiaodong Gu, Weichao Shen, Yuan Dong, Zi- long Dong, and Laurence T Yang. Lamp: Language-motion pretraining for motion generation, retrieval, and captioning. arXiv preprint arXiv:2410.07093, 2024. 7

  35. [35]

    Intergen: Diffusion-based multi-human motion genera- tion under complex interactions.International Journal of Computer Vision, 132(9):3463–3483, 2024

    Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, and Lan Xu. Intergen: Diffusion-based multi-human motion genera- tion under complex interactions.International Journal of Computer Vision, 132(9):3463–3483, 2024. 5

  36. [36]

    Mogao: An omni foundation model for interleaved multi-modal generation

    Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025. 3

  37. [37]

    Animationgpt:an aigc tool for generating game combat motion assets.https : / / github

    Yihao Liao, Yiyu Fu, Ziming Cheng, and Jiangfeiyang Wang. Animationgpt:an aigc tool for generating game combat motion assets.https : / / github . com / fyyakaxyy/AnimationGPT, 2024. 5

  38. [38]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004. 7

  39. [39]

    Motion-x: A large-scale 3d expressive whole-body human motion dataset

    Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large-scale 3d expressive whole-body human motion dataset. Advances in Neural Information Processing Systems, 36: 25268–25280, 2023. 3, 5

  40. [40]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 2, 5

  41. [41]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2, 6

  42. [42]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 5

  43. [43]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 1

  44. [44]

    arXiv preprint arXiv:2206.08916 , year=

    Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mot- taghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks.arXiv preprint arXiv:2206.08916, 2022. 2, 3

  45. [45]

    Scamo: Exploring the scaling law in au- toregressive motion generation model

    Shunlin Lu, Jingbo Wang, Zeyu Lu, Ling-Hao Chen, Wenxun Dai, Junting Dong, Zhiyang Dou, Bo Dai, and Ruimao Zhang. Scamo: Exploring the scaling law in au- toregressive motion generation model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27872–27882, 2025. 3

  46. [46]

    Janusflow: Harmonizing autore- gression and rectified flow for unified multimodal under- standing and generation

    Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autore- gression and rectified flow for unified multimodal under- standing and generation. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 7739–7751,

  47. [47]

    Ian Mason, Sebastian Starke, and Taku Komura. Real-time style modelling of human locomotion via feature-wise trans- formations and local motion phases.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 5(1):1– 18, 2022. 5

  48. [48]

    Em- body 3d: A large-scale multimodal motion and behavior dataset.arXiv preprint arXiv:2510.16258, 2025

    Claire McLean, Makenzie Meendering, Tristan Swartz, Orri Gabbay, Alexandra Olsen, Rachel Jacobs, Nicholas Rosen, Philippe de Bree, Tony Garcia, Gadsden Merrill, et al. Em- body 3d: A large-scale multimodal motion and behavior dataset.arXiv preprint arXiv:2510.16258, 2025. 5 10

  49. [49]

    Rethinking diffusion for text-driven human motion generation: Redundant representations, evaluation, and masked autoregression

    Zichong Meng, Yiming Xie, Xiaogang Peng, Zeyu Han, and Huaizu Jiang. Rethinking diffusion for text-driven human motion generation: Redundant representations, evaluation, and masked autoregression. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27859–27871, 2025. 3

  50. [50]

    Motion-r1: Chain-of-thought reasoning and reinforcement learning for human motion generation.arXiv preprint arXiv:2506.10353, 2025

    Runqi Ouyang, Haoyun Li, Zhenyuan Zhang, Xiaofeng Wang, Zheng Zhu, Guan Huang, and Xingang Wang. Motion-r1: Chain-of-thought reasoning and reinforcement learning for human motion generation.arXiv preprint arXiv:2506.10353, 2025. 2

  51. [51]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,

  52. [52]

    Continuous autoregressive models with noise augmentation avoid error accumulation.arXiv preprint arXiv:2411.18447, 2024

    Marco Pasini, Javier Nistal, Stefan Lattner, and George Fazekas. Continuous autoregressive models with noise augmentation avoid error accumulation.arXiv preprint arXiv:2411.18447, 2024. 5

  53. [53]

    Babel: Bodies, action and behavior with english la- bels

    Abhinanda R Punnakkal, Arjun Chandrasekaran, Nikos Athanasiou, Alejandra Quiros-Ramirez, and Michael J Black. Babel: Bodies, action and behavior with english la- bels. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 722–731, 2021. 5

  54. [54]

    Searching for Activation Functions

    Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions.arXiv preprint arXiv:1710.05941, 2017. 1

  55. [55]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational confer- ence on machine learning, pages 8821–8831. Pmlr, 2021. 2

  56. [56]

    Continuous autoregressive language models.arXiv preprint arXiv:2510.27688, 2025

    Chenze Shao, Darren Li, Fandong Meng, and Jie Zhou. Continuous autoregressive language models.arXiv preprint arXiv:2510.27688, 2025. 4, 2

  57. [57]

    World-grounded human motion recovery via gravity-view coordinates

    Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou. World-grounded human motion recovery via gravity-view coordinates. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 5

  58. [58]

    Lmfusion: Adapting pretrained language models for multimodal gener- ation.arXiv preprint arXiv:2412.15188, 2024

    Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, and Lili Yu. Lmfusion: Adapting pretrained language models for multimodal gener- ation.arXiv preprint arXiv:2412.15188, 2024. 2, 3

  59. [59]

    Mul- timodal latent language modeling with next-token diffusion

    Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, and Furu Wei. Mul- timodal latent language modeling with next-token diffusion. arXiv preprint arXiv:2412.08635, 2024. 4, 2

  60. [60]

    Omni-video: Democratizing uni- fied video understanding and generation.arXiv preprint arXiv:2507.06119, 2025

    Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Meng- ping Yang, and Hao Li. Omni-video: Democratizing uni- fied video understanding and generation.arXiv preprint arXiv:2507.06119, 2025. 1

  61. [61]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 1, 2, 3

  62. [62]

    Nextstep-1: Toward autoregressive image generation with continuous tokens at scale.arXiv preprint arXiv:2508.10711, 2025

    NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, et al. Nextstep-1: Toward autoregressive image generation with continuous tokens at scale.arXiv preprint arXiv:2508.10711, 2025. 1, 2, 3, 4, 5, 6

  63. [63]

    Human Motion Diffusion Model

    Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. Human motion dif- fusion model.arXiv preprint arXiv:2209.14916, 2022. 3, 7

  64. [64]

    Mm-interleaved: Interleaved image-text generative modeling via multi-modal feature synchronizer

    Changyao Tian, Xizhou Zhu, Yuwen Xiong, Weiyun Wang, Zhe Chen, Wenhai Wang, Yuntao Chen, Lewei Lu, Tong Lu, Jie Zhou, et al. Mm-interleaved: Interleaved image-text generative modeling via multi-modal feature synchronizer. arXiv preprint arXiv:2401.10208, 2024. 2, 3

  65. [65]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 2, 4

  66. [66]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2, 4

  67. [67]

    Cider: Consensus-based image description evalua- tion

    Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. 7

  68. [68]

    Unirl-zero: Reinforcement learning on unified models with joint language model and diffusion model experts.arXiv preprint arXiv:2510.17937, 2025

    Fu-Yun Wang, Han Zhang, Michael Gharbi, Hongsheng Li, and Taesung Park. Unirl-zero: Reinforcement learning on unified models with joint language model and diffusion model experts.arXiv preprint arXiv:2510.17937, 2025. 2

  69. [69]

    You think, you act: The new task of arbitrary text to motion generation

    Runqi Wang, Caoyuan Ma, Guopeng Li, Hanrui Xu, Yuke Li, and Zheng Wang. You think, you act: The new task of arbitrary text to motion generation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12012–12022, 2025. 2

  70. [70]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 1, 2, 3

  71. [71]

    Motiongpt-2: A general-purpose motion- language model for motion generation and understanding

    Yuan Wang, Di Huang, Yaqi Zhang, Wanli Ouyang, Jile Jiao, Xuetao Feng, Yan Zhou, Pengfei Wan, Shixiang Tang, and Dan Xu. Motiongpt-2: A general-purpose motion- language model for motion generation and understanding. arXiv preprint arXiv:2410.21747, 2024. 2, 3, 8

  72. [72]

    Univideo: Unified understanding, generation, and editing for videos

    Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhu Chen. Univideo: Unified understanding, generation, and editing for videos. arXiv preprint arXiv:2510.08377, 2025. 1

  73. [73]

    Janus: Decoupling visual encod- ing for unified multimodal understanding and generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encod- ing for unified multimodal understanding and generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025. 1, 2, 3

  74. [74]

    Motion-agent: A conversational framework for human motion generation with llms.arXiv preprint arXiv:2405.17013, 2024

    Qi Wu, Yubo Zhao, Yifan Wang, Xinhang Liu, Yu-Wing Tai, and Chi-Keung Tang. Motion-agent: A conversational framework for human motion generation with llms.arXiv preprint arXiv:2405.17013, 2024. 3 11

  75. [75]

    Mote: Learning motion-text diffusion model for multiple generation tasks.arXiv preprint arXiv:2411.19786,

    Yiming Wu, Wei Ji, Kecheng Zheng, Zicheng Wang, and Dong Xu. Mote: Learning motion-text diffusion model for multiple generation tasks.arXiv preprint arXiv:2411.19786,

  76. [76]

    Motionstreamer: Streaming motion genera- tion via diffusion-based autoregressive model in causal latent space.arXiv preprint arXiv:2503.15451, 2025

    Lixing Xiao, Shunlin Lu, Huaijin Pi, Ke Fan, Liang Pan, Yueer Zhou, Ziyong Feng, Xiaowei Zhou, Sida Peng, and Jingbo Wang. Motionstreamer: Streaming motion genera- tion via diffusion-based autoregressive model in causal latent space.arXiv preprint arXiv:2503.15451, 2025. 3, 7, 1, 2

  77. [77]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024. 1, 2, 3

  78. [78]

    Show-o2: Improved Native Unified Multimodal Models

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show- o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025. 1, 2, 6

  79. [79]

    X-streamer: Unified human world modeling with audiovisual interaction.arXiv preprint arXiv:2509.21574, 2025

    You Xie, Tianpei Gu, Zenan Li, Chenxu Zhang, Guox- ian Song, Xiaochen Zhao, Chao Liang, Jianwen Jiang, Hongyi Xu, and Linjie Luo. X-streamer: Unified human world modeling with audiovisual interaction.arXiv preprint arXiv:2509.21574, 2025. 1

  80. [80]

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025

Showing first 80 references.