arxiv: 2602.12370 · v2 · submitted 2026-02-12 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens

Zekun Li , Sizhe An , Chengcheng Tang , Chuan Guo , Ivan Shugurov , Linguang Zhang , Amy Zhao , Srinath Sridhar

show 2 more authors

Lingling Tao Abhay Mittal

Authors on Pith no claims yet

Pith reviewed 2026-05-16 04:58 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-to-motion generationmotion-to-text captioningpretrained LLMsMixture-of-Transformerscontinuous tokensflow matchingzero-shot generationunified multimodal model

0 comments

The pith

LLaMo extends pretrained language models with a Mixture-of-Transformers design to unify text-to-motion generation and motion captioning using continuous tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to build one model that both creates human motions from text descriptions and writes captions for given motions by starting from existing large language models rather than training from scratch. It avoids the usual loss of language ability during motion training through a modality-specific Mixture-of-Transformers structure that keeps the original language pathways intact while adding motion handling. Motion is represented as continuous latent values instead of discrete tokens to eliminate jitter, and a lightweight flow-matching head lets the decoder-only backbone predict the next token in a streaming way that runs above 30 frames per second. Experiments show strong results on general text-to-motion tasks and especially on zero-shot generation of unseen motions, along with motion-to-text captioning.

Core claim

LLaMo extends pretrained LLMs through a modality-specific Mixture-of-Transformers architecture that encodes human motion into a causal continuous latent space while preserving the next-token prediction paradigm via a lightweight flow-matching head, enabling real-time streaming motion generation above 30 FPS and delivering high-fidelity text-to-motion generation plus motion-to-text captioning without catastrophic forgetting of linguistic capabilities.

What carries the argument

Modality-specific Mixture-of-Transformers (MoT) architecture paired with continuous autoregressive tokens and a flow-matching head for next-token prediction.

If this is right

Real-time text-to-motion generation runs above 30 FPS in a streaming manner.
Zero-shot motion generation works in general settings without task-specific fine-tuning.
Motion-to-text captioning and text-to-motion generation are handled inside the same unified model.
Continuous latent encoding removes jitter artifacts that come from discrete motion tokenization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same continuous-token plus flow-matching pattern could be tested on other sequential modalities such as audio or 3D video.
Scaling the base LLM size further might improve zero-shot performance on long or complex motion sequences.
The architecture might generalize to joint training on multiple motion datasets without separate quantization steps for each.

Load-bearing premise

The modality-specific Mixture-of-Transformers structure keeps the base language model's understanding intact while still allowing effective adaptation to motion data.

What would settle it

Measure language-only task performance on the base LLM before and after full LLaMo training on motion-text pairs; a large drop would falsify the preservation claim.

Figures

Figures reproduced from arXiv: 2602.12370 by Abhay Mittal, Amy Zhao, Chengcheng Tang, Chuan Guo, Ivan Shugurov, Lingling Tao, Linguang Zhang, Sizhe An, Srinath Sridhar, Zekun Li.

**Figure 1.** Figure 1: We introduce LLaMo, the first large-scale motion-language model supporting unified motion understanding and generation [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Framework overview of LLaMo. We utilize modality-specific Mixture-of-Transformer (MoT) to process text and motion tokens separately, while enabling cross-modal interactions through shared self-attention. To preserve the language performance of the base model, text-related modules are frozen. The [BOM] and [EOM] tokens denote the start and end of the motion sequence, respectively. An additional exit head al… view at source ↗

**Figure 3.** Figure 3: Dataset Composition. We gather a large-scale human motion dataset by combining high quality Mocap datasets with large-scale HMR estimated datasets. cannot rely on the traditional strategy to end the autoregressive generation, i.e. terminate the motion generation when end of motion token [EOM] appeared. To address this, following the approach used in TransformerTTS [31] and SpeechT5 [1], we introduce a bin… view at source ↗

**Figure 4.** Figure 4: Zero-shot Text-to-Motion Generation Results on MotionMillion-Eval [ [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: User Study of Zero-shot Text-to-Motion Generation. We use the prompts from MotionMillion-Eval [12] to evaluate our model against MotionMillion [12]. Results show that users significantly prefer our model across all the evaluation axes. Methods Motion Activated #Params Text Activated #Params Total #Params MotionMillion-3B 3B 4.2B 4.2B MotionMillion-7B 7B 8.2B 8.2B LLaMo-1B 1B 1B 2B LLaMo-3B 3B 3B 6B LLaMo-… view at source ↗

**Figure 7.** Figure 7: Semantic distribution visualization. We apply t-SNE [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative Comparison about Motion Reconstruction. The blue motion is ground truth motion and the orange motion is the [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

read the original abstract

Recent progress in large models has led to significant advances in unified multimodal generation and understanding. However, the development of models that unify motion-language generation and understanding remains largely underexplored. Existing approaches often fine-tune large language models (LLMs) on paired motion-text data, which can result in catastrophic forgetting of linguistic capabilities due to the limited scale of available text-motion pairs. Furthermore, prior methods typically convert motion into discrete representations via quantization to integrate with language models, introducing substantial jitter artifacts from discrete tokenization. To address these challenges, we propose LLaMo, a unified framework that extends pretrained LLMs through a modality-specific Mixture-of-Transformers (MoT) architecture. This design inherently preserves the language understanding of the base model while enabling scalable multimodal adaptation. We encode human motion into a causal continuous latent space and maintain the next-token prediction paradigm in the decoder-only backbone through a lightweight flow-matching head, allowing for streaming motion generation in real-time (>30 FPS). Leveraging the comprehensive language understanding of pretrained LLMs and large-scale motion-text pretraining, our experiments demonstrate that LLaMo achieves high-fidelity text-to-motion generation and motion-to-text captioning in general settings, especially zero-shot motion generation, marking a significant step towards a general unified motion-language large model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLaMo uses continuous causal motion latents and a flow-matching head on a MoT-extended LLM to sidestep discrete jitter and forgetting, but the abstract supplies no metrics or language-benchmark results to show it works.

read the letter

The paper's actual advance is replacing quantized motion tokens with a continuous autoregressive latent space while keeping the decoder-only next-token prediction loop via a lightweight flow-matching head. The MoT routing is meant to let the base LLM absorb large-scale motion-text data without losing its original language skills, and the design targets real-time streaming output above 30 FPS. This is a direct response to the jitter and forgetting problems that quantization-based fine-tuning runs into on small paired datasets.

Referee Report

3 major / 1 minor

Summary. The paper proposes LLaMo, a unified framework extending pretrained LLMs via a modality-specific Mixture-of-Transformers (MoT) architecture. Motion is encoded into a causal continuous latent space with a lightweight flow-matching head to preserve the next-token prediction paradigm, enabling real-time streaming generation. The central claims are that this design inherently avoids catastrophic forgetting of linguistic capabilities during large-scale motion-text pretraining and achieves high-fidelity text-to-motion generation plus motion-to-text captioning, with particular strength in zero-shot settings.

Significance. If the experimental claims are substantiated, the work would advance unified motion-language modeling by demonstrating scalable multimodal adaptation of LLMs without discrete quantization artifacts or loss of base capabilities, with practical benefits for real-time (>30 FPS) generation. The continuous autoregressive token approach and MoT routing represent a potentially generalizable direction for other modalities, though the absence of reported language-benchmark retention metrics limits assessment of the preservation claim.

major comments (3)

[Abstract / §4] Abstract and §4 (Experiments): The claim that the MoT architecture 'inherently preserves the language understanding of the base model' is presented as a design property but is unsupported by any quantitative comparison (e.g., GLUE, MMLU, or zero-shot reasoning scores) between the original LLM and the post-adaptation LLaMo model. Large-scale motion-text pretraining could still induce forgetting even with routing, and no such results are provided.
[Abstract] Abstract: The statements of 'high-fidelity text-to-motion generation and motion-to-text captioning in general settings, especially zero-shot motion generation' are made without any reported metrics, baselines, ablation studies, or dataset details. This leaves the central performance claims without verifiable support in the manuscript as described.
[§3] §3 (Method): The continuous flow-matching head is introduced to maintain the decoder-only next-token paradigm, but no derivation or analysis is given showing how this head interacts with the discrete language token predictions to guarantee retention of original LLM behavior after joint training.

minor comments (1)

[§3] Notation for the continuous latent space and flow-matching head should be defined more explicitly with equations to clarify the autoregressive generation process.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's thorough review and constructive suggestions. We will revise the manuscript to provide quantitative evidence for our claims and additional analysis as requested.

read point-by-point responses

Referee: [Abstract / §4] Abstract and §4 (Experiments): The claim that the MoT architecture 'inherently preserves the language understanding of the base model' is presented as a design property but is unsupported by any quantitative comparison (e.g., GLUE, MMLU, or zero-shot reasoning scores) between the original LLM and the post-adaptation LLaMo model. Large-scale motion-text pretraining could still induce forgetting even with routing, and no such results are provided.

Authors: We agree that empirical validation is important to substantiate the preservation claim. Although the MoT design routes language inputs exclusively through the frozen base LLM parameters, we will add quantitative comparisons on language benchmarks such as GLUE, MMLU, and zero-shot reasoning tasks in the revised §4 to demonstrate retention of linguistic capabilities post-adaptation. revision: yes
Referee: [Abstract] Abstract: The statements of 'high-fidelity text-to-motion generation and motion-to-text captioning in general settings, especially zero-shot motion generation' are made without any reported metrics, baselines, ablation studies, or dataset details. This leaves the central performance claims without verifiable support in the manuscript as described.

Authors: The full manuscript in §4 provides detailed experimental results, including quantitative metrics (e.g., FID, R-Precision for text-to-motion; BLEU, CIDEr for motion-to-text), comparisons against baselines like MDM and MotionGPT, ablation studies on the MoT and flow-matching components, and dataset details (HumanML3D, KIT-Motion-Language). We will revise the abstract to include key numerical results and explicit references to these sections for better verifiability. revision: partial
Referee: [§3] §3 (Method): The continuous flow-matching head is introduced to maintain the decoder-only next-token paradigm, but no derivation or analysis is given showing how this head interacts with the discrete language token predictions to guarantee retention of original LLM behavior after joint training.

Authors: We will expand §3 with a formal analysis and derivation. Specifically, we will show that the flow-matching head predicts continuous motion latents in an autoregressive manner using a separate output projection, while language token prediction uses the original LLM head and loss. The MoT architecture ensures that language token embeddings and attention are processed only by the base experts, isolating gradients and preserving the original next-token prediction objective for text. This will include equations demonstrating the separation of modalities during joint training. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rely on architectural proposal and empirical results, not self-referential reductions

full rationale

The paper introduces LLaMo via a modality-specific Mixture-of-Transformers extension to pretrained LLMs, continuous latent motion encoding, and a flow-matching head for next-token prediction. No equations, derivations, or self-citations are shown that reduce the central claims (preservation of language capabilities, zero-shot performance) to fitted parameters or prior self-work by construction. The 'inherently preserves' statement is presented as a design property of the new architecture rather than a derived equivalence to inputs. This is a standard model-extension approach without internal circular logic in the provided derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Limited information from abstract only; the central claim rests on the assumption that continuous latent encoding plus flow-matching preserves autoregressive next-token prediction without introducing new inconsistencies, plus standard transformer training assumptions.

axioms (1)

domain assumption Next-token prediction paradigm remains valid when a flow-matching head is added to a decoder-only LLM backbone for motion sequences
Invoked to enable streaming generation at >30 FPS while keeping the original training objective.

invented entities (1)

Mixture-of-Transformers (MoT) architecture no independent evidence
purpose: Modality-specific extension of pretrained LLMs that preserves language capabilities during multimodal adaptation
New architectural component introduced to solve catastrophic forgetting when fine-tuning on motion-text pairs.

pith-pipeline@v0.9.0 · 5568 in / 1370 out tokens · 57637 ms · 2026-05-16T04:58:05.784753+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

modality-specific Mixture-of-Transformers (MoT) architecture... freezing the text-related modules and updating the motion-specific parameters only... continuous causal latent space and models the next-token distribution... through a flow-matching head

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

IAM: Identity-Aware Human Motion and Shape Joint Generation
cs.CV 2026-04 unverdicted novelty 6.0

IAM jointly synthesizes motion sequences and body shape parameters conditioned on multimodal identity signals to achieve more realistic and identity-consistent human motions.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · cited by 1 Pith paper · 21 internal anchors

[1]

Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing

Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, et al. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. InProceedings of the 60th an- nual meeting of the association for computational linguistics (volume 1: Long papers), pages 5723–5738, 2022. 5, 1

work page 2022
[2]

Black, and G ¨ul Varol

Nikos Athanasiou, Alp ´ar Ceske, Markos Diomataris, Michael J. Black, and G ¨ul Varol. MotionFix: Text-driven 3d human motion editing. InSIGGRAPH Asia 2024 Confer- ence Papers, 2024. 8

work page 2024
[3]

Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020. 2

work page 1901
[4]

Motionctrl: A real-time controllable vision-language-motion model

Bin Cao, Sipeng Zheng, Ye Wang, Lujie Xia, Qianshan Wei, Qin Jin, Jing Liu, and Zongqing Lu. Motionctrl: A real-time controllable vision-language-motion model. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 12253–12262, 2025. 3, 8

work page 2025
[5]

Executing your commands via motion diffusion in latent space

Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18000–18010, 2023. 7

work page 2023
[6]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Dis- cord: Discrete tokens to continuous motion via rectified flow decoding

Jungbin Cho, Junwan Kim, Jisoo Kim, Minseo Kim, Mingu Kang, Sungeun Hong, Tae-Hyun Oh, and Youngjae Yu. Dis- cord: Discrete tokens to continuous motion via rectified flow decoding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14602–14612, 2025. 2, 6

work page 2025
[8]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xing- hang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583,

work page arXiv
[10]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 1, 3, 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Motion question answering via modular motion programs

Mark Endo, Joy Hsu, Jiaman Li, and Jiajun Wu. Motion question answering via modular motion programs. InIn- ternational Conference on Machine Learning, pages 9312–

work page
[12]

Go to zero: Towards zero-shot motion generation with million-scale data.arXiv preprint arXiv:2507.07095,

Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, and Jingbo Wang. Go to zero: Towards zero-shot motion generation with million-scale data.arXiv preprint arXiv:2507.07095,

work page arXiv
[13]

Humocon: Concept discovery for hu- man motion understanding

Qihang Fang, Chengcheng Tang, Bugra Tekin, Shugao Ma, and Yanchao Yang. Humocon: Concept discovery for hu- man motion understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7179–7190, 2025. 2

work page 2025
[14]

Learning complex 3d human self-contact

Mihai Fieraru, Mihai Zanfir, Elisabeta Oneata, Alin-Ionut Popa, Vlad Olaru, and Cristian Sminchisescu. Learning complex 3d human self-contact. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1343– 1351, 2021. 5

work page 2021
[15]

Generating diverse and natural 3d human motions from text

Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5152–5161, 2022. 2, 3, 5, 7, 8

work page 2022
[16]

Tm2t: Stochastic and tokenized modeling for the reciprocal gener- ation of 3d human motions and texts

Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. Tm2t: Stochastic and tokenized modeling for the reciprocal gener- ation of 3d human motions and texts. InECCV, 2022. 7

work page 2022
[17]

Tm2t: Stochastic and tokenized modeling for the reciprocal genera- tion of 3d human motions and texts

Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. Tm2t: Stochastic and tokenized modeling for the reciprocal genera- tion of 3d human motions and texts. InEuropean Conference on Computer Vision, pages 580–597. Springer, 2022. 2, 7, 8

work page 2022
[18]

Momask: Generative masked model- ing of 3d human motions

Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked model- ing of 3d human motions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910, 2024. 3, 7

work page 1900
[19]

Snapmogen: Human motion generation from expressive texts.arXiv preprint arXiv:2507.09122, 2025

Chuan Guo, Inwoo Hwang, Jian Wang, and Bing Zhou. Snapmogen: Human motion generation from expressive texts.arXiv preprint arXiv:2507.09122, 2025. 8, 2

work page arXiv 2025
[20]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Gaussian Error Linear Units (GELUs)

D Hendrycks. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016. 1

work page internal anchor Pith review Pith/arXiv arXiv 2016
[22]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020. 8 9

work page internal anchor Pith review Pith/arXiv arXiv 2009
[23]

Hmvlm: Human motion-vision-lanuage model via moe lora.arXiv preprint arXiv:2511.01463, 2025

Lei Hu, Yongjing Ye, and Shihong Xia. Hmvlm: Human motion-vision-lanuage model via moe lora.arXiv preprint arXiv:2511.01463, 2025. 2, 3

work page arXiv 2025
[24]

Motiongpt: Human motion as a foreign lan- guage.Advances in Neural Information Processing Systems, 36:20067–20079, 2023

Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign lan- guage.Advances in Neural Information Processing Systems, 36:20067–20079, 2023. 2, 3, 5, 7, 8

work page 2023
[25]

Motionchain: Conversational motion controllers via multimodal prompts

Biao Jiang, Xin Chen, Chi Zhang, Fukun Yin, Zhuoyuan Li, Gang Yu, and Jiayuan Fan. Motionchain: Conversational motion controllers via multimodal prompts. InEuropean Conference on Computer Vision, pages 54–74. Springer,

work page
[26]

Solami: Social vision- language-action modeling for immersive interaction with 3d autonomous characters

Jianping Jiang, Weiye Xiao, Zhengyu Lin, Huaizhong Zhang, Tianxiang Ren, Yang Gao, Zhiqian Lin, Zhongang Cai, Lei Yang, and Ziwei Liu. Solami: Social vision- language-action modeling for immersive interaction with 3d autonomous characters. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26887– 26898, 2025. 2

work page 2025
[27]

Scaling up dynamic human-scene interaction mod- eling

Nan Jiang, Zhiyuan Zhang, Hongjie Li, Xiaoxuan Ma, Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, and Siyuan Huang. Scaling up dynamic human-scene interaction mod- eling. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 1737–1747,

work page
[28]

Hyperspherical latents improve continuous-token autoregressive generation.arXiv preprint arXiv:2509.24335, 2025

Guolin Ke and Hui Xue. Hyperspherical latents improve continuous-token autoregressive generation.arXiv preprint arXiv:2509.24335, 2025. 4, 2

work page arXiv 2025
[29]

Imore: Implicit program-guided reasoning for human mo- tion q&a.arXiv preprint arXiv:2508.01984, 2025

Chen Li, Chinthani Sugandhika, Yeo Keat Ee, Eric Peh, Hao Zhang, Hong Yang, Deepu Rajan, and Basura Fernando. Imore: Implicit program-guided reasoning for human mo- tion q&a.arXiv preprint arXiv:2508.01984, 2025. 8

work page arXiv 2025
[30]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InPro- ceedings of the 40th International Conference on Machine Learning, pages 19730–19742. PMLR, 2023. 2

work page 2023
[31]

Neural speech synthesis with transformer network

Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. Neural speech synthesis with transformer network. InProceedings of the AAAI conference on artificial intelli- gence, pages 6706–6713, 2019. 5, 1

work page 2019
[32]

Finedance: A fine-grained choreography dataset for 3d full body dance generation

Ronghui Li, Junfan Zhao, Yachao Zhang, Mingyang Su, Zeping Ren, Han Zhang, Yansong Tang, and Xiu Li. Finedance: A fine-grained choreography dataset for 3d full body dance generation. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 10234– 10243, 2023. 5

work page 2023
[33]

Autoregressive image generation without vec- tor quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vec- tor quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024. 2, 3, 5, 6, 1

work page 2024
[34]

Lamp: Language-motion pretraining for motion generation, retrieval, and captioning

Zhe Li, Weihao Yuan, Yisheng He, Lingteng Qiu, Shen- hao Zhu, Xiaodong Gu, Weichao Shen, Yuan Dong, Zi- long Dong, and Laurence T Yang. Lamp: Language-motion pretraining for motion generation, retrieval, and captioning. arXiv preprint arXiv:2410.07093, 2024. 7

work page arXiv 2024
[35]

Intergen: Diffusion-based multi-human motion genera- tion under complex interactions.International Journal of Computer Vision, 132(9):3463–3483, 2024

Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, and Lan Xu. Intergen: Diffusion-based multi-human motion genera- tion under complex interactions.International Journal of Computer Vision, 132(9):3463–3483, 2024. 5

work page 2024
[36]

Mogao: An omni foundation model for interleaved multi-modal generation

Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025. 3

work page arXiv 2025
[37]

Animationgpt:an aigc tool for generating game combat motion assets.https : / / github

Yihao Liao, Yiyu Fu, Ziming Cheng, and Jiangfeiyang Wang. Animationgpt:an aigc tool for generating game combat motion assets.https : / / github . com / fyyakaxyy/AnimationGPT, 2024. 5

work page 2024
[38]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004. 7

work page 2004
[39]

Motion-x: A large-scale 3d expressive whole-body human motion dataset

Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large-scale 3d expressive whole-body human motion dataset. Advances in Neural Information Processing Systems, 36: 25268–25280, 2023. 3, 5

work page 2023
[40]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2022
[41]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2, 6

work page 2023
[42]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 5

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 1

work page internal anchor Pith review Pith/arXiv arXiv 2017
[44]

arXiv preprint arXiv:2206.08916 , year=

Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mot- taghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks.arXiv preprint arXiv:2206.08916, 2022. 2, 3

work page arXiv 2022
[45]

Scamo: Exploring the scaling law in au- toregressive motion generation model

Shunlin Lu, Jingbo Wang, Zeyu Lu, Ling-Hao Chen, Wenxun Dai, Junting Dong, Zhiyang Dou, Bo Dai, and Ruimao Zhang. Scamo: Exploring the scaling law in au- toregressive motion generation model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27872–27882, 2025. 3

work page 2025
[46]

Janusflow: Harmonizing autore- gression and rectified flow for unified multimodal under- standing and generation

Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autore- gression and rectified flow for unified multimodal under- standing and generation. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 7739–7751,

work page
[47]

Ian Mason, Sebastian Starke, and Taku Komura. Real-time style modelling of human locomotion via feature-wise trans- formations and local motion phases.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 5(1):1– 18, 2022. 5

work page 2022
[48]

Em- body 3d: A large-scale multimodal motion and behavior dataset.arXiv preprint arXiv:2510.16258, 2025

Claire McLean, Makenzie Meendering, Tristan Swartz, Orri Gabbay, Alexandra Olsen, Rachel Jacobs, Nicholas Rosen, Philippe de Bree, Tony Garcia, Gadsden Merrill, et al. Em- body 3d: A large-scale multimodal motion and behavior dataset.arXiv preprint arXiv:2510.16258, 2025. 5 10

work page arXiv 2025
[49]

Rethinking diffusion for text-driven human motion generation: Redundant representations, evaluation, and masked autoregression

Zichong Meng, Yiming Xie, Xiaogang Peng, Zeyu Han, and Huaizu Jiang. Rethinking diffusion for text-driven human motion generation: Redundant representations, evaluation, and masked autoregression. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27859–27871, 2025. 3

work page 2025
[50]

Motion-r1: Chain-of-thought reasoning and reinforcement learning for human motion generation.arXiv preprint arXiv:2506.10353, 2025

Runqi Ouyang, Haoyun Li, Zhenyuan Zhang, Xiaofeng Wang, Zheng Zhu, Guan Huang, and Xingang Wang. Motion-r1: Chain-of-thought reasoning and reinforcement learning for human motion generation.arXiv preprint arXiv:2506.10353, 2025. 2

work page arXiv 2025
[51]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,

work page
[52]

Continuous autoregressive models with noise augmentation avoid error accumulation.arXiv preprint arXiv:2411.18447, 2024

Marco Pasini, Javier Nistal, Stefan Lattner, and George Fazekas. Continuous autoregressive models with noise augmentation avoid error accumulation.arXiv preprint arXiv:2411.18447, 2024. 5

work page arXiv 2024
[53]

Babel: Bodies, action and behavior with english la- bels

Abhinanda R Punnakkal, Arjun Chandrasekaran, Nikos Athanasiou, Alejandra Quiros-Ramirez, and Michael J Black. Babel: Bodies, action and behavior with english la- bels. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 722–731, 2021. 5

work page 2021
[54]

Searching for Activation Functions

Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions.arXiv preprint arXiv:1710.05941, 2017. 1

work page internal anchor Pith review Pith/arXiv arXiv 2017
[55]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational confer- ence on machine learning, pages 8821–8831. Pmlr, 2021. 2

work page 2021
[56]

Continuous autoregressive language models.arXiv preprint arXiv:2510.27688, 2025

Chenze Shao, Darren Li, Fandong Meng, and Jie Zhou. Continuous autoregressive language models.arXiv preprint arXiv:2510.27688, 2025. 4, 2

work page arXiv 2025
[57]

World-grounded human motion recovery via gravity-view coordinates

Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou. World-grounded human motion recovery via gravity-view coordinates. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 5

work page 2024
[58]

Lmfusion: Adapting pretrained language models for multimodal gener- ation.arXiv preprint arXiv:2412.15188, 2024

Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, and Lili Yu. Lmfusion: Adapting pretrained language models for multimodal gener- ation.arXiv preprint arXiv:2412.15188, 2024. 2, 3

work page arXiv 2024
[59]

Mul- timodal latent language modeling with next-token diffusion

Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, and Furu Wei. Mul- timodal latent language modeling with next-token diffusion. arXiv preprint arXiv:2412.08635, 2024. 4, 2

work page arXiv 2024
[60]

Omni-video: Democratizing uni- fied video understanding and generation.arXiv preprint arXiv:2507.06119, 2025

Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Meng- ping Yang, and Hao Li. Omni-video: Democratizing uni- fied video understanding and generation.arXiv preprint arXiv:2507.06119, 2025. 1

work page arXiv 2025
[61]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

Nextstep-1: Toward autoregressive image generation with continuous tokens at scale.arXiv preprint arXiv:2508.10711, 2025

NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, et al. Nextstep-1: Toward autoregressive image generation with continuous tokens at scale.arXiv preprint arXiv:2508.10711, 2025. 1, 2, 3, 4, 5, 6

work page arXiv 2025
[63]

Human Motion Diffusion Model

Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. Human motion dif- fusion model.arXiv preprint arXiv:2209.14916, 2022. 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2022
[64]

Mm-interleaved: Interleaved image-text generative modeling via multi-modal feature synchronizer

Changyao Tian, Xizhou Zhu, Yuwen Xiong, Weiyun Wang, Zhe Chen, Wenhai Wang, Yuntao Chen, Lewei Lu, Tong Lu, Jie Zhou, et al. Mm-interleaved: Interleaved image-text generative modeling via multi-modal feature synchronizer. arXiv preprint arXiv:2401.10208, 2024. 2, 3

work page arXiv 2024
[65]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2, 4

work page 2017
[67]

Cider: Consensus-based image description evalua- tion

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. 7

work page 2015
[68]

Unirl-zero: Reinforcement learning on unified models with joint language model and diffusion model experts.arXiv preprint arXiv:2510.17937, 2025

Fu-Yun Wang, Han Zhang, Michael Gharbi, Hongsheng Li, and Taesung Park. Unirl-zero: Reinforcement learning on unified models with joint language model and diffusion model experts.arXiv preprint arXiv:2510.17937, 2025. 2

work page arXiv 2025
[69]

You think, you act: The new task of arbitrary text to motion generation

Runqi Wang, Caoyuan Ma, Guopeng Li, Hanrui Xu, Yuke Li, and Zheng Wang. You think, you act: The new task of arbitrary text to motion generation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12012–12022, 2025. 2

work page 2025
[70]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

Motiongpt-2: A general-purpose motion- language model for motion generation and understanding

Yuan Wang, Di Huang, Yaqi Zhang, Wanli Ouyang, Jile Jiao, Xuetao Feng, Yan Zhou, Pengfei Wan, Shixiang Tang, and Dan Xu. Motiongpt-2: A general-purpose motion- language model for motion generation and understanding. arXiv preprint arXiv:2410.21747, 2024. 2, 3, 8

work page arXiv 2024
[72]

Univideo: Unified understanding, generation, and editing for videos

Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhu Chen. Univideo: Unified understanding, generation, and editing for videos. arXiv preprint arXiv:2510.08377, 2025. 1

work page arXiv 2025
[73]

Janus: Decoupling visual encod- ing for unified multimodal understanding and generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encod- ing for unified multimodal understanding and generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025. 1, 2, 3

work page 2025
[74]

Motion-agent: A conversational framework for human motion generation with llms.arXiv preprint arXiv:2405.17013, 2024

Qi Wu, Yubo Zhao, Yifan Wang, Xinhang Liu, Yu-Wing Tai, and Chi-Keung Tang. Motion-agent: A conversational framework for human motion generation with llms.arXiv preprint arXiv:2405.17013, 2024. 3 11

work page arXiv 2024
[75]

Mote: Learning motion-text diffusion model for multiple generation tasks.arXiv preprint arXiv:2411.19786,

Yiming Wu, Wei Ji, Kecheng Zheng, Zicheng Wang, and Dong Xu. Mote: Learning motion-text diffusion model for multiple generation tasks.arXiv preprint arXiv:2411.19786,

work page arXiv
[76]

Motionstreamer: Streaming motion genera- tion via diffusion-based autoregressive model in causal latent space.arXiv preprint arXiv:2503.15451, 2025

Lixing Xiao, Shunlin Lu, Huaijin Pi, Ke Fan, Liang Pan, Yueer Zhou, Ziyong Feng, Xiaowei Zhou, Sida Peng, and Jingbo Wang. Motionstreamer: Streaming motion genera- tion via diffusion-based autoregressive model in causal latent space.arXiv preprint arXiv:2503.15451, 2025. 3, 7, 1, 2

work page arXiv 2025
[77]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[78]

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show- o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025. 1, 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[79]

X-streamer: Unified human world modeling with audiovisual interaction.arXiv preprint arXiv:2509.21574, 2025

You Xie, Tianpei Gu, Zenan Li, Chenxu Zhang, Guox- ian Song, Xiaochen Zhao, Chao Liang, Jianwen Jiang, Hongyi Xu, and Linjie Luo. X-streamer: Unified human world modeling with audiovisual interaction.arXiv preprint arXiv:2509.21574, 2025. 1

work page arXiv 2025
[80]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.