Recognition: 1 theorem link
· Lean TheoremLLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens
Pith reviewed 2026-05-16 04:58 UTC · model grok-4.3
The pith
LLaMo extends pretrained language models with a Mixture-of-Transformers design to unify text-to-motion generation and motion captioning using continuous tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLaMo extends pretrained LLMs through a modality-specific Mixture-of-Transformers architecture that encodes human motion into a causal continuous latent space while preserving the next-token prediction paradigm via a lightweight flow-matching head, enabling real-time streaming motion generation above 30 FPS and delivering high-fidelity text-to-motion generation plus motion-to-text captioning without catastrophic forgetting of linguistic capabilities.
What carries the argument
Modality-specific Mixture-of-Transformers (MoT) architecture paired with continuous autoregressive tokens and a flow-matching head for next-token prediction.
If this is right
- Real-time text-to-motion generation runs above 30 FPS in a streaming manner.
- Zero-shot motion generation works in general settings without task-specific fine-tuning.
- Motion-to-text captioning and text-to-motion generation are handled inside the same unified model.
- Continuous latent encoding removes jitter artifacts that come from discrete motion tokenization.
Where Pith is reading between the lines
- The same continuous-token plus flow-matching pattern could be tested on other sequential modalities such as audio or 3D video.
- Scaling the base LLM size further might improve zero-shot performance on long or complex motion sequences.
- The architecture might generalize to joint training on multiple motion datasets without separate quantization steps for each.
Load-bearing premise
The modality-specific Mixture-of-Transformers structure keeps the base language model's understanding intact while still allowing effective adaptation to motion data.
What would settle it
Measure language-only task performance on the base LLM before and after full LLaMo training on motion-text pairs; a large drop would falsify the preservation claim.
Figures
read the original abstract
Recent progress in large models has led to significant advances in unified multimodal generation and understanding. However, the development of models that unify motion-language generation and understanding remains largely underexplored. Existing approaches often fine-tune large language models (LLMs) on paired motion-text data, which can result in catastrophic forgetting of linguistic capabilities due to the limited scale of available text-motion pairs. Furthermore, prior methods typically convert motion into discrete representations via quantization to integrate with language models, introducing substantial jitter artifacts from discrete tokenization. To address these challenges, we propose LLaMo, a unified framework that extends pretrained LLMs through a modality-specific Mixture-of-Transformers (MoT) architecture. This design inherently preserves the language understanding of the base model while enabling scalable multimodal adaptation. We encode human motion into a causal continuous latent space and maintain the next-token prediction paradigm in the decoder-only backbone through a lightweight flow-matching head, allowing for streaming motion generation in real-time (>30 FPS). Leveraging the comprehensive language understanding of pretrained LLMs and large-scale motion-text pretraining, our experiments demonstrate that LLaMo achieves high-fidelity text-to-motion generation and motion-to-text captioning in general settings, especially zero-shot motion generation, marking a significant step towards a general unified motion-language large model.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes LLaMo, a unified framework extending pretrained LLMs via a modality-specific Mixture-of-Transformers (MoT) architecture. Motion is encoded into a causal continuous latent space with a lightweight flow-matching head to preserve the next-token prediction paradigm, enabling real-time streaming generation. The central claims are that this design inherently avoids catastrophic forgetting of linguistic capabilities during large-scale motion-text pretraining and achieves high-fidelity text-to-motion generation plus motion-to-text captioning, with particular strength in zero-shot settings.
Significance. If the experimental claims are substantiated, the work would advance unified motion-language modeling by demonstrating scalable multimodal adaptation of LLMs without discrete quantization artifacts or loss of base capabilities, with practical benefits for real-time (>30 FPS) generation. The continuous autoregressive token approach and MoT routing represent a potentially generalizable direction for other modalities, though the absence of reported language-benchmark retention metrics limits assessment of the preservation claim.
major comments (3)
- [Abstract / §4] Abstract and §4 (Experiments): The claim that the MoT architecture 'inherently preserves the language understanding of the base model' is presented as a design property but is unsupported by any quantitative comparison (e.g., GLUE, MMLU, or zero-shot reasoning scores) between the original LLM and the post-adaptation LLaMo model. Large-scale motion-text pretraining could still induce forgetting even with routing, and no such results are provided.
- [Abstract] Abstract: The statements of 'high-fidelity text-to-motion generation and motion-to-text captioning in general settings, especially zero-shot motion generation' are made without any reported metrics, baselines, ablation studies, or dataset details. This leaves the central performance claims without verifiable support in the manuscript as described.
- [§3] §3 (Method): The continuous flow-matching head is introduced to maintain the decoder-only next-token paradigm, but no derivation or analysis is given showing how this head interacts with the discrete language token predictions to guarantee retention of original LLM behavior after joint training.
minor comments (1)
- [§3] Notation for the continuous latent space and flow-matching head should be defined more explicitly with equations to clarify the autoregressive generation process.
Simulated Author's Rebuttal
We appreciate the referee's thorough review and constructive suggestions. We will revise the manuscript to provide quantitative evidence for our claims and additional analysis as requested.
read point-by-point responses
-
Referee: [Abstract / §4] Abstract and §4 (Experiments): The claim that the MoT architecture 'inherently preserves the language understanding of the base model' is presented as a design property but is unsupported by any quantitative comparison (e.g., GLUE, MMLU, or zero-shot reasoning scores) between the original LLM and the post-adaptation LLaMo model. Large-scale motion-text pretraining could still induce forgetting even with routing, and no such results are provided.
Authors: We agree that empirical validation is important to substantiate the preservation claim. Although the MoT design routes language inputs exclusively through the frozen base LLM parameters, we will add quantitative comparisons on language benchmarks such as GLUE, MMLU, and zero-shot reasoning tasks in the revised §4 to demonstrate retention of linguistic capabilities post-adaptation. revision: yes
-
Referee: [Abstract] Abstract: The statements of 'high-fidelity text-to-motion generation and motion-to-text captioning in general settings, especially zero-shot motion generation' are made without any reported metrics, baselines, ablation studies, or dataset details. This leaves the central performance claims without verifiable support in the manuscript as described.
Authors: The full manuscript in §4 provides detailed experimental results, including quantitative metrics (e.g., FID, R-Precision for text-to-motion; BLEU, CIDEr for motion-to-text), comparisons against baselines like MDM and MotionGPT, ablation studies on the MoT and flow-matching components, and dataset details (HumanML3D, KIT-Motion-Language). We will revise the abstract to include key numerical results and explicit references to these sections for better verifiability. revision: partial
-
Referee: [§3] §3 (Method): The continuous flow-matching head is introduced to maintain the decoder-only next-token paradigm, but no derivation or analysis is given showing how this head interacts with the discrete language token predictions to guarantee retention of original LLM behavior after joint training.
Authors: We will expand §3 with a formal analysis and derivation. Specifically, we will show that the flow-matching head predicts continuous motion latents in an autoregressive manner using a separate output projection, while language token prediction uses the original LLM head and loss. The MoT architecture ensures that language token embeddings and attention are processed only by the base experts, isolating gradients and preserving the original next-token prediction objective for text. This will include equations demonstrating the separation of modalities during joint training. revision: yes
Circularity Check
No circularity: claims rely on architectural proposal and empirical results, not self-referential reductions
full rationale
The paper introduces LLaMo via a modality-specific Mixture-of-Transformers extension to pretrained LLMs, continuous latent motion encoding, and a flow-matching head for next-token prediction. No equations, derivations, or self-citations are shown that reduce the central claims (preservation of language capabilities, zero-shot performance) to fitted parameters or prior self-work by construction. The 'inherently preserves' statement is presented as a design property of the new architecture rather than a derived equivalence to inputs. This is a standard model-extension approach without internal circular logic in the provided derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Next-token prediction paradigm remains valid when a flow-matching head is added to a decoder-only LLM backbone for motion sequences
invented entities (1)
-
Mixture-of-Transformers (MoT) architecture
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
modality-specific Mixture-of-Transformers (MoT) architecture... freezing the text-related modules and updating the motion-specific parameters only... continuous causal latent space and models the next-token distribution... through a flow-matching head
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
IAM: Identity-Aware Human Motion and Shape Joint Generation
IAM jointly synthesizes motion sequences and body shape parameters conditioned on multimodal identity signals to achieve more realistic and identity-consistent human motions.
Reference graph
Works this paper leans on
-
[1]
Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing
Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, et al. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. InProceedings of the 60th an- nual meeting of the association for computational linguistics (volume 1: Long papers), pages 5723–5738, 2022. 5, 1
work page 2022
-
[2]
Nikos Athanasiou, Alp ´ar Ceske, Markos Diomataris, Michael J. Black, and G ¨ul Varol. MotionFix: Text-driven 3d human motion editing. InSIGGRAPH Asia 2024 Confer- ence Papers, 2024. 8
work page 2024
-
[3]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020. 2
work page 1901
-
[4]
Motionctrl: A real-time controllable vision-language-motion model
Bin Cao, Sipeng Zheng, Ye Wang, Lujie Xia, Qianshan Wei, Qin Jin, Jing Liu, and Zongqing Lu. Motionctrl: A real-time controllable vision-language-motion model. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 12253–12262, 2025. 3, 8
work page 2025
-
[5]
Executing your commands via motion diffusion in latent space
Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18000–18010, 2023. 7
work page 2023
-
[6]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Dis- cord: Discrete tokens to continuous motion via rectified flow decoding
Jungbin Cho, Junwan Kim, Jisoo Kim, Minseo Kim, Mingu Kang, Sungeun Hong, Tae-Hyun Oh, and Youngjae Yu. Dis- cord: Discrete tokens to continuous motion via rectified flow decoding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14602–14612, 2025. 2, 6
work page 2025
-
[8]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [9]
-
[10]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 1, 3, 6, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Motion question answering via modular motion programs
Mark Endo, Joy Hsu, Jiaman Li, and Jiajun Wu. Motion question answering via modular motion programs. InIn- ternational Conference on Machine Learning, pages 9312–
-
[12]
Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, and Jingbo Wang. Go to zero: Towards zero-shot motion generation with million-scale data.arXiv preprint arXiv:2507.07095,
-
[13]
Humocon: Concept discovery for hu- man motion understanding
Qihang Fang, Chengcheng Tang, Bugra Tekin, Shugao Ma, and Yanchao Yang. Humocon: Concept discovery for hu- man motion understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7179–7190, 2025. 2
work page 2025
-
[14]
Learning complex 3d human self-contact
Mihai Fieraru, Mihai Zanfir, Elisabeta Oneata, Alin-Ionut Popa, Vlad Olaru, and Cristian Sminchisescu. Learning complex 3d human self-contact. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1343– 1351, 2021. 5
work page 2021
-
[15]
Generating diverse and natural 3d human motions from text
Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5152–5161, 2022. 2, 3, 5, 7, 8
work page 2022
-
[16]
Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. Tm2t: Stochastic and tokenized modeling for the reciprocal gener- ation of 3d human motions and texts. InECCV, 2022. 7
work page 2022
-
[17]
Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. Tm2t: Stochastic and tokenized modeling for the reciprocal genera- tion of 3d human motions and texts. InEuropean Conference on Computer Vision, pages 580–597. Springer, 2022. 2, 7, 8
work page 2022
-
[18]
Momask: Generative masked model- ing of 3d human motions
Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked model- ing of 3d human motions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910, 2024. 3, 7
work page 1900
-
[19]
Snapmogen: Human motion generation from expressive texts.arXiv preprint arXiv:2507.09122, 2025
Chuan Guo, Inwoo Hwang, Jian Wang, and Bing Zhou. Snapmogen: Human motion generation from expressive texts.arXiv preprint arXiv:2507.09122, 2025. 8, 2
-
[20]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Gaussian Error Linear Units (GELUs)
D Hendrycks. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016. 1
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[22]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020. 8 9
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[23]
Hmvlm: Human motion-vision-lanuage model via moe lora.arXiv preprint arXiv:2511.01463, 2025
Lei Hu, Yongjing Ye, and Shihong Xia. Hmvlm: Human motion-vision-lanuage model via moe lora.arXiv preprint arXiv:2511.01463, 2025. 2, 3
-
[24]
Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign lan- guage.Advances in Neural Information Processing Systems, 36:20067–20079, 2023. 2, 3, 5, 7, 8
work page 2023
-
[25]
Motionchain: Conversational motion controllers via multimodal prompts
Biao Jiang, Xin Chen, Chi Zhang, Fukun Yin, Zhuoyuan Li, Gang Yu, and Jiayuan Fan. Motionchain: Conversational motion controllers via multimodal prompts. InEuropean Conference on Computer Vision, pages 54–74. Springer,
-
[26]
Jianping Jiang, Weiye Xiao, Zhengyu Lin, Huaizhong Zhang, Tianxiang Ren, Yang Gao, Zhiqian Lin, Zhongang Cai, Lei Yang, and Ziwei Liu. Solami: Social vision- language-action modeling for immersive interaction with 3d autonomous characters. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26887– 26898, 2025. 2
work page 2025
-
[27]
Scaling up dynamic human-scene interaction mod- eling
Nan Jiang, Zhiyuan Zhang, Hongjie Li, Xiaoxuan Ma, Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, and Siyuan Huang. Scaling up dynamic human-scene interaction mod- eling. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 1737–1747,
-
[28]
Guolin Ke and Hui Xue. Hyperspherical latents improve continuous-token autoregressive generation.arXiv preprint arXiv:2509.24335, 2025. 4, 2
-
[29]
Chen Li, Chinthani Sugandhika, Yeo Keat Ee, Eric Peh, Hao Zhang, Hong Yang, Deepu Rajan, and Basura Fernando. Imore: Implicit program-guided reasoning for human mo- tion q&a.arXiv preprint arXiv:2508.01984, 2025. 8
-
[30]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InPro- ceedings of the 40th International Conference on Machine Learning, pages 19730–19742. PMLR, 2023. 2
work page 2023
-
[31]
Neural speech synthesis with transformer network
Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. Neural speech synthesis with transformer network. InProceedings of the AAAI conference on artificial intelli- gence, pages 6706–6713, 2019. 5, 1
work page 2019
-
[32]
Finedance: A fine-grained choreography dataset for 3d full body dance generation
Ronghui Li, Junfan Zhao, Yachao Zhang, Mingyang Su, Zeping Ren, Han Zhang, Yansong Tang, and Xiu Li. Finedance: A fine-grained choreography dataset for 3d full body dance generation. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 10234– 10243, 2023. 5
work page 2023
-
[33]
Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vec- tor quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024. 2, 3, 5, 6, 1
work page 2024
-
[34]
Lamp: Language-motion pretraining for motion generation, retrieval, and captioning
Zhe Li, Weihao Yuan, Yisheng He, Lingteng Qiu, Shen- hao Zhu, Xiaodong Gu, Weichao Shen, Yuan Dong, Zi- long Dong, and Laurence T Yang. Lamp: Language-motion pretraining for motion generation, retrieval, and captioning. arXiv preprint arXiv:2410.07093, 2024. 7
-
[35]
Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, and Lan Xu. Intergen: Diffusion-based multi-human motion genera- tion under complex interactions.International Journal of Computer Vision, 132(9):3463–3483, 2024. 5
work page 2024
-
[36]
Mogao: An omni foundation model for interleaved multi-modal generation
Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025. 3
-
[37]
Animationgpt:an aigc tool for generating game combat motion assets.https : / / github
Yihao Liao, Yiyu Fu, Ziming Cheng, and Jiangfeiyang Wang. Animationgpt:an aigc tool for generating game combat motion assets.https : / / github . com / fyyakaxyy/AnimationGPT, 2024. 5
work page 2024
-
[38]
Rouge: A package for automatic evaluation of summaries
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004. 7
work page 2004
-
[39]
Motion-x: A large-scale 3d expressive whole-body human motion dataset
Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large-scale 3d expressive whole-body human motion dataset. Advances in Neural Information Processing Systems, 36: 25268–25280, 2023. 3, 5
work page 2023
-
[40]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 2, 5
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[41]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2, 6
work page 2023
-
[42]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 5
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[43]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 1
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[44]
arXiv preprint arXiv:2206.08916 , year=
Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mot- taghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks.arXiv preprint arXiv:2206.08916, 2022. 2, 3
-
[45]
Scamo: Exploring the scaling law in au- toregressive motion generation model
Shunlin Lu, Jingbo Wang, Zeyu Lu, Ling-Hao Chen, Wenxun Dai, Junting Dong, Zhiyang Dou, Bo Dai, and Ruimao Zhang. Scamo: Exploring the scaling law in au- toregressive motion generation model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27872–27882, 2025. 3
work page 2025
-
[46]
Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autore- gression and rectified flow for unified multimodal under- standing and generation. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 7739–7751,
-
[47]
Ian Mason, Sebastian Starke, and Taku Komura. Real-time style modelling of human locomotion via feature-wise trans- formations and local motion phases.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 5(1):1– 18, 2022. 5
work page 2022
-
[48]
Claire McLean, Makenzie Meendering, Tristan Swartz, Orri Gabbay, Alexandra Olsen, Rachel Jacobs, Nicholas Rosen, Philippe de Bree, Tony Garcia, Gadsden Merrill, et al. Em- body 3d: A large-scale multimodal motion and behavior dataset.arXiv preprint arXiv:2510.16258, 2025. 5 10
-
[49]
Zichong Meng, Yiming Xie, Xiaogang Peng, Zeyu Han, and Huaizu Jiang. Rethinking diffusion for text-driven human motion generation: Redundant representations, evaluation, and masked autoregression. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27859–27871, 2025. 3
work page 2025
-
[50]
Runqi Ouyang, Haoyun Li, Zhenyuan Zhang, Xiaofeng Wang, Zheng Zhu, Guan Huang, and Xingang Wang. Motion-r1: Chain-of-thought reasoning and reinforcement learning for human motion generation.arXiv preprint arXiv:2506.10353, 2025. 2
-
[51]
Bleu: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,
-
[52]
Marco Pasini, Javier Nistal, Stefan Lattner, and George Fazekas. Continuous autoregressive models with noise augmentation avoid error accumulation.arXiv preprint arXiv:2411.18447, 2024. 5
-
[53]
Babel: Bodies, action and behavior with english la- bels
Abhinanda R Punnakkal, Arjun Chandrasekaran, Nikos Athanasiou, Alejandra Quiros-Ramirez, and Michael J Black. Babel: Bodies, action and behavior with english la- bels. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 722–731, 2021. 5
work page 2021
-
[54]
Searching for Activation Functions
Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions.arXiv preprint arXiv:1710.05941, 2017. 1
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[55]
Zero-shot text-to-image generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational confer- ence on machine learning, pages 8821–8831. Pmlr, 2021. 2
work page 2021
-
[56]
Continuous autoregressive language models.arXiv preprint arXiv:2510.27688, 2025
Chenze Shao, Darren Li, Fandong Meng, and Jie Zhou. Continuous autoregressive language models.arXiv preprint arXiv:2510.27688, 2025. 4, 2
-
[57]
World-grounded human motion recovery via gravity-view coordinates
Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou. World-grounded human motion recovery via gravity-view coordinates. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 5
work page 2024
-
[58]
Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, and Lili Yu. Lmfusion: Adapting pretrained language models for multimodal gener- ation.arXiv preprint arXiv:2412.15188, 2024. 2, 3
-
[59]
Mul- timodal latent language modeling with next-token diffusion
Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, and Furu Wei. Mul- timodal latent language modeling with next-token diffusion. arXiv preprint arXiv:2412.08635, 2024. 4, 2
-
[60]
Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Meng- ping Yang, and Hao Li. Omni-video: Democratizing uni- fied video understanding and generation.arXiv preprint arXiv:2507.06119, 2025. 1
-
[61]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 1, 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[62]
NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, et al. Nextstep-1: Toward autoregressive image generation with continuous tokens at scale.arXiv preprint arXiv:2508.10711, 2025. 1, 2, 3, 4, 5, 6
-
[63]
Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. Human motion dif- fusion model.arXiv preprint arXiv:2209.14916, 2022. 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[64]
Mm-interleaved: Interleaved image-text generative modeling via multi-modal feature synchronizer
Changyao Tian, Xizhou Zhu, Yuwen Xiong, Weiyun Wang, Zhe Chen, Wenhai Wang, Yuntao Chen, Lewei Lu, Tong Lu, Jie Zhou, et al. Mm-interleaved: Interleaved image-text generative modeling via multi-modal feature synchronizer. arXiv preprint arXiv:2401.10208, 2024. 2, 3
-
[65]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[66]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2, 4
work page 2017
-
[67]
Cider: Consensus-based image description evalua- tion
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. 7
work page 2015
-
[68]
Fu-Yun Wang, Han Zhang, Michael Gharbi, Hongsheng Li, and Taesung Park. Unirl-zero: Reinforcement learning on unified models with joint language model and diffusion model experts.arXiv preprint arXiv:2510.17937, 2025. 2
-
[69]
You think, you act: The new task of arbitrary text to motion generation
Runqi Wang, Caoyuan Ma, Guopeng Li, Hanrui Xu, Yuke Li, and Zheng Wang. You think, you act: The new task of arbitrary text to motion generation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12012–12022, 2025. 2
work page 2025
-
[70]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 1, 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[71]
Motiongpt-2: A general-purpose motion- language model for motion generation and understanding
Yuan Wang, Di Huang, Yaqi Zhang, Wanli Ouyang, Jile Jiao, Xuetao Feng, Yan Zhou, Pengfei Wan, Shixiang Tang, and Dan Xu. Motiongpt-2: A general-purpose motion- language model for motion generation and understanding. arXiv preprint arXiv:2410.21747, 2024. 2, 3, 8
-
[72]
Univideo: Unified understanding, generation, and editing for videos
Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhu Chen. Univideo: Unified understanding, generation, and editing for videos. arXiv preprint arXiv:2510.08377, 2025. 1
-
[73]
Janus: Decoupling visual encod- ing for unified multimodal understanding and generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encod- ing for unified multimodal understanding and generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025. 1, 2, 3
work page 2025
-
[74]
Qi Wu, Yubo Zhao, Yifan Wang, Xinhang Liu, Yu-Wing Tai, and Chi-Keung Tang. Motion-agent: A conversational framework for human motion generation with llms.arXiv preprint arXiv:2405.17013, 2024. 3 11
-
[75]
Yiming Wu, Wei Ji, Kecheng Zheng, Zicheng Wang, and Dong Xu. Mote: Learning motion-text diffusion model for multiple generation tasks.arXiv preprint arXiv:2411.19786,
-
[76]
Lixing Xiao, Shunlin Lu, Huaijin Pi, Ke Fan, Liang Pan, Yueer Zhou, Ziyong Feng, Xiaowei Zhou, Sida Peng, and Jingbo Wang. Motionstreamer: Streaming motion genera- tion via diffusion-based autoregressive model in causal latent space.arXiv preprint arXiv:2503.15451, 2025. 3, 7, 1, 2
-
[77]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024. 1, 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[78]
Show-o2: Improved Native Unified Multimodal Models
Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show- o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025. 1, 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[79]
You Xie, Tianpei Gu, Zenan Li, Chenxu Zhang, Guox- ian Song, Xiaochen Zhao, Chao Liang, Jianwen Jiang, Hongyi Xu, and Linjie Luo. X-streamer: Unified human world modeling with audiovisual interaction.arXiv preprint arXiv:2509.21574, 2025. 1
-
[80]
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.