MoGeFlow learns text-conditioned flows over PartVQ group-specific code embeddings to generate motions, achieving SOTA R-Precision on HumanML3D and KIT-ML while preserving discrete token validity.
Snapmogen: Human motion generation from expressive texts
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7roles
background 2polarities
background 2representative citing papers
MSCoT uses multi-scale hierarchical token prediction, multi-scale guidance, and a token refiner to deliver SOTA text-to-motion control with 48% FID gain, 61% lower error, and 10x faster inference on HumanML3D.
IMU-to-4D uses wearable IMU data and repurposed LLMs to predict coherent 4D human motion plus coarse scene structure, outperforming cascaded state-of-the-art pipelines in temporal stability.
LLaMo scales pretrained LLMs for unified motion-language tasks by encoding motion into continuous causal latents and adding a flow-matching head for real-time autoregressive generation and captioning.
InterCMDM proposes a block-causal latent diffusion framework with dual-stream causal transformers and multi-task attention masks for autoregressive text-conditioned two-person interaction generation and reports SOTA results on InterHuman and Inter-X.
OMG is a diffusion model for omni-modal whole-body humanoid motion generation that uses language, audio, and reference motions after large-scale data curation to achieve state-of-the-art performance and adaptation.
citing papers explorer
-
MoGeFlow: Flowing Through Motion Codebook Geometry for Text-to-Motion Generation
MoGeFlow learns text-conditioned flows over PartVQ group-specific code embeddings to generate motions, achieving SOTA R-Precision on HumanML3D and KIT-ML while preserving discrete token validity.
-
Multi-scale Coarse-to-fine Modeling for Test-time Human Motion Control
MSCoT uses multi-scale hierarchical token prediction, multi-scale guidance, and a token refiner to deliver SOTA text-to-motion control with 48% FID gain, 61% lower error, and 10x faster inference on HumanML3D.
-
Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs
IMU-to-4D uses wearable IMU data and repurposed LLMs to predict coherent 4D human motion plus coarse scene structure, outperforming cascaded state-of-the-art pipelines in temporal stability.
-
LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens
LLaMo scales pretrained LLMs for unified motion-language tasks by encoding motion into continuous causal latents and adding a flow-matching head for real-time autoregressive generation and captioning.
-
InterCMDM: Block-Causal Diffusion for Autoregressive Human Interaction Generation
InterCMDM proposes a block-causal latent diffusion framework with dual-stream causal transformers and multi-task attention masks for autoregressive text-conditioned two-person interaction generation and reports SOTA results on InterHuman and Inter-X.
-
OMG: Omni-Modal Motion Generation for Generalist Humanoid Control
OMG is a diffusion model for omni-modal whole-body humanoid motion generation that uses language, audio, and reference motions after large-scale data curation to achieve state-of-the-art performance and adaptation.
- Next-Scale Autoregressive Models for Text-to-Motion Generation