ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on multi-turn metrics.
arXiv preprint arXiv:2506.24086 , year=
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
SCRIPT presents a scalable diffusion policy with JAST-DiT architecture, nonlinear history conditioning, and RLHR post-training that claims to outperform prior methods on text alignment, motion quality, and physical realism while scaling on a 1200-hour dataset.
IMU-to-4D uses wearable IMU data and repurposed LLMs to predict coherent 4D human motion plus coarse scene structure, outperforming cascaded state-of-the-art pipelines in temporal stability.
LLaMo scales pretrained LLMs for unified motion-language tasks by encoding motion into continuous causal latents and adding a flow-matching head for real-time autoregressive generation and captioning.
MLA-Gen advances text-driven motion synthesis by aligning global motion patterns with fine-grained text semantics and mitigating attention sink effects via new masking techniques.
Proposes LoRA-based mixture-of-experts with autoencoder routing for continual bidirectional motion-language learning, reporting near-zero forgetting on a 5-task HumanML3D benchmark derived via semantic clustering.
UMo presents a sparse MoE-based unified model for real-time co-speech avatar animation that claims superior quality under latency constraints via keyframe-centric design and multi-stage audio-augmented training.
citing papers explorer
-
Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs
IMU-to-4D uses wearable IMU data and repurposed LLMs to predict coherent 4D human motion plus coarse scene structure, outperforming cascaded state-of-the-art pipelines in temporal stability.