SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control
read the original abstract
Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs. We show that scaling model capacity, data, and compute yields a generalist humanoid controller capable of natural, robust whole-body movements. We position motion tracking as a scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (1.2M to 42M parameters), dataset volume (100M+ frames from 700 hours of motion capture), and compute (21k GPU hours). Beyond demonstrating the benefits of scale, we further show downstream utility through: (1) a real-time kinematic planner bridging motion tracking to tasks such as navigation, enabling natural and interactive control, and (2) a unified token space supporting VR teleoperation and vision-language-action (VLA) models with a single policy. Through this interface, we demonstrate autonomous VLA-driven whole-body loco-manipulation requiring coordinated hand and foot placement. Scaling motion tracking exhibits favorable properties: performance improves steadily with compute and data diversity, and learned policies generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.
This paper has not been read by Pith yet.
Forward citations
Cited by 15 Pith papers
-
CEER: Compliant End-Effector and Root Control as a Unified Interface for Hierarchical Humanoid Loco-Manipulation
CEER proposes a compliant end-effector and root control interface that unifies loco-manipulation for humanoids via a distilled low-level policy and hierarchical planners.
-
VOFA: Visual Object Goal Pushing with Force-Adaptive Control for Humanoids
VOFA combines a high-level visuomotor policy with a low-level force-adaptive controller to let humanoids push objects up to 17 kg to arbitrary goals using only noisy onboard vision, achieving over 80% real-world success.
-
ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control
ExoActor uses exocentric video generation to implicitly model robot-environment-object interactions and converts the resulting videos into task-conditioned humanoid control sequences.
-
Learning Whole-Body Humanoid Locomotion via Motion Generation and Motion Tracking
A diffusion-based motion generator combined with an RL motion tracker enables terrain-aware whole-body locomotion on a humanoid robot by adapting reference motions online from perception.
-
CLAW: Composable Language-Annotated Whole-body Motion Generation
CLAW composes motion primitives from a kinematic planner, tracks them with a low-level controller in MuJoCo to produce physically grounded trajectories, and generates segment- and trajectory-level language annotations...
-
Sumo: Dynamic and Generalizable Whole-Body Loco-Manipulation
Test-time steering of pre-trained whole-body policies via sample-based planning lets legged robots generalize dynamic loco-manipulation to varied heavy objects and tasks without additional training or tuning.
-
HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation
HEX introduces a state-centric framework with humanoid-aligned representations and mixture-of-experts proprioceptive prediction for coordinated whole-body control on bipedal humanoids.
-
HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation
HEX is a new framework with humanoid-aligned state representation, mixture-of-experts proprioceptive predictor, history tokens, and residual-gated fusion that achieves state-of-the-art success and generalization on re...
-
Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching
A modular system uses motion matching to compose long-horizon human skill chains, trains RL experts, and distills them into a depth-based policy that lets a Unitree G1 humanoid autonomously climb, vault, and roll over...
-
HAIC: Humanoid Agile Object Interaction Control via Dynamics-Aware World Model
HAIC enables robust humanoid interactions with underactuated objects by predicting their dynamics from proprioceptive history and using a world model for adaptive control.
-
TeleGate: Whole-Body Humanoid Teleoperation via Gated Expert Selection with Motion Prior
TeleGate achieves high-precision real-time whole-body teleoperation of humanoid robots by dynamically gating between expert policies and using a VAE motion prior to infer future intent from history, outperforming dist...
-
HoloMotion-1 Technical Report
HoloMotion-1 trains a large Mixture-of-Experts Transformer policy on a hybrid corpus of video-reconstructed and MoCap motions to achieve robust zero-shot whole-body tracking that transfers directly to real humanoid robots.
-
HoloMotion-1 Technical Report
HoloMotion-1 trains a MoE Transformer policy on hybrid video and MoCap motion data to achieve robust zero-shot tracking that transfers directly to real humanoid robots.
-
Learning Versatile Humanoid Manipulation with Touch Dreaming
HTD, a multimodal transformer policy trained with behavioral cloning and touch dreaming to predict future tactile latents, achieves a 90.9% relative success rate improvement over baselines on five real-world contact-r...
-
Tree Learning: A Multi-Skill Continual Learning Framework for Humanoid Robots
Tree Learning uses root-branch parameter inheritance and multi-modal adaptation to enable continual multi-skill learning in humanoid robots, achieving higher rewards and 100% retention versus joint training in Unity s...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.