SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

Chenran Li; Cyrus Hogg; David Minor; David Sami; Edy Lim; Eugene Jeong; Fernando Casta\~neda; Haoru Xue; Jan Kautz; Jiefeng Li

arxiv: 2511.07820 · v3 · pith:T3RKL6MLnew · submitted 2025-11-11 · 💻 cs.RO · cs.AI· cs.CV· cs.GR· cs.SY· eess.SY

SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

Zhengyi Luo , Ye Yuan , Tingwu Wang , Chenran Li , Fernando Casta\~neda , Sirui Chen , Zi-Ang Cao , Jiefeng Li

show 20 more authors

David Minor Qingwei Ben Jinhyung Park David Sami Zi Wang Xingye Da Runyu Ding Cyrus Hogg Lina Song Edy Lim Eugene Jeong Tairan He Haoru Xue Wenli Xiao Simon Yuen Jan Kautz Yan Chang Umar Iqbal Linxi "Jim" Fan Yuke Zhu

This is my paper

classification 💻 cs.RO cs.AIcs.CVcs.GRcs.SYeess.SY

keywords motiontrackingcontrolhumanoidscalingcomputedatafoundation

0 comments

read the original abstract

Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs. We show that scaling model capacity, data, and compute yields a generalist humanoid controller capable of natural, robust whole-body movements. We position motion tracking as a scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (1.2M to 42M parameters), dataset volume (100M+ frames from 700 hours of motion capture), and compute (21k GPU hours). Beyond demonstrating the benefits of scale, we further show downstream utility through: (1) a real-time kinematic planner bridging motion tracking to tasks such as navigation, enabling natural and interactive control, and (2) a unified token space supporting VR teleoperation and vision-language-action (VLA) models with a single policy. Through this interface, we demonstrate autonomous VLA-driven whole-body loco-manipulation requiring coordinated hand and foot placement. Scaling motion tracking exhibits favorable properties: performance improves steadily with compute and data diversity, and learned policies generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 15 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CEER: Compliant End-Effector and Root Control as a Unified Interface for Hierarchical Humanoid Loco-Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

CEER proposes a compliant end-effector and root control interface that unifies loco-manipulation for humanoids via a distilled low-level policy and hierarchical planners.
VOFA: Visual Object Goal Pushing with Force-Adaptive Control for Humanoids
cs.RO 2026-05 unverdicted novelty 6.0

VOFA combines a high-level visuomotor policy with a low-level force-adaptive controller to let humanoids push objects up to 17 kg to arbitrary goals using only noisy onboard vision, achieving over 80% real-world success.
ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control
cs.RO 2026-04 unverdicted novelty 6.0

ExoActor uses exocentric video generation to implicitly model robot-environment-object interactions and converts the resulting videos into task-conditioned humanoid control sequences.
Learning Whole-Body Humanoid Locomotion via Motion Generation and Motion Tracking
cs.RO 2026-04 unverdicted novelty 6.0

A diffusion-based motion generator combined with an RL motion tracker enables terrain-aware whole-body locomotion on a humanoid robot by adapting reference motions online from perception.
CLAW: Composable Language-Annotated Whole-body Motion Generation
cs.RO 2026-04 accept novelty 6.0

CLAW composes motion primitives from a kinematic planner, tracks them with a low-level controller in MuJoCo to produce physically grounded trajectories, and generates segment- and trajectory-level language annotations...
Sumo: Dynamic and Generalizable Whole-Body Loco-Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

Test-time steering of pre-trained whole-body policies via sample-based planning lets legged robots generalize dynamic loco-manipulation to varied heavy objects and tasks without additional training or tuning.
HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

HEX introduces a state-centric framework with humanoid-aligned representations and mixture-of-experts proprioceptive prediction for coordinated whole-body control on bipedal humanoids.
HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

HEX is a new framework with humanoid-aligned state representation, mixture-of-experts proprioceptive predictor, history tokens, and residual-gated fusion that achieves state-of-the-art success and generalization on re...
Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching
cs.RO 2026-02 unverdicted novelty 6.0

A modular system uses motion matching to compose long-horizon human skill chains, trains RL experts, and distills them into a depth-based policy that lets a Unitree G1 humanoid autonomously climb, vault, and roll over...
HAIC: Humanoid Agile Object Interaction Control via Dynamics-Aware World Model
cs.RO 2026-02 unverdicted novelty 6.0

HAIC enables robust humanoid interactions with underactuated objects by predicting their dynamics from proprioceptive history and using a world model for adaptive control.
TeleGate: Whole-Body Humanoid Teleoperation via Gated Expert Selection with Motion Prior
cs.RO 2026-02 unverdicted novelty 6.0

TeleGate achieves high-precision real-time whole-body teleoperation of humanoid robots by dynamically gating between expert policies and using a VAE motion prior to infer future intent from history, outperforming dist...
HoloMotion-1 Technical Report
cs.RO 2026-05 unverdicted novelty 5.0

HoloMotion-1 trains a large Mixture-of-Experts Transformer policy on a hybrid corpus of video-reconstructed and MoCap motions to achieve robust zero-shot whole-body tracking that transfers directly to real humanoid robots.
HoloMotion-1 Technical Report
cs.RO 2026-05 unverdicted novelty 5.0

HoloMotion-1 trains a MoE Transformer policy on hybrid video and MoCap motion data to achieve robust zero-shot tracking that transfers directly to real humanoid robots.
Learning Versatile Humanoid Manipulation with Touch Dreaming
cs.RO 2026-04 conditional novelty 5.0

HTD, a multimodal transformer policy trained with behavioral cloning and touch dreaming to predict future tactile latents, achieves a 90.9% relative success rate improvement over baselines on five real-world contact-r...
Tree Learning: A Multi-Skill Continual Learning Framework for Humanoid Robots
cs.RO 2026-04 unverdicted novelty 5.0

Tree Learning uses root-branch parameter inheritance and multi-modal adaptation to enable continual multi-skill learning in humanoid robots, achieving higher rewards and 100% retention versus joint training in Unity s...