SUGAR: A Scalable Human-Video-Driven Generalizable Humanoid Loco-Manipulation Learning Framework
Pith reviewed 2026-05-21 07:14 UTC · model grok-4.3
The pith
SUGAR converts imperfect human videos into deployable humanoid loco-manipulation skills that transfer zero-shot to real hardware and improve with more data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SUGAR extracts human-object trajectories and contact labels from raw videos, feeds the imperfect priors into a privileged physics-based refiner that applies a unified mimic reward and progressive state pool to produce feasible high-fidelity skills, then distills those skills into a command generator plus command tracker policy. The resulting system achieves zero-shot real-world transfer, reliable closed-loop execution, autonomous failure recovery, and stable long-horizon behavior under perturbations across six representative loco-manipulation tasks while outperforming reference-tracking baselines and scaling clearly with video volume.
What carries the argument
The privileged physics-based refiner that applies a unified mimic reward and progressive state pool to convert imperfect kinematic priors from human videos into physically feasible high-fidelity skills.
If this is right
- Task performance improves steadily as the quantity of human video data grows.
- The method outperforms reference-tracking baselines on both simulation and real hardware.
- Zero-shot deployment succeeds with closed-loop execution and autonomous recovery from failures.
- Long-horizon loco-manipulation remains stable under external disturbances.
- The same pipeline works across six distinct tasks without per-task reward or reference tuning.
Where Pith is reading between the lines
- Internet-scale human videos could serve as the main training corpus for broad humanoid capabilities beyond the six tasks shown.
- The refinement stage might transfer to other robot morphologies if the mimic reward is adapted to their kinematics.
- Combining the pipeline with large public video datasets would allow continual improvement of deployed humanoid policies without new teleoperation sessions.
- Success here suggests video-driven refinement could reduce dependence on high-fidelity simulators for initial skill acquisition.
Load-bearing premise
The physics-based refiner can turn imperfect video-derived motion priors into skills that transfer reliably to real hardware without any further task-specific engineering.
What would settle it
Performance on the six tasks remains flat or declines when the volume of human video data is increased, or the distilled policies fail to execute closed-loop on real humanoid hardware under external perturbations.
Figures
read the original abstract
Building humanoid robots capable of generalizable whole-body loco-manipulation in the real world remains a fundamental challenge. Existing methods either rely on laborious task-specific reward engineering, rigidly replay reference motions that fail to generalize, or depend on costly teleoperation that limits scalability. While human videos capture diverse human behaviors, motion priors inferred from them are inherently imperfect, suffering from occlusion, contact artifacts, and retargeting errors that render them unsuitable for direct policy learning. To address this, we present SUGAR, a scalable data-driven framework that converts diverse human videos into deployable humanoid loco-manipulation skills, without any task-specific reward engineering or reference-motion conditioning at inference. SUGAR proceeds in three stages. First, a fully automated pipeline extracts kinematic interaction priors including human-object motion trajectories and contact labels from unstructured human videos. Second, a privileged physics-based refiner uses a unified mimic reward and progressive state pool to transform imperfect priors into physically feasible, high-fidelity skills. Third, refined skills are distilled into a hierarchical autonomous policy consisting of a command generator and a command tracker. We evaluate SUGAR on six representative loco-manipulation tasks in simulation and real-world humanoid hardware. Our method substantially outperforms reference-tracking baselines, and performance scales clearly with the amount of human video data. It also achieves zero-shot real-world transfer with reliable closed-loop execution, autonomous failure recovery, and stable long-horizon performance under external perturbations. Project Page: https://tianshuwu.github.io/sugar-humanoid/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents SUGAR, a three-stage data-driven framework for humanoid loco-manipulation from unstructured human videos. Stage 1 automatically extracts kinematic priors (trajectories and contact labels); Stage 2 applies a privileged physics-based refiner with a unified mimic reward and progressive state pool to convert noisy priors into feasible skills; Stage 3 distills the refined skills into a hierarchical policy (command generator + tracker) that runs autonomously at inference without reference conditioning or task-specific rewards. Evaluation on six representative tasks reports outperformance over reference-tracking baselines, clear scaling with video data volume, and zero-shot real-world transfer featuring closed-loop execution, autonomous failure recovery, and robustness to external perturbations.
Significance. If the results hold, SUGAR offers a scalable alternative to reward engineering and teleoperation by directly leveraging abundant human video data for generalizable whole-body skills. The simulation ablations, real-world success rates on the six tasks, and qualitative recovery examples provide direct support for the refiner's role in handling occlusion, contact artifacts, and retargeting errors. This strengthens the case for zero-shot transfer and long-horizon stability, addressing a key bottleneck in humanoid deployment.
major comments (1)
- The central claim that the refiner reliably converts imperfect kinematic priors into transferable skills is load-bearing; while simulation ablations and real-world rates are supplied, an explicit comparison of refiner output fidelity (e.g., contact accuracy or trajectory error) before versus after refinement on the same video set would further substantiate that the progressive state pool is the decisive mechanism rather than downstream policy training.
minor comments (2)
- Abstract: the statements of 'substantially outperforms' and 'performance scales clearly' would be strengthened by including one or two concrete success-rate numbers or scaling slopes even at high level.
- The description of the command generator and tracker in the hierarchical policy could clarify the interface between them (e.g., what state is passed and at what frequency) to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive recommendation. We address the single major comment below and will incorporate the requested analysis.
read point-by-point responses
-
Referee: The central claim that the refiner reliably converts imperfect kinematic priors into transferable skills is load-bearing; while simulation ablations and real-world rates are supplied, an explicit comparison of refiner output fidelity (e.g., contact accuracy or trajectory error) before versus after refinement on the same video set would further substantiate that the progressive state pool is the decisive mechanism rather than downstream policy training.
Authors: We agree that an explicit before-versus-after fidelity comparison on the same video-derived priors would more directly isolate the refiner's contribution. In the revised manuscript we will add quantitative results (contact-state precision/recall and position/velocity MSE) computed on the identical set of extracted kinematic priors before and after the physics-based refiner. This new analysis will be placed in the ablation section alongside the existing policy-level ablations to clarify the role of the progressive state pool and unified mimic reward. revision: yes
Circularity Check
No significant circularity
full rationale
The SUGAR framework is a three-stage empirical pipeline that extracts kinematic priors from external unstructured human videos, refines them via a privileged physics simulator using a unified mimic reward and progressive state pool, and distills the results into a hierarchical policy. All performance claims (zero-shot transfer, scaling with video volume, outperformance of reference-tracking baselines, and closed-loop recovery) are backed by direct simulation ablations, real-world success rates on six tasks, and qualitative examples rather than any internal equation or self-citation that reduces the outcome to a quantity defined by the paper's own fitted parameters. The derivation therefore remains self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Human videos contain extractable kinematic interaction priors (trajectories and contact labels) despite occlusion and retargeting artifacts.
- domain assumption A privileged physics-based refiner with unified mimic reward and progressive state pool can produce high-fidelity, deployable skills from imperfect priors.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SUGAR proceeds in three stages: ... privileged physics-based refiner utilizes a unified mimic-style reward and a progressive state pool ... hierarchical policy: a high-level diffusion policy command generator ... low-level whole-body command tracker
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
performance scales clearly with the amount of human video data ... zero-shot real-world transfer
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer , author=. 2025 , eprint=
work page 2025
-
[2]
VIRAL: Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation , author=. 2025 , eprint=
work page 2025
-
[3]
Visual Whole-Body Control for Legged Loco-Manipulation , author=. 2024 , eprint=
work page 2024
-
[4]
Learning Vision-Driven Reactive Soccer Skills for Humanoid Robots , author=. 2025 , eprint=
work page 2025
-
[5]
Toward Real-World Cooperative and Competitive Soccer with Quadrupedal Robot Teams , author=. 2025 , eprint=
work page 2025
-
[6]
ResMimic: From General Motion Tracking to Humanoid Whole-body Loco-Manipulation via Residual Learning , author=. 2025 , eprint=
work page 2025
-
[7]
HDMI: Learning Interactive Humanoid Whole-Body Control from Human Videos , author=. 2025 , eprint=
work page 2025
-
[8]
TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System , author=. 2025 , eprint=
work page 2025
-
[9]
SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control , author=. 2025 , eprint=
work page 2025
-
[10]
CLONE: Closed-Loop Whole-Body Humanoid Teleoperation for Long-Horizon Tasks , author=. 2025 , eprint=
work page 2025
-
[11]
HOMIE: Humanoid Loco-Manipulation with Isomorphic Exoskeleton Cockpit , author=. 2025 , eprint=
work page 2025
-
[12]
AMO: Adaptive Motion Optimization for Hyper-Dexterous Humanoid Whole-Body Control , author=. 2025 , eprint=
work page 2025
-
[13]
ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills , author=. 2025 , eprint=
work page 2025
-
[14]
BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion , author=. 2025 , eprint=
work page 2025
-
[15]
ExBody2: Advanced Expressive Humanoid Whole-Body Control , author=. 2025 , eprint=
work page 2025
-
[16]
EgoHumanoid: Unlocking In-the-Wild Loco-Manipulation with Robot-Free Egocentric Demonstration , author=. 2026 , eprint=
work page 2026
-
[17]
Humanoid Manipulation Interface: Humanoid Whole-Body Manipulation from Robot-Free Demonstrations , author=. 2026 , eprint=
work page 2026
-
[18]
HumDex: Humanoid Dexterous Manipulation Made Easy , author=. 2026 , eprint=
work page 2026
- [19]
-
[20]
AMASS: Archive of Motion Capture as Surface Shapes , author=. 2019 , eprint=
work page 2019
-
[21]
PHUMA: Physically-Grounded Humanoid Locomotion Dataset , author=. 2025 , eprint=
work page 2025
-
[22]
Learning Human-to-Humanoid Real-Time Whole-Body Teleoperation , author=. 2024 , eprint=
work page 2024
-
[23]
PhysHSI: Towards a Real-World Generalizable and Natural Humanoid-Scene Interaction System , author=. 2025 , eprint=
work page 2025
-
[24]
InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions , author=. 2026 , eprint=
work page 2026
-
[25]
MaskedMimic: Unified Physics-Based Character Control Through Masked Motion Inpainting , author=. 2024 , eprint=
work page 2024
-
[26]
CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control , author=. 2024 , eprint=
work page 2024
-
[27]
HumanX: Toward Agile and Generalizable Humanoid Interaction Skills from Human Videos , author=. 2026 , eprint=
work page 2026
-
[28]
arXiv preprint arXiv:2602.06035 , year=
InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions , author=. arXiv preprint arXiv:2602.06035 , year=
-
[29]
InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation , author =. CVPR , year =
-
[30]
arXiv preprint arXiv:2602.21723 , year=
LessMimic: Long-Horizon Humanoid Interaction with Unified Distance Field Representations , author=. arXiv preprint arXiv:2602.21723 , year=
-
[31]
Learning Athletic Humanoid Tennis Skills from Imperfect Human Motion Data , author=. 2026 , eprint=
work page 2026
-
[32]
HumanPlus: Humanoid Shadowing and Imitation from Humans , author=. 2024 , eprint=
work page 2024
-
[33]
_0 : An Open Foundation Model Towards Universal Humanoid Loco-Manipulation , author=. 2026 , eprint=
work page 2026
-
[34]
PhysHOI: Physics-Based Imitation of Dynamic Human-Object Interaction , author=. 2023 , eprint=
work page 2023
-
[35]
OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction , author=. 2025 , eprint=
work page 2025
-
[36]
Learning from Massive Human Videos for Universal Humanoid Pose Control , author=. 2024 , eprint=
work page 2024
-
[37]
Visual Imitation Enables Contextual Humanoid Control , author=. 2025 , eprint=
work page 2025
-
[38]
ZeroWBC: Learning Natural Visuomotor Humanoid Control Directly from Human Egocentric Video , author=. 2026 , eprint=
work page 2026
-
[39]
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos , author=. 2026 , eprint=
work page 2026
-
[40]
Masquerade: Learning from In-the-wild Human Videos using Data-Editing , author=. 2025 , eprint=
work page 2025
-
[41]
MimicDroid: In-Context Learning for Humanoid Robot Manipulation from Human Play Videos , author=. 2025 , eprint=
work page 2025
-
[42]
OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation , author=. 2024 , eprint=
work page 2024
-
[43]
Vision-based Manipulation from Single Human Video with Open-World Object Graphs , author=. 2025 , eprint=
work page 2025
-
[44]
Learning Part-Aware Dense 3D Feature Field for Generalizable Articulated Object Manipulation , author=. 2026 , eprint=
work page 2026
-
[45]
EqvAfford: SE(3) Equivariance for Point-Level Affordance Learning , author=. 2024 , eprint=
work page 2024
-
[46]
Sam 3d body: Robust full-body human mesh recovery, 2026
SAM 3D Body: Robust Full-Body Human Mesh Recovery , author=. arXiv preprint arXiv:2602.15989 , year=
-
[47]
arXiv preprint arXiv:2512.08406 , year =
SAM-Body4D: Training-Free 4D Human Body Mesh Recovery from Videos , author =. arXiv preprint arXiv:2512.08406 , year =
-
[48]
SAM 3D: 3Dfy Anything in Images
SAM 3D: 3Dfy Anything in Images , author=. 2025 , journal=. 2511.16624 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
Bowen Wen and Wei Yang and Jan Kautz and Stan Birchfield , title =. CVPR , year =
-
[50]
Sensor fusion IV: control paradigms and data structures , volume=
Method for registration of 3-D shapes , author=. Sensor fusion IV: control paradigms and data structures , volume=. 1992 , organization=
work page 1992
-
[51]
Qwen3-VL Technical Report , author=. arXiv preprint arXiv:2511.21631 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[52]
The International Journal of Robotics Research , volume=
Diffusion policy: Visuomotor policy learning via action diffusion , author=. The International Journal of Robotics Research , volume=. 2025 , publisher=
work page 2025
-
[53]
Proximal Policy Optimization Algorithms
Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[54]
DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping , author=. 2025 , eprint=
work page 2025
-
[55]
VisualMimic: Visual Humanoid Loco-Manipulation via Motion Tracking and Generation , author=. 2025 , journal=
work page 2025
-
[56]
WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control , author=. arXiv preprint arXiv:2512.11047 , year=
-
[57]
Ultra: Unified multimodal control for autonomous humanoid whole-body loco-manipulation
ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation , author=. arXiv preprint arXiv:2603.03279 , year=
-
[58]
Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) , month =
Wang, Yinhuai and Zhao, Qihan and Yu, Runyi and Tsui, Hok Wai and Zeng, Ailing and Lin, Jing and Luo, Zhengyi and Yu, Jiwen and Li, Xiu and Chen, Qifeng and Zhang, Jian and Zhang, Lei and Tan, Ping , title =. Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) , month =. 2025 , pages =
work page 2025
-
[59]
SkillMimic-V2: Learning Robust and Generalizable Interaction Skills from Sparse and Noisy Demonstrations , author=. 2025 , eprint=
work page 2025
-
[60]
OmniXtreme: Breaking the Generality Barrier in High-Dynamic Humanoid Control , author=. 2026 , eprint=
work page 2026
-
[61]
Advances in Neural Information Processing Systems , year=
KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills , author=. Advances in Neural Information Processing Systems , year=
-
[62]
KungfuBot2: Learning Versatile Motion Skills for Humanoid Whole-Body Control , author=. arXiv preprint arXiv:2509.16638 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.