pith. sign in

arxiv: 2605.20373 · v1 · pith:JSOQ4HPSnew · submitted 2026-05-19 · 💻 cs.RO · cs.AI· cs.CV

SUGAR: A Scalable Human-Video-Driven Generalizable Humanoid Loco-Manipulation Learning Framework

Pith reviewed 2026-05-21 07:14 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV
keywords humanoid loco-manipulationhuman video learningphysics-based refinementzero-shot transferimitation learninggeneralizable policiesrobot skill from videoautonomous humanoid control
0
0 comments X

The pith

SUGAR converts imperfect human videos into deployable humanoid loco-manipulation skills that transfer zero-shot to real hardware and improve with more data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates a three-stage process that extracts kinematic priors from unstructured human videos, refines them into physically valid motions via a privileged simulator, and distills the results into an autonomous hierarchical policy. This pipeline runs without task-specific reward design or reference tracking at deployment time. A sympathetic reader would care because it replaces labor-intensive teleoperation and manual engineering with scalable video data, potentially letting humanoid robots acquire diverse whole-body skills from everyday recordings. If the approach holds, performance on complex loco-manipulation tasks would grow steadily as more human videos become available while maintaining closed-loop robustness on physical robots.

Core claim

SUGAR extracts human-object trajectories and contact labels from raw videos, feeds the imperfect priors into a privileged physics-based refiner that applies a unified mimic reward and progressive state pool to produce feasible high-fidelity skills, then distills those skills into a command generator plus command tracker policy. The resulting system achieves zero-shot real-world transfer, reliable closed-loop execution, autonomous failure recovery, and stable long-horizon behavior under perturbations across six representative loco-manipulation tasks while outperforming reference-tracking baselines and scaling clearly with video volume.

What carries the argument

The privileged physics-based refiner that applies a unified mimic reward and progressive state pool to convert imperfect kinematic priors from human videos into physically feasible high-fidelity skills.

If this is right

  • Task performance improves steadily as the quantity of human video data grows.
  • The method outperforms reference-tracking baselines on both simulation and real hardware.
  • Zero-shot deployment succeeds with closed-loop execution and autonomous recovery from failures.
  • Long-horizon loco-manipulation remains stable under external disturbances.
  • The same pipeline works across six distinct tasks without per-task reward or reference tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Internet-scale human videos could serve as the main training corpus for broad humanoid capabilities beyond the six tasks shown.
  • The refinement stage might transfer to other robot morphologies if the mimic reward is adapted to their kinematics.
  • Combining the pipeline with large public video datasets would allow continual improvement of deployed humanoid policies without new teleoperation sessions.
  • Success here suggests video-driven refinement could reduce dependence on high-fidelity simulators for initial skill acquisition.

Load-bearing premise

The physics-based refiner can turn imperfect video-derived motion priors into skills that transfer reliably to real hardware without any further task-specific engineering.

What would settle it

Performance on the six tasks remains flat or declines when the volume of human video data is increased, or the distilled policies fail to execute closed-loop on real humanoid hardware under external perturbations.

Figures

Figures reproduced from arXiv: 2605.20373 by Hang Ye, Hao Dong, Jia Li, Qize Yu, Tianshu Wu, Xiangqi Kong, Yizhou Wang, Yue Chen.

Figure 1
Figure 1. Figure 1: SUGAR enables generalizable real-world humanoid loco-manipulation from diverse human videos. We deploy SUGAR on a Unitree G1 humanoid across six representative whole-body interaction tasks: (a) Push Box, (b) Pick Bottle, (c) Carry Box, (d) Sit Chair, (e) Kick Box, and (f1, f2) Pick Bottle under external human disturbances. Diverse human videos [Wang et al., 2026a, Mao et al., 2024, Yang et al., 2026a, Weng… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SUGAR. Our approach consists of three stages: (1) extracting kinematic interaction priors from unstructured human videos through a fully automated pipeline; (2) refining the priors into physically feasible skills with a privileged RL policy; and (3) training a hierarchical autonomous policy on the refined demonstrations for robust humanoid locomanipulation. 3.2 Scalable Kinematic Interaction Pr… view at source ↗
Figure 3
Figure 3. Figure 3: The Training Pipeline of SUGAR. (Left) The Refiner πr transforms noisy kinematic priors τˆ ∈ P into physically feasible expert demonstrations τ ∈ R using privileged RL. (Middle) The Tracker πt distills motor skills from the Refiner via behavior cloning and reinforcement learning to achieve robust command-tracking. (Right) The Generator πg is trained via imitation learning on the rollout dataset D to predic… view at source ↗
Figure 4
Figure 4. Figure 4: Performance with different training data sizes. Success rates, evaluated on both the train and test datasets, consistently improve as the amount of training data increases. 4.2 Comparison with Baselines We compare our method with baseline methods on six whole-body loco-manipulation tasks [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results: Carry Box. (a) Our method stably lifts the box. (b) Without interaction rewards (w/o IR), the policy only imitates the bending motion and fails to lift the box (c) Without interaction robustness enhancement (w/o IRE), the interaction is less robust and causes failure. 4.4 Component Analysis We conduct ablation studies to evaluate the contribution of key components in our framework. Ref… view at source ↗
Figure 6
Figure 6. Figure 6: Recover from failure. Interference*1 Interference*2 Interference*3 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Robustness to external disturbances in the real world. (a) Carry box (b) Sit Chair (c) Kick Box [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Zero-shot generalization to different objects in the real world. A key observation is the policy’s robustness in real-world execution. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
read the original abstract

Building humanoid robots capable of generalizable whole-body loco-manipulation in the real world remains a fundamental challenge. Existing methods either rely on laborious task-specific reward engineering, rigidly replay reference motions that fail to generalize, or depend on costly teleoperation that limits scalability. While human videos capture diverse human behaviors, motion priors inferred from them are inherently imperfect, suffering from occlusion, contact artifacts, and retargeting errors that render them unsuitable for direct policy learning. To address this, we present SUGAR, a scalable data-driven framework that converts diverse human videos into deployable humanoid loco-manipulation skills, without any task-specific reward engineering or reference-motion conditioning at inference. SUGAR proceeds in three stages. First, a fully automated pipeline extracts kinematic interaction priors including human-object motion trajectories and contact labels from unstructured human videos. Second, a privileged physics-based refiner uses a unified mimic reward and progressive state pool to transform imperfect priors into physically feasible, high-fidelity skills. Third, refined skills are distilled into a hierarchical autonomous policy consisting of a command generator and a command tracker. We evaluate SUGAR on six representative loco-manipulation tasks in simulation and real-world humanoid hardware. Our method substantially outperforms reference-tracking baselines, and performance scales clearly with the amount of human video data. It also achieves zero-shot real-world transfer with reliable closed-loop execution, autonomous failure recovery, and stable long-horizon performance under external perturbations. Project Page: https://tianshuwu.github.io/sugar-humanoid/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents SUGAR, a three-stage data-driven framework for humanoid loco-manipulation from unstructured human videos. Stage 1 automatically extracts kinematic priors (trajectories and contact labels); Stage 2 applies a privileged physics-based refiner with a unified mimic reward and progressive state pool to convert noisy priors into feasible skills; Stage 3 distills the refined skills into a hierarchical policy (command generator + tracker) that runs autonomously at inference without reference conditioning or task-specific rewards. Evaluation on six representative tasks reports outperformance over reference-tracking baselines, clear scaling with video data volume, and zero-shot real-world transfer featuring closed-loop execution, autonomous failure recovery, and robustness to external perturbations.

Significance. If the results hold, SUGAR offers a scalable alternative to reward engineering and teleoperation by directly leveraging abundant human video data for generalizable whole-body skills. The simulation ablations, real-world success rates on the six tasks, and qualitative recovery examples provide direct support for the refiner's role in handling occlusion, contact artifacts, and retargeting errors. This strengthens the case for zero-shot transfer and long-horizon stability, addressing a key bottleneck in humanoid deployment.

major comments (1)
  1. The central claim that the refiner reliably converts imperfect kinematic priors into transferable skills is load-bearing; while simulation ablations and real-world rates are supplied, an explicit comparison of refiner output fidelity (e.g., contact accuracy or trajectory error) before versus after refinement on the same video set would further substantiate that the progressive state pool is the decisive mechanism rather than downstream policy training.
minor comments (2)
  1. Abstract: the statements of 'substantially outperforms' and 'performance scales clearly' would be strengthened by including one or two concrete success-rate numbers or scaling slopes even at high level.
  2. The description of the command generator and tracker in the hierarchical policy could clarify the interface between them (e.g., what state is passed and at what frequency) to aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We address the single major comment below and will incorporate the requested analysis.

read point-by-point responses
  1. Referee: The central claim that the refiner reliably converts imperfect kinematic priors into transferable skills is load-bearing; while simulation ablations and real-world rates are supplied, an explicit comparison of refiner output fidelity (e.g., contact accuracy or trajectory error) before versus after refinement on the same video set would further substantiate that the progressive state pool is the decisive mechanism rather than downstream policy training.

    Authors: We agree that an explicit before-versus-after fidelity comparison on the same video-derived priors would more directly isolate the refiner's contribution. In the revised manuscript we will add quantitative results (contact-state precision/recall and position/velocity MSE) computed on the identical set of extracted kinematic priors before and after the physics-based refiner. This new analysis will be placed in the ablation section alongside the existing policy-level ablations to clarify the role of the progressive state pool and unified mimic reward. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The SUGAR framework is a three-stage empirical pipeline that extracts kinematic priors from external unstructured human videos, refines them via a privileged physics simulator using a unified mimic reward and progressive state pool, and distills the results into a hierarchical policy. All performance claims (zero-shot transfer, scaling with video volume, outperformance of reference-tracking baselines, and closed-loop recovery) are backed by direct simulation ablations, real-world success rates on six tasks, and qualitative examples rather than any internal equation or self-citation that reduces the outcome to a quantity defined by the paper's own fitted parameters. The derivation therefore remains self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the domain assumption that human video data contains extractable kinematic interaction priors sufficient for downstream refinement, and that simulation-to-real transfer is feasible once motions are made physically consistent. No free parameters or invented entities are described in the abstract.

axioms (2)
  • domain assumption Human videos contain extractable kinematic interaction priors (trajectories and contact labels) despite occlusion and retargeting artifacts.
    Invoked in the first stage of the pipeline described in the abstract.
  • domain assumption A privileged physics-based refiner with unified mimic reward and progressive state pool can produce high-fidelity, deployable skills from imperfect priors.
    Central to the second stage; if false the entire distillation step fails.

pith-pipeline@v0.9.0 · 5831 in / 1555 out tokens · 34650 ms · 2026-05-21T07:14:20.730625+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 3 internal anchors

  1. [1]

    2025 , eprint=

    Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer , author=. 2025 , eprint=

  2. [2]

    2025 , eprint=

    VIRAL: Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation , author=. 2025 , eprint=

  3. [3]

    2024 , eprint=

    Visual Whole-Body Control for Legged Loco-Manipulation , author=. 2024 , eprint=

  4. [4]

    2025 , eprint=

    Learning Vision-Driven Reactive Soccer Skills for Humanoid Robots , author=. 2025 , eprint=

  5. [5]

    2025 , eprint=

    Toward Real-World Cooperative and Competitive Soccer with Quadrupedal Robot Teams , author=. 2025 , eprint=

  6. [6]

    2025 , eprint=

    ResMimic: From General Motion Tracking to Humanoid Whole-body Loco-Manipulation via Residual Learning , author=. 2025 , eprint=

  7. [7]

    2025 , eprint=

    HDMI: Learning Interactive Humanoid Whole-Body Control from Human Videos , author=. 2025 , eprint=

  8. [8]

    2025 , eprint=

    TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System , author=. 2025 , eprint=

  9. [9]

    2025 , eprint=

    SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control , author=. 2025 , eprint=

  10. [10]

    2025 , eprint=

    CLONE: Closed-Loop Whole-Body Humanoid Teleoperation for Long-Horizon Tasks , author=. 2025 , eprint=

  11. [11]

    2025 , eprint=

    HOMIE: Humanoid Loco-Manipulation with Isomorphic Exoskeleton Cockpit , author=. 2025 , eprint=

  12. [12]

    2025 , eprint=

    AMO: Adaptive Motion Optimization for Hyper-Dexterous Humanoid Whole-Body Control , author=. 2025 , eprint=

  13. [13]

    2025 , eprint=

    ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills , author=. 2025 , eprint=

  14. [14]

    2025 , eprint=

    BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion , author=. 2025 , eprint=

  15. [15]

    2025 , eprint=

    ExBody2: Advanced Expressive Humanoid Whole-Body Control , author=. 2025 , eprint=

  16. [16]

    2026 , eprint=

    EgoHumanoid: Unlocking In-the-Wild Loco-Manipulation with Robot-Free Egocentric Demonstration , author=. 2026 , eprint=

  17. [17]

    2026 , eprint=

    Humanoid Manipulation Interface: Humanoid Whole-Body Manipulation from Robot-Free Demonstrations , author=. 2026 , eprint=

  18. [18]

    2026 , eprint=

    HumDex: Humanoid Dexterous Manipulation Made Easy , author=. 2026 , eprint=

  19. [19]

    2026 , eprint=

    Deep Whole-body Parkour , author=. 2026 , eprint=

  20. [20]

    2019 , eprint=

    AMASS: Archive of Motion Capture as Surface Shapes , author=. 2019 , eprint=

  21. [21]

    2025 , eprint=

    PHUMA: Physically-Grounded Humanoid Locomotion Dataset , author=. 2025 , eprint=

  22. [22]

    2024 , eprint=

    Learning Human-to-Humanoid Real-Time Whole-Body Teleoperation , author=. 2024 , eprint=

  23. [23]

    2025 , eprint=

    PhysHSI: Towards a Real-World Generalizable and Natural Humanoid-Scene Interaction System , author=. 2025 , eprint=

  24. [24]

    2026 , eprint=

    InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions , author=. 2026 , eprint=

  25. [25]

    2024 , eprint=

    MaskedMimic: Unified Physics-Based Character Control Through Masked Motion Inpainting , author=. 2024 , eprint=

  26. [26]

    2024 , eprint=

    CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control , author=. 2024 , eprint=

  27. [27]

    2026 , eprint=

    HumanX: Toward Agile and Generalizable Humanoid Interaction Skills from Human Videos , author=. 2026 , eprint=

  28. [28]

    arXiv preprint arXiv:2602.06035 , year=

    InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions , author=. arXiv preprint arXiv:2602.06035 , year=

  29. [29]

    CVPR , year =

    InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation , author =. CVPR , year =

  30. [30]

    arXiv preprint arXiv:2602.21723 , year=

    LessMimic: Long-Horizon Humanoid Interaction with Unified Distance Field Representations , author=. arXiv preprint arXiv:2602.21723 , year=

  31. [31]

    2026 , eprint=

    Learning Athletic Humanoid Tennis Skills from Imperfect Human Motion Data , author=. 2026 , eprint=

  32. [32]

    2024 , eprint=

    HumanPlus: Humanoid Shadowing and Imitation from Humans , author=. 2024 , eprint=

  33. [33]

    2026 , eprint=

    _0 : An Open Foundation Model Towards Universal Humanoid Loco-Manipulation , author=. 2026 , eprint=

  34. [34]

    2023 , eprint=

    PhysHOI: Physics-Based Imitation of Dynamic Human-Object Interaction , author=. 2023 , eprint=

  35. [35]

    2025 , eprint=

    OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction , author=. 2025 , eprint=

  36. [36]

    2024 , eprint=

    Learning from Massive Human Videos for Universal Humanoid Pose Control , author=. 2024 , eprint=

  37. [37]

    2025 , eprint=

    Visual Imitation Enables Contextual Humanoid Control , author=. 2025 , eprint=

  38. [38]

    2026 , eprint=

    ZeroWBC: Learning Natural Visuomotor Humanoid Control Directly from Human Egocentric Video , author=. 2026 , eprint=

  39. [39]

    2026 , eprint=

    DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos , author=. 2026 , eprint=

  40. [40]

    2025 , eprint=

    Masquerade: Learning from In-the-wild Human Videos using Data-Editing , author=. 2025 , eprint=

  41. [41]

    2025 , eprint=

    MimicDroid: In-Context Learning for Humanoid Robot Manipulation from Human Play Videos , author=. 2025 , eprint=

  42. [42]

    2024 , eprint=

    OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation , author=. 2024 , eprint=

  43. [43]

    2025 , eprint=

    Vision-based Manipulation from Single Human Video with Open-World Object Graphs , author=. 2025 , eprint=

  44. [44]

    2026 , eprint=

    Learning Part-Aware Dense 3D Feature Field for Generalizable Articulated Object Manipulation , author=. 2026 , eprint=

  45. [45]

    2024 , eprint=

    EqvAfford: SE(3) Equivariance for Point-Level Affordance Learning , author=. 2024 , eprint=

  46. [46]

    Sam 3d body: Robust full-body human mesh recovery, 2026

    SAM 3D Body: Robust Full-Body Human Mesh Recovery , author=. arXiv preprint arXiv:2602.15989 , year=

  47. [47]

    arXiv preprint arXiv:2512.08406 , year =

    SAM-Body4D: Training-Free 4D Human Body Mesh Recovery from Videos , author =. arXiv preprint arXiv:2512.08406 , year =

  48. [48]

    SAM 3D: 3Dfy Anything in Images

    SAM 3D: 3Dfy Anything in Images , author=. 2025 , journal=. 2511.16624 , archivePrefix=

  49. [49]

    CVPR , year =

    Bowen Wen and Wei Yang and Jan Kautz and Stan Birchfield , title =. CVPR , year =

  50. [50]

    Sensor fusion IV: control paradigms and data structures , volume=

    Method for registration of 3-D shapes , author=. Sensor fusion IV: control paradigms and data structures , volume=. 1992 , organization=

  51. [51]

    Qwen3-VL Technical Report

    Qwen3-VL Technical Report , author=. arXiv preprint arXiv:2511.21631 , year=

  52. [52]

    The International Journal of Robotics Research , volume=

    Diffusion policy: Visuomotor policy learning via action diffusion , author=. The International Journal of Robotics Research , volume=. 2025 , publisher=

  53. [53]

    Proximal Policy Optimization Algorithms

    Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  54. [54]

    2025 , eprint=

    DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping , author=. 2025 , eprint=

  55. [55]

    2025 , journal=

    VisualMimic: Visual Humanoid Loco-Manipulation via Motion Tracking and Generation , author=. 2025 , journal=

  56. [56]

    Wholebodyvla: Towards unified latent vla for whole-body loco-manipulation control.arXiv preprint arXiv:2512.11047, 2025

    WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control , author=. arXiv preprint arXiv:2512.11047 , year=

  57. [57]

    Ultra: Unified multimodal control for autonomous humanoid whole-body loco-manipulation

    ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation , author=. arXiv preprint arXiv:2603.03279 , year=

  58. [58]

    Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) , month =

    Wang, Yinhuai and Zhao, Qihan and Yu, Runyi and Tsui, Hok Wai and Zeng, Ailing and Lin, Jing and Luo, Zhengyi and Yu, Jiwen and Li, Xiu and Chen, Qifeng and Zhang, Jian and Zhang, Lei and Tan, Ping , title =. Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) , month =. 2025 , pages =

  59. [59]

    2025 , eprint=

    SkillMimic-V2: Learning Robust and Generalizable Interaction Skills from Sparse and Noisy Demonstrations , author=. 2025 , eprint=

  60. [60]

    2026 , eprint=

    OmniXtreme: Breaking the Generality Barrier in High-Dynamic Humanoid Control , author=. 2026 , eprint=

  61. [61]

    Advances in Neural Information Processing Systems , year=

    KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills , author=. Advances in Neural Information Processing Systems , year=

  62. [62]

    Kungfubot2: Learn- ing versatile motion skills for humanoid whole-body control.arXiv preprint arXiv:2509.16638, 2025

    KungfuBot2: Learning Versatile Motion Skills for Humanoid Whole-Body Control , author=. arXiv preprint arXiv:2509.16638 , year=