SUGAR: A Scalable Human-Video-Driven Generalizable Humanoid Loco-Manipulation Learning Framework

Hang Ye; Hao Dong; Jia Li; Qize Yu; Tianshu Wu; Xiangqi Kong; Yizhou Wang; Yue Chen

arxiv: 2605.20373 · v1 · pith:JSOQ4HPSnew · submitted 2026-05-19 · 💻 cs.RO · cs.AI· cs.CV

SUGAR: A Scalable Human-Video-Driven Generalizable Humanoid Loco-Manipulation Learning Framework

Tianshu Wu , Xiangqi Kong , Yue Chen , Qize Yu , Hang Ye , Jia Li , Yizhou Wang , Hao Dong This is my paper

Pith reviewed 2026-05-21 07:14 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV

keywords humanoid loco-manipulationhuman video learningphysics-based refinementzero-shot transferimitation learninggeneralizable policiesrobot skill from videoautonomous humanoid control

0 comments

The pith

SUGAR converts imperfect human videos into deployable humanoid loco-manipulation skills that transfer zero-shot to real hardware and improve with more data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates a three-stage process that extracts kinematic priors from unstructured human videos, refines them into physically valid motions via a privileged simulator, and distills the results into an autonomous hierarchical policy. This pipeline runs without task-specific reward design or reference tracking at deployment time. A sympathetic reader would care because it replaces labor-intensive teleoperation and manual engineering with scalable video data, potentially letting humanoid robots acquire diverse whole-body skills from everyday recordings. If the approach holds, performance on complex loco-manipulation tasks would grow steadily as more human videos become available while maintaining closed-loop robustness on physical robots.

Core claim

SUGAR extracts human-object trajectories and contact labels from raw videos, feeds the imperfect priors into a privileged physics-based refiner that applies a unified mimic reward and progressive state pool to produce feasible high-fidelity skills, then distills those skills into a command generator plus command tracker policy. The resulting system achieves zero-shot real-world transfer, reliable closed-loop execution, autonomous failure recovery, and stable long-horizon behavior under perturbations across six representative loco-manipulation tasks while outperforming reference-tracking baselines and scaling clearly with video volume.

What carries the argument

The privileged physics-based refiner that applies a unified mimic reward and progressive state pool to convert imperfect kinematic priors from human videos into physically feasible high-fidelity skills.

If this is right

Task performance improves steadily as the quantity of human video data grows.
The method outperforms reference-tracking baselines on both simulation and real hardware.
Zero-shot deployment succeeds with closed-loop execution and autonomous recovery from failures.
Long-horizon loco-manipulation remains stable under external disturbances.
The same pipeline works across six distinct tasks without per-task reward or reference tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Internet-scale human videos could serve as the main training corpus for broad humanoid capabilities beyond the six tasks shown.
The refinement stage might transfer to other robot morphologies if the mimic reward is adapted to their kinematics.
Combining the pipeline with large public video datasets would allow continual improvement of deployed humanoid policies without new teleoperation sessions.
Success here suggests video-driven refinement could reduce dependence on high-fidelity simulators for initial skill acquisition.

Load-bearing premise

The physics-based refiner can turn imperfect video-derived motion priors into skills that transfer reliably to real hardware without any further task-specific engineering.

What would settle it

Performance on the six tasks remains flat or declines when the volume of human video data is increased, or the distilled policies fail to execute closed-loop on real humanoid hardware under external perturbations.

Figures

Figures reproduced from arXiv: 2605.20373 by Hang Ye, Hao Dong, Jia Li, Qize Yu, Tianshu Wu, Xiangqi Kong, Yizhou Wang, Yue Chen.

**Figure 1.** Figure 1: SUGAR enables generalizable real-world humanoid loco-manipulation from diverse human videos. We deploy SUGAR on a Unitree G1 humanoid across six representative whole-body interaction tasks: (a) Push Box, (b) Pick Bottle, (c) Carry Box, (d) Sit Chair, (e) Kick Box, and (f1, f2) Pick Bottle under external human disturbances. Diverse human videos [Wang et al., 2026a, Mao et al., 2024, Yang et al., 2026a, Weng… view at source ↗

**Figure 2.** Figure 2: Overview of SUGAR. Our approach consists of three stages: (1) extracting kinematic interaction priors from unstructured human videos through a fully automated pipeline; (2) refining the priors into physically feasible skills with a privileged RL policy; and (3) training a hierarchical autonomous policy on the refined demonstrations for robust humanoid locomanipulation. 3.2 Scalable Kinematic Interaction Pr… view at source ↗

**Figure 3.** Figure 3: The Training Pipeline of SUGAR. (Left) The Refiner πr transforms noisy kinematic priors τˆ ∈ P into physically feasible expert demonstrations τ ∈ R using privileged RL. (Middle) The Tracker πt distills motor skills from the Refiner via behavior cloning and reinforcement learning to achieve robust command-tracking. (Right) The Generator πg is trained via imitation learning on the rollout dataset D to predic… view at source ↗

**Figure 4.** Figure 4: Performance with different training data sizes. Success rates, evaluated on both the train and test datasets, consistently improve as the amount of training data increases. 4.2 Comparison with Baselines We compare our method with baseline methods on six whole-body loco-manipulation tasks [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results: Carry Box. (a) Our method stably lifts the box. (b) Without interaction rewards (w/o IR), the policy only imitates the bending motion and fails to lift the box (c) Without interaction robustness enhancement (w/o IRE), the interaction is less robust and causes failure. 4.4 Component Analysis We conduct ablation studies to evaluate the contribution of key components in our framework. Ref… view at source ↗

**Figure 6.** Figure 6: Recover from failure. Interference*1 Interference*2 Interference*3 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Robustness to external disturbances in the real world. (a) Carry box (b) Sit Chair (c) Kick Box [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Zero-shot generalization to different objects in the real world. A key observation is the policy’s robustness in real-world execution. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

read the original abstract

Building humanoid robots capable of generalizable whole-body loco-manipulation in the real world remains a fundamental challenge. Existing methods either rely on laborious task-specific reward engineering, rigidly replay reference motions that fail to generalize, or depend on costly teleoperation that limits scalability. While human videos capture diverse human behaviors, motion priors inferred from them are inherently imperfect, suffering from occlusion, contact artifacts, and retargeting errors that render them unsuitable for direct policy learning. To address this, we present SUGAR, a scalable data-driven framework that converts diverse human videos into deployable humanoid loco-manipulation skills, without any task-specific reward engineering or reference-motion conditioning at inference. SUGAR proceeds in three stages. First, a fully automated pipeline extracts kinematic interaction priors including human-object motion trajectories and contact labels from unstructured human videos. Second, a privileged physics-based refiner uses a unified mimic reward and progressive state pool to transform imperfect priors into physically feasible, high-fidelity skills. Third, refined skills are distilled into a hierarchical autonomous policy consisting of a command generator and a command tracker. We evaluate SUGAR on six representative loco-manipulation tasks in simulation and real-world humanoid hardware. Our method substantially outperforms reference-tracking baselines, and performance scales clearly with the amount of human video data. It also achieves zero-shot real-world transfer with reliable closed-loop execution, autonomous failure recovery, and stable long-horizon performance under external perturbations. Project Page: https://tianshuwu.github.io/sugar-humanoid/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SUGAR turns unstructured human videos into zero-shot real humanoid loco-manipulation policies via an automated extraction-refine-distill pipeline that scales with data volume.

read the letter

The core takeaway is that this three-stage pipeline extracts kinematic priors and contact labels from ordinary videos, refines them into feasible skills inside a physics simulator using a unified mimic reward and progressive state pooling, then distills the result into a hierarchical policy that runs closed-loop without references or task rewards at inference time. The real-world zero-shot transfer on six tasks plus the clear data-scaling curves are what stand out from the abstract and results summary.

Referee Report

1 major / 2 minor

Summary. The manuscript presents SUGAR, a three-stage data-driven framework for humanoid loco-manipulation from unstructured human videos. Stage 1 automatically extracts kinematic priors (trajectories and contact labels); Stage 2 applies a privileged physics-based refiner with a unified mimic reward and progressive state pool to convert noisy priors into feasible skills; Stage 3 distills the refined skills into a hierarchical policy (command generator + tracker) that runs autonomously at inference without reference conditioning or task-specific rewards. Evaluation on six representative tasks reports outperformance over reference-tracking baselines, clear scaling with video data volume, and zero-shot real-world transfer featuring closed-loop execution, autonomous failure recovery, and robustness to external perturbations.

Significance. If the results hold, SUGAR offers a scalable alternative to reward engineering and teleoperation by directly leveraging abundant human video data for generalizable whole-body skills. The simulation ablations, real-world success rates on the six tasks, and qualitative recovery examples provide direct support for the refiner's role in handling occlusion, contact artifacts, and retargeting errors. This strengthens the case for zero-shot transfer and long-horizon stability, addressing a key bottleneck in humanoid deployment.

major comments (1)

The central claim that the refiner reliably converts imperfect kinematic priors into transferable skills is load-bearing; while simulation ablations and real-world rates are supplied, an explicit comparison of refiner output fidelity (e.g., contact accuracy or trajectory error) before versus after refinement on the same video set would further substantiate that the progressive state pool is the decisive mechanism rather than downstream policy training.

minor comments (2)

Abstract: the statements of 'substantially outperforms' and 'performance scales clearly' would be strengthened by including one or two concrete success-rate numbers or scaling slopes even at high level.
The description of the command generator and tracker in the hierarchical policy could clarify the interface between them (e.g., what state is passed and at what frequency) to aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We address the single major comment below and will incorporate the requested analysis.

read point-by-point responses

Referee: The central claim that the refiner reliably converts imperfect kinematic priors into transferable skills is load-bearing; while simulation ablations and real-world rates are supplied, an explicit comparison of refiner output fidelity (e.g., contact accuracy or trajectory error) before versus after refinement on the same video set would further substantiate that the progressive state pool is the decisive mechanism rather than downstream policy training.

Authors: We agree that an explicit before-versus-after fidelity comparison on the same video-derived priors would more directly isolate the refiner's contribution. In the revised manuscript we will add quantitative results (contact-state precision/recall and position/velocity MSE) computed on the identical set of extracted kinematic priors before and after the physics-based refiner. This new analysis will be placed in the ablation section alongside the existing policy-level ablations to clarify the role of the progressive state pool and unified mimic reward. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The SUGAR framework is a three-stage empirical pipeline that extracts kinematic priors from external unstructured human videos, refines them via a privileged physics simulator using a unified mimic reward and progressive state pool, and distills the results into a hierarchical policy. All performance claims (zero-shot transfer, scaling with video volume, outperformance of reference-tracking baselines, and closed-loop recovery) are backed by direct simulation ablations, real-world success rates on six tasks, and qualitative examples rather than any internal equation or self-citation that reduces the outcome to a quantity defined by the paper's own fitted parameters. The derivation therefore remains self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the domain assumption that human video data contains extractable kinematic interaction priors sufficient for downstream refinement, and that simulation-to-real transfer is feasible once motions are made physically consistent. No free parameters or invented entities are described in the abstract.

axioms (2)

domain assumption Human videos contain extractable kinematic interaction priors (trajectories and contact labels) despite occlusion and retargeting artifacts.
Invoked in the first stage of the pipeline described in the abstract.
domain assumption A privileged physics-based refiner with unified mimic reward and progressive state pool can produce high-fidelity, deployable skills from imperfect priors.
Central to the second stage; if false the entire distillation step fails.

pith-pipeline@v0.9.0 · 5831 in / 1555 out tokens · 34650 ms · 2026-05-21T07:14:20.730625+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SUGAR proceeds in three stages: ... privileged physics-based refiner utilizes a unified mimic-style reward and a progressive state pool ... hierarchical policy: a high-level diffusion policy command generator ... low-level whole-body command tracker
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

performance scales clearly with the amount of human video data ... zero-shot real-world transfer

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 3 internal anchors

[1]

2025 , eprint=

Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer , author=. 2025 , eprint=

work page 2025
[2]

2025 , eprint=

VIRAL: Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation , author=. 2025 , eprint=

work page 2025
[3]

2024 , eprint=

Visual Whole-Body Control for Legged Loco-Manipulation , author=. 2024 , eprint=

work page 2024
[4]

2025 , eprint=

Learning Vision-Driven Reactive Soccer Skills for Humanoid Robots , author=. 2025 , eprint=

work page 2025
[5]

2025 , eprint=

Toward Real-World Cooperative and Competitive Soccer with Quadrupedal Robot Teams , author=. 2025 , eprint=

work page 2025
[6]

2025 , eprint=

ResMimic: From General Motion Tracking to Humanoid Whole-body Loco-Manipulation via Residual Learning , author=. 2025 , eprint=

work page 2025
[7]

2025 , eprint=

HDMI: Learning Interactive Humanoid Whole-Body Control from Human Videos , author=. 2025 , eprint=

work page 2025
[8]

2025 , eprint=

TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System , author=. 2025 , eprint=

work page 2025
[9]

2025 , eprint=

SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control , author=. 2025 , eprint=

work page 2025
[10]

2025 , eprint=

CLONE: Closed-Loop Whole-Body Humanoid Teleoperation for Long-Horizon Tasks , author=. 2025 , eprint=

work page 2025
[11]

2025 , eprint=

HOMIE: Humanoid Loco-Manipulation with Isomorphic Exoskeleton Cockpit , author=. 2025 , eprint=

work page 2025
[12]

2025 , eprint=

AMO: Adaptive Motion Optimization for Hyper-Dexterous Humanoid Whole-Body Control , author=. 2025 , eprint=

work page 2025
[13]

2025 , eprint=

ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills , author=. 2025 , eprint=

work page 2025
[14]

2025 , eprint=

BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion , author=. 2025 , eprint=

work page 2025
[15]

2025 , eprint=

ExBody2: Advanced Expressive Humanoid Whole-Body Control , author=. 2025 , eprint=

work page 2025
[16]

2026 , eprint=

EgoHumanoid: Unlocking In-the-Wild Loco-Manipulation with Robot-Free Egocentric Demonstration , author=. 2026 , eprint=

work page 2026
[17]

2026 , eprint=

Humanoid Manipulation Interface: Humanoid Whole-Body Manipulation from Robot-Free Demonstrations , author=. 2026 , eprint=

work page 2026
[18]

2026 , eprint=

HumDex: Humanoid Dexterous Manipulation Made Easy , author=. 2026 , eprint=

work page 2026
[19]

2026 , eprint=

Deep Whole-body Parkour , author=. 2026 , eprint=

work page 2026
[20]

2019 , eprint=

AMASS: Archive of Motion Capture as Surface Shapes , author=. 2019 , eprint=

work page 2019
[21]

2025 , eprint=

PHUMA: Physically-Grounded Humanoid Locomotion Dataset , author=. 2025 , eprint=

work page 2025
[22]

2024 , eprint=

Learning Human-to-Humanoid Real-Time Whole-Body Teleoperation , author=. 2024 , eprint=

work page 2024
[23]

2025 , eprint=

PhysHSI: Towards a Real-World Generalizable and Natural Humanoid-Scene Interaction System , author=. 2025 , eprint=

work page 2025
[24]

2026 , eprint=

InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions , author=. 2026 , eprint=

work page 2026
[25]

2024 , eprint=

MaskedMimic: Unified Physics-Based Character Control Through Masked Motion Inpainting , author=. 2024 , eprint=

work page 2024
[26]

2024 , eprint=

CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control , author=. 2024 , eprint=

work page 2024
[27]

2026 , eprint=

HumanX: Toward Agile and Generalizable Humanoid Interaction Skills from Human Videos , author=. 2026 , eprint=

work page 2026
[28]

arXiv preprint arXiv:2602.06035 , year=

InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions , author=. arXiv preprint arXiv:2602.06035 , year=

work page arXiv
[29]

CVPR , year =

InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation , author =. CVPR , year =

work page
[30]

arXiv preprint arXiv:2602.21723 , year=

LessMimic: Long-Horizon Humanoid Interaction with Unified Distance Field Representations , author=. arXiv preprint arXiv:2602.21723 , year=

work page arXiv
[31]

2026 , eprint=

Learning Athletic Humanoid Tennis Skills from Imperfect Human Motion Data , author=. 2026 , eprint=

work page 2026
[32]

2024 , eprint=

HumanPlus: Humanoid Shadowing and Imitation from Humans , author=. 2024 , eprint=

work page 2024
[33]

2026 , eprint=

_0 : An Open Foundation Model Towards Universal Humanoid Loco-Manipulation , author=. 2026 , eprint=

work page 2026
[34]

2023 , eprint=

PhysHOI: Physics-Based Imitation of Dynamic Human-Object Interaction , author=. 2023 , eprint=

work page 2023
[35]

2025 , eprint=

OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction , author=. 2025 , eprint=

work page 2025
[36]

2024 , eprint=

Learning from Massive Human Videos for Universal Humanoid Pose Control , author=. 2024 , eprint=

work page 2024
[37]

2025 , eprint=

Visual Imitation Enables Contextual Humanoid Control , author=. 2025 , eprint=

work page 2025
[38]

2026 , eprint=

ZeroWBC: Learning Natural Visuomotor Humanoid Control Directly from Human Egocentric Video , author=. 2026 , eprint=

work page 2026
[39]

2026 , eprint=

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos , author=. 2026 , eprint=

work page 2026
[40]

2025 , eprint=

Masquerade: Learning from In-the-wild Human Videos using Data-Editing , author=. 2025 , eprint=

work page 2025
[41]

2025 , eprint=

MimicDroid: In-Context Learning for Humanoid Robot Manipulation from Human Play Videos , author=. 2025 , eprint=

work page 2025
[42]

2024 , eprint=

OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation , author=. 2024 , eprint=

work page 2024
[43]

2025 , eprint=

Vision-based Manipulation from Single Human Video with Open-World Object Graphs , author=. 2025 , eprint=

work page 2025
[44]

2026 , eprint=

Learning Part-Aware Dense 3D Feature Field for Generalizable Articulated Object Manipulation , author=. 2026 , eprint=

work page 2026
[45]

2024 , eprint=

EqvAfford: SE(3) Equivariance for Point-Level Affordance Learning , author=. 2024 , eprint=

work page 2024
[46]

Sam 3d body: Robust full-body human mesh recovery, 2026

SAM 3D Body: Robust Full-Body Human Mesh Recovery , author=. arXiv preprint arXiv:2602.15989 , year=

work page arXiv
[47]

arXiv preprint arXiv:2512.08406 , year =

SAM-Body4D: Training-Free 4D Human Body Mesh Recovery from Videos , author =. arXiv preprint arXiv:2512.08406 , year =

work page arXiv
[48]

SAM 3D: 3Dfy Anything in Images

SAM 3D: 3Dfy Anything in Images , author=. 2025 , journal=. 2511.16624 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

CVPR , year =

Bowen Wen and Wei Yang and Jan Kautz and Stan Birchfield , title =. CVPR , year =

work page
[50]

Sensor fusion IV: control paradigms and data structures , volume=

Method for registration of 3-D shapes , author=. Sensor fusion IV: control paradigms and data structures , volume=. 1992 , organization=

work page 1992
[51]

Qwen3-VL Technical Report

Qwen3-VL Technical Report , author=. arXiv preprint arXiv:2511.21631 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

The International Journal of Robotics Research , volume=

Diffusion policy: Visuomotor policy learning via action diffusion , author=. The International Journal of Robotics Research , volume=. 2025 , publisher=

work page 2025
[53]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

2025 , eprint=

DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping , author=. 2025 , eprint=

work page 2025
[55]

2025 , journal=

VisualMimic: Visual Humanoid Loco-Manipulation via Motion Tracking and Generation , author=. 2025 , journal=

work page 2025
[56]

Wholebodyvla: Towards unified latent vla for whole-body loco-manipulation control.arXiv preprint arXiv:2512.11047, 2025

WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control , author=. arXiv preprint arXiv:2512.11047 , year=

work page arXiv
[57]

Ultra: Unified multimodal control for autonomous humanoid whole-body loco-manipulation

ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation , author=. arXiv preprint arXiv:2603.03279 , year=

work page arXiv
[58]

Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) , month =

Wang, Yinhuai and Zhao, Qihan and Yu, Runyi and Tsui, Hok Wai and Zeng, Ailing and Lin, Jing and Luo, Zhengyi and Yu, Jiwen and Li, Xiu and Chen, Qifeng and Zhang, Jian and Zhang, Lei and Tan, Ping , title =. Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) , month =. 2025 , pages =

work page 2025
[59]

2025 , eprint=

SkillMimic-V2: Learning Robust and Generalizable Interaction Skills from Sparse and Noisy Demonstrations , author=. 2025 , eprint=

work page 2025
[60]

2026 , eprint=

OmniXtreme: Breaking the Generality Barrier in High-Dynamic Humanoid Control , author=. 2026 , eprint=

work page 2026
[61]

Advances in Neural Information Processing Systems , year=

KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills , author=. Advances in Neural Information Processing Systems , year=

work page
[62]

Kungfubot2: Learn- ing versatile motion skills for humanoid whole-body control.arXiv preprint arXiv:2509.16638, 2025

KungfuBot2: Learning Versatile Motion Skills for Humanoid Whole-Body Control , author=. arXiv preprint arXiv:2509.16638 , year=

work page arXiv

[1] [1]

2025 , eprint=

Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer , author=. 2025 , eprint=

work page 2025

[2] [2]

2025 , eprint=

VIRAL: Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation , author=. 2025 , eprint=

work page 2025

[3] [3]

2024 , eprint=

Visual Whole-Body Control for Legged Loco-Manipulation , author=. 2024 , eprint=

work page 2024

[4] [4]

2025 , eprint=

Learning Vision-Driven Reactive Soccer Skills for Humanoid Robots , author=. 2025 , eprint=

work page 2025

[5] [5]

2025 , eprint=

Toward Real-World Cooperative and Competitive Soccer with Quadrupedal Robot Teams , author=. 2025 , eprint=

work page 2025

[6] [6]

2025 , eprint=

ResMimic: From General Motion Tracking to Humanoid Whole-body Loco-Manipulation via Residual Learning , author=. 2025 , eprint=

work page 2025

[7] [7]

2025 , eprint=

HDMI: Learning Interactive Humanoid Whole-Body Control from Human Videos , author=. 2025 , eprint=

work page 2025

[8] [8]

2025 , eprint=

TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System , author=. 2025 , eprint=

work page 2025

[9] [9]

2025 , eprint=

SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control , author=. 2025 , eprint=

work page 2025

[10] [10]

2025 , eprint=

CLONE: Closed-Loop Whole-Body Humanoid Teleoperation for Long-Horizon Tasks , author=. 2025 , eprint=

work page 2025

[11] [11]

2025 , eprint=

HOMIE: Humanoid Loco-Manipulation with Isomorphic Exoskeleton Cockpit , author=. 2025 , eprint=

work page 2025

[12] [12]

2025 , eprint=

AMO: Adaptive Motion Optimization for Hyper-Dexterous Humanoid Whole-Body Control , author=. 2025 , eprint=

work page 2025

[13] [13]

2025 , eprint=

ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills , author=. 2025 , eprint=

work page 2025

[14] [14]

2025 , eprint=

BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion , author=. 2025 , eprint=

work page 2025

[15] [15]

2025 , eprint=

ExBody2: Advanced Expressive Humanoid Whole-Body Control , author=. 2025 , eprint=

work page 2025

[16] [16]

2026 , eprint=

EgoHumanoid: Unlocking In-the-Wild Loco-Manipulation with Robot-Free Egocentric Demonstration , author=. 2026 , eprint=

work page 2026

[17] [17]

2026 , eprint=

Humanoid Manipulation Interface: Humanoid Whole-Body Manipulation from Robot-Free Demonstrations , author=. 2026 , eprint=

work page 2026

[18] [18]

2026 , eprint=

HumDex: Humanoid Dexterous Manipulation Made Easy , author=. 2026 , eprint=

work page 2026

[19] [19]

2026 , eprint=

Deep Whole-body Parkour , author=. 2026 , eprint=

work page 2026

[20] [20]

2019 , eprint=

AMASS: Archive of Motion Capture as Surface Shapes , author=. 2019 , eprint=

work page 2019

[21] [21]

2025 , eprint=

PHUMA: Physically-Grounded Humanoid Locomotion Dataset , author=. 2025 , eprint=

work page 2025

[22] [22]

2024 , eprint=

Learning Human-to-Humanoid Real-Time Whole-Body Teleoperation , author=. 2024 , eprint=

work page 2024

[23] [23]

2025 , eprint=

PhysHSI: Towards a Real-World Generalizable and Natural Humanoid-Scene Interaction System , author=. 2025 , eprint=

work page 2025

[24] [24]

2026 , eprint=

InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions , author=. 2026 , eprint=

work page 2026

[25] [25]

2024 , eprint=

MaskedMimic: Unified Physics-Based Character Control Through Masked Motion Inpainting , author=. 2024 , eprint=

work page 2024

[26] [26]

2024 , eprint=

CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control , author=. 2024 , eprint=

work page 2024

[27] [27]

2026 , eprint=

HumanX: Toward Agile and Generalizable Humanoid Interaction Skills from Human Videos , author=. 2026 , eprint=

work page 2026

[28] [28]

arXiv preprint arXiv:2602.06035 , year=

InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions , author=. arXiv preprint arXiv:2602.06035 , year=

work page arXiv

[29] [29]

CVPR , year =

InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation , author =. CVPR , year =

work page

[30] [30]

arXiv preprint arXiv:2602.21723 , year=

LessMimic: Long-Horizon Humanoid Interaction with Unified Distance Field Representations , author=. arXiv preprint arXiv:2602.21723 , year=

work page arXiv

[31] [31]

2026 , eprint=

Learning Athletic Humanoid Tennis Skills from Imperfect Human Motion Data , author=. 2026 , eprint=

work page 2026

[32] [32]

2024 , eprint=

HumanPlus: Humanoid Shadowing and Imitation from Humans , author=. 2024 , eprint=

work page 2024

[33] [33]

2026 , eprint=

_0 : An Open Foundation Model Towards Universal Humanoid Loco-Manipulation , author=. 2026 , eprint=

work page 2026

[34] [34]

2023 , eprint=

PhysHOI: Physics-Based Imitation of Dynamic Human-Object Interaction , author=. 2023 , eprint=

work page 2023

[35] [35]

2025 , eprint=

OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction , author=. 2025 , eprint=

work page 2025

[36] [36]

2024 , eprint=

Learning from Massive Human Videos for Universal Humanoid Pose Control , author=. 2024 , eprint=

work page 2024

[37] [37]

2025 , eprint=

Visual Imitation Enables Contextual Humanoid Control , author=. 2025 , eprint=

work page 2025

[38] [38]

2026 , eprint=

ZeroWBC: Learning Natural Visuomotor Humanoid Control Directly from Human Egocentric Video , author=. 2026 , eprint=

work page 2026

[39] [39]

2026 , eprint=

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos , author=. 2026 , eprint=

work page 2026

[40] [40]

2025 , eprint=

Masquerade: Learning from In-the-wild Human Videos using Data-Editing , author=. 2025 , eprint=

work page 2025

[41] [41]

2025 , eprint=

MimicDroid: In-Context Learning for Humanoid Robot Manipulation from Human Play Videos , author=. 2025 , eprint=

work page 2025

[42] [42]

2024 , eprint=

OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation , author=. 2024 , eprint=

work page 2024

[43] [43]

2025 , eprint=

Vision-based Manipulation from Single Human Video with Open-World Object Graphs , author=. 2025 , eprint=

work page 2025

[44] [44]

2026 , eprint=

Learning Part-Aware Dense 3D Feature Field for Generalizable Articulated Object Manipulation , author=. 2026 , eprint=

work page 2026

[45] [45]

2024 , eprint=

EqvAfford: SE(3) Equivariance for Point-Level Affordance Learning , author=. 2024 , eprint=

work page 2024

[46] [46]

Sam 3d body: Robust full-body human mesh recovery, 2026

SAM 3D Body: Robust Full-Body Human Mesh Recovery , author=. arXiv preprint arXiv:2602.15989 , year=

work page arXiv

[47] [47]

arXiv preprint arXiv:2512.08406 , year =

SAM-Body4D: Training-Free 4D Human Body Mesh Recovery from Videos , author =. arXiv preprint arXiv:2512.08406 , year =

work page arXiv

[48] [48]

SAM 3D: 3Dfy Anything in Images

SAM 3D: 3Dfy Anything in Images , author=. 2025 , journal=. 2511.16624 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

CVPR , year =

Bowen Wen and Wei Yang and Jan Kautz and Stan Birchfield , title =. CVPR , year =

work page

[50] [50]

Sensor fusion IV: control paradigms and data structures , volume=

Method for registration of 3-D shapes , author=. Sensor fusion IV: control paradigms and data structures , volume=. 1992 , organization=

work page 1992

[51] [51]

Qwen3-VL Technical Report

Qwen3-VL Technical Report , author=. arXiv preprint arXiv:2511.21631 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[52] [52]

The International Journal of Robotics Research , volume=

Diffusion policy: Visuomotor policy learning via action diffusion , author=. The International Journal of Robotics Research , volume=. 2025 , publisher=

work page 2025

[53] [53]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[54] [54]

2025 , eprint=

DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping , author=. 2025 , eprint=

work page 2025

[55] [55]

2025 , journal=

VisualMimic: Visual Humanoid Loco-Manipulation via Motion Tracking and Generation , author=. 2025 , journal=

work page 2025

[56] [56]

Wholebodyvla: Towards unified latent vla for whole-body loco-manipulation control.arXiv preprint arXiv:2512.11047, 2025

WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control , author=. arXiv preprint arXiv:2512.11047 , year=

work page arXiv

[57] [57]

Ultra: Unified multimodal control for autonomous humanoid whole-body loco-manipulation

ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation , author=. arXiv preprint arXiv:2603.03279 , year=

work page arXiv

[58] [58]

Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) , month =

Wang, Yinhuai and Zhao, Qihan and Yu, Runyi and Tsui, Hok Wai and Zeng, Ailing and Lin, Jing and Luo, Zhengyi and Yu, Jiwen and Li, Xiu and Chen, Qifeng and Zhang, Jian and Zhang, Lei and Tan, Ping , title =. Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) , month =. 2025 , pages =

work page 2025

[59] [59]

2025 , eprint=

SkillMimic-V2: Learning Robust and Generalizable Interaction Skills from Sparse and Noisy Demonstrations , author=. 2025 , eprint=

work page 2025

[60] [60]

2026 , eprint=

OmniXtreme: Breaking the Generality Barrier in High-Dynamic Humanoid Control , author=. 2026 , eprint=

work page 2026

[61] [61]

Advances in Neural Information Processing Systems , year=

KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills , author=. Advances in Neural Information Processing Systems , year=

work page

[62] [62]

Kungfubot2: Learn- ing versatile motion skills for humanoid whole-body control.arXiv preprint arXiv:2509.16638, 2025

KungfuBot2: Learning Versatile Motion Skills for Humanoid Whole-Body Control , author=. arXiv preprint arXiv:2509.16638 , year=

work page arXiv