arxiv: 2604.17335 · v1 · submitted 2026-04-19 · 💻 cs.RO

Recognition: unknown

Learning Whole-Body Humanoid Locomotion via Motion Generation and Motion Tracking

Zewei Zhang , Kehan Wen , Michael Xu , Junzhe He , Chenhao Li , Takahiro Miki , Clemens Schwarke , Chong Zhang

show 2 more authors

Xue Bin Peng Marco Hutter

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:06 UTC · model grok-4.3

classification 💻 cs.RO

keywords humanoid locomotiondiffusion modelreinforcement learningmotion generationterrain adaptationwhole-body controlreference trackingonboard deployment

0 comments

The pith

A diffusion model generates real-time terrain-aware reference motions that an RL tracker follows after closed-loop fine-tuning to enable whole-body humanoid locomotion with directional control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome the split between lower-body-only behaviors from direct RL reward shaping and non-adaptive replay from imitation RL by training a diffusion model on retargeted human data to predict reference motions conditioned on terrain perception. This generator runs alongside an RL whole-body tracker that is first trained on the motion data and then fine-tuned in closed loop with the generator frozen. The combined pipeline produces directional goal-reaching behaviors that adapt whole-body posture to boxes, hurdles, stairs, and mixed terrain. A sympathetic reader would care because the result shows a practical route to deploying humanoids with onboard cameras and compute that can handle unstructured real-world surfaces without falling.

Core claim

We train a diffusion model on retargeted human motions for real-time prediction of terrain-aware reference motions. We concurrently train a whole-body reference tracker with RL using this data. To improve robustness under imperfect references we fine-tune the tracker with a frozen motion generator in a closed-loop setting. The resulting system supports directional goal-reaching control with terrain-aware whole-body adaptation and can be deployed on a Unitree G1 humanoid robot with onboard perception and computation. Hardware experiments demonstrate successful traversal over boxes, hurdles, stairs, and mixed terrain combinations.

What carries the argument

The diffusion model that produces terrain-aware reference motions in real time, paired with the RL whole-body tracker that is fine-tuned closed-loop to follow imperfect references from the generator.

Load-bearing premise

The diffusion model produces sufficiently accurate terrain-aware reference motions in real time and the RL tracker can compensate for remaining imperfections without the robot falling or losing balance on unseen terrain.

What would settle it

Placing the robot on a novel mixed-terrain setup containing obstacle heights and spacings absent from training data and observing repeated falls or failure to reach the directional goal would show the combined generator-tracker system does not deliver the claimed robustness.

Figures

Figures reproduced from arXiv: 2604.17335 by Chenhao Li, Chong Zhang, Clemens Schwarke, Junzhe He, Kehan Wen, Marco Hutter, Michael Xu, Takahiro Miki, Xue Bin Peng, Zewei Zhang.

**Figure 2.** Figure 2: Overview of the training framework. (a) Data Collection & Curation: whole-body robot motions are obtained from human motions through motion [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Results of hardware experiments. (A) The robot climbs onto a 75 cm box and jumps down in three different ways: (a) straight ascent and descent; (b) [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Comparative results of the success rates of the motion tracker with and without fine-tuning across five terrain traversal tasks, including climbing [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Whole-body humanoid locomotion is challenging due to high-dimensional control, morphological instability, and the need for real-time adaptation to various terrains using onboard perception. Directly applying reinforcement learning (RL) with reward shaping to humanoid locomotion often leads to lower-body-dominated behaviors, whereas imitation-based RL can learn more coordinated whole-body skills but is typically limited to replaying reference motions without a mechanism to adapt them online from perception for terrain-aware locomotion. To address this gap, we propose a whole-body humanoid locomotion framework that combines skills learned from reference motions with terrain-aware adaptation. We first train a diffusion model on retargeted human motions for real-time prediction of terrain-aware reference motions. Concurrently, we train a whole-body reference tracker with RL using this motion data. To improve robustness under imperfectly generated references, we further fine-tune the tracker with a frozen motion generator in a closed-loop setting. The resulting system supports directional goal-reaching control with terrain-aware whole-body adaptation, and can be deployed on a Unitree G1 humanoid robot with onboard perception and computation. The hardware experiments demonstrate successful traversal over boxes, hurdles, stairs, and mixed terrain combinations. Quantitative results further show the benefits of incorporating online motion generation and fine-tuning the motion tracker for improved generalization and robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a working pipeline that pairs a diffusion model for real-time terrain-aware motion generation with an RL whole-body tracker and closed-loop fine-tuning, and it runs successfully on a Unitree G1 across boxes, hurdles, stairs, and mixed terrain.

read the letter

The main takeaway is that the authors have built a system for whole-body humanoid locomotion that uses a diffusion model to generate terrain-aware reference motions in real time from human data, trains an RL tracker on those motions, and then fine-tunes the tracker in closed loop with the generator held fixed. This lets the robot adapt to different terrains while maintaining coordinated full-body movements, and they show it working on a Unitree G1 with onboard perception for tasks like crossing boxes, hurdles, stairs, and mixed setups. What the paper does well is address the limitations of both pure RL and standard imitation approaches. RL tends to produce lower-body focused gaits, and imitation often can't adjust to new terrain from perception. The closed-loop fine-tuning is a practical addition to handle imperfect references from the diffusion model. The hardware results give direct evidence that the full pipeline can be deployed and performs across several terrain types, with claims of improved generalization from the online components. The soft spots are around the evidence strength. The abstract mentions quantitative benefits but doesn't include specific numbers, ablation studies, or details on how well the tracker compensates for diffusion errors on novel terrains. The concern that fine-tuning might not fully cover cases with perception noise or truly unseen geometry is worth examining closely in the full paper—if the test terrains overlap too much with training, the adaptation claims could be less robust than presented. No major circularity or invented entities here, but more data on failure modes would help. This paper is for researchers in robotics, particularly those working on learning methods for humanoid robots. Readers interested in combining generative models with reinforcement learning for control would find the specific pipeline and real-robot validation useful. It shows honest engagement with the challenges of high-dimensional control and perception-based adaptation, so it deserves a serious referee even if some sections need expansion on the quantitative side.

Referee Report

2 major / 1 minor

Summary. The paper proposes a framework for whole-body humanoid locomotion combining a diffusion model trained on retargeted human motions to generate real-time terrain-aware reference motions from onboard perception, an RL-based whole-body reference tracker trained concurrently on the motion data, and closed-loop fine-tuning of the tracker against the frozen diffusion generator to improve robustness to imperfect references. The resulting system is claimed to enable directional goal-reaching control with terrain-aware adaptation and is deployed on a Unitree G1 humanoid robot, with hardware experiments demonstrating successful traversal over boxes, hurdles, stairs, and mixed terrain combinations using onboard perception and computation. Quantitative benefits from online generation and fine-tuning for generalization and robustness are asserted.

Significance. If the central claims hold, the work provides a practical bridge between imitation-based whole-body skills and online perception-driven adaptation for high-DoF humanoid control, addressing the common issue of lower-body-dominant behaviors in direct RL. The hardware deployment on the Unitree G1 with onboard sensing and computation offers concrete empirical support for real-world viability across varied terrains, which is a strength for deployability claims in robotics.

major comments (2)

[Abstract] Abstract: The abstract asserts that 'quantitative results further show the benefits of incorporating online motion generation and fine-tuning the motion tracker for improved generalization and robustness,' yet provides no specific metrics, ablation studies, success rates, or training stability details. This is load-bearing for the generalization claim, as the reader's assessment notes moderate evidence strength without these elements to evaluate the magnitude of improvement or reliability on novel terrains.
[Hardware experiments] Hardware experiments section: The claim of successful traversal over mixed/unseen terrain combinations relies on the assumption that closed-loop fine-tuning allows the RL tracker to reliably compensate for inaccuracies in real-time diffusion-generated references under perception noise. No ablation comparing fine-tuned vs. non-fine-tuned performance, failure rates, or distribution overlap analysis is referenced, which directly impacts the load-bearing robustness argument for onboard deployment.

minor comments (1)

[Abstract] The abstract could be strengthened by briefly including 1-2 key quantitative metrics (e.g., success rates or comparison deltas) to better support the asserted benefits without requiring readers to reach the full results section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical contributions of our framework for terrain-aware whole-body humanoid locomotion. We address each major comment point by point below, with revisions to strengthen the presentation of quantitative evidence and robustness claims.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract asserts that 'quantitative results further show the benefits of incorporating online motion generation and fine-tuning the motion tracker for improved generalization and robustness,' yet provides no specific metrics, ablation studies, success rates, or training stability details. This is load-bearing for the generalization claim, as the reader's assessment notes moderate evidence strength without these elements to evaluate the magnitude of improvement or reliability on novel terrains.

Authors: We agree that the abstract would be strengthened by including concrete quantitative details to support the generalization and robustness claims. The manuscript body contains ablation studies and success rates (e.g., in Sections 5.2 and 6), but these were not summarized in the abstract. In the revised version, we have updated the abstract to explicitly reference key metrics, including success rates on novel terrains with and without online generation/fine-tuning, as well as notes on the ablation results demonstrating improved generalization. This change directly addresses the concern about evidence strength without altering the underlying claims. revision: yes
Referee: [Hardware experiments] Hardware experiments section: The claim of successful traversal over mixed/unseen terrain combinations relies on the assumption that closed-loop fine-tuning allows the RL tracker to reliably compensate for inaccuracies in real-time diffusion-generated references under perception noise. No ablation comparing fine-tuned vs. non-fine-tuned performance, failure rates, or distribution overlap analysis is referenced, which directly impacts the load-bearing robustness argument for onboard deployment.

Authors: The referee correctly identifies that an explicit hardware ablation would better substantiate the robustness benefits of closed-loop fine-tuning under real perception noise. The current manuscript describes the fine-tuning procedure in Section 4.3 and reports overall hardware success on mixed terrains in Section 6, but does not include a dedicated side-by-side comparison of fine-tuned versus non-fine-tuned trackers on hardware with failure rates. We have therefore added a new ablation subsection and table in the revised hardware experiments section that reports success rates, failure modes, and qualitative analysis of how fine-tuning enables compensation for diffusion reference inaccuracies on mixed/unseen terrains. This revision directly supports the onboard deployment claims. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper outlines a standard two-stage pipeline: a diffusion model is trained on retargeted human motion data to generate terrain-aware references, an RL tracker is trained to imitate those references, and the tracker is then fine-tuned in closed loop against the frozen generator. All performance claims rest on empirical hardware results (traversal of boxes, stairs, mixed terrain) rather than any equation or quantity that reduces by construction to a fitted parameter or self-citation. No self-definitional steps, fitted-input-as-prediction patterns, or load-bearing self-citations appear in the described method; the approach builds on established diffusion and RL techniques without tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based on abstract only; no explicit free parameters, axioms, or invented entities are detailed in the provided text. The work relies on standard assumptions of diffusion model training and RL reward design from prior literature.

pith-pipeline@v0.9.0 · 5551 in / 1222 out tokens · 40678 ms · 2026-05-10T06:06:28.127750+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 4 canonical work pages

[1]

Learning quadrupedal locomotion over challenging terrain,

J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hut- ter, “Learning quadrupedal locomotion over challenging terrain,”Science robotics, vol. 5, no. 47, p. eabc5986, 2020

2020
[2]

Learning robust perceptive locomotion for quadrupedal robots in the wild,

T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter, “Learning robust perceptive locomotion for quadrupedal robots in the wild,”Science Robotics, vol. 7, no. 62, 2022

2022
[3]

Attention-based map encoding for learning generalized legged locomotion,

J. He, C. Zhang, F. Jenelten, R. Grandia, M. Bächer, and M. Hutter, “Attention-based map encoding for learning generalized legged locomotion,”Science Robotics, vol. 10, no. 105, p. eadv3604, 2025

2025
[4]

Parkour in the wild: Learning a general and extensible agile locomotion policy using multi-expert distillation and rl fine-tuning,

N. Rudin, J. He, J. Aurand, and M. Hutter, “Parkour in the wild: Learning a general and extensible agile locomotion policy using multi-expert distillation and rl fine-tuning,” 2025

2025
[5]

Motion priors reimagined: Adapting flat-terrain skills for complex quadruped mobility,

Z. Zhang, C. Li, T. Miki, and M. Hutter, “Motion priors reimagined: Adapting flat-terrain skills for complex quadruped mobility,” 2025

2025
[6]

Ame-2: Agile and generalized legged locomotion via attention- based neural map encoding,

C. Zhang, V . Klemm, F. Yang, and M. Hutter, “Ame-2: Agile and generalized legged locomotion via attention- based neural map encoding,” 2026

2026
[7]

Deepmimic: Example-guided deep reinforcement learning of physics-based character skills,

X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne, “Deepmimic: Example-guided deep reinforcement learning of physics-based character skills,”ACM Trans. Graph., vol. 37, no. 4, pp. 143:1–143:14, Jul. 2018

2018
[8]

Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

Z. Chen, M. Ji, X. Cheng, X. Peng, X. B. Peng, and X. Wang, “Gmt: General motion tracking for humanoid whole-body control,”arXiv:2506.14770, 2025

work page arXiv 2025
[9]

Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion,

Q. Liao, T. E. Truong, X. Huang, Y . Gao, G. Tevet, K. Sreenath, and C. K. Liu, “Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion,” 2025

2025
[10]

Ex- body2: Advanced expressive humanoid whole-body con- trol.arXiv preprint arXiv:2412.13196, 2024

M. Ji, X. Peng, F. Liu, J. Li, G. Yang, X. Cheng, and X. Wang, “Exbody2: Advanced expressive humanoid whole-body control,”arXiv preprint arXiv:2412.13196, 2024

work page arXiv 2024
[11]

Twist: Teleoperated whole-body imitation system,

Y . Ze, Z. Chen, J. P. Araújo, Z. ang Cao, X. B. Peng, J. Wu, and C. K. Liu, “Twist: Teleoperated whole-body imitation system,” 2025

2025
[12]

Omniretar- get: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction,

L. Yang, X. Huang, Z. Wu, A. Kanazawa, P. Abbeel, C. Sferrazza, C. K. Liu, R. Duan, and G. Shi, “Omniretar- get: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction,” 2025

2025
[13]

Apex: Learning adaptive high- platform traversal for humanoid robots,

Y . Wang, T. Leng, C. Lin, S. Liu, S. Simon, B. Chen, J. Francis, and D. Zhao, “Apex: Learning adaptive high- platform traversal for humanoid robots,” 2026

2026
[14]

Perceptive humanoid parkour: Chaining dynamic human skills via motion matching,

Z. Wu, X. Huang, L. Yang, Y . Zhang, K. Sreenath, X. Chen, P. Abbeel, R. Duan, A. Kanazawa, C. Sferrazza, G. Shi, and C. K. Liu, “Perceptive humanoid parkour: Chaining dynamic human skills via motion matching,” 2026

2026
[15]

X-loco: Towards generalist humanoid locomotion control via synergetic policy distillation,

D. Wang, X. Wang, C. Zhang, J. Shi, Y . Zhao, C. Bai, and X. Li, “X-loco: Towards generalist humanoid locomotion control via synergetic policy distillation,” 2026

2026
[16]

Diffuse-cloc: Guided diffusion for physics-based character look-ahead control,

X. Huang, T. Truong, Y . Zhang, F. Yu, J. P. Sleiman, J. Hodgins, K. Sreenath, and F. Farshidian, “Diffuse-cloc: Guided diffusion for physics-based character look-ahead control,”ACM Transactions on Graphics, vol. 44, no. 4, p. 1–12, Jul. 2025

2025
[17]

Human motion diffusion model,

G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-or, and A. H. Bermano, “Human motion diffusion model,” inThe Eleventh International Conference on Learning Representations, 2023

2023
[18]

Closd: Closing the loop between simulation and diffusion for multi-task character control,

G. Tevet, S. Raab, S. Cohan, D. Reda, Z. Luo, X. B. Peng, A. H. Bermano, and M. van de Panne, “Closd: Closing the loop between simulation and diffusion for multi-task character control,” 2024

2024
[19]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burch- fiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” inProceedings of Robotics: Science and Systems (RSS), 2023

2023
[20]

Robot motion diffusion model: Motion generation for robotic characters,

A. Serifi, R. Grandia, E. Knoop, M. Gross, and M. Bächer, “Robot motion diffusion model: Motion generation for robotic characters,” inSIGGRAPH Asia 2024 Conference Papers, ser. SA ’24. New York, NY , USA: Association for Computing Machinery, 2024

2024
[21]

Parc: Physics- based augmentation with reinforcement learning for character controllers,

M. Xu, Y . Shi, K. Yin, and X. B. Peng, “Parc: Physics- based augmentation with reinforcement learning for character controllers,” inSIGGRAPH 2025 Conference Papers (SIGGRAPH ’25 Conference Papers), 2025

2025
[22]

Physics-based motion capture imitation with deep reinforcement learning,

N. Chentanez, M. Müller, M. Macklin, V . Makoviychuk, and S. Jeschke, “Physics-based motion capture imitation with deep reinforcement learning,” inProceedings of the 11th ACM SIGGRAPH Conference on Motion, Interaction and Games, ser. MIG ’18. New York, NY , USA: Association for Computing Machinery, 2018

2018
[23]

Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Castañeda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, X. Da, R. Ding, C. Hogg, L. Song, E. Lim, E. Jeong, T. He, H. Xue, W. Xiao, Z. Wang, S. Yuen, J. Kautz, Y . Chang, U. Iqbal, L. Fan, and Y . Zhu, “Sonic: Supersizing motion tracking for natural humanoid whole-body control,”arXiv preprint arXiv:2511.07820, 2025

work page arXiv 2025
[24]

Learning agile skills via adversarial imitation of rough partial demonstrations,

C. Li, M. Vlastelica, S. Blaes, J. Frey, F. Grimminger, and G. Martius, “Learning agile skills via adversarial imitation of rough partial demonstrations,” inProceedings of The 6th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, K. Liu, D. Kulic, and J. Ichnowski, Eds., vol. 205. PMLR, 14–18 Dec 2023, pp. 342–352

2023
[25]

Proximal policy optimization algorithms,

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” 2017

2017
[26]

Twist2: Scalable, portable, and holistic humanoid data collection system,

Y . Ze, S. Zhao, W. Wang, A. Kanazawa, R. Duan, P. Abbeel, G. Shi, J. Wu, and C. K. Liu, “Twist2: Scalable, portable, and holistic humanoid data collection system,” 2025

2025
[27]

Available: https://arxiv.org/abs/2403.04436

T. He, Z. Luo, W. Xiao, C. Zhang, K. Kitani, C. Liu, and G. Shi, “Learning human-to-humanoid real-time whole- body teleoperation,”arXiv preprint arXiv:2403.04436, 2024

work page arXiv 2024
[28]

Anymal parkour: Learning agile navigation for quadrupedal robots,

D. Hoeller, N. Rudin, D. Sako, and M. Hutter, “Anymal parkour: Learning agile navigation for quadrupedal robots,” Science Robotics, vol. 9, no. 88, p. eadi7566, 2024

2024
[29]

A reduction of imitation learning and structured prediction to no-regret online learning,

S. Ross, G. J. Gordon, and J. A. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” 2011

2011
[30]

AMASS: Archive of motion capture as surface shapes,

N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black, “AMASS: Archive of motion capture as surface shapes,” inInternational Conference on Computer Vision, Oct. 2019, pp. 5442–5451

2019
[31]

World-grounded human motion recovery via gravity-view coordinates,

Z. Shen, H. Pi, Y . Xia, Z. Cen, S. Peng, Z. Hu, H. Bao, R. Hu, and X. Zhou, “World-grounded human motion recovery via gravity-view coordinates,” inSIGGRAPH Asia Conference Proceedings, 2024

2024
[32]

Drake: Model-based design and verification for robotics,

D. D. Team, “Drake: Model-based design and verification for robotics,” 2019. [Online]. Available: https://drake.mit. edu

2019
[33]

Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning,

NVIDIA, :, M. Mittalet al., “Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning,” 2025

2025
[34]

Direct lidar- inertial odometry: Lightweight lio with continuous-time motion correction,

K. Chen, R. Nemiroff, and B. T. Lopez, “Direct lidar- inertial odometry: Lightweight lio with continuous-time motion correction,”2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 3983–3989, 2023

2023
[35]

Elevation mapping for locomotion and navigation using gpu,

T. Miki, L. Wellhausen, R. Grandia, F. Jenelten, T. Homberger, and M. Hutter, “Elevation mapping for locomotion and navigation using gpu,” 2022. APPENDIX TABLE II REWARDEQUATIONS FORRL-BASEDMOTIONTRACKER Group Name Equation rmimic Base Pose Tracking 3 exp −4∥p w b −p w∗ b ∥2 − qb ⊗q ∗ b −1 2 Base Velocity Tracking exp − vb −v b∗ 2 −0.1 ωb −ω b∗ 2 Joint Po...

2022