Recognition: unknown
Learning Whole-Body Humanoid Locomotion via Motion Generation and Motion Tracking
Pith reviewed 2026-05-10 06:06 UTC · model grok-4.3
The pith
A diffusion model generates real-time terrain-aware reference motions that an RL tracker follows after closed-loop fine-tuning to enable whole-body humanoid locomotion with directional control.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We train a diffusion model on retargeted human motions for real-time prediction of terrain-aware reference motions. We concurrently train a whole-body reference tracker with RL using this data. To improve robustness under imperfect references we fine-tune the tracker with a frozen motion generator in a closed-loop setting. The resulting system supports directional goal-reaching control with terrain-aware whole-body adaptation and can be deployed on a Unitree G1 humanoid robot with onboard perception and computation. Hardware experiments demonstrate successful traversal over boxes, hurdles, stairs, and mixed terrain combinations.
What carries the argument
The diffusion model that produces terrain-aware reference motions in real time, paired with the RL whole-body tracker that is fine-tuned closed-loop to follow imperfect references from the generator.
Load-bearing premise
The diffusion model produces sufficiently accurate terrain-aware reference motions in real time and the RL tracker can compensate for remaining imperfections without the robot falling or losing balance on unseen terrain.
What would settle it
Placing the robot on a novel mixed-terrain setup containing obstacle heights and spacings absent from training data and observing repeated falls or failure to reach the directional goal would show the combined generator-tracker system does not deliver the claimed robustness.
Figures
read the original abstract
Whole-body humanoid locomotion is challenging due to high-dimensional control, morphological instability, and the need for real-time adaptation to various terrains using onboard perception. Directly applying reinforcement learning (RL) with reward shaping to humanoid locomotion often leads to lower-body-dominated behaviors, whereas imitation-based RL can learn more coordinated whole-body skills but is typically limited to replaying reference motions without a mechanism to adapt them online from perception for terrain-aware locomotion. To address this gap, we propose a whole-body humanoid locomotion framework that combines skills learned from reference motions with terrain-aware adaptation. We first train a diffusion model on retargeted human motions for real-time prediction of terrain-aware reference motions. Concurrently, we train a whole-body reference tracker with RL using this motion data. To improve robustness under imperfectly generated references, we further fine-tune the tracker with a frozen motion generator in a closed-loop setting. The resulting system supports directional goal-reaching control with terrain-aware whole-body adaptation, and can be deployed on a Unitree G1 humanoid robot with onboard perception and computation. The hardware experiments demonstrate successful traversal over boxes, hurdles, stairs, and mixed terrain combinations. Quantitative results further show the benefits of incorporating online motion generation and fine-tuning the motion tracker for improved generalization and robustness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a framework for whole-body humanoid locomotion combining a diffusion model trained on retargeted human motions to generate real-time terrain-aware reference motions from onboard perception, an RL-based whole-body reference tracker trained concurrently on the motion data, and closed-loop fine-tuning of the tracker against the frozen diffusion generator to improve robustness to imperfect references. The resulting system is claimed to enable directional goal-reaching control with terrain-aware adaptation and is deployed on a Unitree G1 humanoid robot, with hardware experiments demonstrating successful traversal over boxes, hurdles, stairs, and mixed terrain combinations using onboard perception and computation. Quantitative benefits from online generation and fine-tuning for generalization and robustness are asserted.
Significance. If the central claims hold, the work provides a practical bridge between imitation-based whole-body skills and online perception-driven adaptation for high-DoF humanoid control, addressing the common issue of lower-body-dominant behaviors in direct RL. The hardware deployment on the Unitree G1 with onboard sensing and computation offers concrete empirical support for real-world viability across varied terrains, which is a strength for deployability claims in robotics.
major comments (2)
- [Abstract] Abstract: The abstract asserts that 'quantitative results further show the benefits of incorporating online motion generation and fine-tuning the motion tracker for improved generalization and robustness,' yet provides no specific metrics, ablation studies, success rates, or training stability details. This is load-bearing for the generalization claim, as the reader's assessment notes moderate evidence strength without these elements to evaluate the magnitude of improvement or reliability on novel terrains.
- [Hardware experiments] Hardware experiments section: The claim of successful traversal over mixed/unseen terrain combinations relies on the assumption that closed-loop fine-tuning allows the RL tracker to reliably compensate for inaccuracies in real-time diffusion-generated references under perception noise. No ablation comparing fine-tuned vs. non-fine-tuned performance, failure rates, or distribution overlap analysis is referenced, which directly impacts the load-bearing robustness argument for onboard deployment.
minor comments (1)
- [Abstract] The abstract could be strengthened by briefly including 1-2 key quantitative metrics (e.g., success rates or comparison deltas) to better support the asserted benefits without requiring readers to reach the full results section.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the practical contributions of our framework for terrain-aware whole-body humanoid locomotion. We address each major comment point by point below, with revisions to strengthen the presentation of quantitative evidence and robustness claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract asserts that 'quantitative results further show the benefits of incorporating online motion generation and fine-tuning the motion tracker for improved generalization and robustness,' yet provides no specific metrics, ablation studies, success rates, or training stability details. This is load-bearing for the generalization claim, as the reader's assessment notes moderate evidence strength without these elements to evaluate the magnitude of improvement or reliability on novel terrains.
Authors: We agree that the abstract would be strengthened by including concrete quantitative details to support the generalization and robustness claims. The manuscript body contains ablation studies and success rates (e.g., in Sections 5.2 and 6), but these were not summarized in the abstract. In the revised version, we have updated the abstract to explicitly reference key metrics, including success rates on novel terrains with and without online generation/fine-tuning, as well as notes on the ablation results demonstrating improved generalization. This change directly addresses the concern about evidence strength without altering the underlying claims. revision: yes
-
Referee: [Hardware experiments] Hardware experiments section: The claim of successful traversal over mixed/unseen terrain combinations relies on the assumption that closed-loop fine-tuning allows the RL tracker to reliably compensate for inaccuracies in real-time diffusion-generated references under perception noise. No ablation comparing fine-tuned vs. non-fine-tuned performance, failure rates, or distribution overlap analysis is referenced, which directly impacts the load-bearing robustness argument for onboard deployment.
Authors: The referee correctly identifies that an explicit hardware ablation would better substantiate the robustness benefits of closed-loop fine-tuning under real perception noise. The current manuscript describes the fine-tuning procedure in Section 4.3 and reports overall hardware success on mixed terrains in Section 6, but does not include a dedicated side-by-side comparison of fine-tuned versus non-fine-tuned trackers on hardware with failure rates. We have therefore added a new ablation subsection and table in the revised hardware experiments section that reports success rates, failure modes, and qualitative analysis of how fine-tuning enables compensation for diffusion reference inaccuracies on mixed/unseen terrains. This revision directly supports the onboard deployment claims. revision: yes
Circularity Check
No circularity detected in derivation chain
full rationale
The paper outlines a standard two-stage pipeline: a diffusion model is trained on retargeted human motion data to generate terrain-aware references, an RL tracker is trained to imitate those references, and the tracker is then fine-tuned in closed loop against the frozen generator. All performance claims rest on empirical hardware results (traversal of boxes, stairs, mixed terrain) rather than any equation or quantity that reduces by construction to a fitted parameter or self-citation. No self-definitional steps, fitted-input-as-prediction patterns, or load-bearing self-citations appear in the described method; the approach builds on established diffusion and RL techniques without tautological reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Learning quadrupedal locomotion over challenging terrain,
J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hut- ter, “Learning quadrupedal locomotion over challenging terrain,”Science robotics, vol. 5, no. 47, p. eabc5986, 2020
2020
-
[2]
Learning robust perceptive locomotion for quadrupedal robots in the wild,
T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter, “Learning robust perceptive locomotion for quadrupedal robots in the wild,”Science Robotics, vol. 7, no. 62, 2022
2022
-
[3]
Attention-based map encoding for learning generalized legged locomotion,
J. He, C. Zhang, F. Jenelten, R. Grandia, M. Bächer, and M. Hutter, “Attention-based map encoding for learning generalized legged locomotion,”Science Robotics, vol. 10, no. 105, p. eadv3604, 2025
2025
-
[4]
Parkour in the wild: Learning a general and extensible agile locomotion policy using multi-expert distillation and rl fine-tuning,
N. Rudin, J. He, J. Aurand, and M. Hutter, “Parkour in the wild: Learning a general and extensible agile locomotion policy using multi-expert distillation and rl fine-tuning,” 2025
2025
-
[5]
Motion priors reimagined: Adapting flat-terrain skills for complex quadruped mobility,
Z. Zhang, C. Li, T. Miki, and M. Hutter, “Motion priors reimagined: Adapting flat-terrain skills for complex quadruped mobility,” 2025
2025
-
[6]
Ame-2: Agile and generalized legged locomotion via attention- based neural map encoding,
C. Zhang, V . Klemm, F. Yang, and M. Hutter, “Ame-2: Agile and generalized legged locomotion via attention- based neural map encoding,” 2026
2026
-
[7]
Deepmimic: Example-guided deep reinforcement learning of physics-based character skills,
X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne, “Deepmimic: Example-guided deep reinforcement learning of physics-based character skills,”ACM Trans. Graph., vol. 37, no. 4, pp. 143:1–143:14, Jul. 2018
2018
-
[8]
Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025
Z. Chen, M. Ji, X. Cheng, X. Peng, X. B. Peng, and X. Wang, “Gmt: General motion tracking for humanoid whole-body control,”arXiv:2506.14770, 2025
-
[9]
Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion,
Q. Liao, T. E. Truong, X. Huang, Y . Gao, G. Tevet, K. Sreenath, and C. K. Liu, “Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion,” 2025
2025
-
[10]
Ex- body2: Advanced expressive humanoid whole-body con- trol.arXiv preprint arXiv:2412.13196, 2024
M. Ji, X. Peng, F. Liu, J. Li, G. Yang, X. Cheng, and X. Wang, “Exbody2: Advanced expressive humanoid whole-body control,”arXiv preprint arXiv:2412.13196, 2024
-
[11]
Twist: Teleoperated whole-body imitation system,
Y . Ze, Z. Chen, J. P. Araújo, Z. ang Cao, X. B. Peng, J. Wu, and C. K. Liu, “Twist: Teleoperated whole-body imitation system,” 2025
2025
-
[12]
Omniretar- get: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction,
L. Yang, X. Huang, Z. Wu, A. Kanazawa, P. Abbeel, C. Sferrazza, C. K. Liu, R. Duan, and G. Shi, “Omniretar- get: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction,” 2025
2025
-
[13]
Apex: Learning adaptive high- platform traversal for humanoid robots,
Y . Wang, T. Leng, C. Lin, S. Liu, S. Simon, B. Chen, J. Francis, and D. Zhao, “Apex: Learning adaptive high- platform traversal for humanoid robots,” 2026
2026
-
[14]
Perceptive humanoid parkour: Chaining dynamic human skills via motion matching,
Z. Wu, X. Huang, L. Yang, Y . Zhang, K. Sreenath, X. Chen, P. Abbeel, R. Duan, A. Kanazawa, C. Sferrazza, G. Shi, and C. K. Liu, “Perceptive humanoid parkour: Chaining dynamic human skills via motion matching,” 2026
2026
-
[15]
X-loco: Towards generalist humanoid locomotion control via synergetic policy distillation,
D. Wang, X. Wang, C. Zhang, J. Shi, Y . Zhao, C. Bai, and X. Li, “X-loco: Towards generalist humanoid locomotion control via synergetic policy distillation,” 2026
2026
-
[16]
Diffuse-cloc: Guided diffusion for physics-based character look-ahead control,
X. Huang, T. Truong, Y . Zhang, F. Yu, J. P. Sleiman, J. Hodgins, K. Sreenath, and F. Farshidian, “Diffuse-cloc: Guided diffusion for physics-based character look-ahead control,”ACM Transactions on Graphics, vol. 44, no. 4, p. 1–12, Jul. 2025
2025
-
[17]
Human motion diffusion model,
G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-or, and A. H. Bermano, “Human motion diffusion model,” inThe Eleventh International Conference on Learning Representations, 2023
2023
-
[18]
Closd: Closing the loop between simulation and diffusion for multi-task character control,
G. Tevet, S. Raab, S. Cohan, D. Reda, Z. Luo, X. B. Peng, A. H. Bermano, and M. van de Panne, “Closd: Closing the loop between simulation and diffusion for multi-task character control,” 2024
2024
-
[19]
Diffusion policy: Visuomotor policy learning via action diffusion,
C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burch- fiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” inProceedings of Robotics: Science and Systems (RSS), 2023
2023
-
[20]
Robot motion diffusion model: Motion generation for robotic characters,
A. Serifi, R. Grandia, E. Knoop, M. Gross, and M. Bächer, “Robot motion diffusion model: Motion generation for robotic characters,” inSIGGRAPH Asia 2024 Conference Papers, ser. SA ’24. New York, NY , USA: Association for Computing Machinery, 2024
2024
-
[21]
Parc: Physics- based augmentation with reinforcement learning for character controllers,
M. Xu, Y . Shi, K. Yin, and X. B. Peng, “Parc: Physics- based augmentation with reinforcement learning for character controllers,” inSIGGRAPH 2025 Conference Papers (SIGGRAPH ’25 Conference Papers), 2025
2025
-
[22]
Physics-based motion capture imitation with deep reinforcement learning,
N. Chentanez, M. Müller, M. Macklin, V . Makoviychuk, and S. Jeschke, “Physics-based motion capture imitation with deep reinforcement learning,” inProceedings of the 11th ACM SIGGRAPH Conference on Motion, Interaction and Games, ser. MIG ’18. New York, NY , USA: Association for Computing Machinery, 2018
2018
-
[23]
Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Castañeda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, X. Da, R. Ding, C. Hogg, L. Song, E. Lim, E. Jeong, T. He, H. Xue, W. Xiao, Z. Wang, S. Yuen, J. Kautz, Y . Chang, U. Iqbal, L. Fan, and Y . Zhu, “Sonic: Supersizing motion tracking for natural humanoid whole-body control,”arXiv preprint arXiv:2511.07820, 2025
-
[24]
Learning agile skills via adversarial imitation of rough partial demonstrations,
C. Li, M. Vlastelica, S. Blaes, J. Frey, F. Grimminger, and G. Martius, “Learning agile skills via adversarial imitation of rough partial demonstrations,” inProceedings of The 6th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, K. Liu, D. Kulic, and J. Ichnowski, Eds., vol. 205. PMLR, 14–18 Dec 2023, pp. 342–352
2023
-
[25]
Proximal policy optimization algorithms,
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” 2017
2017
-
[26]
Twist2: Scalable, portable, and holistic humanoid data collection system,
Y . Ze, S. Zhao, W. Wang, A. Kanazawa, R. Duan, P. Abbeel, G. Shi, J. Wu, and C. K. Liu, “Twist2: Scalable, portable, and holistic humanoid data collection system,” 2025
2025
-
[27]
Available: https://arxiv.org/abs/2403.04436
T. He, Z. Luo, W. Xiao, C. Zhang, K. Kitani, C. Liu, and G. Shi, “Learning human-to-humanoid real-time whole- body teleoperation,”arXiv preprint arXiv:2403.04436, 2024
-
[28]
Anymal parkour: Learning agile navigation for quadrupedal robots,
D. Hoeller, N. Rudin, D. Sako, and M. Hutter, “Anymal parkour: Learning agile navigation for quadrupedal robots,” Science Robotics, vol. 9, no. 88, p. eadi7566, 2024
2024
-
[29]
A reduction of imitation learning and structured prediction to no-regret online learning,
S. Ross, G. J. Gordon, and J. A. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” 2011
2011
-
[30]
AMASS: Archive of motion capture as surface shapes,
N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black, “AMASS: Archive of motion capture as surface shapes,” inInternational Conference on Computer Vision, Oct. 2019, pp. 5442–5451
2019
-
[31]
World-grounded human motion recovery via gravity-view coordinates,
Z. Shen, H. Pi, Y . Xia, Z. Cen, S. Peng, Z. Hu, H. Bao, R. Hu, and X. Zhou, “World-grounded human motion recovery via gravity-view coordinates,” inSIGGRAPH Asia Conference Proceedings, 2024
2024
-
[32]
Drake: Model-based design and verification for robotics,
D. D. Team, “Drake: Model-based design and verification for robotics,” 2019. [Online]. Available: https://drake.mit. edu
2019
-
[33]
Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning,
NVIDIA, :, M. Mittalet al., “Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning,” 2025
2025
-
[34]
Direct lidar- inertial odometry: Lightweight lio with continuous-time motion correction,
K. Chen, R. Nemiroff, and B. T. Lopez, “Direct lidar- inertial odometry: Lightweight lio with continuous-time motion correction,”2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 3983–3989, 2023
2023
-
[35]
Elevation mapping for locomotion and navigation using gpu,
T. Miki, L. Wellhausen, R. Grandia, F. Jenelten, T. Homberger, and M. Hutter, “Elevation mapping for locomotion and navigation using gpu,” 2022. APPENDIX TABLE II REWARDEQUATIONS FORRL-BASEDMOTIONTRACKER Group Name Equation rmimic Base Pose Tracking 3 exp −4∥p w b −p w∗ b ∥2 − qb ⊗q ∗ b −1 2 Base Velocity Tracking exp − vb −v b∗ 2 −0.1 ωb −ω b∗ 2 Joint Po...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.