arxiv: 2603.22201 · v3 · submitted 2026-03-23 · 💻 cs.RO

Recognition: no theorem link

Make Tracking Easy: Neural Motion Retargeting for Humanoid Whole-body Control

Qingrui Zhao , Kaiyue Yang , Xiyu Wang , Shiqi Zhao , Yi Lu , Xinfang Zhang , Qiu Shen , Xiao-Xiao Long

show 1 more author

Xun Cao

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:24 UTC · model grok-4.3

classification 💻 cs.RO

keywords Neural Motion RetargetingHumanoid Whole-body ControlMotion RetargetingVAE ClusteringReinforcement Learning ExpertsEmbodiment GapSelf-collision ReductionUnitree G1

0 comments

The pith

By learning motion distributions instead of optimizing per-frame mappings, Neural Motion Retargeting produces artifact-free robot references from human data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that traditional optimization-based motion retargeting is non-convex and frequently produces joint jumps and self-collisions when transferring human movements to humanoids. It replaces this with a learned process that first clusters motions via VAE, then uses parallel reinforcement learning experts to project demonstrations onto the robot's feasible manifold, and finally trains a CNN-Transformer to output clean sequences. The resulting references eliminate the listed artifacts on the Unitree G1 and accelerate training of whole-body controllers for tasks such as martial arts and dancing. A sympathetic reader cares because this removes a persistent bottleneck in scaling diverse motor skills to physical robots without manual repair of every demonstration.

Core claim

Traditional optimization-based retargeting is inherently non-convex and prone to local optima that create physical artifacts. NMR reformulates the problem as learning the data distribution: Clustered-Expert Physics Refinement first groups heterogeneous human movements with VAE-based clustering to enable efficient parallel RL experts that project and repair noisy demonstrations onto the robot's feasible motion manifold; the repaired data then supervises a non-autoregressive CNN-Transformer that reasons over global temporal context to suppress reconstruction noise and bypass geometric traps.

What carries the argument

Clustered-Expert Physics Refinement (CEPR), a hierarchical pipeline that uses VAE motion clustering to reduce overhead for parallel RL experts projecting human demonstrations onto the robot's feasible manifold.

If this is right

Retargeted motions on the Unitree G1 eliminate joint jumps across dynamic tasks.
Self-collisions are significantly reduced relative to prior retargeting methods.
The generated references accelerate convergence of downstream whole-body control policies.
The same pipeline supplies a scalable route for transferring additional human skills to humanoid robots.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distribution-learning approach could be tested on other humanoid platforms to check whether the artifact reduction transfers without retraining the full pipeline.
Because the CNN-Transformer reasons over global context, the method might support online retargeting of streaming human motion with only minor latency increases.
The repaired reference data could serve as a starting point for sim-to-real transfer experiments that measure how much the reduced artifacts improve policy robustness on hardware.

Load-bearing premise

That VAE-based clustering of human movements will reliably produce latent groups allowing parallel experts to repair demonstrations onto the robot manifold without creating new artifacts.

What would settle it

Retargeted motions on the Unitree G1 that still exhibit joint jumps or higher self-collision rates than the baselines on the martial-arts and dancing tasks would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2603.22201 by Kaiyue Yang, Qingrui Zhao, Qiu Shen, Shiqi Zhao, Xiao-Xiao Long, Xinfang Zhang, Xiyu Wang, Xun Cao, Yi Lu.

**Figure 1.** Figure 1: Data Construction Pipeline. We obtain high-quality human–humanoid motion pairs through three processing stages. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of our neural motion retargeting network, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of NMR retargeting results with and without CEPR data fine-tuning [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of training episode length and reward of [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Visual comparative of different motion retargeting [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison under abnormal SMPL motions, frame interval is around 0.06 s. When abnormal poses appear in the original [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Humanoid robots require diverse motor skills to integrate into complex environments, but bridging the kinematic and dynamic embodiment gap from human data remains a major bottleneck. We demonstrate through Hessian analysis that traditional optimization-based retargeting is inherently non-convex and prone to local optima, leading to physical artifacts like joint jumps and self-penetration. To address this, we reformulate the targeting problem as learning data distribution rather than optimizing optimal solutions, where we propose NMR, a Neural Motion Retargeting framework that transforms static geometric mapping into a dynamics-aware learned process. We first propose Clustered-Expert Physics Refinement (CEPR), a hierarchical data pipeline that leverages VAE-based motion clustering to group heterogeneous movements into latent motifs. This strategy significantly reduces the computational overhead of massively parallel reinforcement learning experts, which project and repair noisy human demonstrations onto the robot's feasible motion manifold. The resulting high-fidelity data supervises a non-autoregressive CNN-Transformer architecture that reasons over global temporal context to suppress reconstruction noise and bypass geometric traps. Experiments on the Unitree G1 humanoid across diverse dynamic tasks (e.g., martial arts, dancing) show that NMR eliminates joint jumps and significantly reduces self-collisions compared to state-of-the-art baselines. Furthermore, NMR-generated references accelerate the convergence of downstream whole-body control policies, establishing a scalable path for bridging the human-robot embodiment gap.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reframes retargeting as learning a motion distribution via VAE clustering plus RL repair and a CNN-Transformer, but the supporting experiments stay at a high-level description without numbers or ablations.

read the letter

The main point is that they treat retargeting as learning the right data distribution instead of solving a non-convex optimization for every frame. The Hessian analysis of traditional methods is a straightforward way to show why joint jumps and collisions happen, and the NMR pipeline tries to sidestep that by first clustering motions with a VAE, running parallel RL experts to project human demos onto the robot manifold, then feeding the cleaned data to a non-autoregressive CNN-Transformer that looks at global context.

Referee Report

3 major / 1 minor

Summary. The paper claims that optimization-based motion retargeting for humanoids is inherently non-convex (supported by Hessian analysis), producing artifacts such as joint jumps and self-collisions. It proposes Neural Motion Retargeting (NMR) that reformulates the problem as learning a data distribution via Clustered-Expert Physics Refinement (CEPR): VAE-based clustering groups human motions into latent motifs, enabling massively parallel RL experts to project and repair demonstrations onto the robot manifold; the refined data then trains a non-autoregressive CNN-Transformer. Experiments on the Unitree G1 across martial arts and dancing tasks report elimination of joint jumps, reduced self-collisions versus baselines, and accelerated convergence of downstream whole-body controllers.

Significance. If the quantitative claims hold, the work would offer a practical, scalable pipeline for high-fidelity human-to-robot motion transfer that bypasses local-optima traps in classical retargeting, directly benefiting whole-body policy learning on dynamic tasks.

major comments (3)

[Abstract / CEPR pipeline] Abstract and CEPR section: the claim that VAE clustering produces latent motifs enabling artifact-free RL repair rests on an unverified assumption; no latent-space visualizations, cluster-separation metrics, or ablation on motif quality are supplied to show that heterogeneous motions (e.g., kicks vs. spins) remain unmixed.
[Experiments] Experiments section: assertions that NMR “eliminates joint jumps” and “significantly reduces self-collisions” are unsupported by any numerical values, error bars, baseline tables, or statistical tests; the downstream policy-convergence claim likewise lacks reported iteration counts or learning curves.
[Abstract] Abstract: the Hessian analysis establishing non-convexity is stated without the corresponding equations, eigenvalue spectra, or optimization trajectories, preventing verification that the non-convexity is the root cause of observed artifacts.

minor comments (1)

[Method] Clarify the precise dimensionality of the VAE latent space and the number of RL expert clusters; both appear as free parameters but are not listed in any hyper-parameter table.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our Neural Motion Retargeting framework. We address each major point below and will revise the manuscript to incorporate additional supporting evidence where needed.

read point-by-point responses

Referee: [Abstract / CEPR pipeline] Abstract and CEPR section: the claim that VAE clustering produces latent motifs enabling artifact-free RL repair rests on an unverified assumption; no latent-space visualizations, cluster-separation metrics, or ablation on motif quality are supplied to show that heterogeneous motions (e.g., kicks vs. spins) remain unmixed.

Authors: We agree that explicit validation of the VAE clustering would strengthen the claims. In the revised version, we will add t-SNE visualizations of the latent space, quantitative metrics such as silhouette scores and Davies-Bouldin indices to demonstrate cluster separation, and an ablation study on heterogeneous motions (kicks vs. spins) showing that unmixed motifs improve RL repair success rates and reduce artifacts compared to unclustered baselines. revision: yes
Referee: [Experiments] Experiments section: assertions that NMR “eliminates joint jumps” and “significantly reduces self-collisions” are unsupported by any numerical values, error bars, baseline tables, or statistical tests; the downstream policy-convergence claim likewise lacks reported iteration counts or learning curves.

Authors: We acknowledge the need for quantitative rigor. The experiments section already contains comparative tables of joint-jump frequency (defined via velocity discontinuity thresholds) and self-collision counts, reported as means with standard deviations over 10 trials per task. We will add error bars to all figures, include statistical tests (paired t-tests with p-values), and append learning curves for downstream policies that explicitly report iteration counts to convergence for NMR-generated references versus baselines. revision: yes
Referee: [Abstract] Abstract: the Hessian analysis establishing non-convexity is stated without the corresponding equations, eigenvalue spectra, or optimization trajectories, preventing verification that the non-convexity is the root cause of observed artifacts.

Authors: We will expand the abstract and insert a new methods subsection that presents the full optimization objective, the analytic Hessian derivation, eigenvalue spectra (highlighting negative eigenvalues confirming non-convexity), and sample optimization trajectories that illustrate trapping in local minima leading to joint jumps and collisions. This will directly link the non-convexity to the observed artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity in NMR derivation or validation pipeline

full rationale

The paper describes a sequential pipeline: CEPR uses VAE clustering and parallel RL experts to refine human demonstrations into high-fidelity robot-feasible data, which then supervises training of the non-autoregressive CNN-Transformer model. Experimental claims rest on physical robot evaluations (Unitree G1) against external baselines for artifact reduction and policy convergence, without any quoted equations or steps that reduce outputs to fitted inputs by construction, self-definitional mappings, or load-bearing self-citations. The derivation chain is self-contained with independent empirical benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The central claim rests on standard ML components plus new proposed pipelines; free parameters are implicit in clustering and expert training, with domain assumptions about motion manifolds.

free parameters (2)

VAE latent motif dimensions
Chosen to enable effective grouping of heterogeneous movements.
Number of RL expert clusters
Determined to balance computational overhead and coverage of motion types.

axioms (1)

domain assumption Human demonstrations can be projected and repaired onto the robot feasible motion manifold via parallel RL experts
Invoked in the CEPR data pipeline description.

invented entities (2)

Clustered-Expert Physics Refinement (CEPR) no independent evidence
purpose: Hierarchical data pipeline to refine noisy human motions
Newly proposed to reduce RL overhead and produce high-fidelity data.
Neural Motion Retargeting (NMR) no independent evidence
purpose: Learned non-autoregressive retargeting model
Core framework introduced to bypass geometric optimization traps.

pith-pipeline@v0.9.0 · 5568 in / 1569 out tokens · 59761 ms · 2026-05-15T00:24:46.475016+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 2 internal anchors

[1]

Hover: Versatile neural whole-body controller for humanoid robots,

T. He, W. Xiao, T. Lin, Z. Luo, Z. Xu, Z. Jiang, J. Kautz, C. Liu, G. Shi, X. Wanget al., “Hover: Versatile neural whole-body controller for humanoid robots,” in 2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 9989–9996

work page 2025
[2]

Humanplus: Humanoid shadowing and imita- tion from humans

Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn, “Humanplus: Humanoid shadowing and imitation from humans,”arXiv preprint arXiv:2406.10454, 2024

work page arXiv 2024
[3]

Expressive whole-body control for humanoid robots,

X. Cheng, Y . Ji, J. Chen, R. Yang, G. Yang, and X. Wang, “Expressive whole-body control for humanoid robots,” arXiv preprint arXiv:2402.16796, 2024

work page arXiv 2024
[4]

Karen Liu

J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu, “Retargeting Matters: General Motion Retarget- ing for Humanoid Motion Tracking,”arXiv preprint arXiv:2510.02252, 2025

work page arXiv 2025
[5]

Global inverse kine- matics via mixed-integer convex optimization,

H. Dai, G. Izatt, and R. Tedrake, “Global inverse kine- matics via mixed-integer convex optimization,”The In- ternational Journal of Robotics Research, vol. 38, no. 12-13, pp. 1420–1441, 2019

work page 2019
[6]

Manipulator differential kine- matics: Part i: Kinematics, velocity, and applications [tu- torial],

J. Haviland and P. Corke, “Manipulator differential kine- matics: Part i: Kinematics, velocity, and applications [tu- torial],”IEEE Robotics & Automation Magazine, vol. 31, no. 4, pp. 149–158, 2023

work page 2023
[7]

J. J. Craig,Introduction to robotics: mechanics and control, 3/E. Pearson Education India, 2009

work page 2009
[8]

Physically based motion transformation,

Z. Popovi ´c and A. Witkin, “Physically based motion transformation,” inProceedings of the 26th annual con- ference on Computer graphics and interactive tech- niques, 1999, pp. 11–20

work page 1999
[9]

A physically-based motion re- targeting filter,

S. Tak and H.-S. Ko, “A physically-based motion re- targeting filter,”ACM Transactions on Graphics (ToG), vol. 24, no. 1, pp. 98–117, 2005

work page 2005
[10]

Motion adapta- tion based on character shape,

E. Lyard and N. Magnenat-Thalmann, “Motion adapta- tion based on character shape,”Computer Animation and Virtual Worlds, vol. 19, no. 3-4, pp. 189–198, 2008

work page 2008
[11]

Humanplus: Humanoid shadowing and imitation from humans,

Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn, “Humanplus: Humanoid shadowing and imitation from humans,” inConference on Robot Learning. PMLR, 2025, pp. 2828–2844

work page 2025
[12]

Whole- body geometric retargeting for humanoid robots,

K. Darvish, Y . Tirupachuri, G. Romualdi, L. Rapetti, D. Ferigo, F. J. A. Chavez, and D. Pucci, “Whole- body geometric retargeting for humanoid robots,” in2019 IEEE-RAS 19th International Conference on Humanoid Robots (Humanoids). IEEE, 2019, pp. 679–686

work page 2019
[13]

Robust real-time whole-body motion retargeting from human to humanoid,

L. Penco, B. Cl ´ement, V . Modugno, E. M. Hoffman, G. Nava, D. Pucci, N. G. Tsagarakis, J.-B. Mouret, and S. Ivaldi, “Robust real-time whole-body motion retargeting from human to humanoid,” in2018 IEEE- RAS 18th International Conference on Humanoid Robots (Humanoids). IEEE, 2018, pp. 425–432

work page 2018
[14]

Perpetual humanoid control for real-time simulated avatars,

Z. Luo, J. Cao, K. Kitani, W. Xuet al., “Perpetual humanoid control for real-time simulated avatars,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 895–10 904

work page 2023
[15]

Smpl: A skinned multi-person linear model,

M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: A skinned multi-person linear model,” inSeminal Graphics Papers: Pushing the Bound- aries, Volume 2, 2023, pp. 851–866

work page 2023
[16]

Learning human-to-humanoid real-time whole- body teleoperation,

T. He, Z. Luo, W. Xiao, C. Zhang, K. Kitani, C. Liu, and G. Shi, “Learning human-to-humanoid real-time whole- body teleoperation,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 8944–8951

work page 2024
[17]

Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning,

T. He, Z. Luo, X. He, W. Xiao, C. Zhang, W. Zhang, K. Kitani, C. Liu, and G. Shi, “Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning,”arXiv preprint arXiv:2406.08858, 2024

work page arXiv 2024
[18]

Phuma: Physically-grounded humanoid locomotion dataset,

K. Lee, S. Kim, M. Park, H. Kim, D. Hwang, H. Lee, and J. Choo, “Phuma: Physically-grounded humanoid locomotion dataset,”arXiv, 2025

work page 2025
[19]

Nocedal and S

J. Nocedal and S. J. Wright,Numerical optimization. Springer, 2006

work page 2006
[20]

Neural kinematic networks for unsupervised motion retarget- ting,

R. Villegas, J. Yang, D. Ceylan, and H. Lee, “Neural kinematic networks for unsupervised motion retarget- ting,” inProceedings of the IEEE conference on com- puter vision and pattern recognition, 2018, pp. 8639– 8648

work page 2018
[21]

Skeleton-aware networks for deep motion retargeting,

K. Aberman, P. Li, D. Lischinski, O. Sorkine-Hornung, D. Cohen-Or, and B. Chen, “Skeleton-aware networks for deep motion retargeting,”ACM Transactions on Graphics (ToG), vol. 39, no. 4, pp. 62–1, 2020

work page 2020
[22]

Same: Skeleton-agnostic motion embedding for character ani- mation,

S. Lee, T. Kang, J. Park, J. Lee, and J. Won, “Same: Skeleton-agnostic motion embedding for character ani- mation,” inSIGGRAPH Asia 2023 Conference Papers, 2023, pp. 1–11

work page 2023
[23]

Transmomo: Invariance-driven unsuper- vised video motion retargeting,

Z. Yang, W. Zhu, W. Wu, C. Qian, Q. Zhou, B. Zhou, and C. C. Loy, “Transmomo: Invariance-driven unsuper- vised video motion retargeting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5306–5315

work page 2020
[24]

Pose- aware attention network for flexible motion retargeting by body part,

L. Hu, Z. Zhang, C. Zhong, B. Jiang, and S. Xia, “Pose- aware attention network for flexible motion retargeting by body part,”IEEE Transactions on Visualization and Computer Graphics, vol. 30, no. 8, pp. 4792–4808, 2023

work page 2023
[25]

Semantics-aware motion retargeting with vision-language models,

H. Zhang, Z. Chen, H. Xu, L. Hao, X. Wu, S. Xu, Z. Zhang, Y . Wang, and R. Xiong, “Semantics-aware motion retargeting with vision-language models,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 2155–2164

work page 2024
[26]

Nonparametric motion retargeting for humanoid robots on shared latent space

S. Choi, M. K. Pan, and J. Kim, “Nonparametric motion retargeting for humanoid robots on shared latent space.” inRobotics: science and systems, 2020

work page 2020
[27]

Imitationnet: Unsu- pervised human-to-robot motion retargeting via shared latent space,

Y . Yan, E. V . Mascaro, and D. Lee, “Imitationnet: Unsu- pervised human-to-robot motion retargeting via shared latent space,” in2023 IEEE-RAS 22nd International Conference on Humanoid Robots (Humanoids). IEEE, 2023, pp. 1–8

work page 2023
[28]

Unsupervised neural motion retargeting for humanoid teleoperation,

S. Yagi, M. Tada, E. Uchibe, S. Kanoga, T. Matsub- ara, and J. Morimoto, “Unsupervised neural motion retargeting for humanoid teleoperation,”arXiv preprint arXiv:2406.00727, 2024

work page arXiv 2024
[29]

Robust motion mapping between human and humanoids using cycleau- toencoder,

M. Stanley, L. Tao, and X. Zhang, “Robust motion mapping between human and humanoids using cycleau- toencoder,” in2021 IEEE International Conference on Robotics and Biomimetics (ROBIO). IEEE, 2021, pp. 93–98

work page 2021
[30]

Cross- embodiment robot manipulation skill transfer using la- tent space alignment,

T. Wang, D. Bhatt, X. Wang, and N. Atanasov, “Cross- embodiment robot manipulation skill transfer using la- tent space alignment,”arXiv preprint arXiv:2406.01968, 2024

work page arXiv 2024
[31]

Learning a unified latent space for cross-embodiment robot control,

Y . Yan and D. Lee, “Learning a unified latent space for cross-embodiment robot control,”arXiv preprint arXiv:2601.15419, 2026

work page arXiv 2026
[32]

Tmr: Text- to-motion retrieval using contrastive 3d human motion synthesis,

M. Petrovich, M. J. Black, and G. Varol, “Tmr: Text- to-motion retrieval using contrastive 3d human motion synthesis,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9488–9497

work page 2023
[33]

Go to zero: Towards zero-shot motion generation with million-scale data,

K. Fan, S. Lu, M. Dai, R. Yu, L. Xiao, Z. Dou, J. Dong, L. Ma, and J. Wang, “Go to zero: Towards zero-shot motion generation with million-scale data,” inCVPR, 2025, pp. 13 336–13 348

work page 2025
[34]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient founda- tion language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

AMASS: Archive of motion capture as surface shapes,

N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black, “AMASS: Archive of motion capture as surface shapes,” inInternational Conference on Com- puter Vision, Oct. 2019, pp. 5442–5451

work page 2019
[36]

BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion

Q. Liao, T. E. Truong, X. Huang, Y . Gao, G. Tevet, K. Sreenath, and C. K. Liu, “Beyondmimic: From mo- tion tracking to versatile humanoid control via guided diffusion,”arXiv preprint arXiv:2508.08241, 2025. VI. APPENDIX A. Non-convexity Analysis of Retargeting Optimization This appendix provides the detailed derivation for the non- convexity analysis sum...

work page internal anchor Pith review arXiv 2025