CoorDex: Coordinating Body and Hand Priors for Continuous Dexterous Humanoid Loco-Manipulation
Pith reviewed 2026-06-26 07:59 UTC · model grok-4.3
The pith
CoorDex distills body and hand motion teachers into latent priors so a high-DoF humanoid can grasp and manipulate while walking without stopping.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By freezing proprioception-conditioned latent priors distilled from privileged motion-tracking teachers and composing them through a coordinated residual policy with shared task context and separate body-hand heads, high-dimensional contact-rich loco-manipulation becomes trainable on a 20-DoF hand mounted on a walking humanoid.
What carries the argument
The coordinated latent residual policy that composes frozen body and hand priors through shared task context and separate residual heads.
If this is right
- The same latent-prior interface can be reused across multiple loco-manipulation tasks without retraining the priors.
- Separate residual heads for body and hand allow the policy to improve contact without disrupting the natural gait learned by the teacher.
- Freezing the priors reduces the effective action space so that standard PPO can solve contact-rich problems that otherwise remain unsolved under the same reward budget.
Where Pith is reading between the lines
- If the priors capture general coordination, the approach may transfer to new objects or environments without new demonstrations.
- The method suggests that other high-DoF humanoid skills could be decomposed into body and end-effector priors rather than trained monolithically.
- Success on continuous fridge opening implies the framework may extend to longer-horizon tasks that alternate locomotion and manipulation without explicit mode switches.
Load-bearing premise
Distilling the privileged motion-tracking teachers into proprioception-conditioned latent priors will keep whole-body motion natural while making finger contacts reliable enough for the residual RL stage to succeed under the same reward budget.
What would settle it
Run the same walk-grasp-carry task with the latent priors replaced by direct joint-space actions or a single monolithic latent head and observe whether success rate drops to near zero while locomotion remains stable.
Figures
read the original abstract
Humanoid loco-manipulation is often simplified into a stop-and-go process: walking to an object, stopping to manipulate it, and then resuming locomotion. It also commonly relies on low degree-of-freedom (DoF) end effectors that behave like an open-close grasp primitive. We introduce CoorDex, a learning pipeline that converts high-dimensional body and dexterous hand control into coordinated latent residual control, enabling high-DoF dexterous loco-manipulation on the move. Starting from simulated whole-body and hand demonstrations, CoorDex trains privileged motion tracking teachers for the humanoid body and dexterous hand, distills them into proprioception-conditioned latent priors, and uses the frozen priors as the action space for downstream residual reinforcement learning. A coordinated latent residual policy composes these priors through shared task context and separate body-hand residual heads, preserving natural whole-body motion while improving finger-level contact reliability. CoorDex enables a Unitree G1 humanoid with a 20-DoF WUJI hand to execute dexterous manipulation while in motion, including non-stop bottle grasping and carrying, fridge door opening on the move, and cube pick-and-turn. Ablations on the walk-grasp-carry task show that joint-space PPO, joint-space hand control, and monolithic latent prediction all fail under the same reward budget, while the latent-prior interface and coordinated residual structure make high-dimensional contact-rich loco-manipulation trainable. Project Page: https://skevinci.github.io/coordex/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. CoorDex introduces a pipeline that trains privileged motion-tracking teachers on simulated whole-body and dexterous-hand demonstrations, distills them into proprioception-conditioned latent priors, and employs the frozen priors as the action space for a coordinated residual RL policy with shared task context and separate body/hand residual heads. This enables continuous high-DoF loco-manipulation on a Unitree G1 with 20-DoF WUJI hand, demonstrated on non-stop bottle grasping/carrying, moving fridge-door opening, and cube pick-and-turn. Ablations on the walk-grasp-carry task show that joint-space PPO, joint-space hand control, and monolithic latent prediction fail under the same reward budget while the proposed latent-prior interface succeeds.
Significance. If the distillation step preserves the necessary finger-level coordination, the method would meaningfully advance humanoid loco-manipulation beyond stop-and-go or low-DoF primitives. The coordinated residual structure and real-robot validation on multiple contact-rich tasks while walking constitute the primary strengths; the approach is reproducible via the linked project page and relies on standard RL rather than ad-hoc heuristics.
major comments (2)
- [Abstract / §4] Abstract and §4 (ablations): the claim that the latent priors retain sufficient information for reliable high-DoF finger contacts rests on the distillation step, yet the reported ablations compare only against non-latent baselines and do not quantify preservation relative to the privileged teachers (e.g., no contact-success-rate or trajectory-deviation metrics between teacher and distilled prior). This comparison is load-bearing for the central claim that the frozen proprioception-conditioned priors enable downstream residual RL to succeed under the same reward budget.
- [§3.2] §3.2 (distillation): the paper does not report information-preservation diagnostics (mutual information, reconstruction error on ground-truth contacts/object states, or finger-joint error) after compressing privileged signals into the latent space conditioned only on proprioception. Without these, it remains unclear whether the observed failures of monolithic latent prediction are due to the interface itself or to loss of coordination details during distillation.
minor comments (2)
- [§5] Figure captions and §5 (real-robot results) should explicitly state the number of successful trials and failure modes for each task to allow direct comparison with the simulated ablations.
- [§3] Notation for the latent prior (e.g., z_b, z_h) and residual heads should be introduced once with a clear diagram reference rather than being redefined inline in multiple sections.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for stronger evidence on information preservation during distillation. We address both major comments below and will incorporate quantitative diagnostics in the revision to better support the central claims.
read point-by-point responses
-
Referee: [Abstract / §4] Abstract and §4 (ablations): the claim that the latent priors retain sufficient information for reliable high-DoF finger contacts rests on the distillation step, yet the reported ablations compare only against non-latent baselines and do not quantify preservation relative to the privileged teachers (e.g., no contact-success-rate or trajectory-deviation metrics between teacher and distilled prior). This comparison is load-bearing for the central claim that the frozen proprioception-conditioned priors enable downstream residual RL to succeed under the same reward budget.
Authors: We agree that direct metrics comparing the privileged teachers to the distilled priors would strengthen the evidence for information retention. The current ablations demonstrate that the full pipeline succeeds where joint-space and monolithic baselines fail under identical reward budgets, implying the priors provide usable coordination; however, this is indirect. In the revised manuscript we will add explicit preservation metrics (finger-joint RMSE, contact success rate on object interactions, and end-effector trajectory deviation) evaluated on held-out demonstration sequences, reported in §3.2 and §4. These will quantify how much coordination is retained after distillation into the proprioception-conditioned latent space. revision: yes
-
Referee: [§3.2] §3.2 (distillation): the paper does not report information-preservation diagnostics (mutual information, reconstruction error on ground-truth contacts/object states, or finger-joint error) after compressing privileged signals into the latent space conditioned only on proprioception. Without these, it remains unclear whether the observed failures of monolithic latent prediction are due to the interface itself or to loss of coordination details during distillation.
Authors: We concur that explicit preservation diagnostics would help isolate whether monolithic latent prediction fails due to the prediction interface or due to information loss in distillation. Note that the monolithic baseline employs the identical distillation procedure and latent dimensionality as the proposed method; its failure therefore points primarily to the value of the coordinated residual structure rather than distillation quality alone. Nevertheless, to address the concern directly we will include in the revision: (i) reconstruction error on ground-truth contacts and object states, (ii) average finger-joint position error, and (iii) mutual-information estimates between privileged teacher actions and latent prior outputs, all conditioned only on proprioception. These will appear in §3.2 alongside the existing training details. revision: yes
Circularity Check
No circularity: derivation relies on external demonstrations and standard RL pipeline
full rationale
The paper's chain begins with external simulated whole-body and hand demonstrations, trains privileged motion-tracking teachers, distills to proprioception-conditioned latent priors, and applies frozen priors in residual RL. No equation or step reduces by construction to a fitted parameter renamed as prediction, nor does any load-bearing claim rest on a self-citation chain that itself lacks independent verification. The ablations compare against non-latent baselines under the same reward budget, but the core method remains self-contained against those external benchmarks and does not exhibit self-definitional, fitted-input, or uniqueness-imported circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne. Deepmimic: example-guided deep reinforcement learning of physics-based character skills.ACM Transactions on Graphics, 37 (4):1–14, 2018. ISSN 1557-7368. doi:10.1145/3197517.3201311. URLhttp://dx.doi. org/10.1145/3197517.3201311
-
[2]
Q. Liao, T. E. Truong, X. Huang, Y . Gao, G. Tevet, K. Sreenath, and C. K. Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion, 2025. URLhttps: //arxiv.org/abs/2508.08241
Pith/arXiv arXiv 2025
-
[3]
Z. Luo, Y . Yuan, T. Wang, C. Li, F. Casta˜neda, S. Chen, Z.-A. Cao, J. Li, D. Minor, Q. Ben, J. Park, D. Sami, Z. Wang, X. Da, R. Ding, C. Hogg, L. Song, E. Lim, E. Jeong, T. He, H. Xue, W. Xiao, S. Yuen, J. Kautz, Y . Chang, U. Iqbal, L. J. Fan, and Y . Zhu. Sonic: Supersizing motion tracking for natural humanoid whole-body control, 2026. URLhttps://arx...
Pith/arXiv arXiv 2026
-
[4]
T. He, Z. Luo, W. Xiao, C. Zhang, K. Kitani, C. Liu, and G. Shi. Learning human-to-humanoid real-time whole-body teleoperation, 2024. URLhttps://arxiv.org/abs/2403.04436
arXiv 2024
- [5]
-
[6]
M. Ji, X. Peng, F. Liu, J. Li, G. Yang, X. Cheng, and X. Wang. Exbody2: Advanced expressive humanoid whole-body control, 2025. URLhttps://arxiv.org/abs/2412.13196
arXiv 2025
-
[7]
T. He, W. Xiao, T. Lin, Z. Luo, Z. Xu, Z. Jiang, J. Kautz, C. Liu, G. Shi, X. Wang, L. Fan, and Y . Zhu. Hover: Versatile neural whole-body controller for humanoid robots, 2025. URL https://arxiv.org/abs/2410.21229. 18
arXiv 2025
-
[8]
X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. Amp: adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics, 40(4):1–20,
-
[9]
Amp: adversarial motion priors for stylized physics-based character control,
ISSN 1557-7368. doi:10.1145/3450626.3459670. URLhttp://dx.doi.org/10. 1145/3450626.3459670
-
[10]
X. B. Peng, Y . Guo, L. Halper, S. Levine, and S. Fidler. Ase: large-scale reusable adversarial skill embeddings for physically simulated characters.ACM Transactions on Graphics, 41(4): 1–17, 2022. ISSN 1557-7368. doi:10.1145/3528223.3530110. URLhttp://dx.doi.org/ 10.1145/3528223.3530110
-
[11]
Tessler, Y
C. Tessler, Y . Kasten, Y . Guo, S. Mannor, G. Chechik, and X. B. Peng. Calm: Conditional adversarial latent models for directable virtual characters.ACM Transactions on Graphics, 2023
2023
-
[12]
Z. Luo, J. Cao, J. Merel, A. Winkler, J. Huang, K. Kitani, and W. Xu. Universal humanoid motion representations for physics-based control. InInternational Conference on Learning Representations, 2024
2024
-
[13]
J. Tan, W. Xu, X. Jiang, J. Zhang, K. Yang, K. Wu, J. Xiong, S. Chen, Y . Li, Y . Feng, Y . Fang, Y . Zou, Y . Song, and R. Xu. Spherical latent motion prior for physics-based simulated hu- manoid control, 2026. URLhttps://arxiv.org/abs/2603.01294
arXiv 2026
-
[14]
T. He, Z. Luo, X. He, W. Xiao, C. Zhang, W. Zhang, K. Kitani, C. Liu, and G. Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning, 2024. URLhttps://arxiv.org/abs/2406.08858
arXiv 2024
-
[15]
Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn. Humanplus: Humanoid shadowing and imitation from humans, 2024. URLhttps://arxiv.org/abs/2406.10454
arXiv 2024
-
[16]
L. Heng, Y . Tang, J. Xu, H. Bao, D. Huang, and Y . Wang. Humdex: Humanoid dexterous manipulation made easy, 2026
2026
-
[17]
S. Zhao, Y . Ze, Y . Wang, C. K. Liu, P. Abbeel, G. Shi, and R. Duan. Resmimic: From general motion tracking to humanoid whole-body loco-manipulation via residual learning, 2025
2025
-
[18]
Y . Fu, F. Xie, C. Xu, J. Xiong, H. Yuan, and Z. Lu. Demohlm: From one demonstration to generalizable humanoid loco-manipulation, 2025. URLhttps://arxiv.org/abs/2510. 11258
2025
-
[19]
Kuang, H
Y . Kuang, H. Geng, A. Elhafsi, T.-D. Do, P. Abbeel, J. Malik, M. Pavone, and Y . Wang. Skillblender: Towards versatile humanoid whole-body loco-manipulation via skill blending,
-
[20]
URLhttps://arxiv.org/abs/2506.09366
- [21]
-
[22]
W. Sun, L. Feng, Y . Liu, B. Cao, Y . Jin, and Z. Xie. Ulc: A unified and fine-grained controller for humanoid loco-manipulation, 2025
2025
-
[23]
T. He, Z. Wang, H. Xue, Q. Ben, Z. Luo, W. Xiao, Y . Yuan, X. Da, F. Castaneda, S. Sastry, C. Liu, G. Shi, L. Fan, and Y . Zhu. Viral: Visual sim-to-real at scale for humanoid loco- manipulation.arXiv preprint arXiv:2511.15200, 2025
arXiv 2025
-
[24]
H. Xue, T. He, Z. Wang, Q. Ben, W. Xiao, Z. Luo, X. Da, F. Casta˜neda, G. Shi, S. Sastry, L. J. Fan, and Y . Zhu. Opening the sim-to-real door for humanoid pixel-to-action policy transfer,
-
[25]
URLhttps://arxiv.org/abs/2512.01061. 19
-
[26]
Jiang, J
H. Jiang, J. Chen, Q. Bu, L. Chen, M. Shi, Y . Zhang, D. Li, C. Suo, C. Wang, Z. Peng, and H. Li. Wholebodyvla: Towards unified latent vla for whole-body loco-manipulation control,
-
[27]
URLhttps://arxiv.org/abs/2512.11047
-
[28]
R. Wang, J. Zhang, J. Chen, Y . Xu, P. Li, T. Liu, and H. Wang. Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation, 2023. URLhttps: //arxiv.org/abs/2210.02697
arXiv 2023
-
[29]
P. Li, T. Liu, Y . Li, Y . Geng, Y . Zhu, Y . Yang, and S. Huang. Gendexgrasp: Generalizable dexterous grasping, 2023. URLhttps://arxiv.org/abs/2210.00722
arXiv 2023
-
[30]
X. Zhan, L. Yang, Y . Zhao, K. Mao, H. Xu, Z. Lin, K. Li, and C. Lu. Oakink2: A dataset of bimanual hands-object manipulation in complex task completion, 2024. URLhttps:// arxiv.org/abs/2403.19417
arXiv 2024
-
[31]
Z. Wei, Z. Xu, J. Guo, Y . Hou, C. Gao, Z. Cai, J. Luo, and L. Shao.D(R,O)grasp: A unified representation of robot and object interaction for cross-embodiment dexterous grasping, 2025. URLhttps://arxiv.org/abs/2410.01702
arXiv 2025
-
[32]
Z. Wei, Y . Yao, and M. Ding. One hand to rule them all: Canonical representations for unified dexterous manipulation, 2026. URLhttps://arxiv.org/abs/2602.16712
Pith/arXiv arXiv 2026
-
[33]
K. Li, P. Li, T. Liu, Y . Li, and S. Huang. Maniptrans: Efficient dexterous bimanual manipula- tion transfer via residual learning, 2025. URLhttps://arxiv.org/abs/2503.21860
arXiv 2025
- [34]
-
[35]
J. Br ¨udigam, A.-A. Abbas, M. Sorokin, K. Fang, B. Hung, M. Guru, S. Sosnowski, J. Wang, S. Hirche, and S. L. Cleac’h. Jacta: A versatile planner for learning dexterous and whole-body manipulation, 2024. URLhttps://arxiv.org/abs/2408.01258
arXiv 2024
-
[36]
M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Mu ˜noz, X. Yao, R. Zurbr ¨ugg, N. Rudin, L. Wawrzyniak, M. Rakhsha, A. Denzler, E. Heiden, A. Borovicka, O. Ahmed, I. Akinola, A. Anwar, M. T. Carlson, J. Y . Feng, A. Garg, R. Gasoto, L. Gulich, Y . Guo, M. Gussert, A. Hansen, M. Kulkarni, C. Li, W. Liu, V . Makoviychuk, G. Malczyk, H...
Pith/arXiv arXiv 2025
-
[37]
W. Xie, J. Han, J. Zheng, H. Li, X. Liu, J. Shi, W. Zhang, C. Bai, and X. Li. Kungfubot: Physics-based humanoid whole-body control for learning highly-dynamic skills, 2025. URL https://arxiv.org/abs/2506.12851
arXiv 2025
-
[38]
Y . Ze, S. Zhao, W. Wang, A. Kanazawa, R. Duan, P. Abbeel, G. Shi, J. Wu, and C. K. Liu. Twist2: Scalable, portable, and holistic humanoid data collection system, 2025. URLhttps: //arxiv.org/abs/2511.02832. 20
arXiv 2025
-
[39]
Y . Ze, Z. Chen, J. P. Ara´ujo, Z. ang Cao, X. B. Peng, J. Wu, and C. K. Liu. Twist: Teleoperated whole-body imitation system, 2025. URLhttps://arxiv.org/abs/2505.02833
arXiv 2025
-
[40]
J. Li, X. Cheng, T. Huang, S. Yang, R.-Z. Qiu, and X. Wang. Amo: Adaptive motion optimiza- tion for hyper-dexterous humanoid whole-body control, 2025. URLhttps://arxiv.org/ abs/2505.03738
arXiv 2025
-
[41]
Z. Chen, M. Ji, X. Cheng, X. Peng, X. B. Peng, and X. Wang. Gmt: General motion tracking for humanoid whole-body control, 2025. URLhttps://arxiv.org/abs/2506.14770
arXiv 2025
-
[42]
S. Zhao, X. Zhu, Y . Chen, C. Li, Y . Xie, X. Zhang, M. Ding, and M. Tomizuka. Dexh2r: Task-oriented dexterous manipulation from human to robots.IEEE/ASME Transactions on Mechatronics, 2025
2025
-
[43]
Zhang, Q
G. Zhang, Q. Xu, H. Zhang, J. Ma, L. He, Y . Bao, Z. Ping, Z. Yuan, C. Lu, C. Yuan, et al. Unidex: A robot foundation suite for universal dexterous hand control from egocentric hu- man videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1841–1852, 2026
2026
-
[44]
Liang, Y
Z. Liang, Y . Mu, Y . Wang, T. Chen, W. Shao, W. Zhan, M. Tomizuka, P. Luo, and M. Ding. Dexhanddiff: Interaction-aware diffusion planning for adaptive dexterous manipulation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1745–1755, 2025
2025
-
[45]
F. Liu, Z. Gu, Y . Cai, Z. Zhou, H. Jung, J. Jang, S. Zhao, S. Ha, Y . Chen, D. Xu, and Y . Zhao. Opt2skill: Imitating dynamically-feasible whole-body trajectories for versatile humanoid loco- manipulation, 2025. URLhttps://arxiv.org/abs/2409.20514
arXiv 2025
-
[46]
Q. Ben, F. Jia, J. Zeng, J. Dong, D. Lin, and J. Pang. Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025
arXiv 2025
-
[47]
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347
Pith/arXiv arXiv 2017
-
[48]
H. Zhao, R. Cathomen, L. Gulich, W. Liu, E. A. Ongan, M. Lin, S. Jain, S. Pouya, and Y . Chang. Agile: A comprehensive workflow for humanoid loco-manipulation learning, 2026. URLhttps://arxiv.org/abs/2603.20147
arXiv 2026
-
[49]
Y . Qin, W. Yang, B. Huang, K. Van Wyk, H. Su, X. Wang, Y .-W. Chao, and D. Fox. Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system. InRobotics: Science and Systems, 2023
2023
-
[50]
Unitree g1 humanoid robot.https://www.unitree.com/g1, 2026
Unitree Robotics. Unitree g1 humanoid robot.https://www.unitree.com/g1, 2026. Ac- cessed: 2026-05-27
2026
-
[51]
Wuji hand product introduction.https://docs.wuji.tech/docs/en/ wuji-hand/latest/overview/, 2026
WUJI TECH. Wuji hand product introduction.https://docs.wuji.tech/docs/en/ wuji-hand/latest/overview/, 2026. Accessed: 2026-05-27
2026
-
[52]
Unitree dex3-1 dexterous hand.https://www.unitree.com/Dex3-1,
Unitree Robotics. Unitree dex3-1 dexterous hand.https://www.unitree.com/Dex3-1,
-
[53]
Accessed: 2026-05-27. 21
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.