pith. sign in

arxiv: 2606.23680 · v1 · pith:NYNPGDB6new · submitted 2026-06-22 · 💻 cs.RO · cs.AI· cs.LG

CoorDex: Coordinating Body and Hand Priors for Continuous Dexterous Humanoid Loco-Manipulation

Pith reviewed 2026-06-26 07:59 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords humanoid loco-manipulationdexterous manipulationlatent priorsresidual reinforcement learningmotion trackingwhole-body controlproprioceptive control
0
0 comments X

The pith

CoorDex distills body and hand motion teachers into latent priors so a high-DoF humanoid can grasp and manipulate while walking without stopping.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a pipeline that first trains separate privileged teachers on whole-body and dexterous-hand demonstrations, then distills those teachers into proprioception-conditioned latent priors. These frozen priors become the action space for a residual reinforcement-learning policy whose body and hand heads share task context but keep separate residuals. The resulting controller keeps natural locomotion while making finger contacts reliable enough for continuous tasks such as carrying a bottle or opening a fridge door on the move. Ablations indicate that joint-space PPO, direct hand control, and monolithic latent prediction all fail under identical reward budgets, whereas the coordinated latent-residual structure succeeds.

Core claim

By freezing proprioception-conditioned latent priors distilled from privileged motion-tracking teachers and composing them through a coordinated residual policy with shared task context and separate body-hand heads, high-dimensional contact-rich loco-manipulation becomes trainable on a 20-DoF hand mounted on a walking humanoid.

What carries the argument

The coordinated latent residual policy that composes frozen body and hand priors through shared task context and separate residual heads.

If this is right

  • The same latent-prior interface can be reused across multiple loco-manipulation tasks without retraining the priors.
  • Separate residual heads for body and hand allow the policy to improve contact without disrupting the natural gait learned by the teacher.
  • Freezing the priors reduces the effective action space so that standard PPO can solve contact-rich problems that otherwise remain unsolved under the same reward budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the priors capture general coordination, the approach may transfer to new objects or environments without new demonstrations.
  • The method suggests that other high-DoF humanoid skills could be decomposed into body and end-effector priors rather than trained monolithically.
  • Success on continuous fridge opening implies the framework may extend to longer-horizon tasks that alternate locomotion and manipulation without explicit mode switches.

Load-bearing premise

Distilling the privileged motion-tracking teachers into proprioception-conditioned latent priors will keep whole-body motion natural while making finger contacts reliable enough for the residual RL stage to succeed under the same reward budget.

What would settle it

Run the same walk-grasp-carry task with the latent priors replaced by direct joint-space actions or a single monolithic latent head and observe whether success rate drops to near zero while locomotion remains stable.

Figures

Figures reproduced from arXiv: 2606.23680 by Chenran Li, Mingyu Ding, Shuning Li, Sikai Li, Yunchao Yao, Zhenyu Wei.

Figure 1
Figure 1. Figure 1: Dexterous loco-manipulation on the move. CoorDex enables a humanoid equipped with high-DoF dexterous hands to perform continuous loco-manipulation tasks that require simultane￾ous coordination between locomotion and dexterous hand control, such as walk-grasp-carry, fridge opening while stepping back, and walk-pick-turn. Abstract: Humanoid loco-manipulation is often simplified into a stop-and-go process: wa… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of CoorDex. Body and hand reference motions are tracked by privileged teach￾ers and distilled into separate proprioception-conditioned latent priors. During downstream RL, a coordinated residual policy uses task context and prior means to predict body and hand latent resid￾uals. The frozen decoders map the corrected latents to joint-position targets for loco-manipulation. 3.1 Prior Construction We… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison on WALKGRAB. Each column shows sequential key frames from one rollout of the corresponding method. All Joint Space produces unstable whole-body mo￾tion. Body Prior + Hand Joint Space reaches the bottle but fails to learn a reliable grasp. Monolithic Latent Residual reaches the interaction region but produces less natural body motion and fails to complete the task. CoorDex completes t… view at source ↗
Figure 4
Figure 4. Figure 4: Non-stop locomotion on WALKGRAB. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: WALKPICKTURN real-world demo. 0 1 2 3 4 5 6 7 8 9 [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: WALKGRAB real-world demo. C Real-World Demos This section provides additional qualitative hardware visualizations and clarifies the hardware vari￾ant used for real-robot replay. The quantitative simulation experiments in the main paper are con￾ducted on a Unitree G1 humanoid equipped with a 20-DoF WUJI dexterous hand. In contrast, the physical robot available for our hardware visualization uses a Unitree G… view at source ↗
Figure 7
Figure 7. Figure 7: OPENFRIDGE real-world demo. Due to facility constraints, we use a simplified mock￾up instead of a full refrigerator door, focusing on the core behavior of maintaining a grasp while stepping backward to pull the object open. specific to the dexterous hand morphology. When replacing the hand, the same pipeline can be instantiated by training a hand specific tracking teacher and distilling it into a hand-spec… view at source ↗
read the original abstract

Humanoid loco-manipulation is often simplified into a stop-and-go process: walking to an object, stopping to manipulate it, and then resuming locomotion. It also commonly relies on low degree-of-freedom (DoF) end effectors that behave like an open-close grasp primitive. We introduce CoorDex, a learning pipeline that converts high-dimensional body and dexterous hand control into coordinated latent residual control, enabling high-DoF dexterous loco-manipulation on the move. Starting from simulated whole-body and hand demonstrations, CoorDex trains privileged motion tracking teachers for the humanoid body and dexterous hand, distills them into proprioception-conditioned latent priors, and uses the frozen priors as the action space for downstream residual reinforcement learning. A coordinated latent residual policy composes these priors through shared task context and separate body-hand residual heads, preserving natural whole-body motion while improving finger-level contact reliability. CoorDex enables a Unitree G1 humanoid with a 20-DoF WUJI hand to execute dexterous manipulation while in motion, including non-stop bottle grasping and carrying, fridge door opening on the move, and cube pick-and-turn. Ablations on the walk-grasp-carry task show that joint-space PPO, joint-space hand control, and monolithic latent prediction all fail under the same reward budget, while the latent-prior interface and coordinated residual structure make high-dimensional contact-rich loco-manipulation trainable. Project Page: https://skevinci.github.io/coordex/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. CoorDex introduces a pipeline that trains privileged motion-tracking teachers on simulated whole-body and dexterous-hand demonstrations, distills them into proprioception-conditioned latent priors, and employs the frozen priors as the action space for a coordinated residual RL policy with shared task context and separate body/hand residual heads. This enables continuous high-DoF loco-manipulation on a Unitree G1 with 20-DoF WUJI hand, demonstrated on non-stop bottle grasping/carrying, moving fridge-door opening, and cube pick-and-turn. Ablations on the walk-grasp-carry task show that joint-space PPO, joint-space hand control, and monolithic latent prediction fail under the same reward budget while the proposed latent-prior interface succeeds.

Significance. If the distillation step preserves the necessary finger-level coordination, the method would meaningfully advance humanoid loco-manipulation beyond stop-and-go or low-DoF primitives. The coordinated residual structure and real-robot validation on multiple contact-rich tasks while walking constitute the primary strengths; the approach is reproducible via the linked project page and relies on standard RL rather than ad-hoc heuristics.

major comments (2)
  1. [Abstract / §4] Abstract and §4 (ablations): the claim that the latent priors retain sufficient information for reliable high-DoF finger contacts rests on the distillation step, yet the reported ablations compare only against non-latent baselines and do not quantify preservation relative to the privileged teachers (e.g., no contact-success-rate or trajectory-deviation metrics between teacher and distilled prior). This comparison is load-bearing for the central claim that the frozen proprioception-conditioned priors enable downstream residual RL to succeed under the same reward budget.
  2. [§3.2] §3.2 (distillation): the paper does not report information-preservation diagnostics (mutual information, reconstruction error on ground-truth contacts/object states, or finger-joint error) after compressing privileged signals into the latent space conditioned only on proprioception. Without these, it remains unclear whether the observed failures of monolithic latent prediction are due to the interface itself or to loss of coordination details during distillation.
minor comments (2)
  1. [§5] Figure captions and §5 (real-robot results) should explicitly state the number of successful trials and failure modes for each task to allow direct comparison with the simulated ablations.
  2. [§3] Notation for the latent prior (e.g., z_b, z_h) and residual heads should be introduced once with a clear diagram reference rather than being redefined inline in multiple sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger evidence on information preservation during distillation. We address both major comments below and will incorporate quantitative diagnostics in the revision to better support the central claims.

read point-by-point responses
  1. Referee: [Abstract / §4] Abstract and §4 (ablations): the claim that the latent priors retain sufficient information for reliable high-DoF finger contacts rests on the distillation step, yet the reported ablations compare only against non-latent baselines and do not quantify preservation relative to the privileged teachers (e.g., no contact-success-rate or trajectory-deviation metrics between teacher and distilled prior). This comparison is load-bearing for the central claim that the frozen proprioception-conditioned priors enable downstream residual RL to succeed under the same reward budget.

    Authors: We agree that direct metrics comparing the privileged teachers to the distilled priors would strengthen the evidence for information retention. The current ablations demonstrate that the full pipeline succeeds where joint-space and monolithic baselines fail under identical reward budgets, implying the priors provide usable coordination; however, this is indirect. In the revised manuscript we will add explicit preservation metrics (finger-joint RMSE, contact success rate on object interactions, and end-effector trajectory deviation) evaluated on held-out demonstration sequences, reported in §3.2 and §4. These will quantify how much coordination is retained after distillation into the proprioception-conditioned latent space. revision: yes

  2. Referee: [§3.2] §3.2 (distillation): the paper does not report information-preservation diagnostics (mutual information, reconstruction error on ground-truth contacts/object states, or finger-joint error) after compressing privileged signals into the latent space conditioned only on proprioception. Without these, it remains unclear whether the observed failures of monolithic latent prediction are due to the interface itself or to loss of coordination details during distillation.

    Authors: We concur that explicit preservation diagnostics would help isolate whether monolithic latent prediction fails due to the prediction interface or due to information loss in distillation. Note that the monolithic baseline employs the identical distillation procedure and latent dimensionality as the proposed method; its failure therefore points primarily to the value of the coordinated residual structure rather than distillation quality alone. Nevertheless, to address the concern directly we will include in the revision: (i) reconstruction error on ground-truth contacts and object states, (ii) average finger-joint position error, and (iii) mutual-information estimates between privileged teacher actions and latent prior outputs, all conditioned only on proprioception. These will appear in §3.2 alongside the existing training details. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external demonstrations and standard RL pipeline

full rationale

The paper's chain begins with external simulated whole-body and hand demonstrations, trains privileged motion-tracking teachers, distills to proprioception-conditioned latent priors, and applies frozen priors in residual RL. No equation or step reduces by construction to a fitted parameter renamed as prediction, nor does any load-bearing claim rest on a self-citation chain that itself lacks independent verification. The ablations compare against non-latent baselines under the same reward budget, but the core method remains self-contained against those external benchmarks and does not exhibit self-definitional, fitted-input, or uniqueness-imported circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; all technical details remain at the level of high-level pipeline description.

pith-pipeline@v0.9.1-grok · 5826 in / 1078 out tokens · 28546 ms · 2026-06-26T07:59:06.217359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 3 canonical work pages

  1. [1]

    X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne. Deepmimic: example-guided deep reinforcement learning of physics-based character skills.ACM Transactions on Graphics, 37 (4):1–14, 2018. ISSN 1557-7368. doi:10.1145/3197517.3201311. URLhttp://dx.doi. org/10.1145/3197517.3201311

  2. [2]

    Q. Liao, T. E. Truong, X. Huang, Y . Gao, G. Tevet, K. Sreenath, and C. K. Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion, 2025. URLhttps: //arxiv.org/abs/2508.08241

  3. [3]

    Z. Luo, Y . Yuan, T. Wang, C. Li, F. Casta˜neda, S. Chen, Z.-A. Cao, J. Li, D. Minor, Q. Ben, J. Park, D. Sami, Z. Wang, X. Da, R. Ding, C. Hogg, L. Song, E. Lim, E. Jeong, T. He, H. Xue, W. Xiao, S. Yuen, J. Kautz, Y . Chang, U. Iqbal, L. J. Fan, and Y . Zhu. Sonic: Supersizing motion tracking for natural humanoid whole-body control, 2026. URLhttps://arx...

  4. [4]

    T. He, Z. Luo, W. Xiao, C. Zhang, K. Kitani, C. Liu, and G. Shi. Learning human-to-humanoid real-time whole-body teleoperation, 2024. URLhttps://arxiv.org/abs/2403.04436

  5. [5]

    Cheng, Y

    X. Cheng, Y . Ji, J. Chen, R. Yang, G. Yang, and X. Wang. Expressive whole-body control for humanoid robots, 2024. URLhttps://arxiv.org/abs/2402.16796

  6. [6]

    M. Ji, X. Peng, F. Liu, J. Li, G. Yang, X. Cheng, and X. Wang. Exbody2: Advanced expressive humanoid whole-body control, 2025. URLhttps://arxiv.org/abs/2412.13196

  7. [7]

    T. He, W. Xiao, T. Lin, Z. Luo, Z. Xu, Z. Jiang, J. Kautz, C. Liu, G. Shi, X. Wang, L. Fan, and Y . Zhu. Hover: Versatile neural whole-body controller for humanoid robots, 2025. URL https://arxiv.org/abs/2410.21229. 18

  8. [8]

    X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. Amp: adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics, 40(4):1–20,

  9. [9]

    Amp: adversarial motion priors for stylized physics-based character control,

    ISSN 1557-7368. doi:10.1145/3450626.3459670. URLhttp://dx.doi.org/10. 1145/3450626.3459670

  10. [10]

    X. B. Peng, Y . Guo, L. Halper, S. Levine, and S. Fidler. Ase: large-scale reusable adversarial skill embeddings for physically simulated characters.ACM Transactions on Graphics, 41(4): 1–17, 2022. ISSN 1557-7368. doi:10.1145/3528223.3530110. URLhttp://dx.doi.org/ 10.1145/3528223.3530110

  11. [11]

    Tessler, Y

    C. Tessler, Y . Kasten, Y . Guo, S. Mannor, G. Chechik, and X. B. Peng. Calm: Conditional adversarial latent models for directable virtual characters.ACM Transactions on Graphics, 2023

  12. [12]

    Z. Luo, J. Cao, J. Merel, A. Winkler, J. Huang, K. Kitani, and W. Xu. Universal humanoid motion representations for physics-based control. InInternational Conference on Learning Representations, 2024

  13. [13]

    J. Tan, W. Xu, X. Jiang, J. Zhang, K. Yang, K. Wu, J. Xiong, S. Chen, Y . Li, Y . Feng, Y . Fang, Y . Zou, Y . Song, and R. Xu. Spherical latent motion prior for physics-based simulated hu- manoid control, 2026. URLhttps://arxiv.org/abs/2603.01294

  14. [14]

    T. He, Z. Luo, X. He, W. Xiao, C. Zhang, W. Zhang, K. Kitani, C. Liu, and G. Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning, 2024. URLhttps://arxiv.org/abs/2406.08858

  15. [15]

    Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn. Humanplus: Humanoid shadowing and imitation from humans, 2024. URLhttps://arxiv.org/abs/2406.10454

  16. [16]

    L. Heng, Y . Tang, J. Xu, H. Bao, D. Huang, and Y . Wang. Humdex: Humanoid dexterous manipulation made easy, 2026

  17. [17]

    S. Zhao, Y . Ze, Y . Wang, C. K. Liu, P. Abbeel, G. Shi, and R. Duan. Resmimic: From general motion tracking to humanoid whole-body loco-manipulation via residual learning, 2025

  18. [18]

    Y . Fu, F. Xie, C. Xu, J. Xiong, H. Yuan, and Z. Lu. Demohlm: From one demonstration to generalizable humanoid loco-manipulation, 2025. URLhttps://arxiv.org/abs/2510. 11258

  19. [19]

    Kuang, H

    Y . Kuang, H. Geng, A. Elhafsi, T.-D. Do, P. Abbeel, J. Malik, M. Pavone, and Y . Wang. Skillblender: Towards versatile humanoid whole-body loco-manipulation via skill blending,

  20. [20]

    URLhttps://arxiv.org/abs/2506.09366

  21. [21]

    Zhang, Y

    Y . Zhang, Y . Yuan, P. Gurunath, I. Gupta, S. Omidshafiei, A. akbar Agha-mohammadi, M. Vazquez-Chanlatte, L. Pedersen, T. He, and G. Shi. Falcon: Learning force-adaptive hu- manoid loco-manipulation, 2025. URLhttps://arxiv.org/abs/2505.06776

  22. [22]

    W. Sun, L. Feng, Y . Liu, B. Cao, Y . Jin, and Z. Xie. Ulc: A unified and fine-grained controller for humanoid loco-manipulation, 2025

  23. [23]

    T. He, Z. Wang, H. Xue, Q. Ben, Z. Luo, W. Xiao, Y . Yuan, X. Da, F. Castaneda, S. Sastry, C. Liu, G. Shi, L. Fan, and Y . Zhu. Viral: Visual sim-to-real at scale for humanoid loco- manipulation.arXiv preprint arXiv:2511.15200, 2025

  24. [24]

    H. Xue, T. He, Z. Wang, Q. Ben, W. Xiao, Z. Luo, X. Da, F. Casta˜neda, G. Shi, S. Sastry, L. J. Fan, and Y . Zhu. Opening the sim-to-real door for humanoid pixel-to-action policy transfer,

  25. [25]

    URLhttps://arxiv.org/abs/2512.01061. 19

  26. [26]

    Jiang, J

    H. Jiang, J. Chen, Q. Bu, L. Chen, M. Shi, Y . Zhang, D. Li, C. Suo, C. Wang, Z. Peng, and H. Li. Wholebodyvla: Towards unified latent vla for whole-body loco-manipulation control,

  27. [27]

    URLhttps://arxiv.org/abs/2512.11047

  28. [28]

    R. Wang, J. Zhang, J. Chen, Y . Xu, P. Li, T. Liu, and H. Wang. Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation, 2023. URLhttps: //arxiv.org/abs/2210.02697

  29. [29]

    P. Li, T. Liu, Y . Li, Y . Geng, Y . Zhu, Y . Yang, and S. Huang. Gendexgrasp: Generalizable dexterous grasping, 2023. URLhttps://arxiv.org/abs/2210.00722

  30. [30]

    X. Zhan, L. Yang, Y . Zhao, K. Mao, H. Xu, Z. Lin, K. Li, and C. Lu. Oakink2: A dataset of bimanual hands-object manipulation in complex task completion, 2024. URLhttps:// arxiv.org/abs/2403.19417

  31. [31]

    Z. Wei, Z. Xu, J. Guo, Y . Hou, C. Gao, Z. Cai, J. Luo, and L. Shao.D(R,O)grasp: A unified representation of robot and object interaction for cross-embodiment dexterous grasping, 2025. URLhttps://arxiv.org/abs/2410.01702

  32. [32]

    Z. Wei, Y . Yao, and M. Ding. One hand to rule them all: Canonical representations for unified dexterous manipulation, 2026. URLhttps://arxiv.org/abs/2602.16712

  33. [33]

    K. Li, P. Li, T. Liu, Y . Li, and S. Huang. Maniptrans: Efficient dexterous bimanual manipula- tion transfer via residual learning, 2025. URLhttps://arxiv.org/abs/2503.21860

  34. [34]

    Jiang, Y

    Z. Jiang, Y . Xie, K. Lin, Z. Xu, W. Wan, A. Mandlekar, L. Fan, and Y . Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning, 2025. URLhttps://arxiv.org/abs/2410.24185

  35. [35]

    Br ¨udigam, A.-A

    J. Br ¨udigam, A.-A. Abbas, M. Sorokin, K. Fang, B. Hung, M. Guru, S. Sosnowski, J. Wang, S. Hirche, and S. L. Cleac’h. Jacta: A versatile planner for learning dexterous and whole-body manipulation, 2024. URLhttps://arxiv.org/abs/2408.01258

  36. [36]

    Mittal, P

    M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Mu ˜noz, X. Yao, R. Zurbr ¨ugg, N. Rudin, L. Wawrzyniak, M. Rakhsha, A. Denzler, E. Heiden, A. Borovicka, O. Ahmed, I. Akinola, A. Anwar, M. T. Carlson, J. Y . Feng, A. Garg, R. Gasoto, L. Gulich, Y . Guo, M. Gussert, A. Hansen, M. Kulkarni, C. Li, W. Liu, V . Makoviychuk, G. Malczyk, H...

  37. [37]

    W. Xie, J. Han, J. Zheng, H. Li, X. Liu, J. Shi, W. Zhang, C. Bai, and X. Li. Kungfubot: Physics-based humanoid whole-body control for learning highly-dynamic skills, 2025. URL https://arxiv.org/abs/2506.12851

  38. [38]

    Y . Ze, S. Zhao, W. Wang, A. Kanazawa, R. Duan, P. Abbeel, G. Shi, J. Wu, and C. K. Liu. Twist2: Scalable, portable, and holistic humanoid data collection system, 2025. URLhttps: //arxiv.org/abs/2511.02832. 20

  39. [39]

    Y . Ze, Z. Chen, J. P. Ara´ujo, Z. ang Cao, X. B. Peng, J. Wu, and C. K. Liu. Twist: Teleoperated whole-body imitation system, 2025. URLhttps://arxiv.org/abs/2505.02833

  40. [40]

    J. Li, X. Cheng, T. Huang, S. Yang, R.-Z. Qiu, and X. Wang. Amo: Adaptive motion optimiza- tion for hyper-dexterous humanoid whole-body control, 2025. URLhttps://arxiv.org/ abs/2505.03738

  41. [41]

    Z. Chen, M. Ji, X. Cheng, X. Peng, X. B. Peng, and X. Wang. Gmt: General motion tracking for humanoid whole-body control, 2025. URLhttps://arxiv.org/abs/2506.14770

  42. [42]

    S. Zhao, X. Zhu, Y . Chen, C. Li, Y . Xie, X. Zhang, M. Ding, and M. Tomizuka. Dexh2r: Task-oriented dexterous manipulation from human to robots.IEEE/ASME Transactions on Mechatronics, 2025

  43. [43]

    Zhang, Q

    G. Zhang, Q. Xu, H. Zhang, J. Ma, L. He, Y . Bao, Z. Ping, Z. Yuan, C. Lu, C. Yuan, et al. Unidex: A robot foundation suite for universal dexterous hand control from egocentric hu- man videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1841–1852, 2026

  44. [44]

    Liang, Y

    Z. Liang, Y . Mu, Y . Wang, T. Chen, W. Shao, W. Zhan, M. Tomizuka, P. Luo, and M. Ding. Dexhanddiff: Interaction-aware diffusion planning for adaptive dexterous manipulation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1745–1755, 2025

  45. [45]

    F. Liu, Z. Gu, Y . Cai, Z. Zhou, H. Jung, J. Jang, S. Zhao, S. Ha, Y . Chen, D. Xu, and Y . Zhao. Opt2skill: Imitating dynamically-feasible whole-body trajectories for versatile humanoid loco- manipulation, 2025. URLhttps://arxiv.org/abs/2409.20514

  46. [46]

    Q. Ben, F. Jia, J. Zeng, J. Dong, D. Lin, and J. Pang. Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025

  47. [47]

    Schulman, F

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

  48. [48]

    H. Zhao, R. Cathomen, L. Gulich, W. Liu, E. A. Ongan, M. Lin, S. Jain, S. Pouya, and Y . Chang. Agile: A comprehensive workflow for humanoid loco-manipulation learning, 2026. URLhttps://arxiv.org/abs/2603.20147

  49. [49]

    Y . Qin, W. Yang, B. Huang, K. Van Wyk, H. Su, X. Wang, Y .-W. Chao, and D. Fox. Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system. InRobotics: Science and Systems, 2023

  50. [50]

    Unitree g1 humanoid robot.https://www.unitree.com/g1, 2026

    Unitree Robotics. Unitree g1 humanoid robot.https://www.unitree.com/g1, 2026. Ac- cessed: 2026-05-27

  51. [51]

    Wuji hand product introduction.https://docs.wuji.tech/docs/en/ wuji-hand/latest/overview/, 2026

    WUJI TECH. Wuji hand product introduction.https://docs.wuji.tech/docs/en/ wuji-hand/latest/overview/, 2026. Accessed: 2026-05-27

  52. [52]

    Unitree dex3-1 dexterous hand.https://www.unitree.com/Dex3-1,

    Unitree Robotics. Unitree dex3-1 dexterous hand.https://www.unitree.com/Dex3-1,

  53. [53]

    Accessed: 2026-05-27. 21